<h1><center>SINr : preprocessing text and training a SINr model</center></h1>
Github page : https://github.com/SINr-Embeddings/sinr/tree/main

Documentation : https://sinr-embeddings.github.io/sinr/_build/html/modules.html

Publications :
-  (https://hal.science/hal-03197434). Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez,
   Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse
   Interpretable Node Representations is not a Sin!. Advances in
   Intelligent Data Analysis XIX, 19th International Symposium on
   Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal.
   pp.325-337,
-  (https://hal.science/hal-03770444). Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier.
   Are Embedding Spaces Interpretable? Results of an Intrusion Detection
   Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille,
   France.
-  (https://hal.science/hal-04321407). Simon Guillot, Thibault Prouteau, Nicolas Dugué.
   Sparser is better: one step closer to word embedding interpretability.
   IWCS 2023, Nancy, France.
-  (https://hal.science/hal-04398742). Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau.
   Filtering communities in word co-occurrence networks to foster the
   emergence of meaning. Complex Networks 2023, Menton, France.

In this notebook :
- How to preprocess textual corpus with SINr library
- How to make a cooccurence matrix from the preprocessed text
- How to train a SINr model
- How to create a SINrVectors object (to explore and evaluate the model)
- How to sparsify the model for better interpretability
- How to filter dimensions with the SINr-filtered method
- How to load an existing SINrVectors object

For examples of manipulation and evaluations of models, see the notebooks sinrvec_en (english) or sinrvec_fr (french). 

In [2]:
import nltk # For textual resources

import sinr.text.preprocess as ppcs
from sinr.text.cooccurrence import Cooccurrence
from sinr.text.pmi import pmi_filter
import sinr.graph_embeddings as ge

## Textual corpus

In [3]:
# Get a textual corpus
# For example, texts from the Project Gutenberg electronic text archive,
# hosted at http://www.gutenberg.org/
nltk.download('gutenberg')
gutenberg = nltk.corpus.gutenberg # contains 25,000 free electronic books
file = open("my_corpus.txt", "w")
file.write(gutenberg.raw())
file.close()

[nltk_data] Downloading package gutenberg to
[nltk_data]     /lium/home/aberanger/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


## Preprocess the corpus

In [None]:
# If required, download and install the spacy model used for preprocessin
!python -m spacy download en_core_web_lg

In [4]:
vrt_maker = ppcs.VRTMaker(ppcs.Corpus(ppcs.Corpus.REGISTER_WEB,
                                      ppcs.Corpus.LANGUAGE_EN,
                                      "my_corpus.txt"),
                                      ".", n_jobs=8)
vrt_maker.do_txt_to_vrt()
sentences = ppcs.extract_text("my_corpus.vrt", min_freq=20)

2024-06-13 11:23:04,445 - do_txt_to_vrt - INFO - 256893lines to preprocess


  0%|          | 0/256893 [00:00<?, ?it/s]

2024-06-13 11:33:59,336 - do_txt_to_vrt - INFO - VRT-style file written in /export/home/lium/aberanger/sinr/notebooks/my_corpus.vrt


  0%|          | 0/3066063 [00:00<?, ?it/s]

## Construct cooccurrence matrix

In [5]:
c = Cooccurrence()
c.fit(sentences, window=5)
c.matrix = pmi_filter(c.matrix)
c.save("my_cooc_matrix.pk")

## Train a SINr model

In [6]:
sinr = ge.SINr.load_from_cooc_pkl("my_cooc_matrix.pk")
commu = sinr.detect_communities(gamma=10)
sinr.extract_embeddings(commu)

2024-06-13 11:35:09,926 - load_from_cooc_pkl - INFO - Building Graph.
2024-06-13 11:35:09,930 - load_pkl_text - INFO - Loading cooccurrence matrix and dictionary.
2024-06-13 11:35:09,982 - load_pkl_text - INFO - Finished loading data.
2024-06-13 11:35:10,350 - load_from_cooc_pkl - INFO - Finished building graph.
2024-06-13 11:35:10,356 - detect_communities - INFO - Detecting communities.
2024-06-13 11:35:10,550 - detect_communities - INFO - Finished detecting communities.
2024-06-13 11:35:10,554 - extract_embeddings - INFO - Extracting embeddings.
2024-06-13 11:35:10,556 - extract_embeddings - INFO - Applying NFM.


Gamma for louvain : 10
Communities detected in 0.16205 [s]
solution properties:
-------------------  -----------
# communities        678
min community size     2
max community size    53
avg. community size    6.43658
modularity             0.0823232
-------------------  -----------


2024-06-13 11:35:10,561 - get_nfm_embeddings - INFO - Starting NFM
2024-06-13 11:35:14,750 - extract_embeddings - INFO - NFM successfully applied.
2024-06-13 11:35:14,752 - extract_embeddings - INFO - Finished extracting embeddings.


## Construct a SINrVectors to work with the model

In [7]:
sinr_vec = ge.InterpretableWordsModelBuilder(sinr,
                                             'my_sinr_vectors_name',
                                             n_jobs=8,
                                             n_neighbors=25).build()
sinr_vec.save('./models/my_sinrvec.pk')

## Sparsify word vectors for better interpretability and performances

Sparsifying word vectors can increase performances and interpretability. You can play with different thresholds of sparsity and compare results of similarity task and DistRatio. These evaluations are available in the SINr library, see the notebooks sinrvec_en or sinrvec_fr for examples.

For more more informations about sparsity of word embeddings :
-  (https://hal.science/hal-04321407). Simon Guillot, Thibault Prouteau, Nicolas Dugué.
   Sparser is better: one step closer to word embedding interpretability.
   IWCS 2023, Nancy, France.

In [None]:
sinr_vec.sparsify(100)
# Save your sparse model
sinr_vec.save('./models/my_spars_sinrvec.pk')

## Use SINr-filtered method for better performances, better interpretability and to reduce memory footprint

SINr-filtered is a method which filter the dimensions of the model according to their number of non zero values. It relies on the similarity task and calculate the similarity for different filtering thresholds to select the better ones. For this method, it is better to first sparsify the vectors. It works with different thresholds according to the size of the model and the number of values on its dimensions.

For more informations you can refer at :
-  (https://hal.science/hal-04398742) Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau.
   Filtering communities in word co-occurrence networks to foster the
   emergence of meaning. Complex Networks 2023, Menton, France.

#### How to calculate the thresholds for a model

In [34]:
low_threshold, high_threshold = sinr_vec.dim_nnz_thresholds(step=10, diff_tol=0.001)

Minimum of non zero values in dimensions : 36
Maximum of non zero values in dimensions : 2053
Mean similarity of the model with all dimensions (MEN, WS353, SCWS, SimLex-999) : 0.15279358314888308

10 : 0.1528 20 : 0.1528 30 : 0.1528 40 : 0.1577 50 : 0.1583 60 : 0.158 70 : 0.1585 80 : 0.1621 90 : 0.163 100 : 0.159 110 : 0.1589 120 : 0.157 130 : 0.1587 140 : 0.1533 150 : 0.1594 160 : 0.1595 170 : 0.1543 180 : 0.1611 190 : 0.1602 200 : 0.1546 210 : 0.1681 220 : 0.1586 230 : 0.1688 240 : 0.1681 250 : 0.1583 260 : 0.1681 270 : 0.1641 280 : 0.1638 290 : 0.1651 300 : 0.1701 310 : 0.1705 320 : 0.1642 330 : 0.166 340 : 0.1651 350 : 0.1665 360 : 0.1657 370 : 0.1667 380 : 0.1672 390 : 0.1661 400 : 0.1662 410 : 0.1704 420 : 0.1676 430 : 0.1718 440 : 0.1663 450 : 0.1612 460 : 0.1614 470 : 0.161 480 : 0.173 490 : 0.1673 500 : 0.168 510 : 0.1623 520 : 0.1681 530 : 0.1728 540 : 0.172 550 : 0.1715 560 : 0.1716 570 : 0.1698 580 : 0.1663 590 : 0.1766 600 : 0.172 610 : 0.1725 620 : 0.1673 630 : 0.1723 640

#### And how to filter the model

In [43]:
sinr_vec.remove_communities_dim_nnz(threshold_min=low_threshold, threshold_max=high_threshold)

  0%|          | 0/678 [00:00<?, ?it/s]

## Load an existing SinrVectors object

In [42]:
sinr_vec = ge.SINrVectors('my_sinr_vectors_name')
sinr_vec.load('./models/my_spars_sinrvec.pk')