## **3. Pondération statistique** (TF-IDF / OKapiBM25)  

https://stackoverflow.com/questions/46580932/calculate-tf-idf-using-sklearn-for-n-grams-in-python  
http://scikit-learn.sourceforge.net/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer  
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

https://pypi.org/project/rank-bm25/

In [65]:
path = '../04-filtrage/output/'
acteur = 'chum'
sous_corpus = False
tag = ''

### **Lire le vocabulaire** (termes retenus au prétraitement)

In [66]:
from pandas import *

if sous_corpus:
    file_path = path + acteur + '/' + tag + '/' + tag + '_terms.csv' # si on veut les termes lemmatisés : -lemmatized.csv

else:
    file_path = path + acteur + '/' + acteur + '_terms.csv' # si on veut les termes lemmatisés : -lemmatized.csv
    
with open(file_path, encoding='utf-8') as f:
    csv = read_csv(f)

In [67]:
vocabulaire = [t.lower() for t in list(csv['Expression'])]

In [68]:
print('On a un vocabulaire de {} formes.'.format(len(vocabulaire)))
vocabulaire

On a un vocabulaire de 580 formes.


['centre de recherche',
 'fondation du chum',
 'enseignement et académie',
 'répertoire enseignement',
 'comité des usagers',
 "politique d'approvisionnement",
 'réseau québécois',
 'commissaire local',
 'local aux plaintes',
 'commissaire local aux plaintes',
 'regroupement des retraités',
 'retraités du chum',
 'patients répertoire',
 'réseau québécois de la télésanté',
 'recherche innovation',
 "affilié à l'université",
 'regroupement des retraités du chum',
 'bibliothèque comité',
 'retraités du chum ruisss',
 'chum confidentialité avertissement',
 'chum ruisss chum',
 'ruisss chum',
 'approvisionnement regroupement des retraités',
 'centre de recherche innovation',
 'académie centre de recherche',
 'académie centre',
 'chum ruisss',
 'approvisionnement regroupement',
 'fondation du chum bénévolat',
 'bénévolat bibliothèque comité',
 'patients répertoire enseignement',
 'formations laboratoires',
 'comité des usagers commissaire',
 'plaintes politique',
 'commissaire local aux plai

### **Lire le corpus**

In [69]:
import os, shutil, re
from pathlib import Path
from os import path
from pandas import *

# Change the directory
if sous_corpus:
    base_path = '../03-corpus/2-sous-corpus/'
    base_path = path.join(base_path, acteur, tag) 

else: 
    base_path = '../03-corpus/2-data/1-fr/'
    base_path = path.join(base_path, acteur + '.csv')
        
with open(base_path, "r", encoding = "UTF-8") as f:
    data = read_csv(base_path)
    text = data['text'].tolist()
    corpus = [(re.sub('\d', '', t.strip('\n').lower().replace('’', '\''))) for t in text]

In [70]:
corpus = corpus[:round(len(corpus))]

nb_docs = len(corpus)

print("On a donc un corpus de {} documents.".format(nb_docs))

On a donc un corpus de 2299 documents.


### **Appliquer le prétraitement**
Si les termes passées comme vocabulaire sont lemmatisés, changer le paramètre lem pour True au moment d'appliquer la fonction nlp(corpus)  
Le TfIdfVectorizer de sklearn va extraire lui-mêmes les ngrammes, faire le filtrage des mots fonctionnels et calculer le tf-idf pour nos termes d'intérêt ;  
Or, si les termes qu'on lui donne comme vocabulaire ont été lemmatisés, on veut donc aussi lui passer un corpus lemmatisé.

In [71]:
import nltk
from nltk.tokenize import RegexpTokenizer
from french_lefff_lemmatizer.french_lefff_lemmatizer import FrenchLefffLemmatizer

def nlp(corpus, lem=False): 
    if not lem:
        # Tokenisation
        tokenizer = RegexpTokenizer(r"\w\'|\w+")

        tokens = [tokenizer.tokenize(doc) for  doc in corpus]
        len_corpus = len(nltk.flatten(tokens))
        print("Avec le RegExpTokenizer, notre corpus contient {} tokens.".format(len_corpus))

        return tokens

    else:
        # POS tagging
        input = [" ".join(nltk.flatten(doc)).replace("' ", "'") for doc in tokens]
        import treetaggerwrapper
        tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')

        path = '../04-filtrage/mapping_treeTagger_lefff.csv'

        with open(path) as f:
            csv = read_csv(f)

        treeTag = [term for term in csv['TreeTagger'].tolist()] 
        lefff = [term for term in csv['Lefff'].tolist()]

        mapping = {term : lefff[treeTag.index(term)] for term in treeTag}

        tagged= [tagger.tag_text(doc) for doc in input]

        tuples_doc = []
        for doc in tagged:
            tuples = []
            for t in doc:
                token = t.split('\t')[0]
                pos = mapping[t.split('\t')[1]]

                tuples.append([token, pos])
            tuples_doc.append(tuples)

        #Lemmatisation
        lemmatizer = FrenchLefffLemmatizer()
        docs_lemmas = []

        for doc in tuples_doc:
            doc_lemma = []
            for t in doc:
                term_lemmatized = ""
                if(lemmatizer.lemmatize(t[0], t[1]) == []):
                    term_lemmatized = lemmatizer.lemmatize(t[0])
                else:
                    term_lemmatized = lemmatizer.lemmatize(t[0], t[1])[0][0] # [0][0] pour avoir le lemme seul et non (lemme, pos)
            
                if len(term_lemmatized) >1 :
                    doc_lemma.append(term_lemmatized)
            docs_lemmas.append(doc_lemma)

        docs_lemmas = [" ".join(doc) for doc in docs_lemmas]

        return docs_lemmas

In [72]:
corpus = nlp(corpus)

Avec le RegExpTokenizer, notre corpus contient 868105 tokens.


In [73]:
file_path = '../04-filtrage/mwe_stopwords.txt'

with open (file_path, 'r', encoding='utf-8') as f:
    mwe_sw = [t.lower().strip('\n') for t in f.readlines()]

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

# max_df : ignore words that appear in 85% of documents, 
# min df:  ignore words that appear in less than 1% of documents 
# vocabulary = vocabulaire

# Sans utiliser le vocabulaire
# tfidf = TfidfVectorizer(min_df=0.1, stop_words=None, ngram_range=(2,4), max_df=0.85, use_idf=True, max_features=200)

def identity_tokenizer(text):
    return text

# vocabulary = vocabulaire
tfidf = TfidfVectorizer(vocabulary = vocabulaire, tokenizer=identity_tokenizer, ngram_range=(2,5), use_idf=True, lowercase=False, stop_words= mwe_sw)
tfs = tfidf.fit_transform(corpus)



In [75]:
features_names = tfidf.get_feature_names_out()
corpus_index = [corpus.index(n) for n in corpus]

import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=features_names, columns=corpus_index).transpose()

In [76]:
df

Unnamed: 0,centre de recherche,fondation du chum,enseignement et académie,répertoire enseignement,comité des usagers,politique d'approvisionnement,réseau québécois,commissaire local,local aux plaintes,commissaire local aux plaintes,...,nouvelle demande,professeur à l'université,biologie moléculaire,équipes du crchum,temps de pandémie,innovations thérapeutiques,facteurs de risque,imagerie médicale,étude publiée,crchum professeur
0,0.100364,0.033455,0.033455,0.033455,0.033455,0.0,0.033455,0.033455,0.033455,0.033455,...,0.0,0.0,0.0,0.0,0.0,0.566032,0.0,0.0,0.0,0.000000
1,0.207102,0.103551,0.103551,0.103551,0.103551,0.0,0.103551,0.103551,0.103551,0.103551,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
2,0.153857,0.051286,0.051286,0.051286,0.051286,0.0,0.051286,0.051286,0.051286,0.051286,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
3,0.073406,0.073406,0.073406,0.073406,0.073406,0.0,0.073406,0.073406,0.073406,0.073406,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
4,0.136307,0.136307,0.136307,0.136307,0.136307,0.0,0.136307,0.136307,0.136307,0.136307,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2294,0.077019,0.077019,0.077019,0.077019,0.077019,0.0,0.077019,0.077019,0.077019,0.077019,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
2295,0.116786,0.116786,0.116786,0.116786,0.116786,0.0,0.116786,0.116786,0.116786,0.116786,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
2296,0.135278,0.045093,0.045093,0.045093,0.045093,0.0,0.045093,0.045093,0.045093,0.045093,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
2297,0.078642,0.078642,0.078642,0.078642,0.078642,0.0,0.078642,0.078642,0.078642,0.078642,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000


In [77]:
from pathlib import Path

base_path = '../05-transformation/' + acteur + '/'
Path(base_path).mkdir(parents=True, exist_ok=True)

if sous_corpus:
    path = path + tag + '/'
    titre = tag

else:
    titre = acteur

df.to_csv(base_path + titre + '_matrice-TFIDF.csv')

In [78]:
terms_weighted = []
rows, cols = tfs.nonzero()
for row, col in zip(rows,cols):
    terms_weighted.append([features_names[col], tfs[row,col]])

terms_weighted = DataFrame(terms_weighted, columns=['Terme', 'TF-IDF'])
terms_weighted.sort_values(["TF-IDF"], 
                    axis=0,
                    ascending=[False], 
                    inplace=True)

In [79]:
terms_weighted

Unnamed: 0,Terme,TF-IDF
102418,innove action,0.990927
78714,fibrillation auriculaire,0.988783
40050,chum ssss,0.986652
116462,aide médicale,0.982739
63524,fibrillation auriculaire,0.981036
...,...,...
76169,approvisionnement regroupement,0.002026
76168,fondation du chum bénévolat,0.002026
76177,regroupement des retraités du chum,0.002026
76188,répertoire enseignement,0.002026


In [80]:
termes= set(features_names)

liste_filtre = {term: df[term].max() for term in termes}

In [81]:
termes_tries = pd.DataFrame(liste_filtre.items(), columns=['Terme', 'TF-IDF'])
termes_tries.sort_values(["TF-IDF"], 
                    axis=0,
                    ascending=[False], 
                    inplace=True)

termes_tries.to_csv(base_path + titre + '_weighting_TF-IDF.csv')

## **OKapi BM25**
https://hal.archives-ouvertes.fr/hal-00760158 

In [82]:
from rank_bm25 import BM25Okapi

In [83]:
bm25 = BM25Okapi(corpus)

In [84]:
tokenized_queries = [t.split() for t in vocabulaire]

features_names = [t for t in vocabulaire]
corpus_index = [corpus.index(n) for n in corpus]

tab = [bm25.get_scores(query) for query in tokenized_queries]
df = pd.DataFrame(tab, index=features_names, columns=corpus_index).transpose()

In [None]:
terms_weighted = {term: df[term].max() for term in df}

In [None]:
tab = DataFrame(terms_weighted.items(), columns=['Terme', 'OkapiBM25'])
tab.sort_values(["OkapiBM25"], 
                    axis=0,
                    ascending=[False], 
                    inplace=True)

In [None]:
tab.to_csv(base_path + titre + '_weighting_OKapiBM25.csv')