## **3. Pondération statistique** (TF-IDF)  

https://stackoverflow.com/questions/46580932/calculate-tf-idf-using-sklearn-for-n-grams-in-python  
http://scikit-learn.sourceforge.net/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer  
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction


In [345]:
path = '/Users/camilledemers/Documents/05-transformation/'
acteur = 'inspq'
sous_corpus = False
tag = ''

### **Lire le vocabulaire** (termes retenus au prétraitement)

In [346]:
from pandas import *

if sous_corpus:
    file_path = path + acteur + '/' + tag + '/' + tag + '_terms.csv' # si on veut les termes lemmatisés : terms-lemmatized.csv

else:
    file_path = path + acteur + '/' + acteur + '_terms.csv' # si on veut les termes lemmatisés : terms-lemmatized.csv
    
with open(file_path) as f:
    csv = read_csv(f)

In [347]:
vocabulaire = [t.lower() for t in list(csv['Terme'])]

In [348]:
print('On a un vocabulaire de {} formes.'.format(len(vocabulaire)))
vocabulaire

On a un vocabulaire de 14717 formes.


['marie claude',
 'covid 19',
 'santé publique',
 'marie ève',
 'jean françois',
 'marie hélène',
 'comité scientifique',
 'veille scientifique',
 'anne marie',
 'jean pierre',
 'santé environnementale',
 'agressions sexuelles',
 'groupe scientifique',
 'santé au travail',
 'politiques publiques',
 'groupe de travail',
 'marie josée',
 'environnement bâti',
 'changements climatiques',
 'santé des voyageurs',
 'consommation d alcool',
 'comité d experts',
 'infections nosocomiales',
 'habitudes de vie',
 'saine alimentation',
 'mode de vie',
 'prévention des traumatismes',
 'vie actif',
 'mode de vie actif',
 'facultés affaiblies',
 'prévention de la violence',
 'st laurent',
 'st pierre',
 'jean philippe',
 'prud homme',
 'assurance qualité',
 'maladies chroniques',
 'eau potable',
 'santé publique du québec',
 'publique du québec',
 'maladies infectieuses',
 'inégalités sociales',
 'gestion des risques',
 'développement des compétences',
 'inégalités sociales de santé',
 'sociales de 

### **Lire le corpus**

In [349]:
import os, shutil, re
from pathlib import Path

# Change the directory
if sous_corpus:
    path = '/Users/camilledemers/Documents/03-corpus/2-sous-corpus/'
    path = path + acteur + '/' + tag +'/'

else: 
    path = '/Users/camilledemers/Documents/03-corpus/2-data/1-fr/'
    path = path + acteur + '/'

os.chdir(path)
len(os.listdir())


corpus = []

for file in os.listdir():
    if file.endswith(".txt") and not file.endswith('-corpus_FR.txt') and not 'PDF' in file:
        file_path = path + file
        
        with open(file_path, 'r', encoding = "UTF-8") as f:
            data = f.readlines()
            text = re.sub('\d', '', data[1].strip('\n').lower().replace('’', '\''))
            text = text.replace('  ', ' ')
            corpus.append(text)

In [350]:
count_pdf = 0
for file in os.listdir(): 
    if 'PDF' in file:
        count_pdf +=1

print('Il y avait {} documents PDF dans notre dossier, pour l\'instant on ne les traitera pas.'.format(count_pdf))

Il y avait 0 documents PDF dans notre dossier, pour l'instant on ne les traitera pas.


In [351]:
corpus = corpus[:round(len(corpus))]

nb_docs = len(corpus)

print("On a donc un corpus de {} documents.".format(nb_docs))

On a donc un corpus de 4342 documents.


### **Appliquer le prétraitement**
À appliquer seulement si les termes passées comme vocabulaire sont lemmatisés.  
Le TfIdfVectorizer de sklearn va extraire lui-mêmes les ngrammes, faire le filtrage des mots fonctionnels et calculer le tf-idf pour nos termes d'intérêt ;  
Or, si les termes qu'on lui donne comme vocabulaire ont été lemmatisés, on veut donc aussi lui passer un corpus lemmatisé.

In [352]:
import nltk
from nltk.tokenize import RegexpTokenizer
from french_lefff_lemmatizer.french_lefff_lemmatizer import FrenchLefffLemmatizer

def nlp(corpus): 
    # Tokenisation
    tokenizer = RegexpTokenizer(r"\w\'|\w+")

    tokens = [tokenizer.tokenize(doc) for  doc in corpus]
    len_corpus = len(nltk.flatten(tokens))
    print("Avec le RegExpTokenizer, notre corpus contient {} tokens.".format(len_corpus))

    # POS tagging
    input = [" ".join(nltk.flatten(doc)).replace("' ", "'") for doc in tokens]
    import treetaggerwrapper
    tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')

    path = '/Users/camilledemers/Documents/04-filtrage/mapping_treeTagger_lefff.csv'

    with open(path) as f:
        csv = read_csv(f)

    treeTag = [term for term in csv['TreeTagger'].tolist()] 
    lefff = [term for term in csv['Lefff'].tolist()]

    mapping = {term : lefff[treeTag.index(term)] for term in treeTag}

    tagged= [tagger.tag_text(doc) for doc in input]

    tuples_doc = []
    for doc in tagged:
        tuples = []
        for t in doc:
            token = t.split('\t')[0]
            pos = mapping[t.split('\t')[1]]

            tuples.append([token, pos])
        tuples_doc.append(tuples)

    #Lemmatisation
    lemmatizer = FrenchLefffLemmatizer()
    docs_lemmas = []

    for doc in tuples_doc:
        doc_lemma = []
        for t in doc:
            term_lemmatized = ""
            if(lemmatizer.lemmatize(t[0], t[1]) == []):
                term_lemmatized = lemmatizer.lemmatize(t[0])
            else:
                term_lemmatized = lemmatizer.lemmatize(t[0], t[1])[0][0] # [0][0] pour avoir le lemme seul et non (lemme, pos)
        
            if len(term_lemmatized) >1 :
                doc_lemma.append(term_lemmatized)
        docs_lemmas.append(doc_lemma)

    docs_lemmas = [" ".join(doc) for doc in docs_lemmas]

    return docs_lemmas

In [353]:
corpus = nlp(corpus)

Avec le RegExpTokenizer, notre corpus contient 39223816 tokens.


In [331]:
from sklearn.feature_extraction.text import TfidfVectorizer

# max_df : ignore words that appear in 85% of documents, 
# min df:  ignore words that appear in less than 1% of documents 
# vocabulary = vocabulaire

# Sans utiliser le vocabulaire
# tfidf = TfidfVectorizer(min_df=0.1, stop_words=None, ngram_range=(2,4), max_df=0.85, use_idf=True, max_features=200)

tfidf = TfidfVectorizer(vocabulary = vocabulaire, ngram_range=(2,4), use_idf=True)
tfs = tfidf.fit_transform(corpus)

In [332]:
features_names = tfidf.get_feature_names_out()
corpus_index = [corpus.index(n) for n in corpus]

import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=features_names, columns=corpus_index)

In [333]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3052,3053,31,3055,3056,3057,3058,3059,3060,3061
centre de recherche,0.05106,0.099895,0.200438,0.169988,0.070864,0.120084,0.052021,0.096184,0.214922,0.144782,...,0.193288,0.166656,0.045396,0.096184,0.066776,0.113547,0.142047,0.05079,0.032664,0.095711
politique d'approvisionnement,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
réseau québécois,0.05106,0.033298,0.100219,0.056663,0.070864,0.120084,0.052021,0.096184,0.107461,0.048261,...,0.064429,0.055552,0.045396,0.096184,0.066776,0.113547,0.047349,0.05079,0.032664,0.023928
commissaire local,0.05106,0.033298,0.100219,0.056663,0.070864,0.120084,0.052021,0.096184,0.107461,0.048261,...,0.064429,0.055552,0.045396,0.096184,0.066776,0.113547,0.047349,0.05079,0.032664,0.023928
réseaux sociaux,0.05106,0.033298,0.100219,0.056663,0.070864,0.120084,0.052021,0.096184,0.107461,0.048261,...,0.064429,0.055552,0.045396,0.096184,0.066776,0.113547,0.047349,0.05079,0.032664,0.023928
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
patient chum,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
investigator chum,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
chirurgie cardiaque,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
crchum researcher,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000


In [334]:
from pathlib import Path

path = '/Users/camilledemers/Documents/05-transformation/' + acteur + '/'
Path(path).mkdir(parents=True, exist_ok=True)

if sous_corpus:
    path = path + tag + '/'
    titre = tag

else:
    titre = acteur

df.to_csv(path + titre + '_matrice-TFIDF.csv')

In [335]:
terms_weighted = []
rows, cols = tfs.nonzero()
for row, col in zip(rows,cols):
    terms_weighted.append([features_names[col], tfs[row,col]])

In [336]:
terms_weighted = DataFrame(terms_weighted, columns=['Terme', 'TF-IDF'])
terms_weighted.sort_values(["TF-IDF"], 
                    axis=0,
                    ascending=[False], 
                    inplace=True)

In [337]:
terms_weighted

Unnamed: 0,Terme,TF-IDF
126924,code promo,0.993444
124303,code promo,0.993444
138028,innove action,0.990688
108362,fibrillation auriculaire,0.988591
142009,liver disease,0.988339
...,...,...
53805,chum bénévolat bibliothèque,0.003463
53804,bénévolat bibliothèque,0.003463
53803,bénévolat bibliothèque comité,0.003463
53802,bibliothèque comité,0.003463


In [338]:
termes= set(features_names)

liste_filtre = {term:0 for term in termes}

In [339]:
for term in liste_filtre:
    for score in terms_weighted:
        max = terms_weighted[terms_weighted['Terme'] == term].max()['TF-IDF']
        liste_filtre[term] = max

In [340]:
termes_tries = pd.DataFrame(liste_filtre.items(), columns=['Terme', 'TF-IDF'])
termes_tries.sort_values(["TF-IDF"], 
                    axis=0,
                    ascending=[False], 
                    inplace=True)

termes_tries.to_csv(path + titre + '_weighting_TF-IDF.csv')