## **4. Keywords Clustering** 
We will compare different models implemeting each of these parameters:
- K-Means vs Expectation maximization VS Agglomerative algorithm
- One-hot vs Sentence transformers embedding
- [Distance measure (Euclidean VS Cosine)][?]
- Number extracted features (25%, 50%, 75%, 100% of total number of features)
- Number of clusters (50, 100, 150, 200)

*En grande partie basé sur le tutoriel suivant* :   
https://colab.research.google.com/drive/1HHNFjKlip1AaFIuvvn0AicWyv6egLOZw?usp=sharing#scrollTo=zhP1daroRzRV    
(Une approche à base de Word embedding - on pourrait utiliser les scores TF-IDF ou OKapi pour les traits discriminants plutôt que la fréquence (voir plus bas))

In [20]:
from pandas import *
from sklearn.cluster import KMeans
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn import metrics
import matplotlib.pyplot as plt
from collections import Counter

In [21]:
algorithmes = ['K-means', 'Expectation-Maximization', 'AgglomerativeClustering']
embeddings = ['One-Hot', 'Sentence transformers']
features = [25, 50, 100, 150, 200]

results = []
for algorithme in algorithmes:
    for embedding in embeddings:
            results.append(\
            {'algorithme' : algorithme,\
                'embedding': embedding, \
                'N features': None, \
                'Score Silhouette': None})


# On va remplir ce dictionnaire avec les bons scores au fur et à mesure qu'on expérimente
results = DataFrame(results)
results

Unnamed: 0,algorithme,embedding,N features,K (nb clusters),Score Silhouette
0,K-means,One-Hot,,,
1,K-means,Sentence transformers,,,
2,Expectation-Maximization,One-Hot,,,
3,Expectation-Maximization,Sentence transformers,,,
4,AgglomerativeClustering,One-Hot,,,
5,AgglomerativeClustering,Sentence transformers,,,


In [22]:
def add_results(algo, embed, dist, n_f, k, silhouette, results=results):
    results.loc[ \
        (results['algorithme'] == algo) & \
        (results['embedding'] == embed), 'N features'] = n_f

    results.loc[ \
        (results['algorithme'] == algo) & \
        (results['embedding'] == embed), 'distance'] = dist

    results.loc[ \
        (results['algorithme'] == algo) & \
        (results['embedding'] == embed), 'K (nb clusters)'] = k

    results.loc[ \
        (results['algorithme'] == algo) & \
        (results['embedding'] == embed), 'Score Silhouette'] = silhouette

    results=results[['algorithme', 'embedding', 'N features', 'distance', 'K (nb clusters)', 'Score Silhouette']]

    return results

**Importer la liste de termes candidats avec leur fréquence**

In [23]:
import glob
import pandas as pd

# get data file names
path ='../05-transformation'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True).drop(columns=["Unnamed: 0", 'Structure syntaxique', 'LLR', 'TF (sklearn)', 'DF (sklearn)', 'TF-IDF', 'OkapiBM25', 'Terme formatté'])

In [24]:
big_frame

Unnamed: 0,Corpus,Terme,Fréquence (TF),Fréquence documentaire (DF),isMeSHTerm,isTaxoTerm
0,ciusss_centresud,clinique de cognition,24,6,False,False
1,ciusss_centresud,problèmes liés,42,30,False,False
2,ciusss_centresud,usagers du sud-ouest-verdun,24,24,False,False
3,ciusss_centresud,accès aux services secteurs des faubourgs,58,58,False,False
4,ciusss_centresud,gynécologie,20,14,False,True
...,...,...,...,...,...,...
12646,pinel,statut temps complet permanent salaire,19,19,False,False
12647,pinel,acquisition,75,70,False,False
12648,pinel,philippe-pinel,233,94,False,False
12649,pinel,statut,138,66,False,False


On va additionner les fréquences brutes (TF) et fréquences documentaires (DF) de chaque terme dans les différents corpus ; ensuite on va travailler avec ces valeurs uniquement.

In [25]:
big_frame['Fréquence totale (TF)'] = big_frame.groupby(['Terme'])['Fréquence (TF)'].transform('sum')
big_frame['Fréquence documentaire totale (DF)'] = big_frame.groupby(['Terme'])['Fréquence documentaire (DF)'].transform('sum')

big_frame

Unnamed: 0,Corpus,Terme,Fréquence (TF),Fréquence documentaire (DF),isMeSHTerm,isTaxoTerm,Fréquence totale (TF),Fréquence documentaire totale (DF)
0,ciusss_centresud,clinique de cognition,24,6,False,False,24,6
1,ciusss_centresud,problèmes liés,42,30,False,False,42,30
2,ciusss_centresud,usagers du sud-ouest-verdun,24,24,False,False,24,24
3,ciusss_centresud,accès aux services secteurs des faubourgs,58,58,False,False,58,58
4,ciusss_centresud,gynécologie,20,14,False,True,20,14
...,...,...,...,...,...,...,...,...
12646,pinel,statut temps complet permanent salaire,19,19,False,False,19,19
12647,pinel,acquisition,75,70,False,False,133,124
12648,pinel,philippe-pinel,233,94,False,False,233,94
12649,pinel,statut,138,66,False,False,450,254


In [26]:
big_frame = big_frame.drop(columns = ['Corpus', 'Fréquence (TF)', 'Fréquence documentaire (DF)'])

big_frame = big_frame.drop_duplicates(subset=['Terme'])
big_frame

Unnamed: 0,Terme,isMeSHTerm,isTaxoTerm,Fréquence totale (TF),Fréquence documentaire totale (DF)
0,clinique de cognition,False,False,24,6
1,problèmes liés,False,False,42,30
2,usagers du sud-ouest-verdun,False,False,24,24
3,accès aux services secteurs des faubourgs,False,False,58,58
4,gynécologie,False,True,20,14
...,...,...,...,...,...
12641,vie professionnelle favorisant la formation,False,False,68,68
12643,jours de congé,False,False,69,61
12645,jour d'utilisation,False,False,61,61
12646,statut temps complet permanent salaire,False,False,19,19


In [27]:
big_frame['TF + DF'] = big_frame['Fréquence totale (TF)'] + big_frame['Fréquence documentaire totale (DF)']
big_frame['Terme'] = big_frame['Terme'].astype(str)

*Embedding : One-hot encoding*  
> One Hot encoding is a representation of categorical variables as binary vectors. Each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

*Créer une fonction avec N en paramètres = nombre de features retenus souhaités*

In [28]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("(\w+\'|\w+-\w+|\(|\)|\w+)")

file_path = "../04-filtrage/stopwords.txt"
with open(file_path, 'r', encoding="utf-8") as f:
    stopwords = [t.lower().strip('\n') for t in f.readlines()]

def to_tokens(kw, min_chars=2):
    tokens = tokenizer.tokenize(str(kw)) # split the string into a list of words
    tokens = [word for word in tokens if len(word) > min_chars] 
    tokens = [str(word) for word in tokens if word not in stopwords] 
    
    tokens = set(tokens) # to remove duplicates
    tokens = sorted(tokens) # converts our set back to a list and sorts words in alphabetical order
    return tokens

In [29]:
keywords_oh = big_frame

In [30]:
keywords_oh["tokens"] = keywords_oh["Terme"].apply(lambda x: to_tokens(
    x,
    min_chars=2,
)).astype(str)

# Test - Seulement retenir des n-grammes où n est au-dessus de 2
keywords_oh["len"] = keywords_oh["tokens"].apply(lambda x : len(x))
keywords_oh = keywords_oh[keywords_oh['len'] > 1].drop(columns=["len"])

In [31]:
keywords_oh

Unnamed: 0,Terme,isMeSHTerm,isTaxoTerm,Fréquence totale (TF),Fréquence documentaire totale (DF),TF + DF,tokens
0,clinique de cognition,False,False,24,6,30,"['clinique', 'cognition']"
1,problèmes liés,False,False,42,30,72,"['liés', 'problèmes']"
2,usagers du sud-ouest-verdun,False,False,24,24,48,"['sud-ouest', 'usagers', 'verdun']"
3,accès aux services secteurs des faubourgs,False,False,58,58,116,"['accès', 'faubourgs', 'secteurs', 'services']"
4,gynécologie,False,True,20,14,34,['gynécologie']
...,...,...,...,...,...,...,...
12641,vie professionnelle favorisant la formation,False,False,68,68,136,"['favorisant', 'formation', 'professionnelle',..."
12643,jours de congé,False,False,69,61,130,"['congé', 'jours']"
12645,jour d'utilisation,False,False,61,61,122,"['jour', 'utilisation']"
12646,statut temps complet permanent salaire,False,False,19,19,38,"['complet', 'permanent', 'salaire', 'statut', ..."


In [32]:
# Cette fontion prend en paramètres :
# - un DataFrame qui contient un champ où sont consignés des vecteurs contenant les tokens de chaque mot-clé
# - le nombre de features maximal qui doit être retenu pour constituer le plongement lexical (embedding)

# Elle retourne le dataframe où une colonne 'vector' a été ajoutée avec le bon nombre de features

def features_embeddings(df, n_features):
    vocab = sorted(set(df["tokens"].explode()))
    len(vocab)

    counter = Counter(df["tokens"].explode().to_list())
    vocab = []

    # Ici, ça pourrait être intéressant de retenir sur la base du score TF-IDF ou OKapi
    for key,value in counter.most_common(n_features):
        vocab.append(key)

    return vocab

vocab = features_embeddings(keywords_oh, 50)

In [33]:
def to_vector(keyword,vocab):
    vector = []
    for word in vocab:
        if word in keyword:
            vector.append(1)
        else:
            vector.append(0)
    return vector
    
keywords_oh["vector"] = keywords_oh["tokens"].apply(lambda x: to_vector(x,vocab))
keywords_oh

Unnamed: 0,Terme,isMeSHTerm,isTaxoTerm,Fréquence totale (TF),Fréquence documentaire totale (DF),TF + DF,tokens,vector
0,clinique de cognition,False,False,24,6,30,"['clinique', 'cognition']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,problèmes liés,False,False,42,30,72,"['liés', 'problèmes']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,usagers du sud-ouest-verdun,False,False,24,24,48,"['sud-ouest', 'usagers', 'verdun']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,accès aux services secteurs des faubourgs,False,False,58,58,116,"['accès', 'faubourgs', 'secteurs', 'services']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,gynécologie,False,True,20,14,34,['gynécologie'],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...,...,...,...,...,...
12641,vie professionnelle favorisant la formation,False,False,68,68,136,"['favorisant', 'formation', 'professionnelle',...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
12643,jours de congé,False,False,69,61,130,"['congé', 'jours']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
12645,jour d'utilisation,False,False,61,61,122,"['jour', 'utilisation']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
12646,statut temps complet permanent salaire,False,False,19,19,38,"['complet', 'permanent', 'salaire', 'statut', ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


*Embedding : Sentence transformers*  

> "A **transformer** is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part  of the input data.
Transformers are increasingly the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM). The additional  training parallelization allows training on larger datasets. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), which were trained with large language datasets, such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks."   
  
(https://en.wikipedia.org/wiki/Transformer_(machine_learning_model))


In [34]:
keywords_st = big_frame[['Terme', 'isMeSHTerm',	'isTaxoTerm',	'Fréquence totale (TF)',	'Fréquence documentaire totale (DF)',	'TF + DF']]

In [35]:
# On va utiliser un modèle BERT/sentence transformers (fr) pour extraire nos embeddings plutôt que des simples one-hot encoding
from sentence_transformers import SentenceTransformer
model =  SentenceTransformer("dangvantuan/sentence-camembert-base")



In [37]:
sentences = keywords_st['Terme'].tolist()
embeddings_st = model.encode(sentences)

keywords_st

KeyboardInterrupt: 

#### **K-means clustering** (*sklearn*)


*Un premier essai sur nos one-hot embeddings*

In [39]:
def results_kmeans(df, embedding, embedding_str):
    K = range(20, round(len(vocab)))
    silhouette_scores = []

    # On commence par tester différentes valeurs de k pour trouver celle pour laquelle le score Silhouette est
    # le plus élevé
    for k in K:
        X = embedding
        km = KMeans(n_clusters=k, init='k-means++', algorithm='elkan', max_iter=200, n_init=1).fit(X)

        # Run LSA
        # Since LSA/SVD results are not normalized,
        # we redo the normalization to improve the k-means result.
        svd = TruncatedSVD(n_components= round(len(vocab)/2))
        normalizer = Normalizer(copy=False)
        lsa = make_pipeline(svd, normalizer)

        X = lsa.fit_transform(X)
        #### FIN LSA
        km.fit(X)

        labels = km.labels_
        silhouette_scores.append([k, metrics.silhouette_score(X, labels)])

    df = DataFrame(silhouette_scores, columns=['Nombre de clusters (k)', 'Score Silhouette'])
    true_k = int(df[df['Score Silhouette'] == df['Score Silhouette'].max()]['Nombre de clusters (k)'])
    resultats = df[df['Score Silhouette'] == df['Score Silhouette'].max()]

    print("Score Silhouette")
    print("On va regrouper nos termes en " + str(true_k) + " clusters.")

    # On stocke le score dans notre tableau de résultats
    algorithme = 'K-means'
    distance = 'Euclidean'
    features = len(vocab)

    ### On reroule ensuite avec la valeur retenue et on va stocker les résultats dans un CSV
    X = embedding
    km = KMeans(n_clusters=true_k, init='k-means++', algorithm='elkan', max_iter=200, n_init=1).fit(X)

    # Run LSA
    # Since LSA/SVD results are not normalized,
    # we redo the normalization to improve the k-means result.
    svd = TruncatedSVD(n_components=round(len(vocab)/2))
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)

    X = lsa.fit_transform(X)
    km.fit(X)

    labels = km.labels_
    df["kmeans"] = km.labels_


    # Pour mieux interpréter, on assigne un label significatif à nos clusters
    # On retient le terme pour chaque cluster dont la valeur TF + DF est la plus élevée
    current_labels = set(km.labels_.tolist())

    desired_labels = {x : None for x in current_labels} # (on initialise à None)

    for label in current_labels:
        cluster = df[df["kmeans"] == label]
        max_freq = cluster['TF + DF'].max()
        new_label = cluster[cluster['TF + DF'] == max_freq]['Terme'].values[0]

        desired_labels[label] = new_label

    df['Cluster_kmeans_euclidean'] = df['kmeans'].map(desired_labels)

    df.sort_values(["Cluster_kmeans_euclidean"], 
            axis=0,
            ascending=[False], 
            inplace=True)

    df = df[['Cluster_kmeans_euclidean', 'Terme', 'Fréquence totale (TF)', 'Fréquence documentaire totale (DF)', 'TF + DF']]
    df = df.sort_values(['Cluster_kmeans_euclidean', 'Fréquence totale (TF)', 'Fréquence documentaire totale (DF)'],
                ascending = [True, False, False])


    # On stocke les résultats dans un CSV
    base_path = '../06-clustering/'
    file_path = base_path + algorithme + '_' + embedding_str + '_' + distance + '_' + str(features) + '.csv'
    df.to_csv(file_path)

    return resultats

In [40]:
embeddings_oh = keywords_oh["vector"].to_list()
resultats_km = results_kmeans(keywords_oh, embeddings_oh, 'One-Hot')
resultats_km

Score Silhouette
On va regrouper nos termes en 49 clusters.


ValueError: Length of values (11837) does not match length of index (30)

In [None]:
# On stocke le score dans notre tableau de résultats
algorithme = 'K-means'
embedding = 'One-Hot'
distance = 'Euclidean'
features = len(vocab)
true_k = resultats_km['Nombre de clusters (k)'].values[0]
score = resultats_km['Score Silhouette'].values[0]

add_results(algorithme, embedding, distance, features, true_k, score)

In [None]:
resultats_km = results_kmeans(keywords_st, embeddings_st, 'Sentence transformers')
resultats_km

In [None]:
# On stocke le score dans notre tableau de résultats
algorithme = 'K-means'
embedding = 'Sentence transformers'
distance = 'Euclidean'
features = len(vocab)
true_k = resultats_km['Nombre de clusters (k)'].values[0]
score = resultats_km['Score Silhouette'].values[0]

add_results(algorithme, embedding, distance, features, true_k, score)

*Reprendre ici si ça fonctionne*

#### **K-means clustering** (*sklearn*)


*Un deuxième essai, cette fois sur sur nos transformers embeddings*

In [None]:
K = range(20,len(vocab))
#Sum_of_squared_distances = []
silhouette_scores = []

for k in K:
    #true_k = int(input())
    X = embeddings
    kmeans = KMeans(n_clusters=k, init='k-means++', algorithm='elkan', max_iter=200, n_init=1).fit(X)
    #Sum_of_squared_distances.append(kmeans.inertia_)

    # Run LSA
    # Since LSA/SVD results are not normalized,
    # we redo the normalization to improve the k-means result.
    svd = TruncatedSVD(n_components=round(k/4))
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)

    X = lsa.fit_transform(X)
    kmeans.fit(X)

    labels = kmeans.labels_
    #keywords["Cluster"] = list(kmeans.labels_)

    #original_space_centroids = svd.inverse_transform(X)
    #order_centroids = original_space_centroids.argsort()[:, ::-1]
    silhouette_scores.append([k, metrics.silhouette_score(X, labels)])

df = DataFrame(silhouette_scores, columns=['Nombre de clusters (k)', 'Score Silhouette'])
true_k = int(df[df['Score Silhouette'] == df['Score Silhouette'].max()]['Nombre de clusters (k)'])

# print("Méthode Elbow")
# plt.plot(K, Sum_of_squared_distances, 'bx-')
# plt.xlabel('k')
# plt.ylabel('Sum_of_squared_distances')
# plt.title('Elbow Method For Optimal k')
# plt.show()

print("Score Silhouette")
print("On va regrouper nos termes en " + str(true_k) + " clusters.")
df

In [None]:
algorithme = 'K-means'
embedding = 'Sentence transformers'
distance = 'Euclidean'
features = len(vocab)

add_results(algorithme, embedding, distance, features)

In [None]:
X = embeddings
kmeans = KMeans(n_clusters=true_k, init='k-means++', algorithm='elkan', max_iter=200, n_init=1).fit(X)

# Run LSA
# Since LSA/SVD results are not normalized,
# we redo the normalization to improve the k-means result.
svd = TruncatedSVD(n_components=true_k)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

X = lsa.fit_transform(X)
kmeans.fit(X)

labels = kmeans.labels_
keywords_st["kmeans"] = list(kmeans.labels_)

keywords_st


In [None]:
current_labels = set(kmeans.labels_.tolist())

desired_labels = {x : None for x in current_labels} # (on initialise à None)

for label in current_labels:
    cluster = keywords_st[keywords_st["kmeans"] == label]
    max_freq = cluster['TF + DF'].max()
    new_label = cluster[cluster['TF + DF'] == max_freq]['Terme'].values[0]

    desired_labels[label] = new_label

keywords_st['Cluster_kmeans_euclidean'] = keywords_st['kmeans'].map(desired_labels)

In [None]:
#keywords = keywords[['Terme', 'Fréquence (TF)', 'Fréquence documentaire (DF)', 'Cluster']]
keywords_st.sort_values(["Cluster_kmeans_euclidean"], 
        axis=0,
        ascending=[False], 
        inplace=True)

In [None]:
#keywords = keywords.drop(columns=['TF + DF', 'tokens', 'vector', 'kmeans'])

keywords_st = keywords_st[['Corpus', 'Cluster_kmeans_euclidean', 'kmeans', 'Terme', 'Fréquence (TF)', 'Fréquence documentaire (DF)', 'TF + DF']]
keywords_st = keywords_st.sort_values(['Cluster_kmeans_euclidean', 'Fréquence (TF)', 'Fréquence documentaire (DF)'],
              ascending = [True, False, False])

keywords_st

In [None]:
base_path = '../06-clustering/'
file_path = base_path + '_KMeans_transformers_euclidean.csv'
keywords_st.to_csv(file_path)

In [None]:
keywords_st.groupby("Cluster_kmeans_euclidean")["Terme"].count()

keywords_st

### **K-means clustering** (*NLTK*)

Le but d'utiliser NLTK est de pouvoir prendre la distance cosinus entre les vecteurs plutôt que la distance Euclidienne, pour les embeddings basés sur le sentence transformer.

In [None]:
from nltk import cluster
from nltk.cluster import KMeansClusterer
from nltk.cluster import cosine_distance
import nltk
import numpy as np
from numpy import array, ndarray
  
from sklearn import metrics

In [None]:
# initialise the clusterer (will also assign the vectors to clusters)

K = range(20,len(vocab))
silhouette_scores = []
for k in K:
    X = embeddings
    clusterer = cluster.KMeansClusterer(true_k, distance=cosine_distance, avoid_empty_clusters=True, repeats=25)
    # Run LSA
    # Since LSA/SVD results are not normalized,
    # we redo the normalization to improve the result.
    svd = TruncatedSVD(n_components=round(k/4))
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)
    X = lsa.fit_transform(X)
    labels = clusterer.cluster(X, assign_clusters= True)
    silhouette_scores.append([k, metrics.silhouette_score(X, labels)])

df = DataFrame(silhouette_scores, columns=['Nombre de clusters (k)', 'Score Silhouette'])
true_k = int(df[df['Score Silhouette'] == df['Score Silhouette'].max()]['Nombre de clusters (k)'])

df[df['Nombre de clusters (k)'] == true_k]
df

In [None]:
algorithme = 'K-means'
embedding = 'Sentence transformers'
distance = 'cosine'
features = len(vocab)

add_results(algorithme, embedding, distance, features)

results

In [None]:
keywords_oh

In [None]:
keywords_st["kmeans_cosine"] = list(labels)

In [None]:
current_labels = set(labels)

desired_labels = {x : None for x in current_labels} # (on initialise à None)

for label in current_labels:
    cluster = keywords_st[keywords_st["kmeans_cosine"] == label]
    max_freq = cluster['TF + DF'].max()
    new_label = cluster[cluster['TF + DF'] == max_freq]['Terme'].values[0]

    desired_labels[label] = new_label

keywords_st['Cluster_kmeans_cosine'] = keywords_st['kmeans_cosine'].map(desired_labels)

keywords_st

### **EM clustering** (*sklearn*)

> L'algorithme espérance-maximisation (en anglais expectation-maximization algorithm, souvent abrégé EM), est un algorithme itératif qui permet de trouver les paramètres du maximum de vraisemblance d'un modèle probabiliste lorsque ce dernier dépend de variables latentes non observables. 

(https://fr.wikipedia.org/wiki/Algorithme_esp%C3%A9rance-maximisation)

**sklearn.mixture GaussianMixture**  
https://www.analyticsvidhya.com/blog/2019/10/gaussian-mixture-models-clustering/  
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

*Encore une fois, on fait un premier essai sur nos one-hot embeddings*

In [None]:
from sklearn.mixture import GaussianMixture

K = range(20,len(vocab))
silhouette_scores = []
for k in K:
    X = keywords_oh["vector"].to_list()
    gmm = GaussianMixture(n_components=k, init_params='k-means++').fit(X)
    # Run LSA
    # Since LSA/SVD results are not normalized,
    # we redo the normalization to improve the result.
    svd = TruncatedSVD(n_components=k)
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)
    X = lsa.fit_transform(X)
    gmm.fit(X)
    labels = gmm.predict(X)
    silhouette_scores.append([k, metrics.silhouette_score(X, labels)])

df = DataFrame(silhouette_scores, columns=['Nombre de clusters (k)', 'Score Silhouette'])
true_k = int(df[df['Score Silhouette'] == df['Score Silhouette'].max()]['Nombre de clusters (k)'])

df[df['Nombre de clusters (k)'] == true_k]

In [None]:
algorithme = 'Expectation-Maximization'
embedding = 'One-Hot'
distance = None
features = len(vocab)

add_results(algorithme, embedding, distance, features)

results

In [None]:
keywords_oh["E-M"] = list(labels)
current_labels = set(labels)

desired_labels = {x : None for x in current_labels} # (on initialise à None)

for label in current_labels:
    cluster = keywords_oh[keywords_oh["E-M"] == label]
    max_freq = cluster['TF + DF'].max()
    new_label = cluster[cluster['TF + DF'] == max_freq]['Terme'].values[0]

    desired_labels[label] = new_label

keywords_oh['Cluster_E-M'] = keywords_oh['E-M'].map(desired_labels)

keywords_oh

In [None]:
vectors = keywords_oh["vector"].tolist()

pca = PCA(n_components=2).fit(vectors)
pca_2d = pca.transform(vectors)

plt.scatter(pca_2d[:,0], pca_2d[:,1], c=keywords_oh["GMM"], s=keywords_oh["TF + DF"])

In [None]:
current_labels = set(labels)

desired_labels = {x : None for x in current_labels} # (on initialise à None)

for label in current_labels:
    cluster = keywords_oh[keywords_oh["GMM"] == label]
    max_freq = cluster['TF + DF'].max()
    new_label = cluster[cluster['TF + DF'] == max_freq]['Terme'].values[0]

    desired_labels[label] = new_label

keywords_oh['Cluster_GMM'] = keywords_oh['GMM'].map(desired_labels)

In [None]:
#keywords_oh = keywords_oh[['Cluster_kmeans_euclidean', 'Cluster_GMM', 'Terme', 'Fréquence (TF)', 'Fréquence documentaire (DF)', 'TF + DF']]
keywords_oh = keywords_oh[['Corpus', 'Cluster_kmeans_euclidean', 'Cluster_GMM', 'Terme', 'Fréquence (TF)', 'Fréquence documentaire (DF)', 'TF + DF', 'isMeSHTerm', 'isTaxoTerm', 'vector', 'kmeans']]
keywords_oh

*Deuxième essai, cette fois sur les sentence embeddings / transformers*

In [None]:
K = range(20,len(vocab))
silhouette_scores = []
for k in K:
    X = embeddings
    gmm = GaussianMixture(n_components=k, init_params='k-means++').fit(X)
    # Run LSA
    # Since LSA/SVD results are not normalized,
    # we redo the normalization to improve the result.
    svd = TruncatedSVD(n_components=round(k/4))
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)
    X = lsa.fit_transform(X)
    gmm.fit(X)
    labels = gmm.predict(X)
    silhouette_scores.append([k, metrics.silhouette_score(X, labels)])

df = DataFrame(silhouette_scores, columns=['Nombre de clusters (k)', 'Score Silhouette'])
true_k = int(df[df['Score Silhouette'] == df['Score Silhouette'].max()]['Nombre de clusters (k)'])

df[df['Nombre de clusters (k)'] == true_k]

In [None]:
algorithme = 'E-M'
embedding = 'Sentence transformers'
distance = None
features = len(vocab)

add_results(algorithme, embedding, distance, features)
results

[...] to do : ajouter le EM clustering / sklearn avec le sentence embeddings

En fait je ne suis pas convaincue, parce que dans tous les cas ça ne donne jamais d'aussi bons résultats avec les sentence embeddings quant au scoresilhouette ; peut-être plus tard, mais peut-être pas une priorité.

### **EM clustering** (*NLTK*)

*Premier essai, sur les one-hot embeddings / bag-of-words*

Pas prioritaire non plus puisqu'on a déjà le EM Clustering fonctionnel avec celui de sk-learn

In [None]:
# from nltk import cluster
# from nltk.cluster import KMeansClusterer, euclidean_distance

# #On initialise sur les kmeans
# vectors = [array(f) for f in keywords_oh['vector'].tolist()]

# clusterer = KMeansClusterer(true_k, euclidean_distance, initial_means=None, repeats=10)
# means = clusterer.cluster(vectors, True, trace=True)

# ##########
# clusterer = cluster.EMClusterer(means, bias=0.1)
# clusters = clusterer.cluster(vectors, True, trace=True)



*Deuxième essai, cette fois sur les sentence embeddings / transformers*  
(même chose)

In [None]:
###########

### **Agglomerative clustering** (*NLTK / sklearn*)
> In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:  
> - Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
> - Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.    
  
(https://en.wikipedia.org/wiki/Hierarchical_clustering)

**sklearn AgglomerativeClustering / one-hot embeddings**  
(le clusterer agglomératif de sklearn, qui permet d'utiliser la distance Euclidienne, mais pas celui d'NLTK)*

In [None]:
from sklearn.cluster import AgglomerativeClustering

K = range(20,50)
Sum_of_squared_distances = []
silhouette_scores = []

for k in K:
    X = keywords_oh['vector'].tolist()
    clustering = AgglomerativeClustering().fit(X)

    #Sum_of_squared_distances.append(km.inertia_)

    # Run LSA
    # Since LSA/SVD results are not normalized,
    # we redo the normalization to improve the k-means result.
    svd = TruncatedSVD(n_components=round(k/4))
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)

    X = lsa.fit_transform(X)

    clusters = clustering.labels_
    #keywords["Cluster"] = list(kmeans.labels_)

    #original_space_centroids = svd.inverse_transform(X)
    #order_centroids = original_space_centroids.argsort()[:, ::-1]
    silhouette_scores.append([k, metrics.silhouette_score(X, clusters)])

df = DataFrame(silhouette_scores, columns=['Nombre de clusters (k)', 'Score Silhouette'])
true_k = int(df[df['Score Silhouette'] == df['Score Silhouette'].max()]['Nombre de clusters (k)'])

# print("Méthode Elbow")
# plt.plot(K, Sum_of_squared_distances, 'bx-')
# plt.xlabel('k')
# plt.ylabel('Sum_of_squared_distances')
# plt.title('Elbow Method For Optimal k')
# plt.show()

print("Score Silhouette")
print("On va regrouper nos termes en " + str(true_k) + " clusters.")
df

In [None]:
algorithme = 'AgglomerativeClustering'
embedding = 'One-Hot'
distance = 'Euclidean'
features = len(vocab)

add_results(algorithme, embedding, distance, features)

results

**NLTK Group Average Agglomerative Clustering (GAAC) / Sentence transformers embeddings**

In [None]:
from nltk.cluster import GAAClusterer

K = range(20,50)
Sum_of_squared_distances = []
silhouette_scores = []

for k in K:
    X = embeddings
    clusterer = GAAClusterer(k)

    #Sum_of_squared_distances.append(km.inertia_)

    # Run LSA
    # Since LSA/SVD results are not normalized,
    # we redo the normalization to improve the k-means result.
    svd = TruncatedSVD(n_components=round(k/4))
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)

    X = lsa.fit_transform(X)

    clusters = clusterer.cluster(X, True)
    #keywords["Cluster"] = list(kmeans.labels_)

    #original_space_centroids = svd.inverse_transform(X)
    #order_centroids = original_space_centroids.argsort()[:, ::-1]
    silhouette_scores.append([k, metrics.silhouette_score(X, clusters)])

df = DataFrame(silhouette_scores, columns=['Nombre de clusters (k)', 'Score Silhouette'])
true_k = int(df[df['Score Silhouette'] == df['Score Silhouette'].max()]['Nombre de clusters (k)'])

# print("Méthode Elbow")
# plt.plot(K, Sum_of_squared_distances, 'bx-')
# plt.xlabel('k')
# plt.ylabel('Sum_of_squared_distances')
# plt.title('Elbow Method For Optimal k')
# plt.show()

print("Score Silhouette")
print("On va regrouper nos termes en " + str(true_k) + " clusters.")
df

In [None]:
algorithme = 'AgglomerativeClustering'
embedding = 'Sentence transformers'
distance = 'Cosine'
features = len(vocab)

add_results(algorithme, embedding, distance, features)

results

In [None]:
from nltk.cluster import GAAClusterer

# use a set of tokens with 2D indices
vectors = embeddings

# test the GAAC clusterer with 4 clusters
clusterer = GAAClusterer(50)
clusters = clusterer.cluster(vectors, True)

keywords_st["GAAC"] = clusters

current_labels = set(clusters)

desired_labels = {x : None for x in current_labels} # (on initialise à None)

for label in current_labels:
    cluster = keywords_oh[keywords_st["GAAC"] == label]
    max_freq = cluster['TF + DF'].max()
    new_label = cluster[cluster['TF + DF'] == max_freq]['Terme'].values[0]

    desired_labels[label] = new_label

keywords_st['Cluster_GAAC'] = keywords_st['GAAC'].map(desired_labels)


In [None]:
keywords_st

### **Résultats**

In [None]:
results