## **4. Keywords Clustering**
BERT Sentence transformers embeddings | Community Detection Algorithm

We will compare models implemeting different combinations of these parameters:
- **Accuracy** : ranging from 65% to 100%
- **Minimum size of clusters** : 2 to 20

We will choose the set of parameters which maximizes accuracy and minimizes the resulting number of orphan terms (terms not assigned to any cluster)

Online tutorial followed :  
https://colab.research.google.com/github/searchsolved/search-solved-public-seo/blob/main/search_engine_journal/SEJ_Semantic_Clustering_Tool_by_LeeFootSEO.ipynb

____________________________________________________________________________________________________________________

### **Embedding** : BERT Sentence Transformers

> "A **transformer** is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part  of the input data.
Transformers are increasingly the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM). The additional  training parallelization allows training on larger datasets. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), which were trained with large language datasets, such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks."   
  
(https://en.wikipedia.org/wiki/Transformer_(machine_learning_model))


In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util
model =  SentenceTransformer("dangvantuan/sentence-camembert-base")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
cluster_accuracy = 65  # 0-100 (100 = very tight clusters, but higher percentage of no_cluster groups)
min_cluster_size = 2  # set the minimum size of cluster groups. (Lower number = tighter groups)


accuracy = [65, 70, 75, 80, 85, 90, 95, 100]
size = [2, 5, 10, 25, 50, 100, 500, 1000]
 
results= []
for a in accuracy:
    for s in size:
        results.append({'Accuracy' : a, 'Minimum cluster size': s, 'Number of orphans': None})


# On veut sélectionner les paramètres qui vont maximiser le accuracy et minimiser le nombre d'orphans
results = pd.DataFrame(results)
results

Unnamed: 0,Accuracy,Minimum cluster size,Number of orphans
0,65,2,
1,65,5,
2,65,10,
3,65,25,
4,65,50,
...,...,...,...
59,100,25,
60,100,50,
61,100,100,
62,100,500,


In [3]:
file_path = '../06-clustering/candidate_terms.csv'
with open(file_path, encoding='utf-8') as f:
    df = pd.read_csv(f).drop(columns=["Unnamed: 0"])
    df['Terme'] = df['Terme'].astype('str')
    df['TF + DF'] = df['TF'] + df['DF']

# store the data
cluster_name_list = []
corpus_sentences_list = []
df_all = []

corpus_set = set(df['Terme'])
corpus_set_all = corpus_set
cluster = True

df

Unnamed: 0,Corpora,Terme,Structure syntaxique,Forme lemmatisée,isMeSHTerm,MeSHID,MesH_prefLabel_fr,MesH_prefLabel_en,isTaxoTerm,Log Likelihood,TF,DF,TF*IDF,OKapiBM25,TF + DF
0,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",services sociaux,NOM ADJ,service social,True,D012947,Services sociaux et travail social (activité),Social Work,True,1674.908057,40189,15418,1.000000,26.361483,55607
1,"['chum', 'chuqc', 'cisss_ca', 'cisss_cotenord'...",santé publique,NOM ADJ,santé public,True,D011634,Santé publique,Public Health,True,1572.987576,32510,11194,1.000000,18.947189,43704
2,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",santé mentale,NOM ADJ,santé mental,True,D008603,Santé mentale,Mental Health,True,1579.080827,13229,4795,1.000000,24.142062,18024
3,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé,NOM PRP DET:ART NOM,ministère de le santé,False,,,,False,1411.378559,10741,7142,0.553007,21.734060,17883
4,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé et des services,NOM PRP DET:ART NOM KON PRP:det NOM,ministère de le santé et des service,False,,,,False,1107.888160,10560,7061,0.553007,31.111185,17621
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12609,['santeestrie'],centre de crise,NOM PRP NOM,centre de crise,False,,,,True,84.828851,3,3,0.136894,8.582629,6
12610,['inesss'],carcinome rénal,NOM ADJ,carcinome rénal,True,D002292,Néphrocarcinome,"Carcinoma, Renal Cell",False,61.813956,3,3,0.141561,18.359316,6
12611,['inesss'],durée de vie,NOM PRP NOM,durée de vie,True,D008136,Longévité,Longevity,False,434.083437,3,3,0.184051,21.441105,6
12612,['laval_sante'],commotion cérébrale,NOM ADJ,commotion cérébral,True,D001924,Commotion de l'encéphale,Brain Concussion,True,388.209903,3,3,0.557684,13.725435,6


In [4]:
from nltk.tokenize import RegexpTokenizer
regex = "[\w+-]+|\([\s+\w+\d+-]+\)|\w+|\w"
tokenizex = RegexpTokenizer(regex)

file_path = "../04-filtrage/stopwords.txt"
with open(file_path, 'r', encoding="utf-8") as f:
    stop = [t.lower().strip('\n') for t in f.readlines()]

def to_tokens(kw, min_chars=2):
    tokens = tokenizex.tokenize(str(kw)) # split the string into a list of words
    tokens = [word for word in tokens if len(word) > min_chars] 
    tokens = [str(word) for word in tokens if word not in stop] 
    
    tokens = set(tokens) # to remove duplicates
    tokens = sorted(tokens) # converts our set back to a list and sorts words in alphabetical order
    return tokens

df["tokens"] = df["Terme"].apply(lambda x: to_tokens(
    x,
    min_chars=3,
))
df

Unnamed: 0,Corpora,Terme,Structure syntaxique,Forme lemmatisée,isMeSHTerm,MeSHID,MesH_prefLabel_fr,MesH_prefLabel_en,isTaxoTerm,Log Likelihood,TF,DF,TF*IDF,OKapiBM25,TF + DF,tokens
0,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",services sociaux,NOM ADJ,service social,True,D012947,Services sociaux et travail social (activité),Social Work,True,1674.908057,40189,15418,1.000000,26.361483,55607,"[services, sociaux]"
1,"['chum', 'chuqc', 'cisss_ca', 'cisss_cotenord'...",santé publique,NOM ADJ,santé public,True,D011634,Santé publique,Public Health,True,1572.987576,32510,11194,1.000000,18.947189,43704,"[publique, santé]"
2,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",santé mentale,NOM ADJ,santé mental,True,D008603,Santé mentale,Mental Health,True,1579.080827,13229,4795,1.000000,24.142062,18024,"[mentale, santé]"
3,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé,NOM PRP DET:ART NOM,ministère de le santé,False,,,,False,1411.378559,10741,7142,0.553007,21.734060,17883,"[ministère, santé]"
4,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé et des services,NOM PRP DET:ART NOM KON PRP:det NOM,ministère de le santé et des service,False,,,,False,1107.888160,10560,7061,0.553007,31.111185,17621,"[ministère, santé, services]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12609,['santeestrie'],centre de crise,NOM PRP NOM,centre de crise,False,,,,True,84.828851,3,3,0.136894,8.582629,6,"[centre, crise]"
12610,['inesss'],carcinome rénal,NOM ADJ,carcinome rénal,True,D002292,Néphrocarcinome,"Carcinoma, Renal Cell",False,61.813956,3,3,0.141561,18.359316,6,"[carcinome, rénal]"
12611,['inesss'],durée de vie,NOM PRP NOM,durée de vie,True,D008136,Longévité,Longevity,False,434.083437,3,3,0.184051,21.441105,6,[durée]
12612,['laval_sante'],commotion cérébrale,NOM ADJ,commotion cérébral,True,D001924,Commotion de l'encéphale,Brain Concussion,True,388.209903,3,3,0.557684,13.725435,6,"[commotion, cérébrale]"


In [5]:
# keep looping through until no more clusters are created
def community_clusters(cluster_accuracy, min_cluster_size, corpus_set=corpus_set):
    cluster_accuracy = cluster_accuracy / 100
    cluster = True
    while cluster:
        corpus_sentences = list(corpus_set)
        check_len = len(corpus_sentences)

        corpus_embeddings = model.encode(corpus_sentences, batch_size=256, show_progress_bar=True, convert_to_tensor=True)
        clusters = util.community_detection(corpus_embeddings, min_community_size= min_cluster_size, threshold=cluster_accuracy, init_max_size=len(corpus_embeddings))

        for keyword, cluster in enumerate(clusters):
            print("\nCluster {}, #{} Elements ".format(keyword + 1, len(cluster)))

            for sentence_id in cluster[0:]:
                print("\t", corpus_sentences[sentence_id])
                corpus_sentences_list.append(corpus_sentences[sentence_id])
                cluster_name_list.append("Cluster {}, #{} Elements ".format(keyword + 1, len(cluster)))

        df_new = pd.DataFrame(None)
        df_new['Cluster Name'] = cluster_name_list
        df_new["Terme"] = corpus_sentences_list

        df_all.append(df_new)
        have = set(df_new["Terme"])

        corpus_set = corpus_set_all - have
        remaining = len(corpus_set)
        print("Total Unclustered Keywords: ", remaining)
        if check_len == remaining:
            break

In [None]:
for a in accuracy:
    for s in size:
        score = community_clusters(a, s)
        results.loc[((results['Accuracy'] == a) & (results['Minimum cluster size'] == s)), \
        'Number of orphans'] = score

In [None]:
# Ensuite, on voudrait représenter sur un graphique le accuracy, la taille des clusters et le nombre d'orphelins pour prendre
# la valeur combinaison de valeurs maximisant l'accuracy et minimisant le nombre de clusters orphelins

file_path = '../06-Clustering/results_BERT_communityDetection.csv'
results.to_csv(file_path)

In [None]:
########################################
# En fonction du graphique, on choisit nos paramètres
########################################
import seaborn as sns
import numpy as np
results['Number of orphans'] = pd.to_numeric(results['Number of orphans'])
results_pivot = results.pivot(index= 'Accuracy', columns='Minimum cluster size', values = 'Number of orphans')

fig = sns.heatmap(results_pivot, annot=True, fmt="d", cmap='binary').get_figure()
file_path = '../06-Clustering/figure_BERT_communityDetection.tiff'
fig.savefig(file_path) 

____________________________________________________________________
Choix des paramètres 
____________________________________________________________________

In [15]:
############################
# Choix des paramètres

cluster_accuracy = 75
min_cluster_size = 5

############################

cluster_accuracy = cluster_accuracy / 100
cluster = True
while cluster:
    corpus_sentences = list(corpus_set)
    check_len = len(corpus_sentences)

    corpus_embeddings = model.encode(corpus_sentences, batch_size=256, show_progress_bar=True, convert_to_tensor=True)
    clusters = util.community_detection(corpus_embeddings, min_community_size= min_cluster_size, threshold=cluster_accuracy, init_max_size=len(corpus_embeddings))

    for keyword, cluster in enumerate(clusters):
        print("\nCluster {}, #{} Elements ".format(keyword + 1, len(cluster)))

        for sentence_id in cluster[0:]:
            print("\t", corpus_sentences[sentence_id])
            corpus_sentences_list.append(corpus_sentences[sentence_id])
            cluster_name_list.append("Cluster {}, #{} Elements ".format(keyword + 1, len(cluster)))

    df_new = pd.DataFrame(None)
    df_new['Cluster Name'] = cluster_name_list
    df_new["Terme"] = corpus_sentences_list

    df_all.append(df_new)
    have = set(df_new["Terme"])

    corpus_set = corpus_set_all - have
    remaining = len(corpus_set)
    print("Total Unclustered Keywords: ", remaining)
    if check_len == remaining:
        break

Batches: 100%|██████████| 12/12 [00:45<00:00,  3.78s/it]

Total Unclustered Keywords:  3055





## **4. Keywords Clustering**
One-Hot embeddings | K-Means Expectation maximization

https://colab.research.google.com/drive/1HHNFjKlip1AaFIuvvn0AicWyv6egLOZw?usp=sharing#scrollTo=Ya0TkMAJYvAM

In [7]:
# make a new dataframe from the list of dataframe and merge back into the orginal df
df_new = pd.concat(df_all)
df = df.merge(df_new.drop_duplicates('Terme'), how='left', on="Terme")

df['Cluster Name'] = df['Cluster Name'].fillna("zzz_no_cluster")

In [8]:
df

Unnamed: 0,Corpora,Terme,Structure syntaxique,Forme lemmatisée,isMeSHTerm,MeSHID,MesH_prefLabel_fr,MesH_prefLabel_en,isTaxoTerm,Log Likelihood,TF,DF,TF*IDF,OKapiBM25,TF + DF,tokens,Cluster Name
0,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",services sociaux,NOM ADJ,service social,True,D012947,Services sociaux et travail social (activité),Social Work,True,1674.908057,40189,15418,1.000000,26.361483,55607,"[services, sociaux]","Cluster 30, #24 Elements"
1,"['chum', 'chuqc', 'cisss_ca', 'cisss_cotenord'...",santé publique,NOM ADJ,santé public,True,D011634,Santé publique,Public Health,True,1572.987576,32510,11194,1.000000,18.947189,43704,"[publique, santé]","Cluster 67, #15 Elements"
2,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",santé mentale,NOM ADJ,santé mental,True,D008603,Santé mentale,Mental Health,True,1579.080827,13229,4795,1.000000,24.142062,18024,"[mentale, santé]","Cluster 1, #261 Elements"
3,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé,NOM PRP DET:ART NOM,ministère de le santé,False,,,,False,1411.378559,10741,7142,0.553007,21.734060,17883,"[ministère, santé]","Cluster 2, #170 Elements"
4,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé et des services,NOM PRP DET:ART NOM KON PRP:det NOM,ministère de le santé et des service,False,,,,False,1107.888160,10560,7061,0.553007,31.111185,17621,"[ministère, santé, services]","Cluster 29, #24 Elements"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12609,['santeestrie'],centre de crise,NOM PRP NOM,centre de crise,False,,,,True,84.828851,3,3,0.136894,8.582629,6,"[centre, crise]","Cluster 998, #3 Elements"
12610,['inesss'],carcinome rénal,NOM ADJ,carcinome rénal,True,D002292,Néphrocarcinome,"Carcinoma, Renal Cell",False,61.813956,3,3,0.141561,18.359316,6,"[carcinome, rénal]",zzz_no_cluster
12611,['inesss'],durée de vie,NOM PRP NOM,durée de vie,True,D008136,Longévité,Longevity,False,434.083437,3,3,0.184051,21.441105,6,[durée],"Cluster 63, #16 Elements"
12612,['laval_sante'],commotion cérébrale,NOM ADJ,commotion cérébral,True,D001924,Commotion de l'encéphale,Brain Concussion,True,388.209903,3,3,0.557684,13.725435,6,"[commotion, cérébrale]","Cluster 682, #4 Elements"


In [9]:
# Attribuer un label significatif aux clusters 
# Deux options :
#   1 - S'il y a un token en commun  entre les termes d'un même clusters, le clusters sera désigné par celui-ci
#   2 - Sinon, on prend le terme dont la somme TF + DF est la plus élevée 
# rename the clusters to the shortest keyword in the cluster

labels = set(df['Cluster Name'].tolist())
desired_labels = {x : None for x in labels} # (on initialise à None)
for label in labels:
    d = df[df['Cluster Name'] == label]['tokens'].tolist()
    new_label = list(set.intersection(*map(set,d)))
    try:
        desired_labels[label] = new_label[0]
    except:
            cluster = df[df["Cluster Name"] == label]
            max_freq = cluster['TF + DF'].max()
            new_label = cluster[cluster['TF + DF'] == max_freq]['Terme'].values
            desired_labels[label] = new_label[0]

df['Cluster'] = df['Cluster Name'].map(desired_labels)

In [10]:
# move the cluster and keyword columns to the front
col = df.pop("Terme")
df.insert(1, col.name, col)

col = df.pop('Cluster')
df.insert(1, col.name, col)

df.sort_values(["Cluster", "Terme"], ascending=[True, True], inplace=True)

In [11]:
df

Unnamed: 0,Corpora,Cluster,Terme,Structure syntaxique,Forme lemmatisée,isMeSHTerm,MeSHID,MesH_prefLabel_fr,MesH_prefLabel_en,isTaxoTerm,Log Likelihood,TF,DF,TF*IDF,OKapiBM25,TF + DF,tokens,Cluster Name
11510,['inesss'],(cancer colorectal métastatique),abevmy (cancer colorectal métastatique),NOM PUN NOM ADJ ADJ PUN,abevmy (cancer colorectal métastatique ),False,,,,False,267.978979,39,39,0.008728,10.536159,78,"[(cancer colorectal métastatique), abevmy]","Cluster 1310, #2 Elements"
11439,['inesss'],(cancer colorectal métastatique),aybintio (cancer colorectal métastatique),NOM PUN NOM ADJ ADJ PUN,aybintio (cancer colorectal métastatique ),False,,,,False,289.445249,39,39,0.008728,12.443722,78,"[(cancer colorectal métastatique), aybintio]","Cluster 1310, #2 Elements"
800,"['cisss_iles', 'cisss_outaouais', 'ciusss_at',...",(chsld),durée (chsld),NOM PUN NOM PUN,durée (chsld ),False,,,,False,525.643231,524,455,0.452913,23.949148,979,"[(chsld), durée]","Cluster 1863, #2 Elements"
5988,['ciusss_mcq'],(chsld),hébergement (chsld),NOM PUN NOM PUN,hébergement (chsld ),False,,,,False,798.167800,85,83,0.531501,16.015980,168,"[(chsld), hébergement]","Cluster 1863, #2 Elements"
8218,['ciusss_centresud'],(crds),demandes (crds),NOM PUN NOM PUN,demande (crds ),False,,,,False,603.821792,65,56,0.122942,9.633273,121,"[(crds), demandes]","Cluster 1639, #2 Elements"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1180,['msss'],évolution,tableau synthèse de l'évolution des données,NOM NOM PRP DET:ART NOM PRP:det NOM,tableau synthèse de le évolution des donnée,False,,,,False,-inf,366,366,0.104796,23.143103,732,"[données, synthèse, tableau, évolution]","Cluster 1722, #2 Elements"
6986,['ciusss_cn'],évènements,municipalités et évènements,NOM KON NOM,municipalité et évènements,False,,,,False,1387.580036,69,53,0.456666,9.356082,122,"[municipalités, évènements]","Cluster 139, #2 Elements"
4397,['msss'],évènements,registre des évènements,NOM PRP:det NOM,registre des évènements,False,,,,False,-inf,131,112,0.395966,14.591255,243,"[registre, évènements]","Cluster 139, #2 Elements"
5376,['cisss_iles'],îles,cisss des îles,NOM PRP:det NOM,cisss des île,False,,,,False,1079.751879,101,39,1.000000,6.522820,140,"[cisss, îles]","Cluster 1629, #2 Elements"


In [12]:
file_path = '../06-clustering/clusters_BERT_communityDetection.csv'
df.drop(columns=['Cluster Name', 'tokens']).to_csv(file_path, index=False)

In [13]:
uncluster_percent = (remaining / len(df)) * 100
clustered_percent = 100 - uncluster_percent
print(clustered_percent,"% of rows clustered successfully!")

75.78087838909148 % of rows clustered successfully!


## **Visualisation**
https://inside-machinelearning.com/en/efficient-sentences-embedding-visualization-tsne/

In [14]:
# On va d'abord choisir les clusters qu'on veut visualiser (ajouter une colonne = Visualize Y/N)
clusters = pd.DataFrame(set(df['Cluster'].tolist()), columns=['Cluster'])

file_path = '../06-clustering/clusters_to_visualize.csv'
clusters.to_csv(file_path)

__________________________________________________________________________________________
Étape d'indexation manuelle
__________________________________________________________________________________________

In [16]:
with open(file_path, encoding = 'utf-8') as f:
    clusters = pd.read_csv(f, sep=';')
    clusters = clusters[clusters['Visualize (Y/N)'] == 'Y']['Cluster'].tolist()

clusters

['hémato-oncologie',
 'alimentation alternative',
 'cardiologie',
 'philippe-pinel',
 'jeunesse (dpj)',
 'prise de sang',
 'rhumatisme',
 'travailleurs immunosupprimés',
 'psychothérapie',
 'asthme',
 'anticoagulants',
 'fumée',
 'infections nosocomiales',
 'jeunes adultes',
 'résidence privée',
 'intoxications',
 'grossesse gynéchologie',
 'dépressif',
 'autosoins',
 'malentendants',
 'cyberdépendance',
 'paralysie',
 'infections',
 'traumatologie',
 'littérature',
 'palliatifs',
 'infirmière clinicienne',
 'génériques',
 'dermatite',
 'cancer colorectal',
 'insuffisance cardiaque',
 'risques infectieux',
 'infections transmissibles',
 'implant cochléaire',
 'maternité',
 'maladie de parkinson',
 'grand-parentalité',
 'centre intégré universitaire',
 'médicaments biosimilaires',
 'environnement',
 'défavorisation',
 'prise en charge médicale',
 'enceintes',
 'déficience',
 'développement des enfants',
 'allergie',
 'déficience visuelle',
 'pollution',
 'vieillissement',
 'reconstructi

In [17]:
# To speed up the learning process, I suggest to reduce the amount of data. 
to_visualize = df.query('Cluster in @clusters')
to_visualize

Unnamed: 0,Corpora,Cluster,Terme,Structure syntaxique,Forme lemmatisée,isMeSHTerm,MeSHID,MesH_prefLabel_fr,MesH_prefLabel_en,isTaxoTerm,Log Likelihood,TF,DF,TF*IDF,OKapiBM25,TF + DF,tokens,Cluster Name
2931,['ciusss_at'],alimentation alternative,alimentation alternative,NOM ADJ,alimentation alternatif,False,,,,False,-inf,168,84,0.055972,2.610970,252,"[alimentation, alternative]","Cluster 569, #4 Elements"
5314,"['ciusss_at', 'ciusss_nordmtl']",alimentation alternative,maison alternative,NOM ADJ,maison alternatif,False,,,,False,1047.188893,103,26,1.000000,8.291842,129,"[alternative, maison]","Cluster 569, #4 Elements"
5124,['inesss'],alimentation alternative,normes disponibles alternatives,NOM ADJ ADJ,norme disponible alternatif,False,,,,False,-inf,107,107,0.059649,27.873693,214,"[alternatives, disponibles, normes]","Cluster 569, #4 Elements"
8757,['ciusss_nordmtl'],alimentation alternative,protocole alternatives,NOM ADJ,protocole alternatif,False,,,,False,525.792755,60,60,0.169313,9.155886,120,"[alternatives, protocole]","Cluster 569, #4 Elements"
9933,['ciusss_cn'],allaitement,allaitement alimentation vaccination,NOM NOM NOM,allaitement alimentation vaccination,False,,,,False,751.448129,51,51,0.152911,7.691492,102,"[alimentation, allaitement, vaccination]","Cluster 1065, #3 Elements"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7075,['ciusss_cn'],vieillissement,cours du vieillissement développement,NOM PRP:det NOM NOM,cour du vieillissement développement,False,,,,False,858.896962,67,64,0.091229,21.338439,131,"[cours, développement, vieillissement]","Cluster 230, #7 Elements"
3413,['ciusss_cn'],vieillissement,excellence sur le vieillissement,NOM PRP DET:ART NOM,excellence sur le vieillissement,False,,,,False,-inf,156,72,0.384555,6.390874,228,"[excellence, vieillissement]","Cluster 230, #7 Elements"
6398,['ciusss_cn'],vieillissement,langage au cours du vieillissement,NOM PRP:det NOM PRP:det NOM,langage au cour du vieillissement,False,,,,False,1385.428784,79,64,0.243677,10.763848,143,"[cours, langage, vieillissement]","Cluster 230, #7 Elements"
7061,['ciusss_cn'],vieillissement,langage au cours du vieillissement développement,NOM PRP:det NOM PRP:det NOM NOM,langage au cour du vieillissement développement,False,,,,False,858.896962,67,64,0.091229,9.018718,131,"[cours, développement, langage, vieillissement]","Cluster 230, #7 Elements"


In [18]:
import numpy as np
from sklearn.manifold import TSNE

sentences = to_visualize['Terme'].tolist()
X = list(model.encode(sentences))

X_embedded = TSNE(n_components=2).fit_transform(X)



In [19]:
df_embeddings = pd.DataFrame(X_embedded)
df_embeddings = df_embeddings.rename(columns={0:'x',1:'y'})
df_embeddings = df_embeddings.assign(label=to_visualize.Cluster.values)

In [20]:
df_embeddings = df_embeddings.assign(text=to_visualize.Terme.values)

In [21]:
import plotly.express as px
import plotly

fig = px.scatter(df_embeddings, x='x', y='y', color='label', labels={'color': 'label'}, 
hover_data=['text'], title = 'Quebec Health Care System Terminology Embedding Visualization')
fig.show()

# html file
file_path = '../06-clustering/BERT_communityDetection_visualization_sample.html'
plotly.offline.plot(fig, filename=file_path)


'../06-clustering/BERT_communityDetection_visualization_sample.html'