## **4. Keywords Clustering**
BERT Sentence transformers embeddings | Community Detection Algorithm

We will compare models implemeting different combinations of these parameters:
- **Accuracy** : ranging from 65% to 100%
- **Minimum size of clusters** : 2 to 20

We will choose the set of parameters which maximizes accuracy and minimizes the resulting number of orphan terms (terms not assigned to any cluster)

Online tutorial followed :  
https://colab.research.google.com/github/searchsolved/search-solved-public-seo/blob/main/search_engine_journal/SEJ_Semantic_Clustering_Tool_by_LeeFootSEO.ipynb

____________________________________________________________________________________________________________________

### **Embedding** : BERT Sentence Transformers

> "A **transformer** is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part  of the input data.
Transformers are increasingly the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM). The additional  training parallelization allows training on larger datasets. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), which were trained with large language datasets, such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks."   
  
(https://en.wikipedia.org/wiki/Transformer_(machine_learning_model))


In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util
model =  SentenceTransformer("dangvantuan/sentence-camembert-base")

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
accuracy = [65, 70, 75, 80, 85, 90, 95, 100]
size = [2, 5, 10, 25, 50, 100, 500, 1000]
 
results= []
for a in accuracy:
    for s in size:
        results.append({'Accuracy' : a, 'Minimum cluster size': s, 'Number of orphans': None})


# On veut sélectionner les paramètres qui vont maximiser le accuracy et minimiser le nombre d'orphans
results

In [2]:
file_path = '../04-clustering/final_candidate-terms.csv'
with open(file_path, encoding='utf-8') as f:
    df = pd.read_csv(f, sep=';')
    df['Terme'] = df['Terme'].astype('str')
    df['TF + DF'] = df['TF'] + df['DF']

# store the data
cluster_name_list = []
corpus_sentences_list = []
df_all = []

corpus_set = set(df['Terme'])
corpus_set_all = corpus_set
cluster = True

df

Unnamed: 0,Corpora,Terme,Structure syntaxique,Forme lemmatisée,MeSHID,MesH_prefLabel_fr,MesH_prefLabel_en,isTaxoTerm,Log Likelihood,TF,DF,TF*IDF,OKapiBM25,TF + DF
0,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",services sociaux,NOM ADJ,service social,D012947,Services sociaux et travail social (activité),Social Work,VRAI,1674.908057,40189,15418,1.000000,26.361483,55607
1,"['chum', 'chuqc', 'cisss_ca', 'cisss_cotenord'...",santé publique,NOM ADJ,santé public,D011634,Santé publique,Public Health,VRAI,1572.987576,32510,11194,1.000000,18.947189,43704
2,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",santé mentale,NOM ADJ,santé mental,D008603,Santé mentale,Mental Health,VRAI,1579.080827,13229,4795,1.000000,24.142062,18024
3,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé,NOM PRP DET:ART NOM,ministère de le santé,,,,FAUX,1411.378559,10741,7142,0.553007,21.734060,17883
4,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé et des services sociaux,NOM PRP DET:ART NOM KON PRP:det NOM ADJ,ministère de le santé et des service social,,,,FAUX,1364.201342,10543,7060,0.553007,26.835420,17603
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4673,['santeestrie'],centre de crise,NOM PRP NOM,centre de crise,,,,VRAI,84.828851,3,3,0.136894,8.582629,6
4674,['inesss'],carcinome rénal,NOM ADJ,carcinome rénal,D002292,Néphrocarcinome,"Carcinoma, Renal Cell",FAUX,61.813956,3,3,0.141561,18.359316,6
4675,['inesss'],durée de vie,NOM PRP NOM,durée de vie,D008136,Longévité,Longevity,FAUX,434.083437,3,3,0.184051,21.441105,6
4676,['laval_sante'],commotion cérébrale,NOM ADJ,commotion cérébral,D001924,Commotion de l'encéphale,Brain Concussion,VRAI,388.209903,3,3,0.557684,13.725435,6


In [3]:
from nltk.tokenize import RegexpTokenizer
regex = "[\w+-]+|\([\s+\w+\d+-]+\)|\w+|\w"
tokenizex = RegexpTokenizer(regex)

file_path = "../02-filtrage/stopwords.txt"
with open(file_path, 'r', encoding="utf-8") as f:
    stop = [t.lower().strip('\n') for t in f.readlines()]

def to_tokens(kw, min_chars=2):
    tokens = tokenizex.tokenize(str(kw)) # split the string into a list of words
    tokens = [word for word in tokens if len(word) > min_chars] 
    tokens = [str(word) for word in tokens if word not in stop] 
    
    tokens = set(tokens) # to remove duplicates
    tokens = sorted(tokens) # converts our set back to a list and sorts words in alphabetical order
    return tokens

df["tokens"] = df["Terme"].apply(lambda x: to_tokens(
    x,
    min_chars=3,
))
df

Unnamed: 0,Corpora,Terme,Structure syntaxique,Forme lemmatisée,MeSHID,MesH_prefLabel_fr,MesH_prefLabel_en,isTaxoTerm,Log Likelihood,TF,DF,TF*IDF,OKapiBM25,TF + DF,tokens
0,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",services sociaux,NOM ADJ,service social,D012947,Services sociaux et travail social (activité),Social Work,VRAI,1674.908057,40189,15418,1.000000,26.361483,55607,"[services, sociaux]"
1,"['chum', 'chuqc', 'cisss_ca', 'cisss_cotenord'...",santé publique,NOM ADJ,santé public,D011634,Santé publique,Public Health,VRAI,1572.987576,32510,11194,1.000000,18.947189,43704,"[publique, santé]"
2,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",santé mentale,NOM ADJ,santé mental,D008603,Santé mentale,Mental Health,VRAI,1579.080827,13229,4795,1.000000,24.142062,18024,"[mentale, santé]"
3,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé,NOM PRP DET:ART NOM,ministère de le santé,,,,FAUX,1411.378559,10741,7142,0.553007,21.734060,17883,"[ministère, santé]"
4,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé et des services sociaux,NOM PRP DET:ART NOM KON PRP:det NOM ADJ,ministère de le santé et des service social,,,,FAUX,1364.201342,10543,7060,0.553007,26.835420,17603,"[ministère, santé, services, sociaux]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4673,['santeestrie'],centre de crise,NOM PRP NOM,centre de crise,,,,VRAI,84.828851,3,3,0.136894,8.582629,6,"[centre, crise]"
4674,['inesss'],carcinome rénal,NOM ADJ,carcinome rénal,D002292,Néphrocarcinome,"Carcinoma, Renal Cell",FAUX,61.813956,3,3,0.141561,18.359316,6,"[carcinome, rénal]"
4675,['inesss'],durée de vie,NOM PRP NOM,durée de vie,D008136,Longévité,Longevity,FAUX,434.083437,3,3,0.184051,21.441105,6,[durée]
4676,['laval_sante'],commotion cérébrale,NOM ADJ,commotion cérébral,D001924,Commotion de l'encéphale,Brain Concussion,VRAI,388.209903,3,3,0.557684,13.725435,6,"[commotion, cérébrale]"


In [None]:
# keep looping through until no more clusters are created
def community_clusters(cluster_accuracy, min_cluster_size, corpus_set=corpus_set):
    cluster_accuracy = cluster_accuracy / 100
    cluster = True
    while cluster:
        corpus_sentences = list(corpus_set)
        check_len = len(corpus_sentences)

        corpus_embeddings = model.encode(corpus_sentences, batch_size=256, show_progress_bar=True, convert_to_tensor=True)
        clusters = util.community_detection(corpus_embeddings, min_community_size= min_cluster_size, threshold=cluster_accuracy)

        for keyword, cluster in enumerate(clusters):
            print("\nCluster {}, N = {} Elements ".format(keyword + 1, len(cluster)))

            for sentence_id in cluster[0:]:
                print("\t", corpus_sentences[sentence_id])
                corpus_sentences_list.append(corpus_sentences[sentence_id])
                cluster_name_list.append("Cluster {}, N = {} Elements ".format(keyword + 1, len(cluster)))

        df_new = pd.DataFrame(None)
        df_new['Cluster Name'] = cluster_name_list
        df_new["Terme"] = corpus_sentences_list

        df_all.append(df_new)
        have = set(df_new["Terme"])

        corpus_set = corpus_set_all - have
        remaining = len(corpus_set)
        print("Total Unclustered Keywords: ", remaining)
        if check_len == remaining:
            break

    return remaining

In [None]:
for x in range(len(results)):
    acc = results[x]['Accuracy']
    size = results[x]['Minimum cluster size']
    score = community_clusters(acc, size)
    print(acc, size, score)
    results[x]['Number of orphans'] = score

pd.DataFrame(results)

In [None]:
# Ensuite, on voudrait représenter sur un tableau/graphique le accuracy, la taille des clusters et le nombre d'orphelins pour prendre
# la valeur combinaison de valeurs maximisant l'accuracy et minimisant le nombre de clusters orphelins

file_path = '../04-Clustering/results_BERT_communityDetection.csv'
pd.DataFrame(results).to_csv(file_path)

In [None]:
########################################
# En fonction du graphique, on choisit nos paramètres
########################################
import seaborn as sns
import numpy as np
results = pd.DataFrame(results)
results['Number of orphans'] = pd.to_numeric(results['Number of orphans'])
results_pivot = results.pivot(index= 'Accuracy', columns='Minimum cluster size', values = 'Number of orphans')

fig = sns.heatmap(results_pivot, annot=True, fmt="d", cmap='binary').get_figure()
file_path = '../04-Clustering/figure_BERT_communityDetection.tiff'
fig.savefig(file_path) 

____________________________________________________________________
Choix des paramètres 
____________________________________________________________________

In [4]:
############################
# Choix des paramètres

cluster_accuracy = 75
min_cluster_size = 2

############################

cluster_accuracy = cluster_accuracy / 100


model =  SentenceTransformer("dangvantuan/sentence-camembert-base")
corpus_sentences = list(set(df['Terme'].tolist()))
corpus_embeddings = model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)

clusters = util.community_detection(corpus_embeddings, min_community_size=2, threshold=0.75)
corpus_sentence_list = []
cluster_name_list

for i, cluster in enumerate(clusters):
    print("\nCluster {}".format(i+1, len(cluster)))
    for sentence_id in cluster:
        print("\t", corpus_sentences[sentence_id])
        corpus_sentences_list.append(corpus_sentences[sentence_id])
        cluster_name_list.append(i+1)

df_new = pd.DataFrame(None)
df_new['Cluster'] = cluster_name_list
df_new["Terme"] = corpus_sentences_list

Batches: 100%|██████████| 74/74 [01:06<00:00,  1.12it/s]



Cluster 1
	 image corporelle
	 innovations technologiques
	 membres supérieurs
	 dépression majeure récurrente
	 cuisson adéquate
	 milieux communautaires
	 fauteuils roulants
	 troubles mentaux
	 symptômes semblables
	 grippe saisonnière
	 indication reconnue
	 environnement bâti
	 développement global
	 fauteuil roulant
	 planification globale
	 capteurs intelligents
	 vélo adapté
	 blessures morales
	 symptômes physiques
	 usagers actifs
	 maladie héréditaire
	 tests rapides
	 rapports annuels
	 ventilation assistée
	 organismes publics
	 sclérose latérale
	 domicile régulier
	 impacts potentiels
	 inégalités sociales
	 approche adaptée
	 températures internes
	 pesticides commerciaux
	 équipements spécialisés
	 assurances collectives
	 besoins spécifiques
	 littérature grise
	 hôpital neurologique
	 médecine légale
	 données ouvertes
	 infection modérée
	 dépistage sanguin
	 maladies animales
	 chambres individuelles
	 médecine personnalisée
	 organismes reconnus
	 cliniques spéci

## **4. Keywords Clustering**
One-Hot embeddings | K-Means Expectation maximization

https://colab.research.google.com/drive/1HHNFjKlip1AaFIuvvn0AicWyv6egLOZw?usp=sharing#scrollTo=Ya0TkMAJYvAM

In [5]:
# make a new dataframe from the list of dataframe and merge back into the orginal df
df = df.merge(df_new.drop_duplicates('Terme'), how='left', on="Terme")
df['Cluster'] = df['Cluster'].fillna(0)
df

Unnamed: 0,Corpora,Terme,Structure syntaxique,Forme lemmatisée,MeSHID,MesH_prefLabel_fr,MesH_prefLabel_en,isTaxoTerm,Log Likelihood,TF,DF,TF*IDF,OKapiBM25,TF + DF,tokens,Cluster
0,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",services sociaux,NOM ADJ,service social,D012947,Services sociaux et travail social (activité),Social Work,VRAI,1674.908057,40189,15418,1.000000,26.361483,55607,"[services, sociaux]",19.0
1,"['chum', 'chuqc', 'cisss_ca', 'cisss_cotenord'...",santé publique,NOM ADJ,santé public,D011634,Santé publique,Public Health,VRAI,1572.987576,32510,11194,1.000000,18.947189,43704,"[publique, santé]",62.0
2,"['chum', 'chuqc', 'chusj', 'cisss_ca', 'cisss_...",santé mentale,NOM ADJ,santé mental,D008603,Santé mentale,Mental Health,VRAI,1579.080827,13229,4795,1.000000,24.142062,18024,"[mentale, santé]",1.0
3,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé,NOM PRP DET:ART NOM,ministère de le santé,,,,FAUX,1411.378559,10741,7142,0.553007,21.734060,17883,"[ministère, santé]",2.0
4,"['chuqc', 'chusj', 'cisss_ca', 'cisss_cotenord...",ministère de la santé et des services sociaux,NOM PRP DET:ART NOM KON PRP:det NOM ADJ,ministère de le santé et des service social,,,,FAUX,1364.201342,10543,7060,0.553007,26.835420,17603,"[ministère, santé, services, sociaux]",41.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4673,['santeestrie'],centre de crise,NOM PRP NOM,centre de crise,,,,VRAI,84.828851,3,3,0.136894,8.582629,6,"[centre, crise]",305.0
4674,['inesss'],carcinome rénal,NOM ADJ,carcinome rénal,D002292,Néphrocarcinome,"Carcinoma, Renal Cell",FAUX,61.813956,3,3,0.141561,18.359316,6,"[carcinome, rénal]",361.0
4675,['inesss'],durée de vie,NOM PRP NOM,durée de vie,D008136,Longévité,Longevity,FAUX,434.083437,3,3,0.184051,21.441105,6,[durée],37.0
4676,['laval_sante'],commotion cérébrale,NOM ADJ,commotion cérébral,D001924,Commotion de l'encéphale,Brain Concussion,VRAI,388.209903,3,3,0.557684,13.725435,6,"[commotion, cérébrale]",192.0


In [6]:
df['Cluster'] = df['Cluster'].astype(int)

In [7]:
df = df.sort_values(by='Cluster')
df

Unnamed: 0,Corpora,Terme,Structure syntaxique,Forme lemmatisée,MeSHID,MesH_prefLabel_fr,MesH_prefLabel_en,isTaxoTerm,Log Likelihood,TF,DF,TF*IDF,OKapiBM25,TF + DF,tokens,Cluster
4677,['iucpq'],tomographie par émission de positrons,NOM PRP NOM PRP NOM,tomographie par émission de positron,D049268,Tomographie par émission de positons,Positron-Emission Tomography,FAUX,389.442395,2,1,0.317463,21.257671,3,"[positrons, tomographie, émission]",0
3041,['inesss'],otsuka can traitement de la granulomatose,NOM ADJ NOM PRP DET:ART NOM,otsuka can traitement de le granulomatose,,,,FAUX,1281.107919,66,66,0.007977,13.378159,132,"[granulomatose, otsuka, traitement]",0
3042,['inesss'],otsuka can pour le traitement du prurit,NOM ADJ PRP DET:ART NOM PRP:det NOM,otsuka can pour le traitement du prurit,,,,FAUX,1285.200130,66,66,0.007977,9.666518,132,"[otsuka, prurit, traitement]",0
3043,['inesss'],traitement des polypes,NOM PRP:det NOM,traitement des polype,,,,FAUX,864.174602,66,66,0.007977,21.305386,132,"[polypes, traitement]",0
3047,['inesss'],jardiance empagliflozine,NOM NOM,jardiance empagliflozine,,,,FAUX,1413.152237,66,66,0.007977,12.931803,132,"[empagliflozine, jardiance]",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4256,['ciusss_mcq'],programmes et services en tsa,NOM KON NOM PRP NOM,programme et service en tsa,,,,FAUX,463.935789,31,31,0.176838,22.535160,62,"[programmes, services]",794
2844,['inesss'],porteurs du chromosome de philadelphie,NOM PRP:det NOM PRP NOM,porteur du chromosome de philadelphie,,,,FAUX,1380.901573,66,66,0.007977,9.400217,132,"[chromosome, philadelphie, porteurs]",795
3746,['inesss'],chromosome de philadelphie,NOM PRP NOM,chromosome de philadelphie,D010677,Chromosome Philadelphie,Philadelphia Chromosome,FAUX,-inf,78,78,0.377333,14.257910,156,"[chromosome, philadelphie]",795
2720,['inesss'],traitement topique du psoriasis en plaques,NOM ADJ PRP:det NOM PRP NOM,traitement topique du psoriasis en plaque,,,,FAUX,1212.747304,69,69,0.138644,6.109468,138,"[plaques, psoriasis, topique, traitement]",796


In [8]:
# Attribuer un label significatif aux clusters 
# Deux options :
#   1 - S'il y a un token en commun  entre les termes d'un même clusters, le clusters sera désigné par celui-ci
#   2 - Sinon, on prend le terme dont la somme TF + DF est la plus élevée 
# rename the clusters to the shortest keyword in the cluster

labels = set(df['Cluster'].tolist())
desired_labels = {x : None for x in labels} # (on initialise à None)
for label in labels:
    d = df[df['Cluster'] == label]['tokens'].tolist()
    new_label = list(set.intersection(*map(set,d)))
    try:
        desired_labels[label] = new_label[0]
    except:
            cluster = df[df["Cluster"] == label]
            max_freq = cluster['TF + DF'].max()
            new_label = cluster[cluster['TF + DF'] == max_freq]['Terme'].values
            desired_labels[label] = new_label[0]

df['Cluster label'] = df['Cluster'].map(desired_labels)

In [9]:
# move the cluster and keyword columns to the front
col = df.pop("Terme")
df.insert(1, col.name, col)

#col = df.pop('Cluster')
#df.insert(1, col.name, col)

In [10]:
df.sort_values(["Cluster label"], inplace=True)

In [11]:
file_path = '../04-clustering/clusters_BERT_communityDetection.csv'
df.drop(columns=['Cluster', 'tokens']).sort_values(["Cluster label"]).to_csv(file_path, index=False)

In [12]:
df = df[['Corpora', 'Cluster label',	'Terme',	'Structure syntaxique',	'Forme lemmatisée',	'MeSHID',	'MesH_prefLabel_fr',	'MesH_prefLabel_en',	'isTaxoTerm',	'Log Likelihood',	'TF',	'DF',	'TF*IDF',	'OKapiBM25',	'TF + DF', 'Cluster']]
df

Unnamed: 0,Corpora,Cluster label,Terme,Structure syntaxique,Forme lemmatisée,MeSHID,MesH_prefLabel_fr,MesH_prefLabel_en,isTaxoTerm,Log Likelihood,TF,DF,TF*IDF,OKapiBM25,TF + DF,Cluster
4158,['chusj'],(prms),réadaptation en milieu scolaire (prms),NOM PRP NOM ADJ PUN NOM PUN,réadaptation en milieu scolaire (prms ),,,,FAUX,353.698897,34,25,0.124964,10.093660,59,585
4168,['chusj'],(prms),milieu scolaire (prms),NOM ADJ PUN NOM PUN,milieu scolaire (prms ),,,,FAUX,353.698897,34,25,0.124964,7.005629,59,585
3444,['ciusss_cn'],abandon du tabagisme,centres d'abandon du tabagisme (cat),NOM PRP NOM PRP:det NOM PUN NOM PUN,centre de abandon du tabagisme (cat ),,,,FAUX,512.497260,58,40,0.393765,24.889385,98,60
2698,"['cisss_lanaudiere', 'sante_mtl']",abandon du tabagisme,produits du tabac,NOM PRP:det NOM,produit du tabac,D062789,Produits du tabac,Tobacco Products,FAUX,149.729452,99,56,0.802091,8.765813,155,60
1122,"['ciusss_centreouest', 'ciusss_cn', 'ciusss_es...",abandon du tabagisme,abandon du tabagisme,NOM PRP:det NOM,abandon du tabagisme,,,,FAUX,509.948523,182,113,0.992365,21.372872,295,60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4165,"['chuqc', 'cisss_lanaudiere']",études,diplôme d'études professionnelles (dep),NOM PRP NOM ADJ PUN NOM PUN,diplôme de étude professionnel (dep ),,,,FAUX,158.001869,34,23,0.220414,20.861424,57,778
2782,"['cisss_gaspesie', 'santeestrie']",études de cas,gestion des cas,NOM PRP:det NOM,gestion des cas,,,,FAUX,483.357950,67,36,0.398623,12.489009,103,555
1544,['ciusss_cn'],études de cas,études de cas,NOM PRP NOM,étude de cas,D002363,Présentations de cas,Case Reports,FAUX,-inf,259,214,0.245519,26.339661,473,555
2115,['cisss_iles'],îles,cisss des îles,NOM PRP:det NOM,cisss des île,,,,FAUX,1079.751879,101,39,1.000000,6.522820,140,621


## **Visualisation**
https://inside-machinelearning.com/en/efficient-sentences-embedding-visualization-tsne/

In [14]:
# On va d'abord choisir les clusters qu'on veut visualiser (ajouter une colonne = Visualize Y/N)
clusters = pd.DataFrame(set(df['Cluster label'].tolist()), columns=['Cluster label'])

file_path = '../04-clustering/clusters_to_visualize.csv'
clusters.to_csv(file_path)

__________________________________________________________________________________________
Étape d'indexation manuelle
__________________________________________________________________________________________

In [17]:
with open(file_path, encoding = 'utf-8') as f:
    clusters = pd.read_csv(f, sep=';')
    clusters = clusters[clusters['Visualize (Y/N)'] == 'Y']['Cluster label'].tolist()

clusters

['assurance maladie',
 'médecin de famille',
 'réadaptation en dépendance',
 'services aux aînés',
 'psoriasis',
 'maternité',
 'cannabis',
 'développement des enfants',
 'soins infirmiers',
 'schizophrénie',
 'soins pédiatriques',
 'traumatisme craniocérébral',
 "pratique d'activités physiques",
 'grossesse gynéchologie',
 'maladie pulmonaire',
 'surdoses',
 'sexualité et contraception',
 'violence conjugale',
 'grossesse et accouchement',
 'cancer',
 'programmes et interventions',
 'prévention des traumatismes',
 'contrôle des infections',
 'maladie de parkinson',
 'abandon du tabagisme',
 'imagerie médicale',
 'thrombose',
 'habitudes de vie',
 'inhalothérapie',
 'violence interpersonnelle',
 'bébés',
 'alcool et drogues',
 'médecine de famille (gmf)',
 'institut de cardiologie',
 'insuffisance cardiaque',
 'prévention des maladies cardiaques',
 'allaitement maternel',
 'déficience visuelle',
 'chimiothérapie']

In [34]:
# To speed up the learning process, I suggest to reduce the amount of data. 
to_visualize = df[df['Cluster label'].isin(clusters)]
to_visualize = to_visualize.drop(columns=['Cluster']).rename(columns = {'Cluster label' : 'Cluster'})

In [35]:
import numpy as np
from sklearn.manifold import TSNE

sentences = to_visualize['Terme'].tolist()
X = list(model.encode(sentences))

X_embedded = TSNE(n_components=2).fit_transform(X)


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



In [39]:
df_embeddings = pd.DataFrame(X_embedded)
df_embeddings = df_embeddings.rename(columns={0:'x',1:'y'})
df_embeddings = df_embeddings.assign(label=to_visualize.Cluster.values)

In [40]:
df_embeddings = df_embeddings.assign(text=to_visualize.Terme.values)

In [41]:
import plotly.express as px
import plotly

fig = px.scatter(df_embeddings, x='x', y='y', color='label', labels={'color': 'label'}, 
hover_data=['text'], title = 'Quebec Health Care System Terminology Embedding Visualization')
fig.show()

# html file
file_path = '../04-clustering/BERT_communityDetection_visualization_sample.html'
plotly.offline.plot(fig, filename=file_path)


'../04-clustering/BERT_communityDetection_visualization_sample.html'