## **4. Keywords Clustering** 
We will compare different models implemeting each of these parameters:
- **Algorithm**: K-Means vs Expectation maximization VS Agglomerative
- **Embedding** : One-hot vs Sentence transformers
- **Number of clusters** : ranging from N/10, N/2 the total number of terms)

In [18]:
import pandas as pd

*One-hot embedding*

> One Hot encoding is a representation of categorical variables as binary vectors. Each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

In [19]:
base_path = '../05-transformation/'
acteur = 'chum'
file_path = base_path + acteur + '_weighting_OKapiBM25.csv'
with open(file_path, encoding='utf-8') as f:
    keywords = pd.read_csv(f)[['Terme', 'Fréquence (TF)', 'Fréquence documentaire (DF)']]

keywords['TF + DF'] = keywords['Fréquence (TF)'] + keywords['Fréquence documentaire (DF)']

keywords = keywords.rename(columns={'Terme': 'Keyword'})
keywords

Unnamed: 0,Keyword,Fréquence (TF),Fréquence documentaire (DF),TF + DF
0,chirurgiens du canada,46,46,92
1,réunions hebdomadaires,86,86,172
2,centre hospitalier de l'université,115,88,203
3,activité de développement professionnel,43,43,86
4,centre de recherche du centre,78,68,146
...,...,...,...,...
179,professeur au département,34,26,60
180,chercheurs du crchum,37,30,67
181,recherche chirurgie,33,32,65
182,calendrier des conférences,40,40,80


In [20]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("(\w+\'|\w+-\w+|\(|\)|\w+)")

file_path = "../04-filtrage/stopwords.txt"
with open(file_path, 'r', encoding="utf-8") as f:
    stop = [t.lower().strip('\n') for t in f.readlines()]

def to_tokens(kw, min_chars=2):
    tokens = tokenizer.tokenize(str(kw)) # split the string into a list of words
    tokens = [word for word in tokens if len(word) > min_chars] 
    tokens = [str(word) for word in tokens if word not in stop] 
    
    tokens = set(tokens) # to remove duplicates
    tokens = sorted(tokens) # converts our set back to a list and sorts words in alphabetical order
    return tokens

keywords["tokens"] = keywords["Keyword"].apply(lambda x: to_tokens(
    x,
    min_chars=3,
))
keywords


In [21]:
keywords["tokens"] = keywords["Keyword"].apply(lambda x: to_tokens(
    x,
    min_chars=3,
))
keywords

Unnamed: 0,Keyword,Fréquence (TF),Fréquence documentaire (DF),TF + DF,tokens
0,chirurgiens du canada,46,46,92,"[canada, chirurgiens]"
1,réunions hebdomadaires,86,86,172,"[hebdomadaires, réunions]"
2,centre hospitalier de l'université,115,88,203,"[centre, hospitalier, université]"
3,activité de développement professionnel,43,43,86,"[activité, développement, professionnel]"
4,centre de recherche du centre,78,68,146,"[centre, recherche]"
...,...,...,...,...,...
179,professeur au département,34,26,60,"[département, professeur]"
180,chercheurs du crchum,37,30,67,"[chercheurs, crchum]"
181,recherche chirurgie,33,32,65,"[chirurgie, recherche]"
182,calendrier des conférences,40,40,80,"[calendrier, conférences]"


In [22]:
vocab = sorted(set(keywords["tokens"].explode()))
len(vocab)

156

In [23]:
def to_vector(keyword,vocab):
    """
    Calculates vector of keyword on given vocabulary.

    Returns vector as a list of values.  
    """
    vector = []
    for word in vocab:
        if word in keyword:
            vector.append(1)
        else:
            vector.append(0)
    return vector

keywords["vector"] = keywords["tokens"].apply(lambda x: to_vector(x,vocab))
embeddings_oh = keywords['vector'].tolist()

*Sentence transformers embedding*

> "A **transformer** is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part  of the input data.
Transformers are increasingly the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM). The additional  training parallelization allows training on larger datasets. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), which were trained with large language datasets, such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks."   

In [24]:
from sentence_transformers import SentenceTransformer, models
import torch

# On va utiliser un modèle BERT/sentence transformers (fr) pour extraire nos embeddings plutôt que des simples one-hot encoding
model =  SentenceTransformer("dangvantuan/sentence-camembert-base")

sentences = keywords['Keyword'].tolist()
embeddings_st = model.encode(sentences, convert_to_numpy=True)



### **Expérimentation**  
On va préparer un tableau avec les différents paramètres possibles ; on va y stocker les scores correspondant au fur et à mesure.
On va retenir les paramètres qui minimisent le score observé.

In [25]:
algorithmes = ['K-Means', 'Expectation-Maximization', 'AgglomerativeClustering']
embeddings = ['One-Hot', 'Sentence transformers']

nb_termes = len(keywords['Keyword'].tolist())
clusters = rg = range(round(nb_termes/5), len(set(tuple(x) for x in keywords['Keyword'].tolist())))

results = []
for algorithme in algorithmes:
    for embedding in embeddings:
        for cluster in clusters:
            results.append(\
                {'algorithme' : algorithme,\
                'embedding': embedding, \
                'K (nb clusters)' : cluster, \
                'Score': None}
            )


# On va remplir ce dictionnaire avec les bons scores au fur et à mesure qu'on expérimente
results = pd.DataFrame(results)
results

Unnamed: 0,algorithme,embedding,K (nb clusters),Score
0,K-Means,One-Hot,37,
1,K-Means,One-Hot,38,
2,K-Means,One-Hot,39,
3,K-Means,One-Hot,40,
4,K-Means,One-Hot,41,
...,...,...,...,...
877,AgglomerativeClustering,Sentence transformers,179,
878,AgglomerativeClustering,Sentence transformers,180,
879,AgglomerativeClustering,Sentence transformers,181,
880,AgglomerativeClustering,Sentence transformers,182,


### **Déterminer K**   
**Score "intersection"** *(lower is better)*  
Pour chaque cluster, on cherche s'il existe une intersection entre les termes qui le constituent ; pour chaque cluster pour lequel il n'existe pas d'intersection,
on ajoute 1 au score ; au final, on va retenir la plus petite valeur de k pour laquelle le score = 0 (si elle existe) 

**Score "orphelins"** *(lower is better)*  
Pour chaque cluster qui ne contient qu'un seul terme (*ie* orphelin), on ajoute 1 au score ; au final, on va retenir la plus petite valeur de k pour laquelle le score = 0 (si elle existe)  

-----------------------------------------------------------------------------------

*Pour combiner les deux scores*  
On pourrait faire la somme du score "intersection" et du score "orphelin" et retenir la valeur k pour laquelle ce score est la plus faible ; ou encore, la valeur k la plus faible pour laquelle les deux scores = 0 (si elle existe).  

On pourrait aussi privilégier un des deux scores, retenir d'abord la plus petite valeur de k pour laquelle le score = 0, puis ensuite retenir la valeur minimale associée à l'autre score.

### *KMeans*

In [40]:
from sklearn.cluster import KMeans

def test_KMeans(embed, keywords=keywords, results=results):
    if embed == 'One-Hot':
        X = embeddings_oh
        
    elif embed == 'Sentence transformers':
        X = embeddings_st

    rg = range(round(len(X)/5), len(set(tuple(x) for x in X)))
    scores = [{'algorithme' : 'K-means', 'embedding' : embed, 'k' : x, 'score_intersection' : None, 'score_orphelin': None} for x in rg]

    for k in range(len(scores)):
        scores[k]['score_intersection'] = 0
        scores[k]['score_orphelin'] = 0

        kmeans = KMeans(n_clusters = scores[k]['k'], init='k-means++', algorithm='elkan', random_state=0, n_init=1, max_iter=200).fit(X)
        keywords["kmeans"] = list(kmeans.labels_)

        labels = set(kmeans.labels_.tolist())
        for label in labels:
            d = keywords[keywords['kmeans'] == label]['tokens'].tolist()
            new_label = list(set.intersection(*map(set,d)))

            # Si on ne trouve pas d'intersection entre les termes d'un même cluster, on ajoute 1 au score intersection ; 
            # au final, on va retenir la plus petite valeur de k pour laquelle le score = 0 (si elle existe) 
            if(len(new_label) == 0):
                scores[k]['score_intersection'] += 1

            ## Si le cluster ne contient qu'un seul terme, on ajoute 1 au score orphelin
            if(len(d) == 1):
                scores[k]['score_orphelin'] += 1

    tab_scores = pd.DataFrame.from_records(scores)
    tab_scores['Score'] = tab_scores['score_intersection'] + tab_scores['score_orphelin']

    tab_scores.sort_values(['score_intersection'])

    try:
        # On retient un sous-ensemble de valeurs pour lesquelles le score intersection est nul (on sait qu'il y a toujours une intersection entre nos clusters)
        candidats =  tab_scores[tab_scores['score_intersection'] == 0]

        # Parmis ces valeurs, on retient celle pour laquelle le score_orphelin est le plus bas
        k = candidats[candidats['score_orphelin'] == candidats['score_orphelin'].min()]['k'].values[0]

    except Exception as e:
        k = tab_scores[tab_scores['Score'] == tab_scores['Score'].min()]['k'].values[0]
        print("NOT OK")

    print("K = " + str(k))

    kmeans = KMeans(n_clusters = k, init='k-means++', algorithm='elkan', random_state=0, n_init=1, max_iter=200).fit(X)
    keywords["kmeans"] = list(kmeans.labels_)

    labels = set(kmeans.labels_.tolist())
    desired_labels = {x : None for x in labels} # (on initialise à None)
    for label in labels:
        d = keywords[keywords['kmeans'] == label]['tokens'].tolist()
        new_label = list(set.intersection(*map(set,d)))

        try:
            desired_labels[label] = new_label[0]

        except Exception as e:
            cluster = keywords[keywords["kmeans"] == label]
            max_freq = cluster['TF + DF'].max()
            new_label = cluster[cluster['TF + DF'] == max_freq]['Keyword'].values
            desired_labels[label] = new_label[0]

    keywords['Cluster'] = keywords['kmeans'].map(desired_labels)

    keywords.sort_values(["Cluster"], 
            axis=0,
            ascending=[False], 
            inplace=True)
    
    keywords = keywords[['Cluster', 'Keyword', 'Fréquence (TF)', 'vector']]

    results.loc[((results['algorithme'] == 'K-Means') & \
                (results['K (nb clusters)'] == k) & \
                (results['embedding'] == embed)), 'Score'] = tab_scores['Score'] == tab_scores['Score'].min()
    
    return keywords


In [41]:
keywords

Unnamed: 0,Keyword,Fréquence (TF),Fréquence documentaire (DF),TF + DF,tokens,vector,kmeans,Cluster,gmm,gaac
88,besoin d'assistance immédiate,912,912,1824,"[assistance, besoin, immédiate]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...",4,projet,112,1
44,recherche du québec,66,71,137,"[québec, recherche]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",4,projet,90,1
69,santé mentale,74,38,112,"[mentale, santé]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",27,projet,13,1
133,projets de recherche,114,80,194,"[projets, recherche]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",13,projet,153,1
54,carrefour de l'innovation,315,282,597,"[carrefour, innovation]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,projet,43,1
...,...,...,...,...,...,...,...,...,...,...
153,immunopathologie professeur,32,65,97,"[immunopathologie, professeur]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,chaires,163,0
140,recherche en santé,166,123,289,"[recherche, santé]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,chaires,94,0
41,chum axe,932,932,1864,[chum],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",4,chaires,108,0
5,recherche du chum,1140,1077,2217,"[chum, recherche]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",23,chaires,164,0


*One-Hot embeddings*

In [42]:
km_onehot = test_KMeans('One-Hot')
km_onehot.drop(columns=['vector'])

NOT OK
K = 37


Unnamed: 0,Cluster,Keyword,Fréquence (TF)
151,services du crchum,immunopathologie professeure,33
121,services du crchum,services du crchum,89
27,service d'urgence,service d'urgence,912
37,service d'urgence,chaires de recherche,33
26,service d'urgence,travaux de recherche,49
...,...,...,...
29,assistance immédiate,cellules cancéreuses,59
53,assistance immédiate,chercheur professionnel de la santé,120
22,assistance immédiate,intelligence artificielle,60
11,assistance immédiate,titulaire de la chaire,48


*Sentence transformers embeddings*

In [43]:
km_st = test_KMeans('Sentence transformers')
km_st.drop(columns=['vector'])

NOT OK
K = 37


Unnamed: 0,Cluster,Keyword,Fréquence (TF)
98,type,diabète de type,36
75,système immunitaire,système immunitaire,93
42,système immunitaire,soins de santé,35
13,système immunitaire,directeur scientifique,35
68,système immunitaire,professeure agrégée,62
...,...,...,...
139,assistance immédiate,cancer de la prostate,94
64,assistance immédiate,conférenciers scientifiques,86
138,assistance immédiate,portée stratégique,32
16,assistance immédiate,département de radiologie,101


### *Expectation-Maximization*

In [30]:
from sklearn.mixture import GaussianMixture

def test_EM(embed, keywords=keywords, results=results):
    if embed == 'One-Hot':
        X = keywords["vector"].to_list()
        
    elif embed == 'Sentence transformers':
        X = embeddings_st

    rg = range(round(len(X)/5), len(set(tuple(x) for x in X)))
    scores = [{'algorithme' : 'Expectation-Maximization', 'embedding' : embed, 'k' : x, 'score_intersection' : None, 'score_orphelin': None} for x in rg]

    for k in range(len(scores)):
        scores[k]['score_intersection'] = 0
        scores[k]['score_orphelin'] = 0

        gmm = GaussianMixture(n_components=scores[k]['k'], init_params='k-means++', covariance_type='diag').fit(X) # diag pour gérer MemoryError
        keywords["gmm"] = list(gmm.predict(X))

        labels = gmm.predict(X)

        labels = set(labels)
        for label in labels:
            d = keywords[keywords['gmm'] == label]['tokens'].tolist()
            new_label = list(set.intersection(*map(set,d)))

            # Si on ne trouve pas d'intersection entre les termes d'un même cluster, on ajoute 1 au score intersection ; 
            # au final, on va retenir la plus petite valeur de k pour laquelle le score = 0 (si elle existe) 
            if(len(new_label) == 0):
                scores[k]['score_intersection'] += 1

            ## Si le cluster ne contient qu'un seul terme, on ajoute 1 au score orphelin
            if(len(d) == 1):
                scores[k]['score_orphelin'] += 1

    tab_scores = pd.DataFrame.from_records(scores)
    tab_scores['Score'] = tab_scores['score_intersection'] + tab_scores['score_orphelin']

    tab_scores.sort_values(['score_intersection'])

    try:
        # On retient un sous-ensemble de valeurs pour lesquelles le score intersection est nul (on sait qu'il y a toujours une intersection entre nos clusters)
        candidats =  tab_scores[tab_scores['score_intersection'] == 0]

        # Parmis ces valeurs, on retient celle pour laquelle le score_orphelin est le plus bas
        k = candidats[candidats['score_orphelin'] == candidats['score_orphelin'].min()]['k'].values[0]

    except Exception as e:
        k = tab_scores[tab_scores['Score'] == tab_scores['Score'].min()]['k'].values[0]
        print("NOT OK")

    print("K = " + str(k))


    gmm = GaussianMixture(n_components=k, init_params='k-means++', covariance_type='diag').fit(X) # diag pour gérer MemoryError
    keywords["gmm"] = list(gmm.predict(X))

    labels = set(list(gmm.predict(X)))
    desired_labels = {x : None for x in labels} # (on initialise à None)
    for label in labels:
        d = keywords[keywords['gmm'] == label]['tokens'].tolist()
        new_label = list(set.intersection(*map(set,d)))

        try:
            desired_labels[label] = new_label[0]

        except Exception as e:
            cluster = keywords[keywords["gmm"] == label]
            max_freq = cluster['TF + DF'].max()
            new_label = cluster[cluster['TF + DF'] == max_freq]['Keyword'].values
            desired_labels[label] = new_label[0]

    keywords['Cluster'] = keywords['gmm'].map(desired_labels)

    keywords.sort_values(["Cluster"], 
            axis=0,
            ascending=[False], 
            inplace=True)

    keywords.sort_values('Cluster')

    keywords = keywords[['Cluster', 'Keyword', 'Fréquence (TF)', 'vector']]

    results.loc[((results['algorithme'] == 'Expectation-Maximization') & \
                (results['K (nb clusters)'] == k) & \
                (results['embedding'] == embed)), 'Score'] = tab_scores['Score'] == tab_scores['Score'].min()
    return keywords

In [44]:
results

Unnamed: 0,algorithme,embedding,K (nb clusters),Score
0,K-Means,One-Hot,37,True
1,K-Means,One-Hot,38,
2,K-Means,One-Hot,39,
3,K-Means,One-Hot,40,
4,K-Means,One-Hot,41,
...,...,...,...,...
877,AgglomerativeClustering,Sentence transformers,179,
878,AgglomerativeClustering,Sentence transformers,180,
879,AgglomerativeClustering,Sentence transformers,181,
880,AgglomerativeClustering,Sentence transformers,182,


*One-Hot embedding*

In [31]:
em_onehot = test_EM('One-Hot')
em_onehot.drop(columns=['vector'])

K = 81


Unnamed: 0,Cluster,Keyword,Fréquence (TF)
90,université,université de montréal titulaire,69
109,université,université mcgill,71
98,type,diabète de type,36
136,titulaire,professeure titulaire,71
108,titulaire,professeur titulaire,166
...,...,...,...
88,assistance,besoin d'assistance immédiate,912
38,assistance,assistance immédiate,912
82,anne-marie,anne-marie mes-masson,43
50,alzheimer,maladie d'alzheimer,38


*Sentence transformers embedding*

In [32]:
em_st = test_EM('Sentence transformers')
em_st.drop(columns=['vector'])

K = 183


Unnamed: 0,Cluster,Keyword,Fréquence (TF)
98,type,diabète de type,36
26,travaux,travaux de recherche,49
90,titulaire,université de montréal titulaire,69
113,titulaire,chum professeur titulaire,44
136,titulaire,professeure titulaire,71
...,...,...,...
183,amphithéâtre,amphithéâtre du crchum,48
50,alzheimer,maladie d'alzheimer,38
45,agrégé,professeur agrégé,150
6,adjointe,professeure adjointe,132


### *Agglomerative Clustering*

In [33]:
from sklearn.cluster import AgglomerativeClustering

In [36]:
from sklearn.mixture import GaussianMixture

def test_GAAC(embed, keywords=keywords):
    if embed == 'One-Hot':
        X = keywords["vector"].to_list()
        
    elif embed == 'Sentence transformers':
        X = embeddings_st

    rg = range(round(len(X)/5), len(set(tuple(x) for x in X)))
    scores = [{'algorithme' : 'Agglomerative Clustering', 'embedding' : embed, 'k' : x, 'score_intersection' : None, 'score_orphelin': None} for x in rg]

    for k in range(len(scores)):
        scores[k]['score_intersection'] = 0
        scores[k]['score_orphelin'] = 0

        model = AgglomerativeClustering().fit(X)
        labels = model.labels_
        keywords["gaac"] = labels
        


        labels = set(labels)
        for label in labels:
            d = keywords[keywords['gaac'] == label]['tokens'].tolist()
            new_label = list(set.intersection(*map(set,d)))

            # Si on ne trouve pas d'intersection entre les termes d'un même cluster, on ajoute 1 au score intersection ; 
            # au final, on va retenir la plus petite valeur de k pour laquelle le score = 0 (si elle existe) 
            if(len(new_label) == 0):
                scores[k]['score_intersection'] += 1

            ## Si le cluster ne contient qu'un seul terme, on ajoute 1 au score orphelin
            if(len(d) == 1):
                scores[k]['score_orphelin'] += 1

    tab_scores = pd.DataFrame.from_records(scores)
    tab_scores['Score'] = tab_scores['score_intersection'] + tab_scores['score_orphelin']

    tab_scores.sort_values(['score_intersection'])

    try:
        # On retient un sous-ensemble de valeurs pour lesquelles le score intersection est nul (pour qu'il y ait toujours une intersection entre nos clusters)
        candidats =  tab_scores[tab_scores['score_intersection'] == 0]

        # Parmis ces valeurs, on retient celle pour laquelle le score_orphelin est le plus bas
        k = candidats[candidats['score_orphelin'] == candidats['score_orphelin'].min()]['k'].values[0]

    except Exception as e:
        k = tab_scores[tab_scores['Score'] == tab_scores['Score'].min()]['k'].values[0]
        print("NOT OK")

    print("K = " + str(k))


    model = AgglomerativeClustering().fit(X)
    labels = model.labels_
    keywords["gaac"] = labels
    desired_labels = {x : None for x in labels} # (on initialise à None)
    for label in labels:
        d = keywords[keywords['gmm'] == label]['tokens'].tolist()
        new_label = list(set.intersection(*map(set,d)))

        try:
            desired_labels[label] = new_label[0]

        except Exception as e:
            cluster = keywords[keywords["gaac"] == label]
            max_freq = cluster['TF + DF'].max()
            new_label = cluster[cluster['TF + DF'] == max_freq]['Keyword'].values
            desired_labels[label] = new_label[0]

    keywords['Cluster'] = keywords['gaac'].map(desired_labels)

    keywords.sort_values(["Cluster"], 
            axis=0,
            ascending=[False], 
            inplace=True)

    keywords = keywords[['Cluster', 'Keyword', 'Fréquence (TF)', 'vector']]

    results.loc[((results['algorithme'] == 'AgglomerativeClustering') & \
                (results['K (nb clusters)'] == k) & \
                (results['embedding'] == embed)), 'Score'] = tab_scores['Score'] == tab_scores['Score'].min()
                
    return keywords

*One-Hot embedding*

In [37]:
gaac_onehot = test_GAAC('One-Hot')
gaac_onehot.drop(columns=['vector'])

NOT OK
K = 37


Unnamed: 0,Cluster,Keyword,Fréquence (TF)
162,projet,axe de recherche imagerie,93
163,projet,chaire de recherche du canada,130
40,projet,axe imagerie,37
122,projet,recherche imagerie,106
47,projet,fonds de recherche du québec,61
...,...,...,...
89,chaires,recherche cancer,203
105,chaires,axe de recherche cancer,177
123,chaires,axe cancer,37
139,chaires,cancer de la prostate,94


*Sentence transformers embedding*

In [38]:
gaac_st = test_GAAC('Sentence transformers')
gaac_st.drop(columns=['vector'])

NOT OK
K = 37


Unnamed: 0,Cluster,Keyword,Fréquence (TF)
88,projet,besoin d'assistance immédiate,912
44,projet,recherche du québec,66
69,projet,santé mentale,74
133,projet,projets de recherche,114
54,projet,carrefour de l'innovation,315
...,...,...,...
153,chaires,immunopathologie professeur,32
140,chaires,recherche en santé,166
41,chaires,chum axe,932
5,chaires,recherche du chum,1140


In [39]:
results

Unnamed: 0,algorithme,embedding,K (nb clusters),Score
0,K-Means,One-Hot,37,
1,K-Means,One-Hot,38,
2,K-Means,One-Hot,39,
3,K-Means,One-Hot,40,
4,K-Means,One-Hot,41,
...,...,...,...,...
877,AgglomerativeClustering,Sentence transformers,179,
878,AgglomerativeClustering,Sentence transformers,180,
879,AgglomerativeClustering,Sentence transformers,181,
880,AgglomerativeClustering,Sentence transformers,182,


Reference  
https://colab.research.google.com/drive/1HHNFjKlip1AaFIuvvn0AicWyv6egLOZw?usp=sharing#scrollTo=aNZQMs7xZzgv  
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)