## **4. Clustering** (sentence_transformers)  

#Semantic Keyword Cluster Tool V1
by [LeeFootSEO](https://twitter.com/LeeFootSEO) February 2022

https://colab.research.google.com/github/searchsolved/search-solved-public-seo/blob/main/search_engine_journal/SEJ_Semantic_Clustering_Tool_by_LeeFootSEO.ipynb#scrollTo=X4p40j7Xng5O  

https://huggingface.co/Sahajtomar/french_semantic

In [173]:
path = '../05-transformation/'
acteur = 'msss'
tag = 'dependances'

if tag:
    csv_file = acteur + '_' + tag + '_weighting_OKapiBM25.csv'

else:
    csv_file = acteur + '_weighting_OKapiBM25.csv'

In [174]:
# %pip install sentence_transformers==2.2.0
# %pip install chardet
# %pip install detect_delimiter

In [175]:
import sys
import time
import sys
import pandas as pd
import chardet
import codecs
from detect_delimiter import detect

from sentence_transformers import SentenceTransformer, util

# File Types

*   Script expects a column called 'Keyword'
*   Recommend No More Than 50K Rows

#Set Cluster Accuracy & Minimum Cluster Size

In [176]:
cluster_accuracy = 25  # 0-100 (100 = very tight clusters, but higher percentage of no_cluster groups)
min_cluster_size = 5  # set the minimum size of cluster groups. (Lower number = tighter groups)

# Choose a Sentence Transformer
Download Pre-Trained Models: https://www.sbert.net/docs/pretrained_models.html

In [177]:
transformer = 'all-mpnet-base-v2'  # provides the best quality - pas trop donées ça va
#transformer = 'all-MiniLM-L6-v2'  # 5 times faster and still offers good quality

In [178]:
with open(path + csv_file, encoding='utf-8') as f:
    df = pd.DataFrame(pd.read_csv(f)['Collocation'])
    df.columns = ['Keyword']

df

Unnamed: 0,Keyword
0,indicateurs de gestion utilisés par le ministè...
1,règlement sur la certification des ressources ...
2,indicateurs de gestion utilisés par le ministère
3,répertoire des indicateurs de gestion en santé
4,indicateurs de gestion en santé et services
...,...
259,jours ouvrables
260,réinsertion sociale
261,ressource certifiée
262,jours approche


In [179]:
count_rows = len(df)
if count_rows > 50_000:
  print("WARNING: You May Experience Crashes When Processing Over 50,000 Keywords at Once. Please consider smaller batches!")
print("Uploaded Keyword CSV File Successfully!")
df

Uploaded Keyword CSV File Successfully!


Unnamed: 0,Keyword
0,indicateurs de gestion utilisés par le ministè...
1,règlement sur la certification des ressources ...
2,indicateurs de gestion utilisés par le ministère
3,répertoire des indicateurs de gestion en santé
4,indicateurs de gestion en santé et services
...,...
259,jours ouvrables
260,réinsertion sociale
261,ressource certifiée
262,jours approche


In [180]:
# store the data
cluster_name_list = []
corpus_sentences_list = []
df_all = []

corpus_set = set(df['Keyword'])
corpus_set_all = corpus_set
cluster = True


#Clustering Keywords - This can take a while!

In [181]:
# keep looping through until no more clusters are created

cluster_accuracy = cluster_accuracy / 100
model = SentenceTransformer(transformer)

while cluster:

    corpus_sentences = list(corpus_set)
    check_len = len(corpus_sentences)

    corpus_embeddings = model.encode(corpus_sentences, batch_size=256, show_progress_bar=True, convert_to_tensor=True)
    clusters = util.community_detection(corpus_embeddings, min_community_size=min_cluster_size, threshold=cluster_accuracy, init_max_size=len(corpus_embeddings))

    for keyword, cluster in enumerate(clusters):
        print("\nCluster {}, #{} Elements ".format(keyword + 1, len(cluster)))

        for sentence_id in cluster[0:]:
            print("\t", corpus_sentences[sentence_id])
            corpus_sentences_list.append(corpus_sentences[sentence_id])
            cluster_name_list.append("Cluster {}, #{} Elements ".format(keyword + 1, len(cluster)))

    df_new = pd.DataFrame(None)
    df_new['Cluster Name'] = cluster_name_list
    df_new["Keyword"] = corpus_sentences_list

    df_all.append(df_new)
    have = set(df_new["Keyword"])

    corpus_set = corpus_set_all - have
    remaining = len(corpus_set)
    print("Total Unclustered Keywords: ", remaining)
    if check_len == remaining:
        break

Batches: 100%|██████████| 2/2 [00:05<00:00,  2.74s/it]



Cluster 1, #246 Elements 
	 services sociaux ciusss
	 services sociaux ententes
	 services sociaux
	 services sociaux cisss
	 santé et services sociaux
	 services sociaux ententes de gestion
	 services sociaux pour les personnes en situation
	 réseau de la santé et des services sociaux
	 santé et services
	 gestion en santé et services sociaux
	 accès aux services
	 services sociaux msss
	 service de soutien
	 accès aux services de santé
	 services de santé
	 service en clsc dans les délais
	 service en clsc dans les délais établis
	 établissements du réseau de la santé et des services sociaux
	 ministère de la santé et des services sociaux
	 réseau de la santé et des services
	 gestion en santé et services
	 service en clsc
	 indicateurs de gestion en santé et services
	 ressource publique services
	 services en dépendance
	 établissements du réseau de la santé et des services
	 services sociaux rassemble
	 services de réadaptation
	 ressource publique services de réadaptation
	 ciss

Batches: 100%|██████████| 1/1 [00:00<00:00,  3.08it/s]



Cluster 1, #6 Elements 
	 thérapie individuelle
	 intervention précoce
	 désintoxication en externe réinsertion
	 matière de consommation et de pratique des jha
	 alcool de cannabis
	 toxicomanie pour les problèmes de jeu pathologique
Total Unclustered Keywords:  12


Batches: 100%|██████████| 1/1 [00:00<00:00,  7.06it/s]

Total Unclustered Keywords:  12





In [182]:
# make a new dataframe from the list of dataframe and merge back into the orginal df
df_new = pd.concat(df_all)
df = df.merge(df_new.drop_duplicates('Keyword'), how='left', on="Keyword")

In [183]:
# rename the clusters to the shortest keyword in the cluster
# Est-ce qu'on pourrait plutôt le renommer par le keyword le plus fréquent ?

df['Length'] = df['Keyword'].astype(str).map(len)
df = df.sort_values(by="Length", ascending=True)

df['Cluster Name'] = df.groupby('Cluster Name')['Keyword'].transform('first')
df.sort_values(['Cluster Name', "Keyword"], ascending=[True, True], inplace=True)

df['Cluster Name'] = df['Cluster Name'].fillna("zzz_no_cluster")

del df['Length']

In [184]:
# move the cluster and keyword columns to the front
col = df.pop("Keyword")
df.insert(0, col.name, col)

col = df.pop('Cluster Name')
df.insert(0, col.name, col)

df.sort_values(["Cluster Name", "Keyword"], ascending=[True, True], inplace=True)

In [185]:
uncluster_percent = (remaining / count_rows) * 100
clustered_percent = 100 - uncluster_percent
print(clustered_percent,"% of rows clustered successfully!")

95.45454545454545 % of rows clustered successfully!


In [186]:
if tag:
    file_output = acteur + '_' + tag + '_clusters.csv'
else: 
    file_output = acteur + '_clusters.csv'
df.to_csv('../06-clustering/' + file_output , index=False)