---------------------------------------------------------------
## VI. UTILISATION DE BERTOPIC POUR ATTRIBUER UN THÈME À CHAQUE LIGNE
--------------------------------------------------------------

Afin de pouvoir obtenir des clusters cohérents d'un point de vue thématique, nous utilisons la biliothèque Bertopics comme première méthode de classification non supervisée.

In [15]:
import pandas as pd

from nltk.tokenize import sent_tokenize, word_tokenize
from sentence_transformers import SentenceTransformer

from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer

from scipy.cluster import hierarchy as sch

In [11]:
# Import du jeu de données
df = pd.read_csv('../data/processed/processed_data.csv')

---
### 1. Préparation de Bertopic
---

In [12]:
# Récupération de l'intégralité des textes
texts = df['text']

In [13]:
# Configuration de la tokénisation
sentences = [sent_tokenize(text) for text in texts]
sentences = [sentence for doc in sentences for sentence in doc]

---
### 2. Utilisation de Bertopic
---

On choisi d'utiliser le modèle "all-mpnet-base-v2" qui présente de bon résultat sur la langue anglaise : https://huggingface.co/sentence-transformers/all-mpnet-base-v2

In [14]:
# Étape 1 - Extraction des embeddings
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(texts, show_progress_bar=True)



Batches:   0%|          | 0/2038 [00:00<?, ?it/s]

On ne démontrera pas ici l'intégralité des modalités utilisées dans le cadre du projet, une seule itération est présentée avec les paramètres communiqués ci-après.

In [19]:
# On défini ici les paramètres de l'UMAP et du HDBSCAN
params = [350,350,0.0,175]

In [20]:
# Étape 2 - Réduction de dimensionnalité
umap_model = UMAP(n_neighbors=params[0], n_components=params[1], min_dist=params[2], metric='cosine')

# Étape 3 - Établissement des clusters réduit par les empbeddings
hdbscan_model = HDBSCAN(min_cluster_size=params[3], metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Étape 4 - TOkenisation des sujet
vectorizer_model = CountVectorizer(stop_words="english")

# Étape 5 - Création des représentations des sujets
ctfidf_model = ClassTfidfTransformer()

# Étape 6 - (Optionnelle) Fine-tuning des représentations des sujets avec le modèl`bertopic.representation`
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          
  umap_model=umap_model,                    
  hdbscan_model=hdbscan_model,              
  vectorizer_model=vectorizer_model,        
  ctfidf_model=ctfidf_model,                
  representation_model=representation_model 
)

posts = df['text'].values

topics, probs = topic_model.fit_transform(posts)

In [21]:
# On visualise les sujets
topic_model.visualize_topics()

In [22]:
# Établissement de la Classification hiérarchique ascendante
linkage_function = lambda x: sch.linkage(x, 'single', optimal_ordering=True)
hierarchical_topics = topic_model.hierarchical_topics(texts, linkage_function=linkage_function)

topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

100%|██████████| 20/20 [00:11<00:00,  1.71it/s]


In [23]:
# On extrait les informations des documents puis on les affiche
extract = topic_model.get_document_info(posts)
extract

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,Dank is it a tornado n Raleigh car blowincg n ...,0,0_hurricane_hurricanes_fema_irma,"[hurricane, hurricanes, fema, irma, evacuees, ...","[Nice YouTube video, but what's your point? We...",hurricane - hurricanes - fema - irma - evacuee...,1.000000,False
1,@smoak_queen 'I'm going to be in so much troub...,0,0_hurricane_hurricanes_fema_irma,"[hurricane, hurricanes, fema, irma, evacuees, ...","[Nice YouTube video, but what's your point? We...",hurricane - hurricanes - fema - irma - evacuee...,1.000000,False
2,@CSAresu American Tragedy http://t.co/SDmrzG...,0,0_hurricane_hurricanes_fema_irma,"[hurricane, hurricanes, fema, irma, evacuees, ...","[Nice YouTube video, but what's your point? We...",hurricane - hurricanes - fema - irma - evacuee...,1.000000,False
3,How to Survive a Dust Storm http://t.co/0yL3yT...,0,0_hurricane_hurricanes_fema_irma,"[hurricane, hurricanes, fema, irma, evacuees, ...","[Nice YouTube video, but what's your point? We...",hurricane - hurricanes - fema - irma - evacuee...,0.980731,False
4,I SCREAMED 'WHATS A CHONCe' http://t.co/GXYivs...,0,0_hurricane_hurricanes_fema_irma,"[hurricane, hurricanes, fema, irma, evacuees, ...","[Nice YouTube video, but what's your point? We...",hurricane - hurricanes - fema - irma - evacuee...,1.000000,False
...,...,...,...,...,...,...,...,...
65197,Bigoted North Carolina Governor Pat McCrory is...,10,10_legislation_transgender_discrimination_abor...,"[legislation, transgender, discrimination, abo...",[(Reuters) - A Mississippi measure that would ...,legislation - transgender - discrimination - a...,0.873265,False
65198,WASHINGTON (Reuters) - Individuals who steal c...,-1,-1_trump_obama_presidential_president,"[trump, obama, presidential, president, donald...","[Donald Trump actually won, despite the popula...",trump - obama - presidential - president - don...,0.000000,False
65199,WASHINGTON (Reuters) - Republicans in the U.S....,6,6_obamacare_senate_repeal_congress,"[obamacare, senate, repeal, congress, medicaid...",[WASHINGTON (Reuters) - U.S. Republican senato...,obamacare - senate - repeal - congress - medic...,0.676768,False
65200,People are scratching their heads at an event ...,-1,-1_trump_obama_presidential_president,"[trump, obama, presidential, president, donald...","[Donald Trump actually won, despite the popula...",trump - obama - presidential - president - don...,0.000000,False


In [25]:
# On enregistre le jeu de données avec ses paramètres pour une utilisation ultérieure
paramss = str(params[0]) + '_' + str(params[1]) + '_' + str(params[2]) + '_' + str(params[3])

# Enregistrement en CSV
extract.to_csv(f'../data/processed/bertopics{paramss}.csv', index=False)

# Enregistrement en Excel
extract.to_excel(f'../data/processed/bertopics{paramss}.xlsx', index=False)