# Topic modeling : LDA

In [1]:
from jyquickhelper import add_notebook_menu
add_notebook_menu()

## Dataset : Grand Débat National (Great national debate)

The aim of this exercise is to be familiar with the text-mining and topic models such as LDA. One of the contexts where topic modeling is very useful is in open-ended questions. It allows us to explore the variation of topics addressed in people's responses. For this we will use the french "Grand Débat National" dataset. This dataset presents a complete set of responses from the [Grand Débat National](https://granddebat.fr/), the public debate organized by President Macron. The purpose of the debate was to better understand the needs and opinions of the French people following the Yellow Vests protests. The results of this debate are now available as [open data](https://granddebat.fr/pages/donnees-ouvertes).

## 1. Import data

**Question 1 :** Download one of the ecological transition csv files and load the content into a pandas dataframe. Name this variable `raw_data`

In [2]:
import pandas as pd
import os

# Load the CSV file into a DataFrame
file_path = 'REPONSE_ECOLOGIE.csv'  
raw_data = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(raw_data.head(10))




  reference                                              title  \
0       2-4                              transition écologique   
1       2-5                                   La surpopulation   
2       2-6                                             climat   
3       2-7                                  POLLUTION AIR EAU   
4       2-8                               Economie vs Ecologie   
5       2-9                 égalité territoriale de traitement   
6      2-11  Nous sommes les gardiens de la terre et des pa...   
7      2-12                            Pollution de la planete   
8      2-13  imposer une écotaxe aux compagnies maritimes e...   
9      2-14  Je ne souhaite pas répondre au questionnaire d...   

                                            authorId           authorType  \
0  VXNlcjoxMTQwMTc0YS0xZTFmLTExZTktOTRkMi1mYTE2M2...  Citoyen / Citoyenne   
1  VXNlcjpjOWYxZWQ1NS0xYzEwLTExZTktOTRkMi1mYTE2M2...  Citoyen / Citoyenne   
2  VXNlcjozZjlhNzAwOS0xYTc2LTExZTktOTRkMi1

  raw_data = pd.read_csv(file_path)


We will focus on the last question: ``Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?`` We hope that our LDA model will help us to analyze the topics on which their responses are focused. Let's take a look on the data :

In [3]:

# List all column names to find the exact name of the question
print(raw_data.columns)

Index(['reference', 'title', 'authorId', 'authorType', 'authorZipCode',
       'Quel est aujourd'hui pour vous le problème concret le plus important dans le domaine de l'environnement ?',
       'Que faudrait-il faire selon vous pour apporter des réponses à ce problème ?',
       ' Par rapport à votre mode de chauffage actuel, pensez-vous qu'il existe des solutions alternatives plus écologiques ?',
       'Si oui, que faudrait-il faire pour vous convaincre ou vous aider à changer de mode de chauffage ?',
       'Avez-vous pour vos déplacements quotidiens la possibilité de recourir à des solutions de mobilité alternatives à la voiture individuelle comme les transports en commun, le covoiturage, l'auto-partage, le transport à la demande, le vélo, etc. ?',
       'Si oui, que faudrait-il faire pour vous convaincre ou vous aider à utiliser ces solutions alternatives ?',
       'Si non, quelles sont les solutions de mobilité alternatives que vous souhaiteriez pouvoir utiliser ?',
       'Et

In [4]:
for i, col in enumerate(raw_data.columns):
    print(f"{i}: {col[:50]}")


0: reference
1: title
2: authorId
3: authorType
4: authorZipCode
5: Quel est aujourd'hui pour vous le problème concret
6: Que faudrait-il faire selon vous pour apporter des
7:  Par rapport à votre mode de chauffage actuel, pen
8: Si oui, que faudrait-il faire pour vous convaincre
9: Avez-vous pour vos déplacements quotidiens la poss
10: Si oui, que faudrait-il faire pour vous convaincre
11: Si non, quelles sont les solutions de mobilité alt
12: Et qui doit selon vous se charger de vous proposer
13: Que pourrait faire la France pour faire partager s
14: Y a-t-il d'autres points sur la transition écologi
15: QUXVlc3Rpb246MTU0 - Avez-vous pour vos déplacement
16: QUXVlc3Rpb246MTU1 - Si oui, que faudrait-il faire 
17: QUXVlc3Rpb246MjA3 - Si non, quelles sont les solut
18: QUXVlc3Rpb246MTU3 - Et qui doit selon vous se char
19: QUXVlc3Rpb246MTU4 - Que pourrait faire la France p
20: QUXVlc3Rpb246MTU5 - Y a-t-il d'autres points sur l


In [7]:
# Define the question of interest
question = "Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?"

# Display the first 10 responses to the specified question
responses = raw_data[question].head(10)
print(responses)


0                                              Inconnu
1                                              Inconnu
2                                              Inconnu
3                                              Inconnu
4    une aide significative pour de l'éolien ou du ...
5    rien il n'existe pas de solution fiable actuel...
6    Primes et visites à domicile pour étude de la ...
7     Aide incitative simple sans condition de revenus
8    j'ai déjà réduit ma conso electrique en instal...
9                                              Inconnu
Name: Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?, dtype: object


As we note, there is a lot of missing data (like any open-ended question, people decide whether or not to write a comment). A cleanup step is necessary.

## 2. Clean and vectorize documents

Before training our LDA model, we need to tokenize our text. We will tokenize with the [spaCy]  (https://spacy.io/) library because we will only perform some basic preprocessing. We will just initialize a blank template for the French language.

In [23]:
import spacy
# Load the French language model
nlp = spacy.load('fr_core_news_sm')





Let's remove all the rows from the dataframe that don't have an answer for our question (the `NaNs above). This new dataframe will be called ``texts``.

In [24]:
# Remove rows that don't have an answer for the specified question
texts = raw_data.dropna(subset=[question])

# Display the new DataFrame
print(texts.head())
print(f"Number of rows in the new DataFrame: {texts.shape[0]}")

  reference                  title  \
0       2-4  transition écologique   
1       2-5       La surpopulation   
2       2-6                 climat   
3       2-7      POLLUTION AIR EAU   
4       2-8   Economie vs Ecologie   

                                            authorId           authorType  \
0  VXNlcjoxMTQwMTc0YS0xZTFmLTExZTktOTRkMi1mYTE2M2...  Citoyen / Citoyenne   
1  VXNlcjpjOWYxZWQ1NS0xYzEwLTExZTktOTRkMi1mYTE2M2...  Citoyen / Citoyenne   
2  VXNlcjozZjlhNzAwOS0xYTc2LTExZTktOTRkMi1mYTE2M2...  Citoyen / Citoyenne   
3  VXNlcjozOWQwNzJjNC0xZDEwLTExZTktOTRkMi1mYTE2M2...  Citoyen / Citoyenne   
4  VXNlcjo3M2YxN2NlZS0xZDRiLTExZTktOTRkMi1mYTE2M2...  Citoyen / Citoyenne   

  authorZipCode  \
0       97231.0   
1       57000.0   
2       34140.0   
3       17400.0   
4       35430.0   

  Quel est aujourd'hui pour vous le problème concret le plus important dans le domaine de l'environnement ?  \
0                                            Inconnu                              

First preprocessing with spacy :

In [25]:
# Extract the relevant text column into a list for processing
text_data = texts[question].tolist()

In [26]:
spacy_docs = list(nlp.pipe(text_data))

We now have a list of spaCy documents. We will transform each spaCy document into a list of tokens. Instead of the original tokens, we will work with lemmas instead. This will allow our model to generalize better

Here is the full list of preprocessing used: 
 
- remove all **words less than 3 characters**,
- remove all **stop-words**, and
- lemmatize** the remaining words and,
- put these words in **minuscule**.

In [29]:
# Extract lemmas for tokens longer than 3 characters and not stop words
docs = [[token.lemma_.lower() for token in doc if len(token.text) > 3 and not token.is_stop] for doc in spacy_docs]

# Print the first three processed documents to verify
print(docs[:3])


[['inconnu'], ['inconnu'], ['inconnu']]


In order to preserve some word order in our modeling, we will take into account frequent bigrams. For this, we will use the [Gensim](https://radimrehurek.com/gensim/)library. We would like to point out that the Gensim library is an excellent NLP library for topics modeling. 

Here is the chosen process: 

- We first identify the frequent bigrams in the corpus, 
- then we add them to the list of tokens for the documents in which they appear. This means that the bigrams will not be in their correct position in the text, but this is not a problem: topic models are bag-of-words models that ignore the position of words anyway.

In [30]:
import re
from gensim.models import Phrases

bigram = Phrases(docs, min_count=10)

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token: 
            docs[idx].append(token)

Let's move on to the last steps of the Gensim specific preprocessing. First, we will create a dictionary representation of the documents. This dictionary will map each word to a unique identifier and will help us create word-sack representations of each document. These bag-of-words representations contain the identifiers of the words in the document as well as their frequency. In addition, we can remove the least frequent and most frequent words from the vocabulary. This will improve the quality of our model and speed up its training. The minimum frequency of a word is expressed as an absolute number, the maximum frequency is the proportion of documents in which a word can appear.

In [31]:
from gensim.corpora import Dictionary

dictionary = Dictionary(docs)
print('Number of unique words in original documents :', len(dictionary))

dictionary.filter_extremes(no_below=3, no_above=0.25)
print('Number of unique words after removing rare and common words :', len(dictionary))

print("Example representation of document 3 :", dictionary.doc2bow(docs[2]))

Number of unique words in original documents : 21800
Number of unique words after removing rare and common words : 8560
Example representation of document 3 : []


Next, we create bag-of-words representations for each document in the corpus see method [doc2bow](https://radimrehurek.com/gensim/corpora/dictionary.html) :

In [32]:
corpus = [ dictionary.doc2bow(doc) for doc in docs]

## 3. Topic Modeling with LDA

Now it's time to train our LDA! To do this, we use the following parameters: 

- **corpus**: the bag-of-words representations of our documents
- **id2token**: the mapping of indexes to words
- **num_topics** : the number of topics the model should identify (let's set <font color = "red"><b>10</b></font>)
- **chunksize**: the number of documents the model sees on each update (let's set to <font color = "red"><b>1,000</b></font>)
- **passes**: the number of times we show the total corpus to the model during training (let's set to <font color = "red"><b>5</b></font>)
- **random_state**: we use a seed to ensure reproducibility (let's set to <font color = "red"><b>1</b></font>)

On a corpus of this size, training usually takes one or two minutes.

**Question 2 :**

In [33]:

from gensim.models import LdaModel
from gensim.corpora import Dictionary


# Set LDA model parameters
num_topics = 10
chunksize = 1000
passes = 5
random_state = 1

# Train the LDA model
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    chunksize=chunksize,
    passes=passes,
    random_state=random_state
)

# Print the top topics
for idx, topic in lda_model.print_topics(-1):
    print(f'Topic: {idx}\nWords: {topic}\n')


Topic: 0
Words: 0.114*"chauffage" + 0.043*"mode" + 0.038*"changer" + 0.037*"mode_chauffage" + 0.023*"écologique" + 0.019*"électrique" + 0.018*"falloir" + 0.017*"chaudière" + 0.014*"système" + 0.013*"place"

Topic: 1
Words: 0.072*"isolation" + 0.045*"logement" + 0.041*"maison" + 0.028*"falloir" + 0.026*"isoler" + 0.024*"propriétaire" + 0.018*"thermique" + 0.018*"voir" + 0.016*"énergétique" + 0.016*"mieux"

Topic: 2
Words: 0.065*"bois" + 0.064*"chauffage" + 0.048*"collectif" + 0.039*"installer" + 0.035*"immeuble" + 0.029*"géothermie" + 0.025*"chauffe" + 0.024*"copropriété" + 0.021*"poêle" + 0.020*"toit"

Topic: 3
Words: 0.203*"aide" + 0.136*"financier" + 0.090*"aide_financier" + 0.021*"financement" + 0.020*"incitation" + 0.017*"prêt" + 0.016*"investissement" + 0.015*"taux" + 0.014*"important" + 0.014*"condensation"

Topic: 4
Words: 0.061*"solution" + 0.037*"information" + 0.023*"proposer" + 0.021*"technique" + 0.020*"exister" + 0.019*"entreprise" + 0.018*"meilleur" + 0.017*"bon" + 0.017*

## 4. Results and visualization

**Question 3 :** Let's see what the model has learned. To do this, let's display the ten most characteristic words for each topic.

In [35]:
for (topic, words) in lda_model.print_topics():
    print("***********")
    print("* topic", topic+1, "*")
    print("***********")
    print(topic+1, ":", words)
    print()

***********
* topic 1 *
***********
1 : 0.114*"chauffage" + 0.043*"mode" + 0.038*"changer" + 0.037*"mode_chauffage" + 0.023*"écologique" + 0.019*"électrique" + 0.018*"falloir" + 0.017*"chaudière" + 0.014*"système" + 0.013*"place"

***********
* topic 2 *
***********
2 : 0.072*"isolation" + 0.045*"logement" + 0.041*"maison" + 0.028*"falloir" + 0.026*"isoler" + 0.024*"propriétaire" + 0.018*"thermique" + 0.018*"voir" + 0.016*"énergétique" + 0.016*"mieux"

***********
* topic 3 *
***********
3 : 0.065*"bois" + 0.064*"chauffage" + 0.048*"collectif" + 0.039*"installer" + 0.035*"immeuble" + 0.029*"géothermie" + 0.025*"chauffe" + 0.024*"copropriété" + 0.021*"poêle" + 0.020*"toit"

***********
* topic 4 *
***********
4 : 0.203*"aide" + 0.136*"financier" + 0.090*"aide_financier" + 0.021*"financement" + 0.020*"incitation" + 0.017*"prêt" + 0.016*"investissement" + 0.015*"taux" + 0.014*"important" + 0.014*"condensation"

***********
* topic 5 *
***********
5 : 0.061*"solution" + 0.037*"information"

Another way to observe topics is to **visualize** them. This can be done with the library [pyLDAvis](https://github.com/bmabey/pyLDAvis). PyLDAvis will show us how popular the topics are in our corpus, how similar the topics are, and which words are most important for that topic. Note that it is important to set ``sort_topics = False`` on the call to pyLDAvis. If you don't, the topics will be sorted differently than in Gensim. This may take a few minutes to load.

**Question 5 :**

In [38]:
import pyLDAvis.gensim
import warnings

pyLDAvis.enable_notebook()
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pyLDAvis.gensim.prepare(lda_model, corpus, dictionary, sort_topics=False)

Finally, let's look at the topics that the model recognizes in some of the individual documents. Here we see how LDA tends to assign a high probability to a small number of topics for each document, making its results highly interpretable.

In [39]:
# Nous en affichons que 8
i = 0
for (text, doc) in zip(texts[:8], docs[:8]):
    i += 1
    print("***********")
    print("* doc", i, "  *")
    print("***********")
    print(text)
    print([(topic+1, prob) for (topic, prob) in lda_model[dictionary.doc2bow(doc)] if prob > 0.1])
    print()

***********
* doc 1   *
***********
reference
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), (7, 0.1), (8, 0.1), (9, 0.1), (10, 0.1)]

***********
* doc 2   *
***********
title
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), (7, 0.1), (8, 0.1), (9, 0.1), (10, 0.1)]

***********
* doc 3   *
***********
authorId
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), (7, 0.1), (8, 0.1), (9, 0.1), (10, 0.1)]

***********
* doc 4   *
***********
authorType
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), (7, 0.1), (8, 0.1), (9, 0.1), (10, 0.1)]

***********
* doc 5   *
***********
authorZipCode
[(3, 0.13749978), (4, 0.2625059), (6, 0.26250297), (7, 0.2624795)]

***********
* doc 6   *
***********
Quel est aujourd'hui pour vous le problème concret le plus important dans le domaine de l'environnement ?
[(5, 0.2853956), (7, 0.110025786), (9, 0.53455114)]

***********
* doc 7   *
***********
Que faudrait-il faire selon vous pour apporter des réponses

Many collections of unstructured text are not accompanied by labels. Topic models such as LDA are a useful technique for discovering the most important topics in these documents. **Gensim** facilitates learning about these topics and **pyLDAvis** presents the results in a visually appealing way. Together, they form a powerful toolkit for better understanding what's inside large document sets and for exploring subsets of related texts. While these results are often already quite revealing, it is also possible to use them as a starting point, for example, for a labeling exercise for supervised text classification. In sum, thematic models should be in every data scientist's toolbox as a very quick way to gain insight into large document collections.