# Topic Modeling
In this Python notebook, we are doing topic modeling based on available Python packages in this research field.

## Required Software
* Java (JDK >= 16): [Java SE Development Kit](https://www.oracle.com/java/technologies/downloads/)
* Apache Ant (version >= 1.10.10): [supplies a number of built-in tasks allowing to compile, assemble, test and run Java applications.](https://ant.apache.org/bindownload.cgi)
* MALLET 2.0.8: [MAchine Learning for LanguagE Toolkit](https://mallet.cs.umass.edu/download.php)

## Required Python packages
For this notebook, required Python packages are:
* `nltk`: the [Natural Language Toolkit](https://www.nltk.org/)
* `gensim/3.8.3`: the ["*fastest library for training of vector embeddings*"](https://radimrehurek.com/gensim_3.8.3/)
* `spacy`: [Industrial-Strength Natural Language Processing](https://spacy.io/)
* `pyLDAvis/3.3.1`: [Python library for interactive topic model visualization](https://pypi.org/project/pyLDAvis/)

In [None]:
# In case you need to install the above packages in your environment:
# !pip install matplotlib numpy pandas click==7.1.2
# !pip install nltk gensim==3.8.3 spacy pyLDAvis==3.3.1

In [None]:
# Also required - the spaCy English vocabulary
# !python -m spacy download en_core_web_sm

In [None]:
# Standard and scientific packages | Modules réguliers et scientifiques
import os
import re
import numpy as np
import pandas as pd
from pprint import pprint
from pathlib import Path
import json

# NLTK - Natural Language Toolkit
import nltk
nltk.download('stopwords')  # Only required on the first execution

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel, LdaModel, LdaMulticore

# spaCy for lemmatization | spaCy pour lemmatisation
import spacy

# Plotting tools | outils graphiques
import pyLDAvis
import pyLDAvis.gensim_models  # don't skip this
import matplotlib.pyplot as plt

# Enable logging for gensim - optional | activé le registre pour gensim - en option
import logging
logging.basicConfig(
    format='%(asctime)s : %(levelname)s : %(message)s',
    level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## Loading the data
* Load stop words from NLTK

In [None]:
# NLTK Stop words | NLTK Mots vides
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
#stop_words = stopwords.words('french')

# See the default list | voir la liste pas défaut
print('Default list:', stop_words)

# Add your custom stop words | ajoutez vos mots vides personnalisés 
stop_words.extend([])

# See the final list of stop words | voir la liste complète
print('\nFinal list:', stop_words)

* Get the list of filenames

In [None]:
# Specify the path and extension of text files
txt_folder = Path('data/').rglob('*.txt')
#collecter les chemins de fichiers pour tous vos fichiers texte
#txt_folder = Path('donnee/').rglob('*.txt')

files = [x for x in txt_folder]  # Gather the paths for all text files in a list
print(files[:3], '...', files[-3:])  # Print first 3 and last 3 filenames

* Create a dictionary that will populate a Pandas DataFrame with two columns:
  * `target_names`: the filename without its path
  * `content`: the original text data of the file in single line

In [None]:
#créer un dictionnaire qui contient tous les noms de fichiers
#et les associe à leur texte
papers = {'target_names': [], 'content': []}

for name in files:
    f = open(name, 'r', encoding='utf-8')
    basename = os.path.basename(name)

    # Print at every 10 filenames
    if name in files[::10]:
        print(f'Reading {basename} ...')

    papers['target_names'].append(basename)
    papers['content'].append(' '.join(f.readlines()))
    f.close()

# Convert the dictionary to a pandas data frame 
# convertir le dictionnaire en dataframe pandas
df = pd.DataFrame.from_dict(papers)
print(f'Total: {len(df)} rows. Here are the first five:')
df.head()

## Cleaning the text data

In [None]:
# Convert the text content to a list
# Convertir le contenu du texte en liste
data = papers['content']

# Remove roman numerals | Supprimer les chiffres romains
data = [re.sub('[MDCLXVI]+(\.|\b\w\n)', ' ', sentence) for sentence in data]

# Remove new line characters | Supprimer les caractères de nouvelle ligne
data = [re.sub('\s+', ' ', sentence) for sentence in data]

# Remove distracting quotes | Supprimer les citations distrayantes
#data = [re.sub("\'", "", sentence) for sentence in data]

print('First cleaned sentence:', data[0])
print('\nLast cleaned sentence:', data[-1])

In [None]:
#supprimer la ponctuation et collecter tous les mots individuels
def sentences_to_words(sentences):
    """
    Generator - For each sentence, return a processed list of words
    
    Returns:
    -------
    Each sentence processed by gensim.utils.simple_preprocess(), which
    removes the punctuation and collects all the individual words.
    """
    for sentence in sentences:
        # Setting deacc=True removes punctuations
        yield(simple_preprocess(sentence, deacc=True))

# Create a list of lists of words - one list of words per sentence
data_words = list(sentences_to_words(data))

print('First list of words:', data_words[0])
print('\nLast list of words:', data_words[-1])

## Topic Modeling
We will start by using:
* Gensim's [Phrases class](https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases) - an instance of it "detects phrases based on collocation counts"
* Gensim's [Phraser class](https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phraser) - an alias of [FrozenPhrases](https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.FrozenPhrases) which "cuts down memory consumption of Phrases, by discarding model state not strictly needed for the phrase detection task".

In [None]:
# Build the bigram and trigram models - higher threshold => fewer phrases
#Construire les modèles bigramme et trigramme
bigram = gensim.models.phrases.Phrases(data_words, min_count=4, threshold=8)
trigram = gensim.models.phrases.Phrases(bigram[data_words], threshold=8)

# Faster way to get a sentence identified as a trigram/bigram
## Moyen plus rapide d'obtenir une phrase identifiée comme un trigramme/bigramme
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See a trigram example | voir l'exemple trigramme
print(trigram_mod[bigram_mod[data_words[90]]])

In [None]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
## Définir des fonctions pour les mots vides, les bigrammes, les trigrammes et la lemmatisation
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
# Remove Stop Words | Supprimer les mots vides
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams | faire les bigrammes
data_words_bigrams = make_bigrams(data_words_nostops)

# Form trigrams | faire les trigrammes
data_words_trigrams = make_trigrams(data_words_bigrams)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Initialiser le modèle spacy 'fr', en ne gardant que le composant tagger (pour plus d'efficacité)
#nlp = spacy.load('fr_core_news_sm', disable=['parser', 'ner'])
#nlp = spacy.load("fr_core_news_sm")

# Do lemmatization keeping only noun, adj, vb, adv
# Faire la lemmatisation en ne gardant que le nom, l'adj, le vb, l'adv
data_lemmatized = lemmatization(data_words_trigrams,
                                allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[90])

In [None]:
# Create Dictionary | créer le dictionnaire
id2word = corpora.Dictionary(data_lemmatized)

# Term Document Frequency | Durée Document Fréquence
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

In [None]:
# Readable format of corpus | format lisible du corpus
[[(id2word[id], freq) for id, freq in cp[:10]] for cp in corpus[:4]]

In [None]:
start = 2   # Set the minium number of topics your model will run | choisissez le nombre minimum de thème
limit = 11  #choose the max ceiling for number of topics, your model will have a max of one less than this ceiling 
#choisissez le nombre maximum plafond de thème, votre modèle aura un thème de moins 
step = 2    # Set the step width for number of topics per model | choisissez la taille du pas
mallet_path = '~/mallet-2.0.8/bin/mallet' # update this path to the path to your mallet program

model_list = []
coherence_values = []

for num_topics in range(start, limit, step):
    model = gensim.models.wrappers.LdaMallet(
        mallet_path=mallet_path,
        corpus=corpus,
        num_topics=num_topics,
        id2word=id2word)
    model_list.append(model)

    coherencemodel = CoherenceModel(
        model=model,
        texts=data_lemmatized,
        dictionary=id2word,
        coherence='c_v')
    coherence_values.append(coherencemodel.get_coherence())


In [None]:
# Show graph | voir le graphique

x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores | voir les cohérences
for m, cv in zip(x, coherence_values):
    print("Num Topics|Numero de Théme =", m,
          " has Coherence Value of|a une cohérence de", round(cv, 4))

In [None]:
# Select the model and print the topics
#choissisez le mieux modéle et voir les thémes
# Choose which model in the list you think is the best
# Remember python started indexing from 0
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

In [None]:
# Now run just that model with the exact number of topics you want
ldamallet = gensim.models.wrappers.LdaMallet(
    mallet_path, corpus=corpus, num_topics=8, id2word=id2word)

In [None]:
# Show Topics | voir les thèmes
pprint(ldamallet.show_topics(formatted=False))

# see the Coherence Score | voir la cohérance
coherence_model_ldamallet = CoherenceModel(
    model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

In [None]:
# Visualize the topics
#visualiser les thèmes
# Can't use the gensim method for MALLET directly
# So converting LdaMallet Model to LdaModel as per
# https://radimrehurek.com/gensim/models/wrappers/ldamallet.html
# Note that a "by hand" version of doing thing can be found at
# https://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/

lda_model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(
    ldamallet, gamma_threshold=0.01, iterations=20)

In [None]:
# The notebook crashes because of this cell
# pyLDAvis.enable_notebook()
# vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
# vis

In [None]:
def format_topics_sentences(ldamodel=ldamallet, corpus=corpus, texts=df):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(
                    pd.Series(
                        [int(topic_num), round(prop_topic,4), topic_keywords]),
                    ignore_index=True)
            else:
                break

    sent_topics_df.columns = [
        'Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = texts
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

In [None]:
df_topic_sents_keywords = format_topics_sentences(
    ldamodel=ldamallet, corpus=corpus, texts=df)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = [
    'Document number',
    'Dominant_Topic',
    'Topic_Perc_Contrib',
    'Keywords',
    'file_name',
    'Text']

In [None]:
# Show
df_dominant_topic