# Topic Modeling Le Matin

## Importation des données

In [3]:
import pandas as pd

In [None]:
le_matin = pd.read_csv('le_matin_articles.csv', engine='python', error_bad_lines=False)

In [5]:
le_matin.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,category,content
0,0,0.0,activites-royales,"Le porte-parole du Palais Royal a annoncé, jeu..."
1,1,1.0,activites-royales,"Sa Majesté le Roi Mohammed VI, que Dieu L'assi..."
2,2,2.0,activites-royales,"Sa Majesté le Roi Mohammed VI, que Dieu L'assi..."
3,3,3.0,activites-royales,Sa Majesté le Roi Mohammed VI a adressé un mes...
4,4,4.0,activites-royales,The origin web server timed out responding to ...


## Installation des packages

In [None]:
!pip install nltk
!pip install gensim

## Importation des librairies

In [7]:
import numpy as np

In [8]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.models import CoherenceModel

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')
nltk.download('omw-1.4')

## Fonction de preprocessing

* on supprime tous les stopwords (this, that, where...)
* on supprime les mots de moins de 3 lettres
* on applique la lemmatisation

In [10]:
stemmer = SnowballStemmer('french')

In [11]:
def lemmatize_stemming(text) :
  return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='n'))

In [12]:
def preprocess(text) :
  result = []
  for token in gensim.utils.simple_preprocess(text) :
    if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 :
      result.append(lemmatize_stemming(token))
  return result

## Prepocessing des données

* on supprime les valeurs NAN de nos données
* on applique la fonction de preprocessing

In [13]:
le_matin.dropna(subset = ["content"], inplace=True)

In [14]:
processed_docs = [preprocess(doc) for doc in le_matin['content']]

In [15]:
processed_docs[10][:10]

['voic',
 'communiqu',
 'cabinet',
 'royal',
 'suit',
 'trist',
 'nouvel',
 'déces',
 'cheikh',
 'khalif']

## Stockage des données après preprocessing

* on utilise un dictionnaire qui contient le mot comme clé et son nombre d'occurences comme valeur

In [16]:
dictionary = gensim.corpora.Dictionary(processed_docs)

## Nettoyage du dictionnaire

* on supprime les mots trop rares qui appraîssent moins de 15 fois
* on supprime les mots trop fréquents qui apparaîssent dans plus de 10% des documents
* à la fin on ne garde que les 100 000 mots les plus fréquents 

In [17]:
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

## Conversion en Bag-Of-Words

* on convertit notre dictionnaire en couple mot et nombres d'occurrences : format bag-of-words

In [18]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [19]:
for i in range(10) :
    print("Word {} (\"{}\") appears {} time.".format(bow_corpus[10][i][0], dictionary[bow_corpus[10][i][0]], bow_corpus[10][i][1]))

Word 1 ("altess") appears 1 time.
Word 11 ("dieu") appears 4 time.
Word 14 ("famill") appears 2 time.
Word 18 ("glorif") appears 1 time.
Word 23 ("hériti") appears 2 time.
Word 24 ("illustr") appears 2 time.
Word 40 ("peupl") appears 3 time.
Word 42 ("princ") appears 2 time.
Word 44 ("préserv") appears 1 time.
Word 47 ("royal") appears 1 time.


## Exécution du LDA

* LdaMulticore pour utiliser tout les coeurs du CPU afin de gagner en temps d'exécution

* num_topics : nombre de topic à extraire du corpus

* id2word : mapping des identifiants de mots (entiers) aux mots (chaînes de caractères)

* passes : nombre d'itération d'entraînement sur le corpus

In [20]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = 10, id2word = dictionary, passes = 1000)

In [21]:
topics = []
for idx, topic in lda_model.print_topics(-1) :
    print("Topic: {} -> Words: {}".format(idx, topic))
    topics.append(topic)

Topic: 0 -> Words: 0.039*"alopec" + 0.025*"médic" + 0.022*"cheveux" + 0.019*"cultur" + 0.017*"fréquent" + 0.017*"repouss" + 0.017*"poil" + 0.014*"agad" + 0.014*"corp" + 0.013*"américain"
Topic: 1 -> Words: 0.042*"logist" + 0.026*"digitalis" + 0.024*"transport" + 0.020*"salon" + 0.015*"variol" + 0.014*"sing" + 0.014*"adapt" + 0.014*"professionnel" + 0.009*"urgenc" + 0.009*"tedros"
Topic: 2 -> Words: 0.071*"cybersécur" + 0.044*"hackathon" + 0.035*"lmp" + 0.021*"startup" + 0.020*"numer" + 0.020*"météorolog" + 0.019*"orag" + 0.018*"security" + 0.018*"cyb" + 0.018*"enjeux"
Topic: 3 -> Words: 0.041*"taux" + 0.020*"heur" + 0.015*"mainten" + 0.015*"immun" + 0.015*"majest" + 0.014*"instant" + 0.012*"plateform" + 0.012*"réclam" + 0.012*"médical" + 0.011*"royal"
Topic: 4 -> Words: 0.033*"franklin" + 0.019*"personnag" + 0.019*"américain" + 0.018*"britann" + 0.018*"incarn" + 0.018*"franco" + 0.017*"pierr" + 0.017*"épisod" + 0.017*"michael" + 0.016*"benjamin"
Topic: 5 -> Words: 0.006*"format" + 0.00

## Cohérence du topic modeling

* Les mesures de cohérence évaluent le degré de similitude sémantique entre les mots les mieux notés dans le topics
* Ces mesures aident à faire la distinction entre les topics sémantiquement interprétables et les topics dû à des inférences statistiques
* Pour un bon modèle LDA la cohérence doit être comprise entre 0.4 et 0.7 au delà le modèle est probablement erroné

In [22]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_docs, dictionary=dictionary)
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.4774890245987187


## Stockage des résultats

In [23]:
all_topic_model = []
for i in range(len(topics)):
  str = topics[i].split(' + ')
  topic_model = []
  for j in range(10):
    weight = str[j][0:5]
    word = str[j][7:len(str[j])-1]
    topic_model.append((weight, word))
  all_topic_model.append(topic_model)

In [24]:
df_topic_model = pd.DataFrame(all_topic_model)
df_topic_model.rename(index = {0: "Topic 1", 1: "Topic 2", 2: "Topic 3", 3: "Topic 4", 4: "Topic 5", 5: "Topic 6", 6: "Topic 7", 7: "Topic 8", 8: "Topic 9", 9: "Topic 10"}, inplace = True)

In [25]:
df_topic_model

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Topic 1,"(0.039, alopec)","(0.025, médic)","(0.022, cheveux)","(0.019, cultur)","(0.017, fréquent)","(0.017, repouss)","(0.017, poil)","(0.014, agad)","(0.014, corp)","(0.013, américain)"
Topic 2,"(0.042, logist)","(0.026, digitalis)","(0.024, transport)","(0.020, salon)","(0.015, variol)","(0.014, sing)","(0.014, adapt)","(0.014, professionnel)","(0.009, urgenc)","(0.009, tedros)"
Topic 3,"(0.071, cybersécur)","(0.044, hackathon)","(0.035, lmp)","(0.021, startup)","(0.020, numer)","(0.020, météorolog)","(0.019, orag)","(0.018, security)","(0.018, cyb)","(0.018, enjeux)"
Topic 4,"(0.041, taux)","(0.020, heur)","(0.015, mainten)","(0.015, immun)","(0.015, majest)","(0.014, instant)","(0.012, plateform)","(0.012, réclam)","(0.012, médical)","(0.011, royal)"
Topic 5,"(0.033, franklin)","(0.019, personnag)","(0.019, américain)","(0.018, britann)","(0.018, incarn)","(0.018, franco)","(0.017, pierr)","(0.017, épisod)","(0.017, michael)","(0.016, benjamin)"
Topic 6,"(0.006, format)","(0.006, conseil)","(0.005, relat)","(0.005, gouvern)","(0.005, régional)","(0.004, commun)","(0.004, domain)","(0.004, initi)","(0.004, professionnel)","(0.004, univers)"
Topic 7,"(0.006, vari)","(0.006, enfant)","(0.005, femm)","(0.005, omicron)","(0.004, risqu)","(0.004, médecin)","(0.004, prix)","(0.004, expliqu)","(0.004, sanitair)","(0.004, faut)"
Topic 8,"(0.022, hybrid)","(0.022, aircross)","(0.016, conduit)","(0.016, styl)","(0.016, confort)","(0.015, motoris)","(0.013, conducteur)","(0.013, litr)","(0.013, bord)","(0.013, moteur)"
Topic 9,"(0.029, rir)","(0.025, afric)","(0.025, startup)","(0.019, talent)","(0.019, spectacl)","(0.018, forum)","(0.018, catégor)","(0.018, char)","(0.013, francophon)","(0.012, nomm)"
Topic 10,"(0.026, trafic)","(0.019, euro)","(0.018, perform)","(0.018, nador)","(0.017, west)","(0.015, aéroport)","(0.015, banqu)","(0.015, commercial)","(0.014, sal)","(0.014, régional)"


In [26]:
df_topic_model.to_csv('topic_model_morocco_world_news.csv')

## Visualisation des résultats

In [None]:
!pip install pyLDAvis

In [None]:
import pyLDAvis.gensim_models

In [None]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(lda_model, bow_corpus, dictionary)