## Etape 1 : Importation du dataset

###Le dataset utilisé est importé de la librairie sklearn

In [2]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(shuffle = True)

### Les articles sont déjà regroupés en 20 sujets différents

In [3]:
print(list(newsgroups.target_names))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


### Exemple d'article de notre dataset

In [4]:
newsgroups.data[11]

'From: david@terminus.ericsson.se (David Bold)\nSubject: Re: Question for those with popular morality\nReply-To: david@terminus.ericsson.se\nDistribution: world\nOrganization: Camtec Electronics (Ericsson), Leicester, England\nLines: 77\nNntp-Posting-Host: bangkok\n\nIn article 17570@freenet.carleton.ca, ad354@Freenet.carleton.ca (James Owens) writes:\n>\n>In a previous article, david@terminus.ericsson.se (David Bold) says:\n>\n>>\n>>I don\'t mean to be rude, but I think that you\'ve got hold of the wrong\n>>end of a different stick...\n>>\n>>David\n>\n>I had a look at your posting again and I see what you mean!  I was so\n>intent on explaining how Jung thought we could be more moral than God that\n>I overlooked your main line of thought.\n>\n>You seem to be saying that, God being unknowable, His morality is unknowable.\n\nYep, that\'s pretty much it. I\'m not a Jew but I understand that this is the\nJewish way of thinking. However, the Jews believe that the Covenant between\nYHWH and 

## Etape 2 : Data Preprocessing 


### Importation des modules nécessaires

In [5]:
import numpy as np
np.random.seed(400)

In [6]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

In [7]:
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

### Fonction de preprocessing

In [8]:
stemmer = SnowballStemmer('english')

def lemmatize_stemming(text) :
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='n'))

def preprocess(text) :
    result = []
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 :
            result.append(lemmatize_stemming(token))
    return result

### Exemple de preprocessing

In [9]:
doc_sample = 'I really like using Python for topic modeling. Thank you for the internship !'

print("Original document : ")
words = []
for word in doc_sample.split(' ') :
    words.append(word)
print(words)

print("\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document : 
['I', 'really', 'like', 'using', 'Python', 'for', 'topic', 'modeling.', 'Thank', 'you', 'for', 'the', 'internship', '!']

Tokenized and lemmatized document: 
['like', 'python', 'topic', 'model', 'thank', 'internship']


### Preprocessing du dataset

In [10]:
processed_docs = []

for doc in newsgroups.data :
    processed_docs.append(preprocess(doc))

print(processed_docs[11])

['david', 'terminus', 'ericsson', 'david', 'bold', 'subject', 'question', 'popular', 'moral', 'repli', 'david', 'terminus', 'ericsson', 'distribut', 'world', 'organ', 'camtec', 'electron', 'ericsson', 'leicest', 'england', 'line', 'nntp', 'post', 'host', 'bangkok', 'articl', 'freenet', 'carleton', 'freenet', 'carleton', 'jame', 'owen', 'write', 'previous', 'articl', 'david', 'terminus', 'ericsson', 'david', 'bold', 'say', 'mean', 'rude', 'think', 'hold', 'wrong', 'differ', 'stick', 'david', 'look', 'post', 'mean', 'intent', 'explain', 'jung', 'thought', 'moral', 'overlook', 'main', 'line', 'thought', 'say', 'unknow', 'moral', 'unknow', 'pretti', 'understand', 'jewish', 'think', 'jew', 'believ', 'coven', 'yhwh', 'patriarch', 'abraham', 'mose', 'case', 'establish', 'moral', 'code', 'follow', 'mankind', 'jew', 'decid', 'boundari', 'fall', 'understand', 'sadduce', 'believ', 'torah', 'requir', 'pharise', 'ancestor', 'modern', 'judaism', 'believ', 'torah', 'avail', 'interpret', 'lead', 'unde

## Etape 3 : Stockage des mots de notre dataset

### Création d'un dictionnaire qui contiendra le nombre d'occurrences de chaque mot

In [11]:
dictionary = gensim.corpora.Dictionary(processed_docs)

### Filtrage de notre dictionnaire

* on supprime les mots trop rares qui appraîssent moins de 15 fois
* on supprime les mots trop fréquents qui apparaîssent dans plus de 10% des documents
* à la fin on ne garde que les 100 000 mots les plus fréquents 

In [12]:
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

### Conversion des documents en couple mot et nombres d'occurrences (bag-of-words format)

In [13]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

### Preview du bag-of-words de notre dataset preprocessed

In [14]:
bow_doc_x = bow_corpus[11]

for i in range(len(bow_doc_x)) :
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], dictionary[bow_doc_x[i][0]], bow_doc_x[i][1]))

Word 20 ("small") appears 1 time.
Word 28 ("base") appears 1 time.
Word 86 ("feel") appears 1 time.
Word 91 ("heard") appears 1 time.
Word 97 ("life") appears 2 time.
Word 103 ("opinion") appears 1 time.
Word 125 ("suppos") appears 2 time.
Word 152 ("pretti") appears 1 time.
Word 153 ("requir") appears 2 time.
Word 171 ("clear") appears 1 time.
Word 172 ("code") appears 8 time.
Word 190 ("previous") appears 1 time.
Word 196 ("understand") appears 3 time.
Word 208 ("argument") appears 1 time.
Word 216 ("consid") appears 1 time.
Word 233 ("hand") appears 1 time.
Word 235 ("hard") appears 1 time.
Word 243 ("later") appears 1 time.
Word 249 ("modern") appears 1 time.
Word 267 ("second") appears 1 time.
Word 292 ("direct") appears 1 time.
Word 305 ("thought") appears 2 time.
Word 310 ("appl") appears 1 time.
Word 312 ("avail") appears 1 time.
Word 333 ("interfac") appears 1 time.
Word 347 ("refer") appears 1 time.
Word 403 ("troubl") appears 1 time.
Word 409 ("wrong") appears 3 time.
Word 4

## Etape 4 : Exécution du LDA

### On entraîne notre modèle LDA sachant que :###

* LdaMulticore pour utiliser tout les coeurs du CPU afin de gagner en temps d'exécution

* num_topics : nombre de topic à extraire du corpus

* id2word : mapping des identifiants de mots (entiers) aux mots (chaînes de caractères)

* passes : nombre de passage d'entraînement dans le corpus

* workers : nombre de coeurs à utiliser (par défaut tous les coeurs disponibles)

* alpha : rareté de la distribution document-topic

  - alpha grand = chaque document contient un mélange de tous les sujets
    - les documents semblent similaires les uns aux autres
  - alpha petit = chaque document contient un mélange de très peu de sujets

* eta : rareté de la distribution topic-word

  - eta grand = chaque sujet contient un mélange de la plupart des mots
    - les sujets semblent similaires les uns aux autres
  - eta petit = chaque sujet contient un mélange de quelques mots

In [15]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = 8, id2word = dictionary, passes = 10)

### Pour chaque topic on va analyser le nombre d'occurrences de chaque mot et son poids relatif

In [None]:
topics = []
for idx, topic in lda_model.print_topics(-1) :
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")
    topics.append(topic)

## On stocke les résultats de notre topic modeling

In [23]:
import pandas as pd

In [77]:
all_topic_model = []
for i in range(len(topics)):
  str = topics[i].split(' + ')
  topic_model = []
  for j in range(10):
    weight = str[j][0:5]
    word = str[j][7:len(str[j])-1]
    topic_model.append((weight, word))
  all_topic_model.append(topic_model)

In [76]:
df_topic_model = pd.DataFrame(all_topic_model)
df_topic_model.rename(index = {0: "Topic 1", 1: "Topic 2", 2: "Topic 3", 3: "Topic 4", 4: "Topic 5", 5: "Topic 6", 6: "Topic 7", 7: "Topic 8"}, inplace = True)
df_topic_model.rename(columns = {0: 'Word 1', 1: 'Word 2', 2: 'Word 3', 3: 'Word 4', 4: 'Word 5', 5: 'Word 6', 6: 'Word 7', 7: 'Word 8', 8: 'Word 9', 9: 'Word 10'}, inplace = True)

In [78]:
df_topic_model.to_csv('topic_model.csv')