<h2> Work on Topic Modeling </h2>
<h3> Authors: CHOI Iatin, KAO Weiting, EL KHALIL Marwan, KHIEU Thomas, MAHON Matthew </h3>
<p> For this NLP work, we used the dataset provided for the exercice. The raw dataset is totally unexploitable and we add to first use VBA to make it readable for Python.</p>

#Excel VBA 
<br>
Sub AppendToExistingOnLeft()
<br>
Dim c As Range
<br>
For Each c In Selection
<br>
If c.Value <> "" Then c.Value = "https://" & c.Value
<br>
Next
<br>
End Sub

Once in a usable state we ran the code which is available at this address: https://nbviewer.jupyter.org/github/ThomasKhieu/Python/blob/master/NLPscraping.ipynb
<br>
This code was used to scrap the 70k rows of URLs in the file. Unfortunately, we only scrapped 50k rows, the remaining 20k were too slow to process in the python program.
Out of the 50k rows, only 〜 33k were exploitable for our Python program.
<br>
Regarding the scrapping part we only scrapped the < h1 > tag on webpages, we believe that due to digital marketing, the page title should contain the most accurate information to be relevant for Google and other search engine therefore we didn't scrap the whole page content.

This second proposition makes a better use of machine learning than the first one while running less code iteration for the passes.

In [1]:
#import data
import pandas as pd
data = pd.read_csv('output.csv', error_bad_lines=False);
data_text = data[['Page Title']]
data_text['index'] = data_text.index
documents = data_text

In [2]:
#quick look at the data
print(len(documents))
print(documents[:5])

33686
                                          Page Title  index
0  \n            Pôle-Emploi : les différentes ca...      0
1  Ne pas déclarer ses revenus à pôle emploi [Rés...      1
2  \n            AAH 2019 : montant et plafonds d...      2
3  \n            Allocation de rentrée scolaire (...      3
4  \nPrise d'otages à Blagnac : l'homme retranché...      4


In [3]:
#nltk and gensim import
import gensim
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2019)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/Phoenix/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
#stemmer setup
stemmer = SnowballStemmer("french")
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [5]:
#postprocessing document preview
from nltk.stem.porter import *
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['Papillon', ':', 'coloriages']


 tokenized and lemmatized document: 
['papillon', 'coloriag']


In [6]:
#saving processed results
documents = documents.dropna(subset=['Page Title'])
processed_docs = documents['Page Title'].map(preprocess)
processed_docs[:10]

0           [pôl, emploi, différent, catégor, chômeur]
1         [déclar, revenus, pôl, emploi, résolu, ferm]
2                            [mont, plafond, ressourc]
3                              [alloc, rentr, scolair]
4    [pris, otag, blagnac, homm, retranch, ouvr, po...
5                                    [dessert, maison]
6                                     [meteo, bordeau]
7                              [horoscop, chien, jour]
8                            [horoscop, chien, demain]
9                                     [programm, soir]
Name: Page Title, dtype: object

In [7]:
#bag-of-words
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 catégor
1 chômeur
2 différent
3 emploi
4 pôl
5 déclar
6 ferm
7 revenus
8 résolu
9 mont
10 plafond


In [8]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [9]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(398, 1), (399, 1), (1316, 1)]

In [10]:
bow_doc_4310 = bow_corpus[4310]
for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                               dictionary[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))

Word 398 ("coloriag") appears 1 time.
Word 399 ("imprim") appears 1 time.
Word 1316 ("lapin") appears 1 time.


In [11]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.7612892254940364), (1, 0.648412457581353)]


In [12]:
#lda model training
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [13]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.040*"comment" + 0.033*"toujour" + 0.026*"linternaut" + 0.025*"cuisin" + 0.022*"ferm" + 0.022*"résolu" + 0.016*"nouveau" + 0.015*"jour" + 0.012*"sylv" + 0.011*"victim"
Topic: 1 
Words: 0.026*"mort" + 0.021*"fill" + 0.016*"leur" + 0.015*"petit" + 0.014*"euros" + 0.012*"amour" + 0.012*"trop" + 0.012*"parad" + 0.011*"entre" + 0.011*"moin"
Topic: 2 
Words: 0.094*"franc" + 0.046*"meteo" + 0.030*"météo" + 0.028*"port" + 0.019*"miss" + 0.012*"jennif" + 0.012*"jad" + 0.012*"contr" + 0.011*"roug" + 0.011*"test"
Topic: 3 
Words: 0.040*"femm" + 0.036*"plus" + 0.030*"meghan" + 0.026*"markl" + 0.023*"journal" + 0.022*"san" + 0.020*"toujour" + 0.019*"pour" + 0.018*"bel" + 0.017*"harry"
Topic: 4 
Words: 0.045*"pour" + 0.028*"recet" + 0.026*"meilleur" + 0.023*"maison" + 0.015*"pierr" + 0.013*"deux" + 0.011*"noël" + 0.011*"peopl" + 0.010*"avec" + 0.010*"semain"
Topic: 5 
Words: 0.034*"hallyday" + 0.028*"johnny" + 0.025*"elle" + 0.022*"avec" + 0.018*"laetici" + 0.017*"quel" + 0.015*"st

In [14]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
lda_model_tfidf.save('model100.gensim')
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.024*"toujour" + 0.023*"linternaut" + 0.020*"dan" + 0.017*"johnny" + 0.016*"résultat" + 0.016*"hallyday" + 0.010*"horoscop" + 0.008*"françois" + 0.008*"amour" + 0.008*"quel"
Topic: 1 Word: 0.013*"chez" + 0.013*"coloriag" + 0.010*"paris" + 0.010*"pour" + 0.010*"pass" + 0.009*"pomm" + 0.009*"terr" + 0.009*"miss" + 0.009*"cuisin" + 0.008*"château"
Topic: 2 Word: 0.057*"meteo" + 0.023*"femm" + 0.019*"journal" + 0.016*"avec" + 0.016*"toujour" + 0.012*"famill" + 0.008*"jour" + 0.008*"pour" + 0.007*"comm" + 0.007*"trop"
Topic: 3 Word: 0.030*"macron" + 0.021*"brigitt" + 0.018*"franc" + 0.017*"emmanuel" + 0.012*"rob" + 0.010*"pierr" + 0.010*"définit" + 0.010*"pour" + 0.009*"mar" + 0.009*"plus"
Topic: 4 Word: 0.018*"pour" + 0.010*"laetici" + 0.009*"apres" + 0.009*"résolu" + 0.009*"ferm" + 0.009*"symptôm" + 0.009*"elle" + 0.008*"hallyday" + 0.008*"brun" + 0.008*"laur"
Topic: 5 Word: 0.026*"ferm" + 0.024*"résolu" + 0.012*"erreur" + 0.011*"pour" + 0.011*"avant" + 0.010*"star" + 0.00

In [15]:
processed_docs[4310]

['papillon', 'coloriag']

In [16]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.7749902009963989	 
Topic: 0.019*"kat" + 0.017*"princ" + 0.016*"middleton" + 0.015*"erreur" + 0.014*"saint" + 0.014*"famill" + 0.013*"coloriag" + 0.012*"jean" + 0.012*"michel" + 0.011*"mariag"

Score: 0.02500164695084095	 
Topic: 0.026*"mort" + 0.021*"fill" + 0.016*"leur" + 0.015*"petit" + 0.014*"euros" + 0.012*"amour" + 0.012*"trop" + 0.012*"parad" + 0.011*"entre" + 0.011*"moin"

Score: 0.02500123716890812	 
Topic: 0.072*"résolu" + 0.069*"ferm" + 0.027*"fait" + 0.019*"comment" + 0.016*"vous" + 0.014*"pour" + 0.013*"dat" + 0.011*"derni" + 0.009*"vent" + 0.009*"heur"

Score: 0.0250010397285223	 
Topic: 0.040*"comment" + 0.033*"toujour" + 0.026*"linternaut" + 0.025*"cuisin" + 0.022*"ferm" + 0.022*"résolu" + 0.016*"nouveau" + 0.015*"jour" + 0.012*"sylv" + 0.011*"victim"

Score: 0.025001026690006256	 
Topic: 0.061*"dan" + 0.031*"avec" + 0.029*"pour" + 0.028*"grand" + 0.023*"mond" + 0.015*"vous" + 0.015*"coup" + 0.014*"résultat" + 0.014*"neymar" + 0.013*"rat"

Score: 0.025000985711

Our test document has the highest probability to be part of the topic that our model assigned, which is the accurate classification.

In [17]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.5128396153450012	 
Topic: 0.013*"chez" + 0.013*"coloriag" + 0.010*"paris" + 0.010*"pour" + 0.010*"pass" + 0.009*"pomm" + 0.009*"terr" + 0.009*"miss" + 0.009*"cuisin" + 0.008*"château"

Score: 0.2871292233467102	 
Topic: 0.024*"toujour" + 0.023*"linternaut" + 0.020*"dan" + 0.017*"johnny" + 0.016*"résultat" + 0.016*"hallyday" + 0.010*"horoscop" + 0.008*"françois" + 0.008*"amour" + 0.008*"quel"

Score: 0.025006739422678947	 
Topic: 0.026*"ferm" + 0.024*"résolu" + 0.012*"erreur" + 0.011*"pour" + 0.011*"avant" + 0.010*"star" + 0.009*"quel" + 0.008*"comment" + 0.007*"chang" + 0.007*"avec"

Score: 0.025004452094435692	 
Topic: 0.030*"macron" + 0.021*"brigitt" + 0.018*"franc" + 0.017*"emmanuel" + 0.012*"rob" + 0.010*"pierr" + 0.010*"définit" + 0.010*"pour" + 0.009*"mar" + 0.009*"plus"

Score: 0.025004034861922264	 
Topic: 0.015*"plus" + 0.014*"maison" + 0.011*"nouvel" + 0.009*"pourquoi" + 0.009*"fil" + 0.008*"jean" + 0.008*"mort" + 0.008*"jam" + 0.008*"ferm" + 0.008*"résolu"

Score: 

Same here.

In [18]:
#Testing model on unseen document
unseen_document = 'Meghan, épouse du prince Harry, a donné naissance à un garçon'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.3743020296096802	 Topic: 0.036*"macron" + 0.030*"pour" + 0.025*"paris" + 0.023*"brigitt" + 0.022*"coup"
Score: 0.3279981315135956	 Topic: 0.040*"femm" + 0.036*"plus" + 0.030*"meghan" + 0.026*"markl" + 0.023*"journal"
Score: 0.21016591787338257	 Topic: 0.019*"kat" + 0.017*"princ" + 0.016*"middleton" + 0.015*"erreur" + 0.014*"saint"
Score: 0.01250864565372467	 Topic: 0.034*"hallyday" + 0.028*"johnny" + 0.025*"elle" + 0.022*"avec" + 0.018*"laetici"
Score: 0.01250577624887228	 Topic: 0.040*"comment" + 0.033*"toujour" + 0.026*"linternaut" + 0.025*"cuisin" + 0.022*"ferm"
Score: 0.012504009529948235	 Topic: 0.045*"pour" + 0.028*"recet" + 0.026*"meilleur" + 0.023*"maison" + 0.015*"pierr"
Score: 0.012503988109529018	 Topic: 0.072*"résolu" + 0.069*"ferm" + 0.027*"fait" + 0.019*"comment" + 0.016*"vous"
Score: 0.012503846548497677	 Topic: 0.061*"dan" + 0.031*"avec" + 0.029*"pour" + 0.028*"grand" + 0.023*"mond"
Score: 0.012503818608820438	 Topic: 0.094*"franc" + 0.046*"meteo" + 0.030*"mété

We can see that on an unknown dataset, the model works by giving a result related to Meghan.

In [19]:
import pyLDAvis
import pyLDAvis.gensim as gensimvis
lda100 = gensim.models.ldamodel.LdaModel.load('model100.gensim')
lda_display100 = pyLDAvis.gensim.prepare(lda100, bow_corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display100)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


As part of our comparison:
The visualization allows us to see the term frequency for each topic. From this we have from our dataset the following insight for topic modeling:
<br>
<br>
Topic 1: Weather
<br>
Topic 2: Linternaut website
<br>
Topic 3: Weather
<br>
Topic 4: Macron and his wife
<br>
Topic 5: Forum related
<br>
Topic 6: Forum related
<br>
Topic 7: Weather
<br>
Topic 8: Kate and Harry's wedding
<br>
Topic 9: Meghan Markle and Prince Harry
<br>
Topic 10: New house
<br>
<br>

We can see here that we still miss information. We believe that most of it is due to the scraping method which can be questionable and the use of french language as reference for our dataset.