<h2> Work on Topic Modeling </h2>
<h3> Authors: CHOI Iatin, KAO Weiting, EL KHALIL Marwan, KHIEU Thomas, MAHON Matthew </h3>
<p> For this NLP work, we used the dataset provided for the exercice. The raw dataset is totally unexploitable and we add to first use VBA to make it readable for Python.</p>


#Excel VBA
<br>
<br>
Sub AppendToExistingOnLeft()
<br>
Dim c As Range
<br>
For Each c In Selection
<br>
If c.Value <> "" Then c.Value = "https://" & c.Value
<br>
Next
<br>
End Sub

Once in a usable state we ran the code which is available at this address: https://nbviewer.jupyter.org/github/thomaskhieu/Python-NLP/blob/master/NLPScrapingFINAL.ipynb
<br>
This code was used to scrap the 70k rows of URLs in the file.

In [1]:
#Tokenizer process
import spacy
spacy.load("fr_core_news_sm")
from spacy.lang.fr import French
parser = French()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [2]:
#Import word meaning with NLTK
import nltk
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to /Users/Phoenix/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
#stopwords
nltk.download('stopwords')
fr_stop = set(nltk.corpus.stopwords.words('french'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Phoenix/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
#text preparation
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in fr_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

In [5]:
#opening the file
text_data = []
with open('output.csv') as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        text_data.append(tokens)

In [6]:
#dictionary creation then conversion to bag-of-words
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [7]:
#topic finder
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=100)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.047*"santé" + 0.042*"actualités" + 0.040*"européennes" + 0.038*"article"')
(1, '0.215*"moment" + 0.024*"fermé" + 0.023*"définition" + 0.021*"résolu"')
(2, '0.102*"cuisine" + 0.012*"audio" + 0.012*"recettes" + 0.010*"météo"')
(3, '0.193*"newsletter" + 0.014*"ouragan" + 0.012*"rédaction" + 0.012*"réponses"')
(4, '0.104*"people" + 0.062*"sommaire" + 0.031*"france" + 0.019*"plateformes"')


Here we can see 5 topics extracted from our "output.csv" file. It appears that we have topic related to Johnny Hallyday (a french singer), the weather, something looking like forum related (solved thread, closed thread), Emmanuel Macron (french president) and his wife and a blurry fifth topic.
So far, we miss some information to define more topics so we are gonna run the code one more time but with 10 topics to see the trends.

In [8]:
#new document
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[(1485, 1), (11157, 1)]
[(0, 0.08966007), (1, 0.63650686), (2, 0.09407576), (3, 0.08984695), (4, 0.08991033)]


In [9]:
NUM_TOPICS = 10
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=100)
ldamodel.save('model10.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.018*"cette" + 0.016*"victimes" + 0.014*"lanta" + 0.014*"florence"')
(1, '0.294*"newsletter" + 0.085*"européennes" + 0.079*"election" + 0.068*"sondages"')
(2, '0.033*"ouragan" + 0.029*"après" + 0.027*"maman" + 0.019*"conditions"')
(3, '0.345*"cuisine" + 0.072*"définition" + 0.043*"plateformes" + 0.043*"assistance"')
(4, '0.191*"people" + 0.114*"sommaire" + 0.022*"travail" + 0.021*"grève"')
(5, '0.026*"hallyday" + 0.025*"johnny" + 0.020*"comme" + 0.016*"recette"')
(6, '0.389*"moment" + 0.086*"actualités" + 0.078*"article" + 0.040*"france"')
(7, '0.039*"réponses" + 0.038*"aussi" + 0.025*"identité" + 0.024*"foucault"')
(8, '0.102*"santé" + 0.046*"fermé" + 0.041*"résolu" + 0.037*"comment"')
(9, '0.025*"lettre" + 0.021*"photo" + 0.020*"autres" + 0.020*"piste"')


We now have more topics to work with, how would it render with a visualization?

In [10]:
#Visualization
import pyLDAvis
import pyLDAvis.gensim as gensimvis
lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display10)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


The visualization allows us to see the term frequency for each topic. From this we have from our dataset the following insight for topic modeling:
<br>
<br>
Topic 1: Cooking
<br>
Topic 2: Florence from Koh-Lanta
<br>
Topic 3: European Elections
<br>
Topic 4: Hurricane
<br>
Topic 5: Cooking
<br>
Topic 6: Work related strikes
<br>
Topic 7: Johnny Hallyday
<br>
Topic 8: French Weather
<br>
Topic 9: Forum related
<br>
Topic 10: Forum related to healthcare
<br>
<br>
The second notebook will serve as comparison to this one.
One pain point to highlight is the fact that the dataset is in french whereas NLTK library is optimized for english language. There are components of NLTK which have been imported to work with french language but we believe it is suboptimal compared to how it would have worked with an english dataset. Other factors may limit the accuracy of topic definition, we discuss it in the second notebook.