<h2> Work on Topic Modeling </h2>
<h3> Authors: CHOI Iatin, KAO Weiting, EL KHALIL Marwan, KHIEU Thomas, MAHON Matthew </h3>
<p> For this NLP work, we used the dataset provided for the exercice. The raw dataset is totally unexploitable and we add to first use VBA to make it readable for Python.</p>


#Excel VBA
<br>
<br>
Sub AppendToExistingOnLeft()
<br>
Dim c As Range
<br>
For Each c In Selection
<br>
If c.Value <> "" Then c.Value = "https://" & c.Value
<br>
Next
<br>
End Sub

Once in a usable state we ran the code which is available at this address: https://nbviewer.jupyter.org/github/ThomasKhieu/Python/blob/master/NLPscraping.ipynb
<br>
This code was used to scrap the 70k rows of URLs in the file. Unfortunately, we only scrapped 50k rows, the remaining 20k were too slow to process in the python program. 
Out of the 50k rows, only 〜 33k were exploitable for our Python program.
Regarding the scrapping part we only scrapped the < h1 > tag on webpages, we believe that due to digital marketing, the page title should contain the most accurate information to be relevant for Google and other search engine therefore we didn't scrap the whole page content.

In [1]:
#Tokenizer process
import spacy
spacy.load("fr_core_news_sm")
from spacy.lang.fr import French
parser = French()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [2]:
#Import word meaning with NLTK
import nltk
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to /Users/Phoenix/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
#stopwords
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Phoenix/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
#text preparation
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

In [5]:
#opening the file
text_data = []
with open('output.csv') as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        text_data.append(tokens)

In [6]:
#dictionary creation then conversion to bag-of-words
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [7]:
#topic finder
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=80)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.012*"après" + 0.010*"avant" + 0.008*"maison" + 0.007*"contre"')
(1, '0.022*"france" + 0.015*"macron" + 0.014*"hallyday" + 0.014*"toujours"')
(2, '0.036*"résolu" + 0.034*"fermé" + 0.011*"comment" + 0.008*"faire"')
(3, '0.018*"meteo" + 0.013*"meghan" + 0.011*"markle" + 0.011*"paris"')
(4, '0.009*"pourquoi" + 0.009*"cuisine" + 0.008*"mariage" + 0.008*"recettes"')


Here we can see 5 topics extracted from our "output.csv" file. It appears that we have topic related to Johnny Hallyday (a french singer), the weather, something looking like forum related (solved thread, closed thread), Emmanuel Macron (french president) and his wife and a blurry fifth topic.
So far, we miss some information to define more topics so we are gonna run the code one more time but with 10 topics to see the trends.

In [8]:
#new document
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[(877, 1)]
[(0, 0.100017056), (1, 0.10001691), (2, 0.100016974), (3, 0.10001707), (4, 0.599932)]


In [9]:
NUM_TOPICS = 10
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=80)
ldamodel.save('model10.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.078*"résolu" + 0.071*"fermé" + 0.022*"comment" + 0.014*"cette"')
(1, '0.017*"femme" + 0.017*"contre" + 0.013*"nouveau" + 0.012*"quelle"')
(2, '0.030*"après" + 0.029*"comment" + 0.023*"monde" + 0.021*"macron"')
(3, '0.024*"prince" + 0.021*"fille" + 0.016*"grand" + 0.014*"ligne"')
(4, '0.035*"toujours" + 0.025*"coupe" + 0.021*"femmes" + 0.014*"journal"')
(5, '0.027*"hallyday" + 0.022*"johnny" + 0.021*"meghan" + 0.018*"markle"')
(6, '0.064*"france" + 0.014*"premier" + 0.014*"laurent" + 0.014*"euro"')
(7, '0.022*"votre" + 0.020*"recettes" + 0.019*"emmanuel" + 0.016*"couple"')
(8, '0.057*"meteo" + 0.016*"français" + 0.014*"horoscope" + 0.014*"immobilier"')
(9, '0.036*"paris" + 0.029*"météo" + 0.022*"cuisine" + 0.020*"photo"')


We now have more topics to work with, how would it render with a visualization?

In [12]:
#Visualization
import pyLDAvis
import pyLDAvis.gensim as gensimvis
lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display10)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
