## Assignment:

Using the models.ldamodel module from the gensim library, run topic modeling over the corpus. Explore different numbers of topics (varying from 5 to 50), and settle for the parameter which returns topics that you consider to be meaningful at first sight.

### Finding a compromise

The goal is to do topic modeling over all the mails. In other words, we have to find recurrent topic or themes that may appear in the conversations.
They are several way to analyse the mails content, starting by these two "naive" ways:
- put all the extrated mails in only one document
- put each extracted mail in a separate document

But both of these ways have major drawbacks:
- doing topic modelling on a single document would show the most frequent words, so the result should be the same as if we wanted to make a word cloud
- a lot of mail are very small, a few words sometimes, so doing topic analysis here would not be extremely meaningful

So we have to find a compromise: make multiple documents, each of them long enough to be analysed.
One of the best options would be create the entire conversations with the mail history, so we can extract main topic from each conversation. While it makes sense, it's actually pretty time-consuming to obtain the conversations.

What we will do here is simply put each mail in a separate document, excluding mails that are too small to be analysed.

### Extracting keywords

In [204]:
import pandas as pd
from gensim import corpora, models
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re # regular expressions
import matplotlib.pyplot as plt
import numpy as np

In [205]:
# We reuse data from question 1 that already did a lot of cleaning operations !
emails_cleaned = pd.read_pickle("ilovepickefiles_stemming.pickle")

In [206]:
emails_cleaned.head()

Unnamed: 0,TokenizedText
0,[]
1,"[chris, steven]"
2,"[cairo, condemn, final]"
3,"[meet, right, wing, extremist, behind, anti, m..."
4,"[anti, muslim, film, director, hide, follow, l..."


During the topic modeling, we still see some words that don't really fit in any topic (eg: would) so we remove some of them intentionally.

In [207]:
ignore_list = ['00', '10', '15', '30', 'also', 'would']

In [208]:
#for mail in emails_cleaned.TokenizedText

We put all the mails in a text table in order to prepare the corpus to be analysed. We exclude mails that are too small.

In [209]:
min_mail_size = [2, 3, 4, 5, 6, 10, 15, 20, 50];
print("Total number of mail: " + str(emails_cleaned.size))
for i in min_mail_size:
    text = []
    for mail in emails_cleaned.TokenizedText:
        if (len(mail) >= i):
            text.append(mail)
    ratio = len(text) / emails_cleaned.size * 100
    print("Mails with at least " + str(i) + " tokens represent " + str(ratio).zfill(4) + " % of the total.")

Total number of mail: 13002
Mails with at least 2 tokens represent 71.9350869097 % of the total.
Mails with at least 3 tokens represent 56.8527918782 % of the total.
Mails with at least 4 tokens represent 47.0389170897 % of the total.
Mails with at least 5 tokens represent 40.1169050915 % of the total.
Mails with at least 6 tokens represent 35.0176895862 % of the total.
Mails with at least 10 tokens represent 21.8120289186 % of the total.
Mails with at least 15 tokens represent 15.3976311337 % of the total.
Mails with at least 20 tokens represent 11.8289493924 % of the total.
Mails with at least 50 tokens represent 5.4606983541 % of the total.


We choose keep mails with at least 5 tokens: we can have sentences that might make sense, while keeping 40 % of the mails. This is about 5000 mails, so we should be able to extract some topics from them.

In [210]:
MIN_MAIL_SIZE = 5

In [211]:
text = []
for mail in emails_cleaned.TokenizedText:
    # Take only mails that are long enough
    if (len(mail) >= MIN_MAIL_SIZE):
        # Remove unwanted words
        mail_filtered = mail
        for word in mail_filtered:
            if word in ignore_list:
                mail_filtered.remove(word)
        text.append(mail_filtered)
ratio = len(text) / emails_cleaned.size * 100

Now, we convert all the mails' words in numbers, each number corresponding to a word. In other words, we convert our table of mail in a corpus, so we will be able to do topic modeling on it.

In [212]:
text_dictionary = corpora.Dictionary(text)
corpus = [text_dictionary.doc2bow(t) for t in text] 

Now, time to do the modeling. We will play with the topic number in order to have a consistent result.
Let's try with different numbers. First 5, then 10, 25 and finally 50 topics:

In [213]:
def show_topics(lda_model):
    for i in range(lda_model.num_topics):
        topic_words = [word for word, _ in lda_model.show_topic(i, topn = 15)]
        print('Topic ' + str(i+1) + ': ', end = ' ')
        for word in topic_words:
            print(word, end = ' ')
        print("")

In [214]:
lda_model = models.LdaMulticore(corpus, id2word = text_dictionary, num_topics = 5)
show_topics(lda_model)

Topic 1:  state call said offic time secretari presid want work 2009 clinton depart hous meet like 
Topic 2:  state work depart said obama govern presid peopl right offic like american secur foreign secretari 
Topic 3:  state secretari offic time meet depart work 2010 presid govern year hous need obama nation 
Topic 4:  state obama call presid secretari year hous meet work time american peopl think said want 
Topic 5:  state right 2010 parti obama year time like american presid call foreign meet issu work 


In [215]:
lda_model = models.LdaMulticore(corpus, id2word = text_dictionary, num_topics = 10)
show_topics(lda_model)

Topic 1:  state obama presid time secretari call offic american meet said first talk secur hous depart 
Topic 2:  state said obama work presid american like depart secretari govern nation 2010 time polici 2009 
Topic 3:  state work want time nation israel call need 2010 meet secretari like obama year report 
Topic 4:  state hous call american time said depart offic secretari presid peopl work govern meet obama 
Topic 5:  state obama presid work like offic said senat 2009 hous right govern time year need 
Topic 6:  state american like govern year time 2009 last work want presid think 2010 women know 
Topic 7:  call state right said obama secretari work presid 2010 time want need today know like 
Topic 8:  state depart secretari offic said time 2009 nation meet polit rout privat peopl hous presid 
Topic 9:  state secretari depart work offic clinton hous meet time like presid said 2009 2010 obama 
Topic 10:  state offic secretari depart time meet senat room said issu work obama presid 201

In [216]:
lda_model = models.LdaMulticore(corpus, id2word = text_dictionary, num_topics = 25)
show_topics(lda_model)

Topic 1:  state need secretari peopl meet presid american govern said time last obama offic 2015 depart 
Topic 2:  state call work time obama secur year like depart talk want meet israel think 2009 
Topic 3:  state israel call secur work talk 2009 obama polici said time nation right american peac 
Topic 4:  state presid obama 2010 secretari offic meet american polit hous time said call percent nation 
Topic 5:  state call obama make meet time cheryl 2009 depart israel american mill presid 2015 secretari 
Topic 6:  call said state work time want need talk peopl think polit presid today hous govern 
Topic 7:  offic depart state meet secretari call room work time nation said arriv offici confer presid 
Topic 8:  offic state call secretari meet 2010 time said american depart presid democrat like republican parti 
Topic 9:  secretari offic depart state time call arriv meet rout hous room washington privat white nation 
Topic 10:  obama senat democrat state presid right polit republican like

In [217]:
lda_model = models.LdaMulticore(corpus, id2word = text_dictionary, num_topics = 50)
show_topics(lda_model)

Topic 1:  obama american clinton presid back democrat think last right like peopl time offici state polit 
Topic 2:  secretari state offic depart time meet obama presid call hous clinton talk govern white secur 
Topic 3:  state obama hous today said senat israel vote presid work 2010 offic meet polit iran 
Topic 4:  state secretari said like 2010 offic presid time call work room depart good hous year 
Topic 5:  call state like american meet work time right email 2009 think korea north year want 
Topic 6:  state said presid secretari clinton peopl time american work obama first year like call want 
Topic 7:  state work percent presid call obama republican right said american secretari democrat time like come 
Topic 8:  state said call hous presid time foreign today govern work know want leader unit first 
Topic 9:  time israel work polici said take talk state need american right make parti call want 
Topic 10:  state secretari offic depart meet 2010 room work hous presid time like obama

### Observations

First to note, there is some unwanted word cropping ("secretariat" becomes "secretari"), but it is still readable and shouldn't give totally different results.

The goal was to group words such as they relate to the same topic. The results are not concluding: regardless of the number of topic, the same words always reappear: "obama", "state", "secretariat", "call"... It's difficult to put a different name on a lot of topic, because they all look alike a lot.
For sure, we can tell an "administrative" topic is recurrent: state, secretariat, call, obama, office... The result isn't so exciting !