# 01 - Preprocessing
>Generate a word cloud based on the raw corpus -- I recommend you to use the Python word_cloud library. With the help of nltk (already available in your Anaconda environment), implement a standard text pre-processing pipeline (e.g., tokenization, stopword removal, stemming, etc.) and generate a new word cloud. Discuss briefly the pros and cons (if any) of the two word clouds you generated.

In [None]:
import pandas as pd
import numpy as np
from os import path
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import gensim
%matplotlib inline
%load_ext autoreload
%autoreload 2


## 1. Loading the emails and retrieving the raw text
Let us first load the DataFrame and see a bit what we're dealing with.

In [None]:
email_df = pd.read_csv("hillary-clinton-emails/Emails.csv")

Before going any further, let us a bit explore those emails, so we know a bit more what we're working with. We see below how many emails there are and what are the fields that we can find in the DataFrame.

In [None]:
print(email_df.shape)
print(email_df.columns)
email_df.head()

The text we're interested in is mostly in two fields of this large *emails.csv* file. It is in the `ExtractedSubject` and `ExtractedBodyText` columns. We will merge the fields of these two columns into one column (replacing the *NaN* with an empty string and will use this column to print our first WordCloud.

In [None]:

email_df['Text'] = email_df['ExtractedSubject'].fillna('') + email_df['ExtractedBodyText'].fillna('')
email_text = list(email_df['Text'])

`Wordcloud` needs to have a string as an entry, we first need to merge our list into a single string, then we will be able to display the wordcloud of Hillary Clinton's emails, taking the raw text of the mails.

In [None]:
#Merging everything into one big chunk of text.
emails_str = ' '.join(email_text)
wordcloud = WordCloud().generate(emails_str)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

A lot of the words displayed are not very *relevant* to the understanding of the contents of the emails, and are not filtered as they are specific to the email vocabulary. We should manually remove words like *Re*, *Fw*, or even other relatively meaningless words like *pm*, *new*, *also*, ... in order to obtain something more representative from the content of the emails. This is what we will do now with a text preprocessing pipeling.

## 2. Preprocessing pipeline

### a. Tokenization 
We choose to use the *WordPunctTokenizer*, because what it does
> Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp `\w+|[^\w\s]+`.

You can see below how it splits the data into words, numbers and leaves expressions like **.**,**,**,**:** on their own string. The next step will be simply to remove those, as they do not provide any useful information on the data. We will also remove all the numeral expression. 

In [None]:
from nltk.tokenize import WordPunctTokenizer

email_text_processed_token = [WordPunctTokenizer().tokenize(email) for email in email_text]

In [None]:
print(email_text[0:3])
print(email_text_processed_token[0:3])

We see that we still have to remove the numbers and punctuation. To this end, we will use the very useful `isalpha()` function, which 
>returns true if all characters in the string are alphabetic and there is at least one character, false otherwise.

In [None]:
email_text_processed_token = [[word.lower() for word in email if word.isalpha()]
                        for email in email_text_processed_token]

print(email_text[0:3])
print(email_text_processed_token[0:3])

We got the desired tokenization of our data. We now have to remove all the stop words from the text.

### b. Stopwords removal

In order to remove all the common words from our text, we load a *stopwords* dictionary, to which we'll add our custom stop words, especially considering : *fw*, *re*, *pm* and other things like that. This will be an iterative process, and what we will show below will be the dictionary resulting from this iterative process, with all the words we though should be filtered. 

Note that it is necessary to download the *stopwords* dictionaries in order to make them available.

In [None]:
nltk.download("stopwords")

We check out the common words in english, just below, that we will remove from our email. We add also our custom stop words here.

In [None]:
stop_list = ['re','fw','h','pm','docx']
stop_list.extend(stopwords.words('english'))
print(stop_list)

Let us now remove those stop words from our email corpus. 

In [None]:
email_text_processed = [[word for word in email if word not in stop_list]
                        for email in email_text_processed_token]

The result below shows that words like *FW*, numbers and common words have been remove, which is what we wanted.

In [None]:
print(email_text[0:3])
print(email_text_processed[0:3])

Before moving on to the last step of our preprocessing, namely the stemming, we want to display an intermediary word cloud, to see whether this processing improved the "quality" of the word cloud. 

In [None]:
#Merging everything into one big chunk of text.
emails_str = ' '.join([' '.join(email) for email in email_text_processed])
wordcloud = WordCloud().generate(emails_str)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

There are many improvements for this word cloud, compared to the previous one. First of all, we removed a lot of meaningless words, which were omnipresent in the emails (the *Re:*, *FW:* and others), and hence we see more clearly what actualy are the topic of the emails. We see that one of the most proeminent words is *state*, which is expected to be commonly found as Hillary Clinton was secretary of state, and state speaks also about the states in the United States, ... We also find the words *secretary*, *president*, *obama*, *office*, which are all very common words. The only downside here, is that we lost the capital letters in the process, which is necessary in the process of finding the stop words. (It could be done otherwise, but the work would end up being much more complicated).

### c. Stemming 

The last step we want to take is to reduce the words to their *common* root, i.e. identify *scientist* and *science* to the same root (e.g. *sci*). We end up having a lot of *roots* which are not proper words. It will become clearer heareafter.

In [None]:
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()

email_text_processed_st = [[st.stem(word) for word in email]
                        for email in email_text_processed]

In [None]:
print(email_text_processed_st[0:3])

In [None]:
#Merging everything into one big chunk of text.
emails_str = ' '.join([' '.join(email) for email in email_text_processed_st])
wordcloud = WordCloud().generate(emails_str)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Here, we find roughly the same topics as above, but their *stemmed* versions. This is quite disturbing in the process of identifiying exactly which word would come down to each stemmed version, and is in our opinion, not the best way of visualising the content of the emails. There are also some strange associations that could me made, like, for instance, *secretary* and *secret*, which could end up having the same stemmed counterpart. Our preferred word cloud would be the second one, which seems to highlight quite efficiently the content of those emails.

# 02 - Finding the mentions of world countries
> Find all the mentions of world countries in the whole corpus, using the `pycountry` utility (HINT: remember that there will be different surface forms for the same country in the text, e.g., Switzerland, switzerland, CH, etc.) Perform sentiment analysis on every email message using the demo methods in the `nltk.sentiment.util` module. Aggregate the polarity information of all the emails by country, and plot a histogram (ordered and colored by polarity level) that summarizes the perception of the different countries. Repeat the aggregation + plotting steps using different demo methods from the sentiment analysis module -- can you find substantial differences?

We first import all the countries and retrieve them in a list and will try to count the occurence of each country in each email. Note that as it is note easy to find a dictionary that contains the name of the inhabitants of each country (like *swiss* for *Switzerland* or *american* for the *United States*, we will work with the non-stemmed version of the text, as the stemming will not provide anything more (It only would have if we had the name of the inhabitants describing every country) : indeed, maybe *afghan* and *afghanistan* will reduce to the same stemmed version, but *american* and *United State* will certainly not. Considering the stemmed version would introduce a bias towards the countries which have the same stemmed name and inhabitants name. This is why we choose **NOT** to work with the stemmed version of the emails.

First of all, we load the countries, and store their lowercase version into a list.

In [None]:
import pycountry
countries = [country.name.lower() for country in list(pycountry.countries)]

Before properly discussing sentiment analysis, we just want to see which country are mention the most often in the emails, and plot a resulting histogram. We first below count the frequency of apparition of the name of the country in the emails, iterating over all of them. At the same time, we still perform sentiment analysis, but we will discuss it a bit later on. First below, we load the lexicon we need to detect whether a sentence is positive or not.

In [None]:
nltk.download('opinion_lexicon')
nltk.download('vader_lexicon')

Now, after posessing these dictionaries, we wanted to work directly with the`nltk` sentiment analysis methods, but their return types did not fit what we needed, we then chose to slightly rewrite their return types, in order to get what we needed. We rewrote the two main methods we're using, namely `demo_vader_instance` and `demo_liu_hu_lexicon`  The original code can be found on [this](http://www.nltk.org/_modules/nltk/sentiment/util.html#demo_vader_instance) page. We will discuss a bit later on which one we used and the motivation of our choice.

In [None]:
def demo_vader_instance(text):
    """
    Output polarity scores for a text using Vader approach.

    @param text: a text whose polarity has to be evaluated.
    @return vader_analyszer : the result of the analysis with Vader approach
    """
    from nltk.sentiment import SentimentIntensityAnalyzer
    vader_analyzer = SentimentIntensityAnalyzer()
    #print(vader_analyzer.polarity_scores(text))
    return vader_analyzer

def demo_liu_hu_lexicon(sentence,positive,negative,tokenizer):
    """
    Basic example of sentiment classification using Liu and Hu opinion lexicon.
    This function simply counts the number of positive, negative and neutral words
    in the sentence and classifies it depending on which polarity is more represented.
    Words that do not appear in the lexicon are considered as neutral.

    :param sentence: a sentence whose polarity has to be classified.
    """

    tokenized_sent = [word.lower() for word in tokenizer.tokenize(sentence)]
    pos = len([y for y in tokenized_sent if (len(y)>1) and (y in positive)])
    neg = len([y for y in tokenized_sent if (len(y)>1) and (y in negative)])

    return (pos>neg)-(neg>pos)

Before going further, let us just see a small example of the way each method works, and what type of result it returns.

In [None]:
# Extract a given email
str_ = ' '.join(email_text_processed[5])

# Print it and perform sentiment analysis
print(str_)
print("\nSubjectivity analysis: ",end="\t")
nltk.sentiment.util.demo_sent_subjectivity(str_)
print("Liu Hu lexicon :",end="\t")
nltk.sentiment.util.demo_liu_hu_lexicon(str_)
print("Vader Demo :", end="\t")
print(demo_vader_instance(str_).polarity_scores(str_))

We see that the classification is negative here, as the content of the email seems clearly negative. Note that we see here something that could be a problem for later on, in the fact that there are quite some spelling mistakes and abbreviations in the emails. Fixing this problem would require a way too large amount of time.

Then, in the next cell, we properly run sentiment analysis as well as finding the occurrence of every country in the emails. The `frequency` dictionary will store the frequency of each country, while the `pola` dict will store, the sum of the polarity (ranged from -1 to 1) for each email in which a given country appears. 

Note that we choose to use the Vader sentiment analyser as the Liu Hu one, which is often more precise, is also way slower, making the task nearly endless, even though there are only around 8000 email from which we extract the content. The criterion could have been different if we simply ran the Liu Hu one once on the whole corpus and store the result. 

We also only work at the precision of a **whole** email. This may lead to imprecise polarity classification of an email. Indeed, imaging the content is extremely positive, except around the mention of a country, where it is very negative. The total would still be positive, even if the country is mentionned negatively. The improvement could be to split the email by sentences, then perform sentiment analysis on each of them, and see the result for the surroundings of the sentence in which a country is mentionned. This would not be simple to implement, and we did not take the time to do it. We then will probably end up, due to this issue, having much more **positive rankings than negative ones**.

In [None]:
from nltk.corpus import opinion_lexicon
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank

frequency = {}
pola = {}
positive = opinion_lexicon.positive()
negative = opinion_lexicon.negative()

# Needed for another Sentiment Analysis Approach.
tokenizer = treebank.TreebankWordTokenizer()

print("Running sentiment analysis")
for i,email in enumerate(email_text_processed):

    if i%100 ==0 :
        print(i,end=", ")
    i +=1
    
    #Reaggregating the list into one string of words.
    email_str = ' '.join(email)
    s = demo_vader_instance(email_str)
    polarity = s.polarity_scores(email_str)
    
    pos = polarity['pos']
    neg = polarity['neg']
    #Computing the polarity
    pol = 1*(pos>neg) - 1*(neg>pos)
    #pol = demo_liu_hu_lexicon(email_str,positive,negative,tokenizer)
    
    #Assigning the values for each country.
    for country in countries:
        if country in email_str:
            if country in frequency :
                frequency[country] += 1                
                pola[country] += pol
            else  :
                frequency[country] = 1
                pola[country] = pol


After having computing the frequency of appearence of each country as well as its "cumulative" polarity, we need to rescale the polarity betweend -1 and 1 for it to be meaningful. We simply divide the polarity of a country by its frequency of appearance.

In [None]:
for country in countries:
    if country in frequency:
        pola[country] = pola[country]/frequency[country]

Then, we define a DataFrame `df_pola`, which stores the polarity as well as the frequency of appearance of each country.

In [None]:
df_pola = pd.DataFrame.from_dict(pola, orient='index')
df_pola = df_pola.rename(columns={0:"Polarity"}) 
df_pola['Frequency'] = df_pola.index.copy()
df_pola['Frequency'] = df_pola['Frequency'].map(frequency).astype(int)

df_pola = df_pola.sort_values(['Polarity'], ascending = False)
df_pola.head()

We see already from displaying the head of `df_pola` that many countries appear only a very small number of times. Also, a lot of counrties are classified positively, due to the issue that we saw before. Let us focus on the countries that are mentionned negatively, i.e. that have a polarity smaller than 0.

In [None]:
df_pola[df_pola.Polarity <0]

We see that there are actually very few of them ! And that those mentionned negatively are not mentionned very frequently as well, in the exception of *Uganda* and *Serbia*. We will see below that the overwhelming majority of frequently mentionned countries are very positively mentionned, due to this bias of our algorithm.

Let us observe below the *Frequency* at which the countries are mentionned and the *Polarity of those countries*, both ranked. We only take countries that are mentionned more than 20 times, otherwise the plot would become overcrowded. Moreover, this helps us filter some non signficant events : we keep only the countries that are mentionned often, as the ones that are less mentionned are more likely to be wrongly classified, coming from the Sentiment Analysis of a single email, ...

In [None]:
#
(df_pola[df_pola['Frequency'] >= 20])['Frequency'].sort_values(ascending=False).plot(kind='bar')
plt.show()

#Displaying the most appearing countries
(df_pola[df_pola['Frequency'] >= 20])['Polarity'].plot(kind='bar')
plt.show()

We see that the countries which are the most often talked about are : 
- Haiti
- Israel
- United States
- Pakistan
- Libya
- Afghanistan
Nothing too suprising, except maybe that thoes emails talk more about Haiti and Israel than the United States itself.

And if we take on the polarity ranking, the most *positively* mentionned countries (that are mentionned also more than 20 times) are *Indonesia* and *India*. The United States are mentioned positively as well, but are only around the tenth position. The countries that are seen in the most neutral way (close to zero) are *France*, *Yemen*, *Libya* and *Germany*. Note also finally, that there are no countries mentionned negatively due to this problem of considering the whole email.

There seems to be some significant differences between the countries mentionned positively and the country mentionned more neutrally. It does not seem to be linked to the frequency at which the countries appear, as *Libya*, which is neutral in Sentiment Analysis, is very often metnionned. But there are probably some significant differences, that we cannot see using our too simplistic approach.

Before moving on to the next topic, let us save the DataFrame for later reuse.

In [None]:
df_pola.to_csv('polarity_vader.csv')

# 03 - Topic Modelling
> Using the `models.ldamodel` module from the gensim library, run topic modeling over the corpus. Explore different numbers of topics (varying from 5 to 50), and settle for the parameter which returns topics that you consider to be meaningful at first sight.

First, we do an analogous processing to what was done in the Sentiment Analysis part, except we want to filter some more words, that do not bring much to the determination of a topic. We ran several times the topic modelling and tried to remove some words which made topics similar, but had no meaning, like common verbs, adverbs, ... that were not filtered from the stop words before.

In [None]:
new_stop_list = ['would','today','u','e','thx','said','see','new','one','mr','like','us','call','fyi','yet','also','work','time','get','want','like','w','en',
                 'get','talk', 'b', 'think','tomorrow','know','j','g','good','sent','qqdr','let', 'gov','com','case','two','year', 'back','going','need','next',
                 'last','still','go','best','sure']
email_text_processed = [[word.lower() for word in email if word.lower() not in new_stop_list]
                        for email in email_text_processed]

from collections import defaultdict
frequency = defaultdict(int)
for email in email_text_processed:
    for token in email:
         frequency[token] += 1


emails = [[token for token in email if frequency[token] > 1]
         for email in email_text_processed]

Perform the LDA topic modelling and print the results. Note that here, we will not print the bag of most common words in order to compare them, but rather use the *pyLDAvis* library, which allows us to see in 2D how two clusters are close to each other (in terms of common words). We will then try to consider the number of topics which yields the more distinct clusters we can have. What we do below is perform topic modelling for a different number of topics, and store the resulting model in a list, which we will visualise later on.

In [None]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

dictionary = gensim.corpora.Dictionary(emails)
# Converts a collection of words to its bag of word representation (list of word_id, word_frequency 2-tuples$)
corpus = [dictionary.doc2bow(email) for email in emails]

num_topic =  [5,10,15,25,35,40,50]
ldamodel = []
for topic in num_topic :
    print("Considering",topic,"topics")
    ldamodel += [gensim.models.ldamodel.LdaModel(corpus,num_topics=topic,id2word = dictionary)]#, passes=1)


In [None]:
vis_data = gensimvis.prepare(ldamodel[0], corpus, dictionary)
pyLDAvis.display(vis_data)

In [None]:
vis_data = gensimvis.prepare(ldamodel[1], corpus, dictionary)
pyLDAvis.display(vis_data)

In [None]:
vis_data = gensimvis.prepare(ldamodel[2], corpus, dictionary)
pyLDAvis.display(vis_data)

In [None]:
vis_data = gensimvis.prepare(ldamodel[3], corpus, dictionary)
pyLDAvis.display(vis_data)

In [None]:
for i, bag in enumerate(ldamodel[5].print_topics(num_words=8,num_topics = 50)):
    print("Cluster ", i, ": ", bag)

# 4 - Communication graph
> BONUS: build the communication graph (unweighted and undirected) among the different email senders and recipients using the `NetworkX` library. Find communities in this graph with community.best_partition(G) method from the community detection module. Print the most frequent 20 words used by the email authors of each community. Do these word lists look similar to what you've produced at step 3 with LDA? Can you identify clear discussion topics for each community? Discuss briefly the obtained results.

Let us load the `EmailReceivers.csv` file, which contains the recipients of the emails, and we will merge the `Id`,`SenderPersonId` columns of the `email_df` with the `receiver_df` to get our graph. Note that we will need to drop NaNs, as we can't keep an edge which only has one vertex determined.

In [None]:
receiver_df = pd.read_csv('hillary-clinton-emails/EmailReceivers.csv')
receiver_df = receiver_df[['EmailId','PersonId']].set_index('EmailId')

graph_df = email_df[['Id','SenderPersonId']]
graph_df = graph_df.set_index('Id')

graph_df = graph_df.join(receiver_df)
graph_df.dropna(inplace=True)
graph_df = graph_df.astype(int)

graph_df.columns = ['SenderId', 'ReceiverId']
graph_df.head()

In [None]:
print(graph_df.min())
print(graph_df.max())

In [None]:
import community
import networkx as nx
import matplotlib.pyplot as plt

#better with karate_graph() as defined in networkx example.
#erdos renyi don't have true community structure
G = nx.from_pandas_dataframe(graph_df, 'SenderId', 'ReceiverId')
#first compute the best partition
partition = community.best_partition(G)

#drawing
size = float(len(set(partition.values())))
pos = nx.spring_layout(G)
count = 0.
for com in set(partition.values()) :
    count = count + 1.
    list_nodes = [nodes for nodes in partition.keys()
                                if partition[nodes] == com]
    nx.draw_networkx_nodes(G, pos, list_nodes, node_size = 20,
                                node_color = str(count / size))


nx.draw_networkx_edges(G,pos, alpha=0.5)
plt.show()



In [None]:
persons = np.sort(pd.concat([graph_df['SenderId'], graph_df['ReceiverId']]).unique())

In [None]:
persons_df = pd.DataFrame({'Persons': persons})

In [None]:
persons_df['Partition'] = persons_df['Persons'].map(partition)

In [None]:
persons_df['Partition'].unique()