# 01 - Preprocessing
>Generate a word cloud based on the raw corpus -- I recommend you to use the Python word_cloud library. With the help of nltk (already available in your Anaconda environment), implement a standard text pre-processing pipeline (e.g., tokenization, stopword removal, stemming, etc.) and generate a new word cloud. Discuss briefly the pros and cons (if any) of the two word clouds you generated.

In [None]:
import pandas as pd
import numpy as np
from os import path
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import gensim

## 1. Loading the emails and retrieving the raw text
The text we're interested in is mostly in two fields of this large *emails.csv* file. It is in the `ExtractedSubject` and `ExtractedBodyText` columns. We will merge the fields of these two columns into one column (replacing the *NaN* with an empty string and will use this column to print our first WordCloud.

In [None]:
email_df = pd.read_csv("hillary-clinton-emails/emails.csv")

email_df['Text'] = email_df['ExtractedSubject'].fillna('') + email_df['ExtractedBodyText'].fillna('')
email_text = list(email_df['Text'])

`Wordcloud` needs to have a string as an entry, we first need to merge our list into a single string, then we will be able to display the wordcloud of Hillary Clinton's emails, taking the raw text of the mails.

In [None]:
#Merging everything into one big chunk of text.
emails_str = ' '.join(email_text)
wordcloud = WordCloud().generate(emails_str)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

A lot of the words displayed are not very *relevant* to the understanding of the contents of the emails, and are not filtered as they are specific to the email vocabulary. We should manually remove words like *Re*, *Fw*, or even other relatively meaningless words like *pm*, *new*, *also*, ... in order to obtain something more representative from the content of the emails. This is what we will do now with a text preprocessing pipeling.

## 2. Preprocessing pipeline

### a. Tokenization 
We choose to use the *WordPunctTokenizer*, because what it does
> Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp `\w+|[^\w\s]+`.

You can see below how it splits the data into words, numbers and leaves expressions like **.**,**,**,**:** on their own string. The next step will be simply to remove those, as they do not provide any useful information on the data. We will also remove all the numeral expression. 

In [None]:
from nltk.tokenize import WordPunctTokenizer

email_text_processed_token = [WordPunctTokenizer().tokenize(email) for email in email_text]

In [None]:
print(email_text[0:3])
print(email_text_processed_token[0:3])

We see that we still have to remove the numbers and punctuation. To this end, we will use the very useful `isalpha()` function, which 
>returns true if all characters in the string are alphabetic and there is at least one character, false otherwise.

In [None]:
email_text_processed_token = [[word.lower() for word in email if word.isalpha()]
                        for email in email_text_processed_token]

print(email_text[0:3])
print(email_text_processed_token[0:3])

We got the desired tokenization of our data. We now have to remove all the stop words from the text.

### b. Stopwords removal

In order to remove all the common words from our text, we load a *stopwords* dictionary, to which we'll add our custom stop words, especially considering : *fw*, *re*, *pm* and other things like that. This will be an iterative process, and what we will show below will be the dictionary resulting from this iterative process, with all the words we though should be filtered. 

Note that it is necessary to download the *stopwords* dictionaries in order to make them available.

In [None]:
nltk.download("stopwords")

We check out the common words in english, just below, that we will remove from our email. We add also our custom stop words here.

In [None]:
stop_list = ['re','fw','h','pm','docx']
stop_list.extend(stopwords.words('english'))
print(stop_list)

Let us now remove those stop words from our email corpus. 

In [None]:
email_text_processed = [[word for word in email if word not in stop_list]
                        for email in email_text_processed_token]

The result below shows that words like *FW*, numbers and common words have been remove, which is what we wanted.

In [None]:
print(email_text[0:3])
print(email_text_processed[0:3])

Before moving on to the last step of our preprocessing, namely the stemming, we want to display an intermediary word cloud, to see whether this processing improved the "quality" of the word cloud. 

In [None]:
#Merging everything into one big chunk of text.
emails_str = ' '.join([' '.join(email) for email in email_text_processed])
wordcloud = WordCloud().generate(emails_str)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

There are many improvements for this word cloud, compared to the previous one. First of all, we removed a lot of meaningless words, which were omnipresent in the emails (the *Re:*, *FW:* and others), and hence we see more clearly what actualy are the topic of the emails. We see that one of the most proeminent words is *state*, which is expected to be commonly found as Hillary Clinton was secretary of state, and state speaks also about the states in the United States, ... We also find the words *secretary*, *president*, *obama*, *office*, which are all very common words. The only downside here, is that we lost the capital letters in the process, which is necessary in the process of finding the stop words. (It could be done otherwise, but the work would end up being much more complicated).

### c. Stemming 

The last step we want to take is to reduce the words to their *common* root, i.e. identify *scientist* and *science* to the same root (e.g. *sci*). We end up having a lot of *roots* which are not proper words. It will become clearer heareafter.

In [None]:
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()

email_text_processed_st = [[st.stem(word) for word in email]
                        for email in email_text_processed]

In [None]:
print(email_text_processed_st[0:3])

In [None]:
#Merging everything into one big chunk of text.
emails_str = ' '.join([' '.join(email) for email in email_text_processed_st])
wordcloud = WordCloud().generate(emails_str)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Here, we find roughly the same topics as above, but their *stemmed* versions. This is quite disturbing in the process of identifiying exactly which word would come down to each stemmed version, and is in our opinion, not the best way of visualising the content of the emails. There are also some strange associations that could me made, like, for instance, *secretary* and *secret*, which could end up having the same stemmed counterpart. Our preferred word cloud would be the second one, which seems to highlight quite efficiently the content of those emails.

# 02 - Finding the mentions of world countries
> Find all the mentions of world countries in the whole corpus, using the `pycountry` utility (HINT: remember that there will be different surface forms for the same country in the text, e.g., Switzerland, switzerland, CH, etc.) Perform sentiment analysis on every email message using the demo methods in the `nltk.sentiment.util` module. Aggregate the polarity information of all the emails by country, and plot a histogram (ordered and colored by polarity level) that summarizes the perception of the different countries. Repeat the aggregation + plotting steps using different demo methods from the sentiment analysis module -- can you find substantial differences?

We first import all the countries and retrieve them in a list and will try to count the occurence of each country in each email. Note that as it is note easy to find a dictionary that contains the name of the inhabitants of each country (like *swiss* for *Switzerland* or *american* for the *United States*, we will work with the non-stemmed version of the text, as the stemming will not provide anything more (It only would have if we had the name of the inhabitants describing every country) : indeed, maybe *afghan* and *afghanistan* will reduce to the same stemmed version, but *american* and *United State* will certainly not. Considering the stemmed version would introduce a bias towards the countries which have the same stemmed name and inhabitants name. This is why we choose **NOT** to work with the stemmed version of the emails.

First of all, we load the countries, and store their lowercase version into a list.

In [None]:
import pycountry
countries = [country.name.lower() for country in list(pycountry.countries)]

Before properly doing sentiment analysis, we just want to see which country are mention the most often in the emails, and plot a resulting histogram. We first below count the frequency of apparition of the name of the country in the emails, iterating over all of them. We use a `defaultdict` instead of a default dict to store our results, it can add missing values to the dictionary easily. 

In [None]:
from collections import defaultdict
frequency = defaultdict(int)
for email in email_text_processed:
    email_str = ' '.join(email)
    for country in countries:
        if country in email_str:
            frequency[country] += 1

In order to visualise the results, we convert our dict into a `DataFrame` and plot the countries which have a frequency larger than *20* in the emails (otherwise the plot would become too crowded)

In [None]:
df = pd.DataFrame.from_dict(frequency, orient='index')
df = df.sort_values([0], ascending = False)

#Displaying the most appearing countries
df.loc[df[0] >= 20].plot(kind='bar')
plt.show()

We see that the countries which are the most often talked about are : 
- Haiti
- Israel
- United States
- Pakistan
- Libya
- Afghanistan
Nothing too suprising, except maybe that thoes emails talk more about Haiti and Israel than the United States itself.

In [None]:
from  nltk.sentiment.util import demo_vader_instance
nltk.download('vader_lexicon')

In [None]:
str_ = ' '.join(email_text_processed_token[14])
print(str_)
demo_vader_instance(str_)

# 03 - Topic Modelling


Analogous processing to what was done in the Sentiment Analysis part, except we want to filter some more words, that do not bring much to the determination of a topic.

In [None]:
new_stop_list = ['would','today','u','e','thx','said','see']
email_text_processed = [[word.lower() for word in email if word.lower() not in new_stop_list]
                        for email in email_text_processed]

In [None]:
frequency = defaultdict(int)
for email in email_text_processed:
    for token in email:
         frequency[token] += 1


emails = [[token for token in email if frequency[token] > 1]
         for email in email_text_processed]

Perform the LDA topic modelling and print the results.

In [None]:
dictionary = gensim.corpora.Dictionary(emails)
# Converts a collection of words to its bag of word representation (list of word_id, word_frequency 2-tuples$)
corpus = [dictionary.doc2bow(email) for email in emails]

num_topic =  np.arange(5,55,5)
for topic in num_topic :
    print("\t\t\t CONSIDERING",topic,"TOPICS")
    ldamodel = gensim.models.ldamodel.LdaModel(corpus,num_topics=topic,id2word = dictionary)#, passes=1)
    for i, bag in enumerate(ldamodel.print_topics(num_words=8)):
        print("Cluster ", i, ": ", bag)