# 05 - Taming Text

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from stop_words import get_stop_words
from  email.parser import FeedParser
%matplotlib inline

## Part 1: Word cloud

### Email corpus extraction

As we are not intersted in every details (like the sender/receiver, etc..) our first task is to extract the body of the emails.

First let's read the Emails.csv file

In [None]:
mails = pd.read_csv("hillary-clinton-emails/Emails.csv")

In [None]:
mails.head()

Now we have two options: take the raw text or use the extracted body, but since the content of the raw text seems to be messy (further inspections confirmed this) and as some people already did the work for us, we are going to use the extracted bodies.

In [None]:
emailsBody = mails['ExtractedBodyText'].dropna()
print(len(emailsBody))
emailsBody.head()

We then lower case everything (no important information is lost with this and it may avoid us troubles, plus better uniformity)

In [None]:
emailsBody = emailsBody.apply(lambda x: x.lower())

We need the data to be represented in a single string, so we first merge all the cells in a single array and then merge that array in a single string

In [None]:
textArray = emailsBody.values.flatten()
textString = ' '.join(textArray)

The first method is straightforward, generate a cloud directly from the unprocessed, untokenized string

In [None]:
cloud = WordCloud().generate(textString)
#cloud = WordCloud(max_font_size=40).generate(rawTextString) limits size of biggest word

In [None]:
plt.imshow(cloud)
plt.axis('off')
plt.show()

This word cloud shows what we could expect from such emails (state, office, Obama, ...) but also a lot of noisy words that do not give relevant informations, such as 'will', 'said', 'call', etc.. The next steps are intended to improve the cloud.

### Tokenization & stop words

We will now tokenize the entire string, Regexp does the big of the tokenization, while "stop_words" will get rid of useless words like "the", "and"... We also remove some useless dominant word we identified in the previous cloud.

In [None]:
#tokens = nltk.word_tokenize(rawTextString)
tokenizer = RegexpTokenizer(r'\w+')
allTokens = pd.Series(tokenizer.tokenize(textString))
allTokens = allTokens[~allTokens.isin({'will', 'pm', 'said', 'call'})]
print("Number of tokens: ", len(allTokens))

In [None]:
en_stop = get_stop_words('en')
stopTokens = allTokens[~allTokens.isin(en_stop)]
print("Number of tokens after stop word filtering: ", len(stopTokens))

In [None]:
print("Remaining tokens: ", round(len(stopTokens) / len(allTokens) * 100), "%")

The stop words filtering got rid of 39% of the tokens, which is quite big.

Now that nltk got rid of unecessary data for us, we'll merge the array back into a string and use wordcloud again

In [None]:
cleanString = ' '.join(allTokens)

In [None]:
cleanCloud = WordCloud().generate(cleanString)

In [None]:
print("Raw")
plt.imshow(cloud)
plt.axis('off')
plt.show()
print("Without stopwords")
plt.imshow(cleanCloud)
plt.axis('off')
plt.show()

It is indeed better, the removal of some dominant words allowed more relevant ones to appear, and it looks pretty nice.

### Stemming

The next processing step is stemming, for this purpose we are going to use Porter's algorithm, which is already implemented in the stemming python library.

In [None]:
stemmer = PorterStemmer()

In [None]:
stemTokens = stopTokens.apply(stemmer.stem)

In [None]:
tokens = stemTokens #for next part

In [None]:
stopStemStr = ' '.join(stemTokens)
stemCloud = WordCloud().generate(stopStemStr)

In [None]:
print("Without stopwords")
plt.imshow(cleanCloud)
plt.axis('off')
plt.show()
print("Without stopwords + stemming")
plt.imshow(stemCloud)
plt.axis('off')
plt.show()

The result here is a bit different, the stemming transformed some of the words, but now we have a more accurate representation of the kind of vocabulary used in these emails. For example, 'work' was not even there in the previous cloud and some irrelevant words lost of their weight like 'also', 'one', 'us', ...

## Part 3: Topic modeling

In [None]:
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary

The dictionary function will assign and ID to each token and do word counting.
doc2bow converts a dictionary into a (word_id, word_frequency) tuple

In [None]:
dictionary = Dictionary([tokens]) #document term matrix
corpus = [dictionary.doc2bow(text) for text in [tokens]] #bag of words

We will now use the lda function from 5 to 50, printing the topic at each step

In [None]:
lda = [None]*45
for topics in range(5,50):
    idx = topics - 5
    lda[idx] = LdaModel(corpus, num_topics=topics, id2word = dictionary)

In [None]:
for i in range(0, len(lda)):
    print(lda[idx].print_topics(num_topics=3, num_words=3))

Now that the algorithm generated all the possibilities for each parameter, we observe that a lot of topics contain a single letter (which is due to the previous processing), let's try to redo it but without these noise letters.

In [None]:
tokens2 = tokens[~tokens.isin(['s', 't'])]
len(tokens) - len(tokens2)

In [None]:
dictionary = Dictionary([tokens2]) #document term matrix
corpus = [dictionary.doc2bow(text) for text in [tokens2]] #bag of words

In [None]:
lda2 = [None] * 45
for topics in range(5,50):
    idx = topics - 5
    lda2[idx] = LdaModel(corpus, num_topics=topics, id2word = dictionary)

In [None]:
for i in range(0, len(lda2)):
    print(lda2[idx].print_topics(num_topics=3, num_words=3))

This is better, the first thing we notice is that all topics contain the words 'state', and it definitely seems that even with 5 topics we have consistent words in them.