# 05 - Topic Modeling - Exercise 3

In [6]:
import pandas as pd
import numpy as np
import gensim

In [7]:
### ONLY RUN ONCE ####
# import nltk
# nltk.download('wordnet')

To get good results when performing topic modeling, the documents should be large enough. To avoid emails that are too short, we **group the emails by sender** (using the SenderPersonId feature) and then consider each document to be the ExtractedSubject + ExtractedBodyText of all the emails belonging to each sender. We also discard emails where both the ExtratedSubject and the ExtractedBodyText are NaN.

Another option would be to group the emails by subject. However, the ExtratedSubject column has a considerable larger number of NaNs than the column SenderPersonId. Thus, we chose to group the emails by sender.

We then apply our text **pre-processing pipeline** to each document: tokenization, stopwords and digits/numbers removal, lemmatization and removal of tokens that are too small.

In [8]:
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# create DataFrame
path_emails = 'hillary-clinton-emails/Emails.csv'
emails = pd.read_csv(path_emails, usecols = ['SenderPersonId', 'ExtractedSubject', 'ExtractedBodyText'])

# drop records where both ExtractedSubject and ExtractedBodyText are NaN
emails.dropna(axis=0, thresh=1, subset=[['ExtractedSubject', 'ExtractedBodyText']], inplace=True)
emails.fillna(' ', inplace = True)

# group emails by sender
emails_raw = []

for sender in emails.groupby('SenderPersonId'):
    grouped_emails = ''
    
    for subject, body in zip(sender[1]['ExtractedSubject'], sender[1]['ExtractedBodyText']):
        grouped_emails += (subject + ' ' + body + ' ')
    
    emails_raw.append(grouped_emails)

# tokenization
email_tokens = []

for email in emails_raw:
    email_tokens.append(regexp_tokenize(email, pattern='\w+'))

# remove stopwords and digits/numbers
stop_words = set(stopwords.words('english'))

# also consider as stopwords those words typically related to emails
stopwords_emails = ['fyi', 'fm', 'am', 'pm', 'n\'t', 'sent', 'from', 'to', 'subject', 'fw', 'fwd', 'fvv',
                    'cc', 'bcc', 'attachments', 're', 'date', 'html', 'php']

stop_words.update(stopwords_emails)

email_clean_tokens = []

for email in email_tokens:
    clean_tokens = [token for token in email if (token.lower() not in stop_words 
                                                 and any(char.isdigit() for char in token) == False)]
    email_clean_tokens.append(clean_tokens)

# lemmatization
wnl = WordNetLemmatizer()

email_lemma = []

# normalization also takes place in this step
for email in email_clean_tokens:
    lemma = [wnl.lemmatize(token.lower()) for token in email]
    email_lemma.append(lemma)
    
# removal of tokens that are too small
email_clean = []

for email in email_lemma:
    clean = [token for token in email if len(token) > 1]
    email_clean.append(clean)

We initialize a dictionary using the pre-processed corpus. The dictionary encapsulates the mapping between the words in those documents and their integer IDs. We then convert the collection of words in each document to its bag-of-words representation.

In [9]:
# create dictionary (mapping between words and IDs)
dictionary = gensim.corpora.Dictionary(email_clean)

# bag-of-words representation of each document in the corpus
corpus = [dictionary.doc2bow(email) for email in email_clean]

We perform topic modeling over the corpus by running a LDA model. We run the model for different numbers of topics and we display the 10 most significant words of each topic.

In [10]:
number_topics = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]

for n in number_topics:
    lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=n, id2word=dictionary, passes=10)
   
    print('#### Model with', n, 'topics ####')
   
    topic_id = 1
   
    for topic in lda_model.show_topics(num_topics=n, num_words=10, log=False, formatted=False):
        
        string_words = ''
        
        for word in topic[1]:
            string_words += (word[0] + ' ')
        
        print('topic #', topic_id, ':', string_words)

        topic_id += 1

#### Model with 5 topics ####
topic # 1 : call state woman said clinton talk would one also time 
topic # 2 : obama would one american president party said time new republican 
topic # 3 : state call would gov work see also time department get 
topic # 4 : district state blair cherie see would blackberry senate good wireless 
topic # 5 : office secretary state room meeting department arrive en route depart 
#### Model with 10 topics ####
topic # 1 : district senate blackberry handheld wireless vote great good mikulski state 
topic # 2 : state cheryl haiti mill gov cdm call would see clinton 
topic # 3 : state clinton department one said policy foreign secretary would time 
topic # 4 : obama state would one american president said party new time 
topic # 5 : palau state united uighur brother military year bahtiyar would email 
topic # 6 : obama health woman care insurance republican mandate could people would 
topic # 7 : state call secretary office meeting time department gov room tomo

Picking a small number of topics might generate topics for which the most significant words don't have much in common. Picking a large number of topics might lead to redundant topics (i.e. topics for which the most significant words are similar). There's no precise way to pick the number of topics, but in this case **25 topics** seems to be a good compromise.