<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dependencies" data-toc-modified-id="Dependencies-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Dependencies</a></span></li><li><span><a href="#Goal" data-toc-modified-id="Goal-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Goal</a></span></li><li><span><a href="#Functions" data-toc-modified-id="Functions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Functions</a></span></li><li><span><a href="#Tokenise-Email-Bodies" data-toc-modified-id="Tokenise-Email-Bodies-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Tokenise Email Bodies</a></span><ul class="toc-item"><li><span><a href="#Bigrams" data-toc-modified-id="Bigrams-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Bigrams</a></span></li><li><span><a href="#Lemmatisation" data-toc-modified-id="Lemmatisation-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Lemmatisation</a></span></li><li><span><a href="#Bigrams" data-toc-modified-id="Bigrams-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Bigrams</a></span></li></ul></li><li><span><a href="#Construct-the-Corpus-and-LDA-model" data-toc-modified-id="Construct-the-Corpus-and-LDA-model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Construct the Corpus and LDA model</a></span><ul class="toc-item"><li><span><a href="#construct-the-LDA-model" data-toc-modified-id="construct-the-LDA-model-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>construct the LDA model</a></span><ul class="toc-item"><li><span><a href="#Perplexity-and-Coherence" data-toc-modified-id="Perplexity-and-Coherence-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Perplexity and Coherence</a></span></li></ul></li></ul></li><li><span><a href="#Finding-Best-Number-of-Topics" data-toc-modified-id="Finding-Best-Number-of-Topics-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Finding Best Number of Topics</a></span></li><li><span><a href="#Sandbox" data-toc-modified-id="Sandbox-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Sandbox</a></span></li></ul></div>

# Dependencies
`pandas`

`seaborn` 

`spacy`

`python -m spacy download en` - english model from spacy

`gensim`

`pyLDAvis`


In [1]:
# getting deprecation error that likely is fixed with an update but will surpress for now
import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)

# Goal
To analyse the body of the Enron emails and conduct topic analysis

In [2]:
#progress bars
from tqdm import tqdm

import pickle
from os.path import relpath

# wrangling
import pandas as pd

# lemmatization
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
# import string

# gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models.phrases import Phrases, Phraser
from gensim.models.ldamodel import LdaModel
# # plotting
import matplotlib.pyplot as plt
import seaborn as sb
import pyLDAvis.gensim


    
sb.set(style="whitegrid") # to show plots well in darktheme 

In [3]:
data = pd.read_csv(relpath("../data/email_fields.csv"))["Body"]
data.describe()
# data = data.tolist()

# subset for speed in testing
data = data.sample(100000, random_state=1).tolist()

  and should_run_async(code)


# Functions

In [4]:
def sent_to_words(sentences):
    """
    Takes a string and breaks it into words.
    
    Arg:
        sentences (str) : The sentence(s) to be broken down
        
    Returns:
        (list) : words as a list
    """
#     return [simple_preprocess(sentence, deacc=True) for sentence in tqdm(sentences)]
    
    for sentence in tqdm(sentences):
        yield(simple_preprocess(sentence, deacc=True)) # deacc removes punctuation

def remove_stopwords(texts, stop_words):
    """
    Removes the stop words from a piece of text using a specified list of stop words.
    
    Args:
        text (list) : text to have stop words removed from. Should be split by word and given as a list.
        stop_words (set) : stop words for laguage the `text`.
        
    Returns:
        (list) : text list with stop words removed
    """
#     for body in texts:
#         yield([word for word in body if word not in stop_words])
    return [[word for word in body if word not in stop_words] for body in texts]
    
def make_bigrams(words, min_count = 5, threshold = 10):
    """
    Takes a list of words and return bigrams.
    
    Args: taken from `gensim.models.phrases.Phrases` documentation.
        words (iter) :  can be simply a list, but for larger corpora, 
                        consider a generator that streams
                        the sentences directly from disk/network, 
                        See :class:`~gensim.models.word2vec.BrownCorpus`, 
                        :class:`~gensim.models.word2vec.Text8Corpus` 
                        or :class:`~gensim.models.word2vec.LineSentence` for such examples.
        min_count (float), optional : Ignore all words and bigrams with total 
                            collected count lower than this value.
                            Defaults to 5.
        threshold (float), optional : Represent a score threshold for forming the phrases 
                            (higher means fewer phrases). A phrase of words `a` followed 
                            by `b` is accepted if the score of the phrase is greater than threshold.  
                            Heavily depends on concrete scoring-function, see the `scoring` parameter.
    
    Returns :
        (iter) : bigrams
        
    """
    bigrams = Phrases(words)
    return [[bigram for bigram in bigrams[body]] for body in tqdm(words)]
    
def make_trigrams(words):
    """
    Takes a list of words and return bigrams.
    
    Args: taken from `gensim.models.phrases.Phrases` documentation.
        words (iter) :  can be simply a list, but for larger corpora, 
                        consider a generator that streams
                        the sentences directly from disk/network, 
                        See :class:`~gensim.models.word2vec.BrownCorpus`, 
                        :class:`~gensim.models.word2vec.Text8Corpus` 
                        or :class:`~gensim.models.word2vec.LineSentence` for such examples.
        min_count (float), optional : Ignore all words and bigrams with total 
                            collected count lower than this value.
                            Defaults to 5.
        threshold (float), optional : Represent a score threshold for forming the phrases 
                            (higher means fewer phrases). A phrase of words `a` followed 
                            by `b` is accepted if the score of the phrase is greater than threshold.  
                            Heavily depends on concrete scoring-function, see the `scoring` parameter.
    Returns:
        (iter) : trigrams
    """
    return Phrases(make_bigrams(words)[words])
    
    
    
def lemmatisation(text, nlp, allowed_postags = ["NOUN", "ADJ", "VERB", "ADV"]):
    """
    Lemmatises text.  See: 'https://spacy.io/api/annotation' for more info on `allowed_postags`
    
    Args:
        text (str) : texts to be lemmatised.
        nlp : the laguage model of choice.  Defaults to `English` from `spacy`
        allowed_postags (list) : list of parts of speech to be lemmatised.
       
    Returns:
        (list)
    """
    text_out = []
    for words in tqdm(text):
        body = nlp(" ".join(words))
        text_out.append([token.lemma_ for token in body if token.pos_ in allowed_postags])
    return text_out



  and should_run_async(code)


# Tokenise Email Bodies
Break the email bodies into words to prepare for analysing them.  In general, emails which share more words will be about similar topics. "Rare" or infrequently used words are likely to indicate important information than words which show up in all email, e.g. "the" does not say much about the content of the email, but "investment" does.

In [6]:
%pprint # disable pretty print to keep things a bit more compact

Pretty printing has been turned OFF


  and should_run_async(code)


In [7]:
# split email bodies into their component words.
data_words = sent_to_words(data)

# memory management 
# del data

finished constructing word lists


  and should_run_async(code)


## Lemmatisation
In order to simplify the data and amplify the signal from words witht the same stem, e.g. is, were, am = be, the data needs to be lemmatised.

To do so I will use `Spacy`, which has the benifit of also identify which part of speech each lemma is e.g. noun, verb etc.

I take the major parts of speech I believe contribute to topics; nouns, adjectives, verb and adverbs.  There is a case to be made for taking proper nouns so as to include people and this is a factor that may also be worth including. Stop words are also removed using the stop words from `Spacy`

In [11]:
stop_words = STOP_WORDS
stop_words

  and should_run_async(code)


{'nowhere', 'a', 'through', 'latterly', 'who', 'one', 'go', 'put', 'anyhow', 'after', 'when', 'herself', 'quite', 'whereby', "'ve", '‘re', 'everyone', 'get', 'whereas', 'where', 'via', 'against', 'out', 'thru', 'everywhere', 'the', 'still', 'always', 'them', 'five', 'himself', 'indeed', 'several', 'also', 'noone', 'almost', 'whither', 'whom', 'enough', 'thus', 'nor', "'re", 'than', 'being', 'it', 'moreover', 'becomes', 'nevertheless', 'except', 'once', 'give', 'will', 'is', 'his', 'their', '’d', 'ten', 'never', 'not', 'under', 'n‘t', 'be', 'least', 'show', 'due', '’ve', 'other', 'amount', 'seems', 'formerly', 'down', 're', 'whereupon', 'nine', 'he', 'n’t', 'bottom', 'but', 'everything', 'further', 'thence', 'more', '’s', 'sixty', 'only', 'are', 'wherein', 'hereafter', '’ll', 'empty', 'should', 'both', 'alone', 'two', 'whose', 'why', 'third', 'again', 'become', 'someone', 'often', 'that', "'d", 'of', 'besides', 'has', 'sometime', 'various', 'much', 'may', 'six', 'now', 'whenever', 'with

In [12]:
# add use case specific words to exclude 
words_to_add = {"etc", "subject", "com", "forward", "cc", "from", "edu"}
stop_words.update(words_to_add)
stop_words

  and should_run_async(code)


{'nowhere', 'a', 'through', 'latterly', 'who', 'one', 'go', 'put', 'anyhow', 'after', 'when', 'herself', 'quite', 'whereby', "'ve", '‘re', 'everyone', 'get', 'whereas', 'where', 'via', 'against', 'out', 'thru', 'everywhere', 'the', 'still', 'always', 'them', 'five', 'himself', 'indeed', 'several', 'also', 'noone', 'almost', 'whither', 'whom', 'enough', 'thus', 'nor', "'re", 'than', 'being', 'it', 'moreover', 'becomes', 'nevertheless', 'except', 'once', 'give', 'will', 'is', 'his', 'their', '’d', 'ten', 'never', 'not', 'under', 'n‘t', 'be', 'least', 'show', 'due', '’ve', 'other', 'amount', 'seems', 'formerly', 'down', 're', 'whereupon', 'nine', 'he', 'n’t', 'bottom', 'but', 'everything', 'further', 'thence', 'more', '’s', 'sixty', 'only', 'are', 'wherein', 'hereafter', '’ll', 'empty', 'should', 'both', 'alone', 'two', 'whose', 'why', 'third', 'again', 'become', 'someone', 'often', 'that', "'d", 'of', 'besides', 'has', 'sometime', 'various', 'much', 'may', 'six', 'now', 'whenever', 'with

In [13]:
no_stop_words = remove_stopwords(texts = data_words, stop_words = stop_words)

  and should_run_async(code)
100%|██████████| 100000/100000 [01:20<00:00, 1242.43it/s]


In [14]:
pickle.dump(no_stop_words, open("../data/no_stop_words.p", "wb"))

  and should_run_async(code)


In [15]:
# memory management
del data
del data_words
no_stop_words[200]

  and should_run_async(code)


['second', 'thanks', 'lot', 'dfmartha', 'benner', 'pmto', 'don', 'nelson', 'et', 'enron', 'enron', 'norm', 'ruiz', 'et', 'enron', 'enroncc', 'rod', 'williams', 'et', 'enron', 'enron', 'bcc', 'drew', 'fossum', 'et', 'enron', 'satellite', 'phonewant', 'thank', 'help', 'obtaining', 'satellite', 'phone', 'usage', 'short', 'notice', 'able', 'come', 'hesitation', 'parts', 'nice', 'good', 'people', 'work', 'helpful', 'professional', 'situation', 'novice', 'phone', 'thing', 'helpful', 'step', 'way', 'thank', 'martha', 'benner']

In [16]:
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en', disable=['parser', 'ner'])

  and should_run_async(code)


## Bigrams
Bigrams are two words which are often used together and as such can almost be considered one word for the analysis, e.g. chinese food.  This can be extended to n-grams but will not be for this analysis.

Using defaults of `threshold=10` for the time being but may adjust later.

In [17]:
words_bigrams = make_bigrams(no_stop_words)
# mem mamnagement
del no_stop_words

  and should_run_async(code)
100%|██████████| 100000/100000 [00:55<00:00, 1799.73it/s]


In [18]:
pickle.dump(words_bigrams, open("../data/words_bigrams.p", "wb"))

  and should_run_async(code)


In [None]:
words_lemmatised = lemmatisation(words_bigrams, nlp = nlp, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
# mem mamnagement
del words_bigrams

  and should_run_async(code)
 60%|██████    | 60383/100000 [12:06<08:19, 79.31it/s] 

In [None]:
pickle.dump(words_lemmatised, open("../data/words_lemmatised.p", "wb"))

In [None]:
# words_lemmatised

In [None]:
words_lemmatised = pickle.load(open("../data/words_lemmatised.p", "rb"))

# Construct the Corpus and LDA model

In [None]:
# Create the corpus dictionary
dictionary = corpora.Dictionary(words_lemmatised)

# Corpus is just the `words_lemmatised` var that has been made above
# Term document frequency
corpus = [dictionary.doc2bow(text) for text in words_lemmatised]

## construct the LDA model

In [None]:
lda_model = LdaModel(corpus=corpus,
                     id2word=dictionary,
                     num_topics=20,
                     random_state=100,
                     update_every=1,
                     chunksize=100,
                     passes=10,
                     alpha='auto',
                     per_word_topics=True)

In [None]:
pickle.dump(lda_model, open("../data/lda_test.p", "wb"))

In [None]:
lda_model.print_topics()

### Perplexity and Coherence

In [None]:
# a measure of how good the model is. lower the better.)
print('Perplexity: %f' % lda_model.log_perplexity(corpus))  

In [None]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model = lda_model, texts= words_lemmatised, 
                                     dictionary = dictionary, coherence = 'c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: %f' % coherence_lda)

In [None]:
pyLDAvis.enable_notebook(sort=True)
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

# Finding Best Number of Topics
There is a lot of grouping in smaller topics in the graph so will run tests to find the best number of topics using coherence score.

In [None]:
def calc_coherence(corpus, dictionary, texts, limit, start = 2, step = 2):
    """
    Compute coherence of LDA models for increasing number of topics.
    
    Args:
        corpus () :
        dictionary () : 
        texts () : 
        limit (int) : Maximum number of topics to simulate to.
        start (int) : How many topics to start with. 
                        Defaults to 2.
        step (int) : How many topics to increase by for each iteration. 
                        Defaults to 2.
                        
    Returns:
        (list) : [model_list, coherence_list] -> models that produce their corresponding coherence values as a list.        
    """
    model_list = []
    coherence_list = []
    
    for topics in tqdm(range(start, limit, step)):
        model = LdaModel(corpus=corpus,
                         id2word=dictionary,
                         num_topics=topics,
                         random_state=100,
                         update_every=1,
                         chunksize=100,
                         passes=10,
                         alpha='auto',
                         per_word_topics=True)
        model_list.append(model)
        
        coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_score = coherence_model.get_coherence()
        coherence_list.append(coherence_score)
        
    return [model_list, coherence_list]

In [None]:
limit = 20
start = 2
step = 2


model_list, coherence_list = calc_coherence(corpus = corpus, dictionary = dictionary, texts = words_lemmatised,
                                           limit = limit, start = start, step = step)

In [None]:
x = range(start, limit, step)
fig = plt.figure()
ax = plt.plot(x, coherence_list)
plt.xticks(x)
plt.xlabel("Num Topics")
plt.ylabel("Coherence Score")
plt.show()

In [None]:
pickle.dump(model_list, open("../data/model_list.p", "wb"))
pickle.dump(coherence_list, open("../data/coherence_list.p", "wb"))

From the above graph we can judge the point of diminishing returns / accuracy loss for number of topics.

In [None]:
# index of max coherence
max_index = [i for i in range(len(coherence_list)) if max(coherence_list) == coherence_list[i]][0]
optimum_model = model_list[max_index]


In [None]:
pyLDAvis.enable_notebook(sort=True)
vis = pyLDAvis.gensim.prepare(optimum_model, corpus, dictionary)
pyLDAvis.display(vis)

# Sandbox
Code testing ground. Ignore all below

In [None]:
break