# LDA Model experiments

The purpose of this notebook is to demonstrate building an LDA model.  Example code is also provide to perform an optimal topic no search.  The actual code for execution can be found in the src/models directory.

## Build dictionaries
This notebook relies on preprocessed emails.  During preprocessing the following are performed:
1. Clears additional whitespaces.
1. Removes email addresses and URLs in the bodies
1. Expand contractions
1. Tokenize the body.
1. Lemmatize the tokens
1. Remove stopwords from the tokens.

The preprocessed mails are stored in a JSON format intended to be loaded into the Email_Forensic_Processor class.   However, json can be used directly to load and extract the relevant data.

The first step during model building for LDA is to build dictionaries.  The preprocessing prepared the words of interest.  The dictionaries are now compiled and made available as a Bag of Words (BOW), from which TF-IDF can be computed for model building.



In [4]:
# Topic analysis
import os
import re
#import eflp
#import nlpfunct
from gensim.corpora import Dictionary
import json


#import importlib
#importlib.reload(nlpfunct)
#importlib.reload(eflp)


# Set the location of the directory used for processing
maildir_path = os.path.join('..','..','data', 'processed', 'maildir')
subdir = os.path.join(maildir_path,'allen-p','inbox')# Document topic analysis

#new_mail = eflp.Summary()
dictionary_common = Dictionary()   # Dictionary based on common words
dictionary_pos = Dictionary()      # Dictionary based on Part of Speech tagging


limit = 10000

# A helper function which loads the json email and returns it as a dictionary
def loadMail(filename):
    with open (filename, "r") as inputFile:
        return json.load(inputFile)



#tokenized_docs = []
bow_docs_common = []
texts_common = []
bow_docs_pos = []
texts_pos = []
print("Building dictionaries\n")

# Build a dictionary of all the emails
count = 0
for root, dirs, files in os.walk(subdir):
    for file in files:
        if not re.search(r'^\.',file):     #Filter out common files created by the OS
            email_file = os.path.join(root,file)
            #new_mail.initMail(email_file)
            # Build a dictionary and BOW from common token documents
            
            #for key, value in developer.items():
                #print(key, ":", value)
            email_dict = loadMail(email_file)
            texts_common.append(email_dict['body_tokens'])
            dictionary_common.add_documents([email_dict['body_tokens']])
            bow_docs_common.append(dictionary_common.doc2bow(email_dict["body_tokens"]))

            # Build a specialised POS dictionary and BOW
            texts_pos.append(email_dict["body_pos_tokens"])
            dictionary_pos.add_documents([email_dict["body_pos_tokens"]])
            bow_docs_pos.append(dictionary_pos.doc2bow(email_dict["body_pos_tokens"]))

            if count == limit:
                break
            else:
                count = count + 1
print("Finished building dictionaries.\n\n")

print(len(bow_docs_common))
print(len(bow_docs_pos))

Building dictionaries

Finished building dictionaries.


66
66


In [6]:
# Use the bag of words representation of all the email bodies, and learn num_topics using LDA
corpus_common = bow_docs_common
corpus_pos = bow_docs_pos
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 50
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary_common[0]  # This is only to "load" the dictionary.
temp = dictionary_pos[0]  # This is only to "load" the dictionary.

id2word_common = dictionary_common.id2token
id2word_pos = dictionary_pos.id2token

print('Training topics for common model\n')
model_common = LdaModel(
    corpus=corpus_common,
    id2word=id2word_common,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

print('Training topics for POS model\n')
model_pos = LdaModel(
    corpus=corpus_pos,
    id2word=id2word_pos,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)



# Print the top topics for subjective evaluation
print("Finding top topics for common model")
top_topics_common = model_common.top_topics(corpus_common) #, num_words=20)
#print("Finding top topics for pos model")
#top_topics_pos = model_pos.top_topics(corpus_pos) #, num_words=20)

print("Computing topic coherence")
# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence_common = sum([t[1] for t in top_topics_common]) / num_topics
#avg_topic_coherence_pos = sum([t[1] for t in top_topics_pos]) / num_topics

print('Average topic coherence for common BOW: %.4f.' % avg_topic_coherence_common)
#print('Average topic coherence for POS BOW: %.4f.' % avg_topic_coherence_pos)


from pprint import pprint
print("Common BOW topics")
pprint(top_topics_common)

#print("POS BOW topics")
#pprint(top_topics_pos)




Training topics for common model

Training topics for POS model

Finding top topics for common model
Computing topic coherence
Average topic coherence for common BOW: -1.6609.
Common BOW topics
[([(0.0003923108, 'appliance'),
   (0.0003923108, 'appreciate'),
   (0.0003923108, 'arolyne'),
   (0.0003923108, 'arrive'),
   (0.0003923108, 'arrow'),
   (0.0003923108, 'art'),
   (0.0003923108, 'ashton'),
   (0.0003923108, 'asysportsteenstravel'),
   (0.0003923108, 'attenti'),
   (0.0003923108, 'audio'),
   (0.0003923108, 'aust'),
   (0.0003923108, 'b'),
   (0.0003923108, 'bag'),
   (0.0003923108, 'beautifully'),
   (0.0003923108, 'beauty'),
   (0.0003923108, 'behind'),
   (0.0003923108, 'belo'),
   (0.0003923108, 'ance'),
   (0.0003923108, 'bestseller'),
   (0.0003923108, 'app')],
  -0.16244934774971095),
 ([(0.0003923108, 'appliance'),
   (0.0003923108, 'appreciate'),
   (0.0003923108, 'arolyne'),
   (0.0003923108, 'arrive'),
   (0.0003923108, 'arrow'),
   (0.0003923108, 'art'),
   (0.000392

In [8]:
# Use the bag of words representation of all the email bodies, and learn num_topics using LDA
corpus_common = bow_docs_common
corpus_pos = bow_docs_pos
# Train LDA model.
from gensim.models import LdaModel
from gensim.models import CoherenceModel

# Set training parameters.
#num_topics = 2
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary_common[0]  # This is only to "load" the dictionary.
temp = dictionary_pos[0]  # This is only to "load" the dictionary.

id2word_common = dictionary_common.id2token
id2word_pos = dictionary_pos.id2token

num_topics_list = list(range(2,52))
num_topics_subset = num_topics_list[::1]

for num_topics in num_topics_subset:

    model_common = LdaModel(
        corpus=corpus_common,
        id2word=id2word_common,
        chunksize=chunksize,
        alpha='auto',
        eta='auto',
        iterations=iterations,
        num_topics=num_topics,
        passes=passes,
        eval_every=eval_every
    )

    #print('Training topics for POS model\n')
    model_pos = LdaModel(
        corpus=corpus_pos,
        id2word=id2word_pos,
        chunksize=chunksize,
        alpha='auto',
        eta='auto',
        iterations=iterations,
        num_topics=num_topics,
        passes=passes,
        eval_every=eval_every
    )



    # Print the top topics for subjective evaluation
    #print("Finding top topics for common model")
    #top_topics_common = model_common.top_topics(corpus_common) #, num_words=20)
    #print("Finding top topics for pos model")
    #top_topics_pos = model_pos.top_topics(corpus_pos) #, num_words=20)

    #print("Computing topic coherence")
    # Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
    #avg_topic_coherence_common = sum([t[1] for t in top_topics_common]) / num_topics
    #avg_topic_coherence_pos = sum([t[1] for t in top_topics_pos]) / num_topics
    coherence_model_lda_pos = CoherenceModel(model=model_pos, texts = texts_pos, dictionary = dictionary_pos, coherence='c_v')
    coherence_model_lda_common = CoherenceModel(model=model_common, texts = texts_common, dictionary = dictionary_common, coherence='c_v')

    coherence_lda_common = coherence_model_lda_common.get_coherence()
    coherence_lda_pos = coherence_model_lda_pos.get_coherence()

    #print('\nCoherence Score common: ', coherence_lda_common)

    print('Number of topics: ',num_topics,' Average topic coherence for COMMON BOW: %.4f.' % coherence_lda_common)
    #print('Average topic coherence for POS BOW: %.4f.' % avg_topic_coherence_pos)

    #coherence_lda_pos = coherence_model_lda_pos.get_coherence()
    #print('\nCoherence Score common: ', coherence_lda_pos)

    print('Number of topics: ',num_topics,' Average topic coherence for POS BOW: %.4f.\n' % coherence_lda_pos)


    #from pprint import pprint
    #print("Common BOW topics")
    #pprint(top_topics_common)

    #print("POS BOW topics")
    #pprint(top_topics_pos)





Number of topics:  2  Average topic coherence for COMMON BOW: 0.3459.
Number of topics:  2  Average topic coherence for POS BOW: 0.4271.

Number of topics:  12  Average topic coherence for COMMON BOW: 0.4592.
Number of topics:  12  Average topic coherence for POS BOW: 0.4462.

Number of topics:  22  Average topic coherence for COMMON BOW: 0.5691.
Number of topics:  22  Average topic coherence for POS BOW: 0.5328.

Number of topics:  32  Average topic coherence for COMMON BOW: 0.5597.
Number of topics:  32  Average topic coherence for POS BOW: 0.5883.

Number of topics:  42  Average topic coherence for COMMON BOW: 0.5554.
Number of topics:  42  Average topic coherence for POS BOW: 0.5210.



In [3]:
print(top_topics_pos)


[([(0.000435161, 'Choose'), (0.000435161, 'Children'), (0.000435161, 'Families'), (0.000435161, 'Christmas'), (0.000435161, 'Christopher'), (0.000435161, 'Classic'), (0.000435161, 'Cooking'), (0.000435161, 'County'), (0.000435161, 'Crafts'), (0.000435161, 'Cultivating'), (0.000435161, 'De'), (0.000435161, 'Delights'), (0.000435161, 'Diane'), (0.000435161, 'Dick'), (0.000435161, 'DocsLarge'), (0.000435161, 'Editor'), (0.000435161, 'Encyclopedia'), (0.000435161, 'Charlevoix'), (0.000435161, 'Blueprint'), (0.000435161, 'Body')], -0.06931471798999445), ([(0.000435161, 'Choose'), (0.000435161, 'Children'), (0.000435161, 'Families'), (0.000435161, 'Christmas'), (0.000435161, 'Christopher'), (0.000435161, 'Classic'), (0.000435161, 'Cooking'), (0.000435161, 'County'), (0.000435161, 'Crafts'), (0.000435161, 'Cultivating'), (0.000435161, 'De'), (0.000435161, 'Delights'), (0.000435161, 'Diane'), (0.000435161, 'Dick'), (0.000435161, 'DocsLarge'), (0.000435161, 'Editor'), (0.000435161, 'Encyclopedi