# JSC Assignment 3: Natural Language Processing

In this scenario, you're beginning an investigation into fraud and electricity price manipulation at Enron.  Your job is to provide a rough overview of the situation and recommendations for who to investigate at what times on what general topics.


## Questions your report should answer:

### What value you could add with this data?  For example:
 - Find explicit evidence of fraud
 - Identify which groups of people were working closely together
 - Help understand major changes in behavior and the corresponding times.
 - Suggest who to investigate further.

### How will you measure success?  Identify:
 - What the ultimate goal is, even if it's hard to measure (justice?  number of people prosecuted, years behind bars, future crimes not committed because of deterrence)
 - Suggest some proxies of that that are easier to quantify, e.g. Explicit admissions of guilt, evidence of collusion.
 - Say how you'll chose your models (e.g. looking at the topics yourself, checking for major world events that should show up in your analysis)

### Look at the data.
  - Try to find outliers / problems in data.
  - Give a sense of the amount of relevant data you have to support each prediction.  How many emails per person are in the dataset?
  - Look for signs of mass deletion / data hiding.  Did any two people communicate regularly by email, then suddenly stop?

### Brainstorm complementary sources of data.
  - E.g. Data from other companies, other concurrent investigations, SEC filings, stock records

### Brainstorm a comprehensive list of factors that could affect the recorded data, or the trends your models will capture.
  - What are all the factors that could conceivably influence the emails?  E.g. people hacking each others' accounts, retroactively editing emails, suspicion of future investigation, different languages being spoken, people using code words for illegal activities
  - This section just needs to be a list of all the factors that we might conceivably want to model or know in order to improve our topic model.  The purpose is to make sure that when you're making your model, you're keeping in mind what a limited representation it is of a complicated reality.  10-20 factors should be sufficient.  This doesn't have to be a formal model.

### Propose a model/approach staircase (series of more sophisticated models)
  
  - You don't have to implement every model you propose!
  - It doesn't have to be a strict staircase (each model doesn't have to incorporate the one before it).
  - You can propose models that you don't know how to implement.
  - Ideally, differences between results will make it clear which inputs / parts of the model are important.
  - You need to implement as many model refinements or variants as there are members of your group.
  - Suggested approaches:
      - 1) Simple keyword searches
      - Run a topic model on the whole corpus, then
         - 2) See how the topic use varies with time
         - 3) Break topic use down by person and time
      - 4) Do a simple network analysis to see who wrote to whom.
      - 5) Run a time-varying topic model and see how the top words in each topic changed.

You should do 2 of these, plus one for every member of your group.

### Make and explain recommendations
  - For each of your recommendations, try to provide specific emails / events that support that recommendation.
  - Include one or two sanity checks, e.g. that known major events showed up in your analysis.


Assignments will be graded according to [these rubrics](https://jsc370.github.io/2020/assignment_rubrics.html)

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
nltk.download('wordnet')
import string

!pip install gensim
import gensim
from gensim import corpora, models

from collections import Counter
import gensim,logging, warnings
warnings.filterwarnings("ignore")
!pip install pyLDAvis
from pyLDAvis import gensim
from gensim.models import ldamodel
import pyLDAvis
import pyLDAvis.sklearn
from pyLDAvis import sklearn as sklearn_lda


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/24/38/6d81eff34c84c9158d3b7c846bff978ac88b0c2665548941946d3d591158/pyLDAvis-3.2.2.tar.gz (1.7MB)
[K     |████████████████████████████████| 1.7MB 11.7MB/s 
Collecting funcy
  Downloading https://files.pythonhosted.org/packages/66/89/479de0afbbfb98d1c4b887936808764627300208bb771fcd823403645a36/funcy-1.15-py2.py3-none-any.whl
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.2.2-py2.py3-none-any.whl size=135593 sha256=971cfa5346360e5d5dc4663fd21ba9f6ca37b8eb18f6b4c6adf20225d6005a55
  Stored in directory: /root/.cache/pip/wheels/74/df/b6/97234c8446a43be05c9a8687ee0db1f1b5ade5f27729187eae

  from collections import Iterable


In [None]:
!wget https://raw.githubusercontent.com/JSC370/jsc370.github.io/master/enron_top_exe_mails.csv


--2021-02-27 01:46:08--  https://raw.githubusercontent.com/JSC370/jsc370.github.io/master/enron_top_exe_mails.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7935976 (7.6M) [text/plain]
Saving to: ‘enron_top_exe_mails.csv’


2021-02-27 01:46:09 (55.1 MB/s) - ‘enron_top_exe_mails.csv’ saved [7935976/7935976]



In [None]:
enron = pd.read_csv("enron_top_exe_mails.csv")
df = pd.DataFrame(enron)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,file,message,text,date,senders,recipients,subject
0,0,beck-s/all_documents/330.,Message-ID: <6734979.1075849819614.JavaMail.ev...,"['', 'Arthur Andersen in conjunction with Hype...",2001-01-31 05:35:00-08:00,From: robert.a.seekely@us.arthurandersen.com,To: laura.g.ware@us.arthurandersen.com,Performance Management
1,1,beck-s/discussion_threads/295.,Message-ID: <9563779.1075849841009.JavaMail.ev...,"['', 'Arthur Andersen in conjunction with Hype...",2001-01-31 05:35:00-08:00,From: robert.a.seekely@us.arthurandersen.com,To: laura.g.ware@us.arthurandersen.com,Performance Management
2,2,dasovich-j/all_documents/247.,Message-ID: <13169797.1075842938250.JavaMail.e...,"['\tagrubb@calnurses.org, mcgee.nora@epamail.e...",1999-12-06 23:47:00-08:00,From: jeffrey.l.walker@us.arthurandersen.com,To: perin@sirius.com,Re: dinner next tuesday
3,3,dasovich-j/all_documents/249.,Message-ID: <14388889.1075842938392.JavaMail.e...,"['\tagrubb@calnurses.org, mcgee.nora@epamail.e...",1999-12-07 20:44:00-08:00,From: jeffrey.l.walker@us.arthurandersen.com,To: lnewmansciarrino@elanpharma.com,RE: dinner next tuesday
4,4,dasovich-j/all_documents/301.,Message-ID: <21305507.1075842940032.JavaMail.e...,"['X-To: pat.scatena@intel.com', 'X-cc: agrubb@...",2000-01-16 14:48:00-08:00,From: jeffrey.l.walker@us.arthurandersen.com,To: pat.scatena@intel.com,RE: Porlock Vale in 2000


In [None]:
df = df.drop("Unnamed: 0", axis=1)

In [None]:
# text preprocessing 

df["text"] = df['text'].apply(lambda x: x.strip('[]')) 
df["text"] = df['text'].apply(lambda x: x.strip("'',"))
df["text"] = df['text'].apply(lambda x: x.strip(", '',"))




# remove "From" from sender mail ids
def remove_from(text): 
    mail = re.sub(r'From:', '', text)
    return mail

# remove "To" from recipient mail ids
def remove_to(text): 
    mail = re.sub(r'To:', '', text)
    return mail

#remove unnecessary line breakers 
def remove_obj(text): 
    mail = re.sub('\'', '', text)
    return mail

#defin a function to remove punctuation from text
def remove_punct(text):
    punct = "".join([i for i in text if i not in string.punctuation])
    return punct

# remove numbers from mail body 
def remove_numbers(text): 
    mail = re.sub(r"\d+", "", text)
    return mail



# applying all defined functions 
df['senders'] = df['senders'].apply(lambda x : remove_from(x))

df['recipients'] = df['recipients'].apply(lambda x : remove_to(x))

df['text'] = df['text'].apply(lambda x : remove_obj(x))

df['text'] = df['text'].apply(lambda x : remove_punct(x)) 

df['text'] = df['text'].apply(lambda x : remove_numbers(x))


In [None]:
# defining top_k_words from mail body 

def get_stop_words():
    stop = set(stopwords.words('english'))
    
    stop.add("XFrom")
    stop.add("XTo")
    stop.add("Xcc")
    stop.add("Xbcc")
    stop.add("XFolder")
    
    return stop

def getTopKWords(df, kwords):

    stop = get_stop_words()
    counter = Counter()

    mails = df['text'].values

    for mail in mails:
            counter.update([word.lower() 
                            for word 
                            in re.findall(r'\w+', mail)
                            if word.lower() not in stop and len(word) > 2 ])
    topk = counter.most_common(kwords)
    return topk




In [None]:
# creating dataframe only for those mails received by the co-founder of Enron, Kenneth Lay
cf_df = df[df["recipients"].str.contains("kenneth.lay@enron.com")]

In [None]:
# finding top 50 words used in the mails received by kenneth 
co_founder_top_words = getTopKWords(cf_df, 50)
co_founder_top_words

[('enron', 2825),
 ('would', 1916),
 ('ken', 1140),
 ('know', 1066),
 ('time', 967),
 ('company', 962),
 ('please', 958),
 ('business', 922),
 ('lay', 904),
 ('new', 803),
 ('like', 766),
 ('energy', 728),
 ('may', 698),
 ('meeting', 692),
 ('one', 663),
 ('email', 661),
 ('houston', 626),
 ('subject', 604),
 ('xfilename', 588),
 ('also', 587),
 ('call', 584),
 ('information', 560),
 ('xorigin', 553),
 ('best', 539),
 ('thank', 529),
 ('message', 523),
 ('get', 518),
 ('thanks', 516),
 ('could', 500),
 ('need', 494),
 ('work', 493),
 ('years', 484),
 ('well', 476),
 ('employees', 472),
 ('people', 460),
 ('last', 440),
 ('see', 439),
 ('let', 438),
 ('make', 437),
 ('many', 425),
 ('good', 424),
 ('sent', 423),
 ('next', 411),
 ('want', 407),
 ('year', 406),
 ('layk', 396),
 ('help', 393),
 ('management', 384),
 ('market', 376),
 ('attached', 376)]

## LATENT DIRICHLET ALLOCATION (LDA) TOPIC MODELING USING GENSIM.MODEL

In [None]:
# creating dictionary of relevant words
# we are using top 50 words from the mails received by co-founder
corp = []
for i, j in co_founder_top_words:
    corp.append(i)
corp = [word.split() for word  in corp]   # converting individual token (useful for creating gensim dicionary)

In [None]:
corp

[['enron'],
 ['would'],
 ['ken'],
 ['know'],
 ['time'],
 ['company'],
 ['please'],
 ['business'],
 ['lay'],
 ['new'],
 ['like'],
 ['energy'],
 ['may'],
 ['meeting'],
 ['one'],
 ['email'],
 ['houston'],
 ['subject'],
 ['xfilename'],
 ['also'],
 ['call'],
 ['information'],
 ['xorigin'],
 ['best'],
 ['thank'],
 ['message'],
 ['get'],
 ['thanks'],
 ['could'],
 ['need'],
 ['work'],
 ['years'],
 ['well'],
 ['employees'],
 ['people'],
 ['last'],
 ['see'],
 ['let'],
 ['make'],
 ['many'],
 ['good'],
 ['sent'],
 ['next'],
 ['want'],
 ['year'],
 ['layk'],
 ['help'],
 ['management'],
 ['market'],
 ['attached']]

In [None]:
# creating corpora of the relevant keywords
id2word = corpora.Dictionary(corp)

In [None]:
# tokenizing mail body text so that is useful for topic modeling 
# use regular expression tokenizer to tokenize
tokenizer = RegexpTokenizer(r'\w+')
cf_df['text'] = cf_df['text'].apply(lambda x : tokenizer.tokenize(x))
clean_mails = cf_df['text'] 

In [None]:
# creating corpus with term document frequency of the abstracts
corpus = [id2word.doc2bow(text) for text in clean_mails]

In [None]:

# finding tf_idf scores with corpus
tfidf = models.TfidfModel(corpus)
tf_corpus = tfidf[corpus]

In [None]:

# calling LDA model
# we are modeling for 5 topics
LDA_gen = ldamodel.LdaModel(corpus=tf_corpus,
                                           id2word=id2word,
                                           num_topics=5, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=3,
                                           passes=3,
                                           alpha='symmetric',
                                           iterations=100,
                                      
                                           per_word_topics=True)

  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

In [None]:
print(LDA_gen.print_topics(num_words=5)) # printing top keywords in top 5 topics

[(0, '0.132*"know" + 0.129*"well" + 0.117*"company" + 0.104*"let" + 0.099*"people"'), (1, '0.239*"may" + 0.230*"market" + 0.208*"want" + 0.181*"message" + 0.119*"sent"'), (2, '0.566*"meeting" + 0.401*"please" + 0.002*"ken" + 0.001*"enron" + 0.001*"make"'), (3, '0.106*"would" + 0.075*"call" + 0.067*"could" + 0.065*"need" + 0.064*"also"'), (4, '0.262*"time" + 0.192*"new" + 0.180*"make" + 0.120*"year" + 0.102*"thanks"')]


In [None]:
# Topics visualization using pyLDAvis
pyLDAvis.enable_notebook()

In [None]:
# topic visualization
titles = pyLDAvis.gensim.prepare(LDA_gen, tf_corpus, dictionary=LDA_gen.id2word)
titles