Text data containing emails from Enron employees. The Enron scandal is a famous fraud case. Enron employees covered up the bad financial position of the company, thereby keeping the stock price artificially high. Enron employees sold their own stock options, and when the truth came out, Enron investors were left with nothing. The goal is to find all emails that mention specific words, such as "sell enron stock" etc kind of.......

In [None]:
!pip install pyLDAvis



# Import packages

In [None]:
import pandas as pd
import numpy as np
import string
import nltk
import gensim
import pyLDAvis.gensim
import warnings
warnings.filterwarnings('ignore')


from nltk import word_tokenize
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

from gensim import corpora

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
df = pd.read_csv('enron_emails_clean.csv')

# Cleaning the text data
1. Tokenization
  *  Split the text into sentences and the sentences in words
  * transform everything to lowercase
  * remove punctuation
2. Remove all stopwords

In [None]:
# Define stopwords to exclude
stop = set(stopwords.words('english'))
stop.update(("to", "cc", "subject", "http", "from", "sent", "ect", "u", "fwd", "www", "com", 'html'))

# Define punctuations to exclude and lemmatizer
exclude = set(string.punctuation)

* There are a few more steps to take before you have cleaned data, such as "lemmatization" of words, and stemming the verbs. The verbs in the email data are already stemmed, and the lemmatization is to be done, so lets `lemma`

In [None]:
# Import the lemmatizer from nltk
lemma = WordNetLemmatizer()

# Use the previously defined variables stop and exclude to finish of the function: 
# Strip the words from whitespaces using rstrip, and exclude stopwords and punctuations.
# Finally lemmatize the words and assign that to normalized
def clean(text, stop):
    text = str(text).rstrip()
    stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
    punc_free = ''.join(i for i in stop_free if i not in exclude)
    normalized = " ".join(lemma.lemmatize(i) for i in punc_free.split())      
    return normalized

In [None]:
# Clean the emails in df and print results
# Apply the function clean(text,stop) on each line of text data in our dataframe.
text_clean=[]
for text in df['clean_content']:
    text_clean.append(clean(text, stop).split())

In [None]:
text_clean[0][:10]

['investools',
 'advisory',
 'free',
 'digest',
 'trusted',
 'investment',
 'advice',
 'unsubscribe',
 'free',
 'newsletter']

* We have cleaned the data entirely, including splitting the text into words, removing stopwords and punctuations, and lemmatizing the words. 
* We are now ready to run a topic model on this data. 

* Now we have the data with us and inorder to detect any fraudlent kind of, we need to do the topic modelling.

## Topic Modelling on fraud

### Create dictionary and corpus

* In order to run an LDA topic model, We first need to define our dictionary and corpus first, as those need to go into the model.

In [None]:
# Define the dictionary
dictionary = corpora.Dictionary(text_clean)

# Define the corpus 
# Document to Bag of Words
# converts text data into bag-of-words format
# each row is now a list of words with the associated word count
corpus = [dictionary.doc2bow(text) for text in text_clean]

In [None]:
print(dictionary)

Dictionary(31490 unique tokens: ['account', 'accurate', 'acquiring', 'acre', 'address']...)


In [None]:
corpus[0][:10]

[(0, 2),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 6),
 (6, 1),
 (7, 2),
 (8, 4),
 (9, 1)]

## LDA model.

* Now it's time to build the LDA model. Using the dictionary and corpus, we are ready to discover which topics are present in the Enron emails.

In [None]:
# Define the LDA model
# Build the LDA model from gensim models, by inserting the corpus and dictionary.
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=5)

# Save the topics and top 5 words
topics = ldamodel.print_topics(num_words=5)

# Print the results
for topic in topics:
    print(topic)

(0, '0.021*"enron" + 0.008*"message" + 0.007*"stock" + 0.006*"original" + 0.006*"pm"')
(1, '0.016*"td" + 0.013*"font" + 0.006*"size" + 0.006*"tr" + 0.006*"e"')
(2, '0.023*"enron" + 0.016*"bakernet" + 0.015*"company" + 0.013*"employee" + 0.010*"million"')
(3, '0.049*"enron" + 0.013*"hou" + 0.008*"company" + 0.006*"said" + 0.006*"corp"')
(4, '0.011*"enron" + 0.005*"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" + 0.005*"company" + 0.004*"one" + 0.004*"new"')


* We have now successfully created the first topic model on the Enron email data. However, the print of words doesn't really give us the enough information to find a topic that might lead us to signs of fraud.

In [None]:
lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)

In [None]:
pyLDAvis.display(lda_display)

* Have a look at topic 1,4 and 5 from the LDA model on the Enron email data.
* Topic 5 seems to discuss the employee share option program, and seems to point to major conversation(with "Please","Thanks", "mail"," contact")
* Topic 4 seems to be more related to general news around Enron.("stock","million","market","fund") and so on.....

## Finding fraudsters based on topic

* Now we are going to link the results from the topic model back to the original data. 


In [None]:
def get_topic_details(ldamodel, corpus):
    topic_details_df = pd.DataFrame()
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_details_df = topic_details_df.append(pd.Series([topic_num, prop_topic]), ignore_index=True)
    topic_details_df.columns = ['Dominant_Topic', '% Score']
    return topic_details_df

In [None]:
# Run get_topic_details function and check the results
topic_details_df = get_topic_details(ldamodel, corpus)

In [None]:
topic_details_df.head()

Unnamed: 0,Dominant_Topic,% Score
0,4.0,0.976538
1,3.0,0.998935
2,0.0,0.784948
3,2.0,0.993503
4,2.0,0.9934


In [None]:
topic_details_df.tail()

Unnamed: 0,Dominant_Topic,% Score
1676,4.0,0.730687
1677,0.0,0.533567
1678,3.0,0.844016
1679,4.0,0.992028
1680,0.0,0.790291


* Concatenate column-wise the results from the previously defined function get_topic_details() to the original text data contained under contents and inspect the results.

In [None]:
# Add original text to topic details in a dataframe
contents = pd.DataFrame({'Original text': text_clean})
topic_details = pd.concat([get_topic_details(ldamodel, corpus), contents], axis=1)

In [None]:
topic_details.sort_values(by=['% Score'], ascending=False).head(10).head()

Unnamed: 0,Dominant_Topic,% Score,Original text
442,2.0,0.999963,"[pleased, send, web, based, e, mail, alert, pr..."
154,3.0,0.999957,"[joint, venture, enron, meeting, belies, offic..."
135,3.0,0.999953,"[lawyer, agree, order, safeguard, document, ho..."
107,3.0,0.999907,"[sample, article, original, message, schmidt, ..."
849,4.0,0.999877,"[original, message, received, thu, aug, cdt, e..."


In [None]:
topic_details.sort_values(by=['% Score'], ascending=False).head(10).tail()

Unnamed: 0,Dominant_Topic,% Score,Original text
1042,1.0,0.999847,"[forwarded, richard, shapiro, na, enron, pm, p..."
1252,1.0,0.999802,"[attention, reading, message, mail, reader, ca..."
271,4.0,0.999792,"[nancy, sheed, nancy, sheed, businessweek, pm,..."
820,1.0,0.999639,"[title, gas, index, intercontentalexchange, ti..."
211,4.0,0.999623,"[opinionjournal, best, web, today, january, ja..."


* Create a flag with the np.where() function to flag all content that has topic 3 as a dominant topic with a 1, and 0 otherwise

In [None]:
# Create flag for text highest associated with topic 2
topic_details['flag'] = np.where((topic_details['Dominant_Topic'] == 2.0), 1, 0)

In [None]:
topic_details_1 = topic_details[topic_details.flag == 1]

In [None]:
topic_details_1.sort_values(by=['% Score'], ascending=False).head(10)

Unnamed: 0,Dominant_Topic,% Score,Original text,flag
442,2.0,0.999963,"[pleased, send, web, based, e, mail, alert, pr...",1
109,2.0,0.998311,"[original, message, ken, lay, office, chairman...",1
14,2.0,0.998249,"[today, announced, plan, merge, dynegy, major,...",1
155,2.0,0.998249,"[today, announced, plan, merge, dynegy, major,...",1
38,2.0,0.998249,"[today, announced, plan, merge, dynegy, major,...",1
129,2.0,0.998249,"[today, announced, plan, merge, dynegy, major,...",1
181,2.0,0.998249,"[today, announced, plan, merge, dynegy, major,...",1
49,2.0,0.998248,"[today, announced, plan, merge, dynegy, major,...",1
1473,2.0,0.998248,"[today, announced, plan, merge, dynegy, major,...",1
13,2.0,0.998248,"[today, announced, plan, merge, dynegy, major,...",1


* You have now flagged all data that is highest associated with topic 3, that seems to cover internal conversation about enron stock options. You are a true detective. With these exercises you have demonstrated that text mining and topic modeling can be a powerful tool for fraud detection.