## Taking a look at the Enron labeled data

code borrowed from :
https://github.com/shoreason/enron-topic-modeling/blob/master/enron_lda.ipynb

In [2]:
import numpy as np
import pandas as pd
import vocab as vocabulary
import collections
import utils
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support as score
from wordcloud import WordCloud ,STOPWORDS
from collections import defaultdict

In [3]:
import glob
import fileinput
import shutil

In [4]:
datadir = "/data/SuperMod/emails.csv"

In [5]:
enrondata = pd.read_csv(datadir)

In [6]:
enrondata.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [7]:
def parse_raw_message(raw_message):
    lines = raw_message.split('\n')
    email = {}
    message = ''
    keys_to_extract = ['from', 'to']
    for line in lines:
        if ':' not in line:
            message += line.strip()
            email['body'] = message
        else:
            pairs = line.split(':')
            key = pairs[0].lower()
            val = pairs[1].strip()
            if key in keys_to_extract:
                email[key] = val
    return email

In [8]:
def parse_into_emails(messages):
    emails = [parse_raw_message(message) for message in messages]
    return {
        'body': map_to_list(emails, 'body'),
        'to': map_to_list(emails, 'to'),
        'from_': map_to_list(emails, 'from')
    }

In [9]:
def map_to_list(emails, key):
    results = []
    for email in emails:
        if key not in email:
            results.append('')
        else:
            results.append(email[key])
    return results

In [11]:
email_df = pd.DataFrame(parse_into_emails(enrondata.message))
print(email_df.head())

                                                body                    from_  \
0                               Here is our forecast  phillip.allen@enron.com   
1  Traveling to have a business meeting takes the...  phillip.allen@enron.com   
2                     test successful.  way to go!!!  phillip.allen@enron.com   
3  Randy,Can you send me a schedule of the salary...  phillip.allen@enron.com   
4                                                     phillip.allen@enron.com   

                        to  
0     tim.belden@enron.com  
1  john.lavorato@enron.com  
2   leah.arsdall@enron.com  
3    randall.gay@enron.com  
4     greg.piper@enron.com  


In [None]:
517398

In [22]:
print(email_df.iloc[13,0])

---------------------- Forwarded by Phillip K Allen/HOU/ECT on 10/09/2000Richard BurchfieldPhillip,Below is the issues & to do list as we go forward with documenting therequirements for consolidated physical/financial positions and transporttrade capture. What we need to focus on is the first bullet in Allan's list;the need for a single set of requirements. Although the meeting with Keith,on Wednesday,  was informative the solution of creating a infinitely dynamicconsolidated position screen, will be extremely difficult and timeconsuming.  Throughout the meeting on Wednesday, Keith alluded to theinability to get consensus amongst the traders on the presentation of theconsolidated position, so the solution was to make it so that a trader canarrange the position screen to their liking (much like Excel). What needs tohappen on Monday from 3 - 5 is a effort to design a desired layout for theconsolidated position screen, this is critical. This does not excludebuilding a capability to create

In [21]:
print(enrondata.iloc[13,1])

Message-ID: <2707340.1075855687584.JavaMail.evans@thyme>
Date: Mon, 9 Oct 2000 07:00:00 -0700 (PDT)
From: phillip.allen@enron.com
To: keith.holst@enron.com
Subject: Consolidated positions: Issues & To Do list
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Keith Holst
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

---------------------- Forwarded by Phillip K Allen/HOU/ECT on 10/09/2000 
02:00 PM ---------------------------


Richard Burchfield
10/06/2000 06:59 AM
To: Phillip K Allen/HOU/ECT@ECT
cc: Beth Perlman/HOU/ECT@ECT 
Subject: Consolidated positions: Issues & To Do list

Phillip,
 Below is the issues & to do list as we go forward with documenting the 
requirements for consolidated physical/financial positions and transport 
trade capture. What we need to focus on is the first bullet in Allan's list; 
the need for a single set of requireme

In [12]:
email_df.body.iloc

0                                      Here is our forecast
1         Traveling to have a business meeting takes the...
2                            test successful.  way to go!!!
3         Randy,Can you send me a schedule of the salary...
4                                                          
5         Greg,How about either next Tuesday or Thursday...
6         Phillip Allen (pallen@enron.com)Mike Grigsby (...
7                                                          
8         I don't think these are required by the ISP2. ...
9         ---------------------- Forwarded by Phillip K ...
10        Mr. Buckner,For delivered gas behind San Diego...
11        Lucy,Open them and save in the rentroll folder...
12        ---------------------- Forwarded by Phillip K ...
13        ---------------------- Forwarded by Phillip K ...
14        Dave,Here are the names of the west desk membe...
15                          Paula,35 million is finePhillip
16        ---------------------- Forward

In [11]:
testfile = "/data/SuperMod/enron_with_categories/1/82353.txt"
testlabel = "/data/SuperMod/enron_with_categories/1/82353.cats"
with open(testfile, 'r') as readfile:
    sample = readfile.read()
with open(testlabel, 'r') as readlabel:
    samplelabel = readlabel.read()
print(sample)
print("***********************")
print(samplelabel)

Message-ID: <3524436.1075863727537.JavaMail.evans@thyme>
Date: Tue, 12 Feb 2002 05:07:17 -0800 (PST)
From: m..presto@enron.com
To: fgiffels@hgp-inc.com
Subject: RE: Confidential Contact data and RFI
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Presto, Kevin M. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=KPRESTO>
X-To: 'Fred W. Giffels' <fgiffels@HGP-Inc.com>
X-cc: 
X-bcc: 
X-Folder: \Kevin_Presto_Mar2002_1\Presto, Kevin M.\Sent Items
X-Origin: Presto-K
X-FileName: kpresto (Non-Privileged).pst

Any nuclear specific info?
 

Kevin Presto 
UBS Warburg Energy 
kevin.presto@ubswenergy.com 
Phone:  713-853-5035 
Fax:  713-646-8272 

-----Original Message-----
From: Fred W. Giffels [mailto:fgiffels@HGP-Inc.com]
Sent: Tuesday, February 12, 2002 7:04 AM
To: Presto, Kevin M.
Cc: Dan Salter
Subject: Re: Confidential Contact data and RFI


Confidential
Kevin
I know you would not bid the entire 1000Mwe, we are trying to come up with a proxy to assist in 

In [16]:

combined_df = pd.DataFrame(columns=['foldername', 'file', 'email', 'label'])

ind = 0

for i in range(7):
    folder = i + 1
    folder_add = basedir + '/' + str(folder)

    filenames = glob.glob(folder_add + '/*.txt')


    for filename  in filenames:

        ids = filename[37:-4]
        label_filename = filename[:-4] + '.cats'
        
        foldername = ids[0]
        file = ids[2:]
        
        
        with open(filename, 'r') as readfile:
            email = readfile.read()
        with open(label_filename, 'r') as readlabel:
            label = readlabel.read()
        
        combined_df.loc[ind] = [foldername, file, email, label ]
        ind += 1

In [19]:
combined_df.head()

Unnamed: 0,foldername,file,email,label
0,1,229024,Message-ID: <22676486.1075853122206.JavaMail.e...,"1,1,1\n3,8,1\n"
1,1,219048,Message-ID: <30008704.1075852472248.JavaMail.e...,"1,1,1\n2,1,1\n2,2,1\n2,13,1\n3,10,1\n4,10,1\n"
2,1,173960,Message-ID: <27385216.1075846164025.JavaMail.e...,"1,1,1\n2,1,1\n2,2,1\n3,1,1\n3,6,1\n"
3,1,174397,Message-ID: <104094.1075846176879.JavaMail.eva...,"1,1,2\n2,1,2\n2,2,2\n2,4,2\n3,1,2\n"
4,1,115139,Message-ID: <23486926.1075842966554.JavaMail.e...,"1,1,2\n2,1,1\n2,6,2\n3,1,1\n3,6,2\n"


In [28]:
test_label = combined_df.label[1]

In [34]:
findemotion = lambda x: [label.split(',') for label in x.split('\n') if (label != '' and label[0] == '4')]

In [36]:
combined_df['emotion_label'] = combined_df.label.map(findemotion)

In [38]:
combined_df['emotion_label'].head(10)

0                        []
1              [[4, 10, 1]]
2                        []
3                        []
4                        []
5                        []
6                        []
7                        []
8    [[4, 3, 2], [4, 9, 2]]
9              [[4, 10, 2]]
Name: emotion_label, dtype: object

In [39]:
emotion_only = lambda x: [label.split(',')[1] for label in x.split('\n') if (label != '' and label[0] == '4')]

In [41]:
combined_df['emotion_only'] = combined_df.label.map(emotion_only)

In [30]:
test_email = combined_df.email[1]

In [31]:
test_email

'Message-ID: <30008704.1075852472248.JavaMail.evans@thyme>\nDate: Fri, 10 Aug 2001 15:40:25 -0700 (PDT)\nFrom: ray.alvarez@enron.com\nTo: dwatkiss@bracepatt.com, dfergus@brobeck.com\nSubject: CONFIDENTIAL Attached file\nCc: d..steffes@enron.com, robert.frank@enron.com\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nBcc: d..steffes@enron.com, robert.frank@enron.com\nX-From: Alvarez, Ray </O=ENRON/OU=NA/CN=RECIPIENTS/CN=NOTESADDR/CN=EBE4476B-2D94882A-86256A14-75FF3B>\nX-To: dwatkiss@bracepatt.com, dfergus@brobeck.com\nX-cc: Steffes, James D. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JSTEFFE>, Frank, Robert </O=ENRON/OU=NA/CN=RECIPIENTS/CN=RFRANK>\nX-bcc: \nX-Folder: \\JSTEFFE (Non-Privileged)\\Steffes, James D.\\California Issues\nX-Origin: Steffes-J\nX-FileName: JSTEFFE (Non-Privileged).pst\n\nThis is the negative CTC claim.  Transmission and other charges we owe have been set off; ie this is a net amount and the gross CTC is much larger- on the o

In [42]:
combined_df.head(10)

Unnamed: 0,foldername,file,email,label,emotion_label,emotion_only
0,1,229024,Message-ID: <22676486.1075853122206.JavaMail.e...,"1,1,1\n3,8,1\n",[],[]
1,1,219048,Message-ID: <30008704.1075852472248.JavaMail.e...,"1,1,1\n2,1,1\n2,2,1\n2,13,1\n3,10,1\n4,10,1\n","[[4, 10, 1]]",[10]
2,1,173960,Message-ID: <27385216.1075846164025.JavaMail.e...,"1,1,1\n2,1,1\n2,2,1\n3,1,1\n3,6,1\n",[],[]
3,1,174397,Message-ID: <104094.1075846176879.JavaMail.eva...,"1,1,2\n2,1,2\n2,2,2\n2,4,2\n3,1,2\n",[],[]
4,1,115139,Message-ID: <23486926.1075842966554.JavaMail.e...,"1,1,2\n2,1,1\n2,6,2\n3,1,1\n3,6,2\n",[],[]
5,1,232926,Message-ID: <31251032.1075853199944.JavaMail.e...,"1,1,2\n3,6,2\n3,12,2\n",[],[]
6,1,120790,Message-ID: <2573675.1075843395513.JavaMail.ev...,"1,1,1\n2,2,1\n3,2,1\n",[],[]
7,1,174208,Message-ID: <7780541.1075846171179.JavaMail.ev...,"1,1,2\n2,1,2\n2,2,2\n3,1,2\n",[],[]
8,1,125706,Message-ID: <9573777.1075863621029.JavaMail.ev...,"1,1,2\n2,11,2\n3,6,2\n4,3,2\n4,9,2\n","[[4, 3, 2], [4, 9, 2]]","[3, 9]"
9,1,232795,Message-ID: <27422646.1075853196172.JavaMail.e...,"1,1,2\n1,8,2\n2,2,2\n3,7,2\n3,8,2\n4,10,2\n","[[4, 10, 2]]",[10]


In [46]:
combined_df['emotion_cnt'] = combined_df.emotion_label.map(lambda x: len(x))

In [47]:
combined_df.head(10)

Unnamed: 0,foldername,file,email,label,emotion_label,emotion_only,emotion_cnt
0,1,229024,Message-ID: <22676486.1075853122206.JavaMail.e...,"1,1,1\n3,8,1\n",[],[],0
1,1,219048,Message-ID: <30008704.1075852472248.JavaMail.e...,"1,1,1\n2,1,1\n2,2,1\n2,13,1\n3,10,1\n4,10,1\n","[[4, 10, 1]]",[10],1
2,1,173960,Message-ID: <27385216.1075846164025.JavaMail.e...,"1,1,1\n2,1,1\n2,2,1\n3,1,1\n3,6,1\n",[],[],0
3,1,174397,Message-ID: <104094.1075846176879.JavaMail.eva...,"1,1,2\n2,1,2\n2,2,2\n2,4,2\n3,1,2\n",[],[],0
4,1,115139,Message-ID: <23486926.1075842966554.JavaMail.e...,"1,1,2\n2,1,1\n2,6,2\n3,1,1\n3,6,2\n",[],[],0
5,1,232926,Message-ID: <31251032.1075853199944.JavaMail.e...,"1,1,2\n3,6,2\n3,12,2\n",[],[],0
6,1,120790,Message-ID: <2573675.1075843395513.JavaMail.ev...,"1,1,1\n2,2,1\n3,2,1\n",[],[],0
7,1,174208,Message-ID: <7780541.1075846171179.JavaMail.ev...,"1,1,2\n2,1,2\n2,2,2\n3,1,2\n",[],[],0
8,1,125706,Message-ID: <9573777.1075863621029.JavaMail.ev...,"1,1,2\n2,11,2\n3,6,2\n4,3,2\n4,9,2\n","[[4, 3, 2], [4, 9, 2]]","[3, 9]",2
9,1,232795,Message-ID: <27422646.1075853196172.JavaMail.e...,"1,1,2\n1,8,2\n2,2,2\n3,7,2\n3,8,2\n4,10,2\n","[[4, 10, 2]]",[10],1


In [52]:
combined_df.emotion_cnt.value_counts()

0    1376
1     253
2      46
3       6
4       2
5       1
Name: emotion_cnt, dtype: int64

### Most of the emails are neutral, only about 300 has one or more emotion labels

In [61]:
emotion_dict = defaultdict(int)
for i in combined_df.emotion_only:
    if i != []:
        for j in i:
            emotion_dict[int(j)] += 1

In [62]:
emotion_dict

defaultdict(int,
            {1: 12,
             2: 20,
             3: 22,
             4: 20,
             5: 13,
             6: 21,
             7: 13,
             8: 7,
             9: 18,
             10: 128,
             11: 28,
             12: 38,
             13: 8,
             14: 3,
             15: 3,
             16: 10,
             17: 2,
             18: 1,
             19: 9})

### Emotion labels
4.1 jubilation  
4.2 hope / anticipation  
4.3 humor  
4.4 camaraderie  
4.5 admiration  
4.6 gratitude  
4.7 friendship / affection  
4.8 sympathy / support  
4.9 sarcasm    
4.10 secrecy / confidentiality  
4.11 worry / anxiety  
4.12 concern  
4.13 competitiveness / aggressiveness  
4.14 triumph / gloating  
4.15 pride  
4.16 anger / agitation  
4.17 sadness / despair  
4.18 shame  
4.19 dislike / scorn  

## Comment:
Very few negative emotions. Understandable as professionals tend to send professional emails. 

## Try loading and saving data

In [43]:
combined_df.to_csv('/data/SuperMod/enron_with_categories/CombinedData.csv')

In [44]:
loadagain = pd.read_csv('/data/SuperMod/enron_with_categories/CombinedData.csv')

In [45]:
loadagain

Unnamed: 0.1,Unnamed: 0,foldername,file,email,label,emotion_label,emotion_only
0,0,1,229024,Message-ID: <22676486.1075853122206.JavaMail.e...,"1,1,1\n3,8,1\n",[],[]
1,1,1,219048,Message-ID: <30008704.1075852472248.JavaMail.e...,"1,1,1\n2,1,1\n2,2,1\n2,13,1\n3,10,1\n4,10,1\n","[['4', '10', '1']]",['10']
2,2,1,173960,Message-ID: <27385216.1075846164025.JavaMail.e...,"1,1,1\n2,1,1\n2,2,1\n3,1,1\n3,6,1\n",[],[]
3,3,1,174397,Message-ID: <104094.1075846176879.JavaMail.eva...,"1,1,2\n2,1,2\n2,2,2\n2,4,2\n3,1,2\n",[],[]
4,4,1,115139,Message-ID: <23486926.1075842966554.JavaMail.e...,"1,1,2\n2,1,1\n2,6,2\n3,1,1\n3,6,2\n",[],[]
5,5,1,232926,Message-ID: <31251032.1075853199944.JavaMail.e...,"1,1,2\n3,6,2\n3,12,2\n",[],[]
6,6,1,120790,Message-ID: <2573675.1075843395513.JavaMail.ev...,"1,1,1\n2,2,1\n3,2,1\n",[],[]
7,7,1,174208,Message-ID: <7780541.1075846171179.JavaMail.ev...,"1,1,2\n2,1,2\n2,2,2\n3,1,2\n",[],[]
8,8,1,125706,Message-ID: <9573777.1075863621029.JavaMail.ev...,"1,1,2\n2,11,2\n3,6,2\n4,3,2\n4,9,2\n","[['4', '3', '2'], ['4', '9', '2']]","['3', '9']"
9,9,1,232795,Message-ID: <27422646.1075853196172.JavaMail.e...,"1,1,2\n1,8,2\n2,2,2\n3,7,2\n3,8,2\n4,10,2\n","[['4', '10', '2']]",['10']
