<h1> Advanced Learning for Text and Graph Data </h1>
<b> Université Paris-Saclay - Master M2 Data Science - February/March 2017</b> <br>
<i> Students : Peter Martigny & Mehdi Miah </i> <br>
# First part  : create new structures and clean the data

## Open the data 

In [1]:
#import some libraries
import pandas as pd
import numpy as np
import random
import time
import matplotlib.pyplot as plt
%matplotlib inline

#access to the data
path_to_data = "../data/"

# path where we will store the results and clean datasets
path_to_results = "../results/"

In [2]:
# We have to give the types because otherwise, 'mid' may be an int64

#training set
training_set = pd.read_csv(path_to_data + 'training_set.csv')
training_info = pd.read_csv(path_to_data + 'training_info.csv', 
                            dtype = {'mid': object, 'date': object, 'body': object, 'recipients' : object})

#test
test_set = pd.read_csv(path_to_data + 'test_set.csv')
test_info = pd.read_csv(path_to_data + 'test_info.csv',
                        dtype = {'mid': object, 'date': object, 'body': object})

In [3]:
#Lets print all the shapes
#training
print('Shape of training_set : %.0f x %.0f' %(training_set.shape))
print('Shape of training_info : %.0f x %.0f' %(training_info.shape))

#test
print('Shape of test_set : %.0f x %.0f' %(test_set.shape))
print('Shape of test_info : %.0f x %.0f' %(test_info.shape))


Shape of training_set : 125 x 2
Shape of training_info : 43613 x 4
Shape of test_set : 125 x 2
Shape of test_info : 2362 x 3


Let's visualize some data : 

In [4]:
#visualize the first rows : each row corresponds to a sender in the train dataset
training_set.head()

Unnamed: 0,sender,mids
0,karen.buckley@enron.com,158713 158697 200301 158679 278595 298162 2002...
1,amr.ibrahim@enron.com,215241 3437 215640 3506 191790 3517 3520 3562 ...
2,andrea.ring@enron.com,270705 270706 270707 270708 270709 270710 2707...
3,sylvia.hu@enron.com,111444 111422 183084 111412 111347 110883 1105...
4,phillip.platter@enron.com,327074 327384 327385 264443 274124 274125 2741...


In [5]:
#visualise the first rows : each row corresponds to a mail in the train dataset
training_info.head()

Unnamed: 0,mid,date,body,recipients
0,60,2000-07-25 08:14:00,Legal has been assessing the risks of doing bl...,robert.badeer@enron.com murray.o neil@enron.co...
1,66,2000-08-03 02:56:00,Attached is a spreadsheet to estimate export f...,kim.ward@enron.com robert.badeer@enron.com mur...
2,74,2000-08-15 05:37:00,Kevin/Bob: Here is a quick rundown on the cons...,robert.badeer@enron.com john.massey@enron.com ...
3,80,2000-08-20 14:12:00,check this out and let everyone know what s up...,robert.badeer@enron.com jeff.richter@enron.com
4,83,2000-08-22 08:17:00,Further to your letter to us (addressed to Mr....,pgillman@schiffhardin.com kamarlantes@calpx.co...


In [6]:
#visualize the first rows : : each row corresponds to a sender in the test dataset
test_set.head()

Unnamed: 0,sender,mids
0,karen.buckley@enron.com,298389 332383 298390 284071 366982 81773 81791...
1,amr.ibrahim@enron.com,48260 48465 50344 48268 50330 48237 189979 189...
2,andrea.ring@enron.com,366364 271168 271172 271167 271189
3,sylvia.hu@enron.com,134931 134856 233549 233517 134895 233584 3736...
4,phillip.platter@enron.com,274220 274225 274215 274223 274214 274207 2742...


In [7]:
#visualise the first rows : each row corresponds to a mail in the test dataset
test_info.head()

Unnamed: 0,mid,date,body
0,1577,2001-11-19 06:59:51,Note: Stocks of heating oil are very high for...
1,1750,2002-03-05 08:46:57,"Kevin Hyatt and I are going for ""sghetti"" at S..."
2,1916,2002-02-13 14:17:39,This was forwarded to me and it is funny. - Wi...
3,2094,2002-01-22 11:33:56,I will be in to and happy to assist too. I ma...
4,2205,2002-01-11 07:12:19,Thanks. I needed a morning chuckle.


We have two datasets for training et test :
- $\texttt{training_set}$ and $\texttt{test_set}$ contain the list of all mails written by an employee ; 
- $\texttt{training_info}$ and $\texttt{test_info}$ describe each mail : the date, the content and the list of recipients ( only in the training dataset)

## Clean the data

### Clean the date

In [8]:
all_dates = list(training_info.date) + list(test_info.date)

In [9]:
all_years = set([date[:4] for date in all_dates])
all_years

{'0001', '0002', '1998', '1999', '2000', '2001', '2002'}

In [10]:
min(all_dates)

'0001-08-26 22:16:36'

It is clear that some dates are dirty : instead of having '2001' or '2002', one gets '0001' or '0002'.

In [11]:
def clean_date(date):
    '''
    Clean the date by replacing the first '0', if exists, by '2'
    
    == Input ====
    date         (string) date under the yyyy-mm-dd HH:M:SS format
    == Output ====
    date         (string) date under the yyyy-mm-dd HH:M:SS format
    '''
    if date[:1] == '0': #starts with '0xxx'
        date = '2' + date[1:]
    return date   

In [52]:
# Some examples
date = '0001-08-26 22:16:36'
print('Previous date : ', date, '\t Cleaned date : ', clean_date(date))

date = '2001-04-14 14:24:42'
print('Previous date : ', date, '\t Cleaned date : ', clean_date(date))

Previous date :  0001-08-26 22:16:36 	 Cleaned date :  2001-08-26 22:16:36
Previous date :  2001-04-14 14:24:42 	 Cleaned date :  2001-04-14 14:24:42


In [13]:
training_info['date'] = training_info.date.apply(clean_date)
test_info['date'] = test_info.date.apply(clean_date)

In [14]:
#sanity check
min(training_info.date)

'1998-12-21 05:29:00'

### Replace `\t` symbols

In [22]:
training_info['body'][67]

'gngr713-853-7751----- Forwarded by Ginger Dernehl/NA/Enron on 03/14/2001 10:45 AM -----\tKathleen Sullivan\t03/13/2001 10:23 PM\t\t \t\t To: Ginger Dernehl/NA/Enron@Enron\t\t cc: \t\t Subject: New York Regulatory SummaryGinger,Attached is a memo that Howard Fromer, Sarah Novosel and I recently drafted summarizing the regulatory environment in New York. Could you please distribute it to the Government Affairs group.  Thanks,Kathleen'

One can notice than the mail contain some symbols like `\t`. These elements should be removed in order to increase the performance of models.

In [23]:
def replace_t(text):
    return(text.replace('\t', ' '))

In [31]:
# Some example

#a text which was dirty
print('When the text is cleaned : \n%s\n' %replace_t(training_info['body'][67]))

#another one, which was clean
print('When the text was not dirty : \n%s' %training_info['body'][1])
print('After removed \t : \n%s' %replace_t(training_info['body'][1]))

When the text is cleaned : 
gngr713-853-7751----- Forwarded by Ginger Dernehl/NA/Enron on 03/14/2001 10:45 AM ----- Kathleen Sullivan 03/13/2001 10:23 PM      To: Ginger Dernehl/NA/Enron@Enron   cc:    Subject: New York Regulatory SummaryGinger,Attached is a memo that Howard Fromer, Sarah Novosel and I recently drafted summarizing the regulatory environment in New York. Could you please distribute it to the Government Affairs group.  Thanks,Kathleen

When the text was not dirty : 
Attached is a spreadsheet to estimate export fees.
After removed 	 : 
Attached is a spreadsheet to estimate export fees.


In [32]:
training_info['body_rt'] = training_info.body.apply(replace_t)
test_info['body_rt'] = test_info.body.apply(replace_t)

### Remove the forwarded part

In [12]:
cpt_forward = 0
for row in range(training_info.shape[0]):
    cpt_forward += 'Forwarded by' in training_info['body_rt'][row]
    
print('%.0f mails are forwarded amongst %.0f (%.2f%%).' %(cpt_forward, training_info.shape[0], cpt_forward/training_info.shape[0]*100))    

8351 mails are forwarded amongst 43613 (19.15%).


Now, it is time to remove the metadata.

In [33]:
def remove_forward(text):
    if 'Subject:' not in text:
        return text
    else: #there is at least one 'Forwarded by' or one 'From'
        new_text = ''
        last_part = text.split('Subject:')[-1]
        for part in text.split('Subject:'):
            if 'Forwarded by' in part:
                new_part = part.split('Forwarded by')[0]
                new_text = ''.join(new_text + new_part)
            elif 'From:'  in part:
                new_part = part.split('From:')[0]
                new_text = ''.join(new_text + new_part)     
        return ''.join(new_text + last_part)
            

In [35]:
print('After removing metadata : \n%s', remove_forward(training_info['body'][67]))

After removing metadata : 
%s gngr713-853-7751-----  New York Regulatory SummaryGinger,Attached is a memo that Howard Fromer, Sarah Novosel and I recently drafted summarizing the regulatory environment in New York. Could you please distribute it to the Government Affairs group.  Thanks,Kathleen


In [36]:
training_info['body_rf'] = training_info.body_rt.apply(remove_forward)
test_info['body_rf'] = test_info.body_rt.apply(remove_forward)

In [37]:
training_info.head()

Unnamed: 0,mid,date,body,recipients,body_rt,body_rf
0,60,2000-07-25 08:14:00,Legal has been assessing the risks of doing bl...,robert.badeer@enron.com murray.o neil@enron.co...,Legal has been assessing the risks of doing bl...,Legal has been assessing the risks of doing bl...
1,66,2000-08-03 02:56:00,Attached is a spreadsheet to estimate export f...,kim.ward@enron.com robert.badeer@enron.com mur...,Attached is a spreadsheet to estimate export f...,Attached is a spreadsheet to estimate export f...
2,74,2000-08-15 05:37:00,Kevin/Bob: Here is a quick rundown on the cons...,robert.badeer@enron.com john.massey@enron.com ...,Kevin/Bob: Here is a quick rundown on the cons...,Kevin/Bob: Here is a quick rundown on the cons...
3,80,2000-08-20 14:12:00,check this out and let everyone know what s up...,robert.badeer@enron.com jeff.richter@enron.com,check this out and let everyone know what s up...,check this out and let everyone know what s up...
4,83,2000-08-22 08:17:00,Further to your letter to us (addressed to Mr....,pgillman@schiffhardin.com kamarlantes@calpx.co...,Further to your letter to us (addressed to Mr....,Further to your letter to us (addressed to Mr....


In [38]:
#Sanity check
cpt_forward = 0
for row in range(training_info.shape[0]):
    cpt_forward += 'Forwarded by' in training_info['body_rf'][row]

print('%.0f mails are forwarded amongst %.0f (%.2f%%).' %(cpt_forward, training_info.shape[0], cpt_forward/training_info.shape[0]*100))    

124 mails are forwarded amongst 43613 (0.28%).


There were nearly `20`% of mails with metadata, now there are only `0.3`%.

### Text mining to simplify the data

When it comes to a text analysis, the pre-processing is an important part. 

In [39]:
#tokenizer
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') #to remove punctuations

#stemize
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def stem_tokens(tokens):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

#stopwords
from nltk.corpus import stopwords
all_stopwords = stopwords.words("english")

In [40]:
def clean_body(body):
    lower = body.lower()
    token = tokenizer.tokenize(lower)
    #stemmed = stem_tokens(token)
    del_sw = [word for word in token if word not in all_stopwords]
    return ' '.join(del_sw)

In [41]:
#An example
text = 'I have a dream that one day this nation will rise up and live out the true meaning of its creed: \
We hold these truths to be self-evident, that all men are created equal. \
I have a dream that one day on the red hills of Georgia, the sons of former slaves and the sons of former slave owners \
will be able to sit down together at the table of brotherhood.'

print(clean_body(text))

dream one day nation rise live true meaning creed hold truths self evident men created equal dream one day red hills georgia sons former slaves sons former slave owners able sit together table brotherhood


In [42]:
#Apply
training_info['clean_body'] = training_info.body_rf.apply(clean_body)
test_info['clean_body'] = test_info.body_rf.apply(clean_body)

In [43]:
training_info.head()

Unnamed: 0,mid,date,body,recipients,body_rt,body_rf,clean_body
0,60,2000-07-25 08:14:00,Legal has been assessing the risks of doing bl...,robert.badeer@enron.com murray.o neil@enron.co...,Legal has been assessing the risks of doing bl...,Legal has been assessing the risks of doing bl...,legal assessing risks block forward trades fin...
1,66,2000-08-03 02:56:00,Attached is a spreadsheet to estimate export f...,kim.ward@enron.com robert.badeer@enron.com mur...,Attached is a spreadsheet to estimate export f...,Attached is a spreadsheet to estimate export f...,attached spreadsheet estimate export fees
2,74,2000-08-15 05:37:00,Kevin/Bob: Here is a quick rundown on the cons...,robert.badeer@enron.com john.massey@enron.com ...,Kevin/Bob: Here is a quick rundown on the cons...,Kevin/Bob: Here is a quick rundown on the cons...,kevin bob quick rundown consultants sam wehn t...
3,80,2000-08-20 14:12:00,check this out and let everyone know what s up...,robert.badeer@enron.com jeff.richter@enron.com,check this out and let everyone know what s up...,check this out and let everyone know what s up...,check let everyone know
4,83,2000-08-22 08:17:00,Further to your letter to us (addressed to Mr....,pgillman@schiffhardin.com kamarlantes@calpx.co...,Further to your letter to us (addressed to Mr....,Further to your letter to us (addressed to Mr....,letter us addressed mr tim belden dated august...


## Create handy structures

All the code is in the file handy_structures.py

In [44]:
from handy_structures import * #own function

### 1 - Structure giving all email ids for each individual

In [45]:
#Get all mails sent by everybody
emails_ids_per_sender_training = get_mids_per_sender(training_set)
emails_ids_per_sender_test = get_mids_per_sender(test_set)

In [46]:
emails_ids_per_sender_training['shona.wilson@enron.com'][:10]

['375912',
 '375913',
 '369345',
 '375914',
 '375915',
 '375916',
 '375917',
 '369358',
 '369938',
 '369993']

### 2 - Dataframe containing all data

As we get two datasets for learning, it will be more convenient for analysis to merge them. From then on, cross-validation and prediction will be done without fearing any mistakes.

In [47]:
# Let's make a dataframe with all info for both training and test
    # number of rows  = #mails
    # columns : mid + date + body + sender + (recipients)
    
#create a dataframe which contains the senders of each mail (from training_set and test_set)
training_df = transform_dataset(training_info, training_set)
test_df = transform_dataset(test_info, test_set)

                                                                               

In [48]:
training_df.head()

Unnamed: 0,mid,sender,date,body,recipients,body_rt,body_rf,clean_body
0,9716,michelle.cash@enron.com,1998-12-21 05:29:00,"Brent,Attached is a form indemnification agree...",brent.hendry@enron.com mark.e.taylor@enron.com,"Brent,Attached is a form indemnification agree...","Brent,Attached is a form indemnification agree...",brent attached form indemnification agreement ...
1,7830,christian.yoder@enron.com,1999-03-02 07:30:00,"As you are already aware, the West desk is ma...",elizabeth.sager@enron.com mark.e.taylor@enron....,"As you are already aware, the West desk is ma...","As you are already aware, the West desk is ma...",already aware west desk making efforts introdu...
2,90523,larry.f.campbell@enron.com,1999-05-03 09:44:00,Go for it!George Robinson04/27/99 04:14 PMTo: ...,butch.russell@enron.com george.robinson@enron.com,Go for it!George Robinson04/27/99 04:14 PMTo: ...,NNG Gomez Water DisposalI began discussions r...,nng gomez water disposali began discussions re...
3,89618,larry.f.campbell@enron.com,1999-05-05 05:27:00,Just a short message to apprise everyone of th...,william.kendrick@enron.com rick.cates@enron.co...,Just a short message to apprise everyone of th...,Just a short message to apprise everyone of th...,short message apprise everyone cleaning result...
4,89475,larry.f.campbell@enron.com,1999-05-05 05:27:00,Just a short message to apprise everyone of th...,william.kendrick@enron.com rick.cates@enron.co...,Just a short message to apprise everyone of th...,Just a short message to apprise everyone of th...,short message apprise everyone cleaning result...


In [49]:
test_df.head()

Unnamed: 0,mid,sender,date,body,body_rt,body_rf,clean_body
0,284098,jonathan.mckay@enron.com,2001-11-02 05:25:29,"How is everyone.....mother, child.........fath...","How is everyone.....mother, child.........fath...","How is everyone.....mother, child.........fath...",everyone mother child father hope everything w...
1,272008,dutch.quigley@enron.com,2001-11-02 05:34:55,-----Original Message-----From: \tWesner-Soon...,-----Original Message-----From: Wesner-Soong...,-----Original Message----- FW: Contracts that...,original message fw contracts need xpitted out...
2,49273,james.d.steffes@enron.com,2001-11-02 05:57:55,Janine -Ok for you to cover the whole country....,Janine -Ok for you to cover the whole country....,Janine -Ok for you to cover the whole country....,janine ok cover whole country forgotten discus...
3,71901,kim.ward@enron.com,2001-11-02 06:10:47,when?,when?,when?,
4,82354,barry.tycholiz@enron.com,2001-11-02 06:17:44,WOW.... I am positive that your beautiful wife...,WOW.... I am positive that your beautiful wife...,WOW.... I am positive that your beautiful wife...,wow positive beautiful wife sign hauling three...


In [28]:
#Save
training_df.to_csv(path_to_results + 'training_df.csv', index = False)
test_df.to_csv(path_to_results + 'test_df.csv', index = False)

### 3 - Address books containing all recipients and number of emails sent per individual

In [50]:
address_books = get_address_books(training_info, training_set)

                                                                               

In [51]:
#An example
address_books['sara.shackleton@enron.com'][:10]

[('mark.e.taylor@enron.com', 416),
 ('susan.bailey@enron.com', 331),
 ('tana.jones@enron.com', 325),
 ('brent.hendry@enron.com', 285),
 ('kaye.ellis@enron.com', 267),
 ('stephanie.panus@enron.com', 255),
 ('carol.clair@enron.com', 254),
 ('marie.heard@enron.com', 246),
 ('samantha.boyd@enron.com', 236),
 ('tanya.rohauer@enron.com', 194)]