<h1> Advanced Learning for Text and Graph Data </h1>
<b> Université Paris-Saclay - Master M2 Data Science - February/March 2017</b> <br>
<i> Students : Peter Martigny & Mehdi Miah </i> <br>
# First part  : create new structures

## Open the data 

In [1]:
#import some libraries
import pandas as pd
import numpy as np
import random
import time
import matplotlib.pyplot as plt
%matplotlib inline

#access to the data
path_to_data = "..\\data\\"

# path where we will store the results and clean datasets
path_to_results = "..\\results\\"

In [2]:
# We have to give the types because otherwise, 'mid' may be an int64

#training set
training_set = pd.read_csv(path_to_data + 'training_set.csv')
training_info = pd.read_csv(path_to_data + 'training_info.csv', 
                            dtype = {'mid': object, 'date': object, 'body': object, 'recipients' : object})

#test
test_set = pd.read_csv(path_to_data + 'test_set.csv')
test_info = pd.read_csv(path_to_data + 'test_info.csv',
                        dtype = {'mid': object, 'date': object, 'body': object, 'recipients' : object})

Let's visualize some data : 

In [3]:
#visualize the first rows : each row corresponds to a sender in the train dataset
training_set.head()

Unnamed: 0,sender,mids
0,karen.buckley@enron.com,158713 158697 200301 158679 278595 298162 2002...
1,amr.ibrahim@enron.com,215241 3437 215640 3506 191790 3517 3520 3562 ...
2,andrea.ring@enron.com,270705 270706 270707 270708 270709 270710 2707...
3,sylvia.hu@enron.com,111444 111422 183084 111412 111347 110883 1105...
4,phillip.platter@enron.com,327074 327384 327385 264443 274124 274125 2741...


In [4]:
#visualise the first rows : each row corresponds to a mail in the train dataset
training_info.head()

Unnamed: 0,mid,date,body,recipients
0,60,2000-07-25 08:14:00,Legal has been assessing the risks of doing bl...,robert.badeer@enron.com murray.o neil@enron.co...
1,66,2000-08-03 02:56:00,Attached is a spreadsheet to estimate export f...,kim.ward@enron.com robert.badeer@enron.com mur...
2,74,2000-08-15 05:37:00,Kevin/Bob: Here is a quick rundown on the cons...,robert.badeer@enron.com john.massey@enron.com ...
3,80,2000-08-20 14:12:00,check this out and let everyone know what s up...,robert.badeer@enron.com jeff.richter@enron.com
4,83,2000-08-22 08:17:00,Further to your letter to us (addressed to Mr....,pgillman@schiffhardin.com kamarlantes@calpx.co...


In [5]:
#visualize the first rows : : each row corresponds to a sender in the test dataset
test_set.head()

Unnamed: 0,sender,mids
0,karen.buckley@enron.com,298389 332383 298390 284071 366982 81773 81791...
1,amr.ibrahim@enron.com,48260 48465 50344 48268 50330 48237 189979 189...
2,andrea.ring@enron.com,366364 271168 271172 271167 271189
3,sylvia.hu@enron.com,134931 134856 233549 233517 134895 233584 3736...
4,phillip.platter@enron.com,274220 274225 274215 274223 274214 274207 2742...


In [6]:
#visualise the first rows : each row corresponds to a mail in the test dataset
test_info.head()

Unnamed: 0,mid,date,body
0,1577,2001-11-19 06:59:51,Note: Stocks of heating oil are very high for...
1,1750,2002-03-05 08:46:57,"Kevin Hyatt and I are going for ""sghetti"" at S..."
2,1916,2002-02-13 14:17:39,This was forwarded to me and it is funny. - Wi...
3,2094,2002-01-22 11:33:56,I will be in to and happy to assist too. I ma...
4,2205,2002-01-11 07:12:19,Thanks. I needed a morning chuckle.


We have two datasets for training et test :
- $\texttt{training_set}$ and $\texttt{test_set}$ contain the list of all mails written by an employee ; 
- $\texttt{training_info}$ and $\texttt{test_info}$ describe each mail : the date, the content and the list of recipients ( only in the training dataset)

## Create handy structures

All the code is in the file handy_structures.py

In [7]:
from handy_structures import * #own function

### 1 - Structure giving all email ids for each individual

In [8]:
#Get all mails sent by everybody
emails_ids_per_sender_training = get_mids_per_sender(training_set)
emails_ids_per_sender_test = get_mids_per_sender(test_set)

In [9]:
emails_ids_per_sender_training['shona.wilson@enron.com'][:10]

['375912',
 '375913',
 '369345',
 '375914',
 '375915',
 '375916',
 '375917',
 '369358',
 '369938',
 '369993']

### 2 - Dataframe containing all data

As we get two datasets for learning, it will be more convenient for analysis to merge them. From then on, cross-validation and prediction will be done without fearing any mistakes.

In [10]:
# Let's make a dataframe with all info for both training and test
    # number of rows  = #mails
    # columns : mid + date + body + sender + (recipients)
    
#create a dataframe which contains the senders of each mail (from training_set and test_set)
training_df = transform_dataset(training_info, training_set)
test_df = transform_dataset(test_info, test_set)

                                                                               

In [11]:
#Save
training_df.to_csv(path_to_results + 'training_df.csv', index = False)
test_df.to_csv(path_to_results + 'test_df.csv', index = False)

In [12]:
training_df.head()

Unnamed: 0,mid,sender,date,body,recipients
0,47361,enron_update@concureworkplace.com,0001-08-26 22:16:36,The following reports have been waiting for yo...,kimberly.watson@enron.com
1,47362,enron_update@concureworkplace.com,0001-08-27 22:21:02,The following reports have been waiting for yo...,kimberly.watson@enron.com
2,47363,enron_update@concureworkplace.com,0001-08-28 22:25:35,The following reports have been waiting for yo...,kimberly.watson@enron.com
3,45909,enron_update@concureworkplace.com,0001-09-13 22:24:08,Employee Name: Kimberly WatsonReport Name: E...,kimberly.watson@enron.com
4,82030,enron_update@concureworkplace.com,0001-09-17 09:24:00,The following expense report is ready for appr...,barry.tycholiz@enron.com


In [13]:
test_df.head()

Unnamed: 0,mid,sender,date,body
0,284098,jonathan.mckay@enron.com,2001-11-02 05:25:29,"How is everyone.....mother, child.........fath..."
1,272008,dutch.quigley@enron.com,2001-11-02 05:34:55,-----Original Message-----From: \tWesner-Soon...
2,49273,james.d.steffes@enron.com,2001-11-02 05:57:55,Janine -Ok for you to cover the whole country....
3,71901,kim.ward@enron.com,2001-11-02 06:10:47,when?
4,82354,barry.tycholiz@enron.com,2001-11-02 06:17:44,WOW.... I am positive that your beautiful wife...


### 3 - Address books containing all recipients and number of emails sent per individual

In [14]:
address_books = get_address_books(training_info, training_set)

                                                                               

In [15]:
address_books['sally.beck@enron.com'][:10]

[('patti.thompson@enron.com', 171),
 ('greg.piper@enron.com', 111),
 ('beth.apollo@enron.com', 95),
 ('brent.price@enron.com', 79),
 ('leslie.reeves@enron.com', 77),
 ('richard.causey@enron.com', 72),
 ('louise.kitchen@enron.com', 72),
 ('bob.hall@enron.com', 55),
 ('mike.jordan@enron.com', 55),
 ('sheila.glover@enron.com', 55)]