# Official public work with private tools

The controversy of Hilary Clinton's email hit the headlines in 2015. The goal of this project is to get a precise view of her network based on the emails publicly available on [Kaggle](https://www.kaggle.com/kaggle/hillary-clinton-emails). Unfortunately, the database is dirty as anyone could expect. Our work will be divided into the following tasks:

1. Clean the data and recover
2. Enrich the data
3. maybe go back to 1

At this point, drawing a basic graph should be possible.

4. Develop a system to cluster a document into a category
5. Show these categories in the final graph.

We remind that the goal of this milestone is as follows:

1. Handle the data in its size.

2. Understand the data (formats, distribution, missing values, correlations, etc.)

3. Consider a way to enrich, filter, transform the data according to your needs.

4. Update our plan in a reasonable way:
    
    4.1. Reflecting our improved knowledge after data acquaintance
    
    4.2. Discuss how your data suits our project needs 
    
    4.3. Discuss the methods that we will use, and provide the essential mathematical details
    
5. Show that the plan for analysis is reasonable, discussed considered choices but finally dropped


## Previously

As discussed in the first Milestone, we expected data cleaning, including text processing, handling missing values etc. as a major task of the Milestone. We considered that about 60-80% of our task (in the golbal Milestone) will be this. Without any surprise, we remark this expectation to be true. Indeed, database is dirty, really dirty... Although a lot of preprocess seemed to have been already done from the `rawText` of the `emails` database, we remark the latter to have been really poor. For example, in order to construct the Hillary's Network we need the exact `receiver` and `sender` of any emails. We remark some obvious mistakes in the Aliases correspondance with the personId for example. We explain deeper below the major difficulty that we have encountered and how we have finally handled them.

As said above, this Milestone focuses essentially on data cleaning and preprocess of rawText. Moreover, according to the questions and tasks we expect to provide, we have to clean and process data for the following features:

1. 'To' and 'From' (who receives and sends emails)

2. 'Date' (precise date of emails sending)

3. Extract content of emails (as good as possible), preprocess/clean

These three steps are crucial.

We present the plan.

## Plan

1. Data exploration 

    1.1. how are `aliases` and `personId` related ? 

    1.2. how clean are `MetaDataTo` ('To') and `MetaDataFrom` ('From') ? are there missing values ? 
    
    1.3. what is the behaviour of the content of `rawText` ? (is it messy, multi-emails, etc)
        
2. Data Cleaning and Processing

    2.1. Clean the `aliases` and `personId` relation
    
    2.2. Construction clean `To` and `From` features by processing and NaN filling
    
    2.3. Text cleaning for `rawText` (remove unappropriate lines)
    
    2.4. Text processing (lowcase, remove stopwords, remove short sentences, stemmatization, etc)
    
    2.5. Features engineering 
           
        2.5.0. Emails time distribution

        2.5.1. Map emails to countries/regions
        
        2.5.2. Word-frequency
        
        2.5.3. Research of thematics
        
3. Pre-results

    3.1. Adjacency matrix (Hillary's network) construction
    
    .....
        
3. Milestone 3, to be done: graph visu, map visu, 


## Data exploration

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, date, time
from gensim.models import Word2Vec
from sklearn.decomposition import PCA

pd.options.mode.chained_assignment = None  # default='warn', Mutes warnings when copying a slice from a DataFrame.



In [2]:
data_folder = "../data/"

The data is available in two formats: csv and sql. This means that the elements are somehow related. We will see later that the entity relationship diagram is not as it should have been done.

In [3]:
#Data extraction from the csv-files
emails_raw = pd.read_csv(data_folder + 'Emails.csv')
persons = pd.read_csv(data_folder + 'Persons.csv')
aliases = pd.read_csv(data_folder + 'Aliases.csv')
email_receivers = pd.read_csv(data_folder + 'EmailReceivers.csv')

In [4]:
print("emails:", emails_raw.shape)
print("persons:", persons.shape)
print("aliases:", aliases.shape)
print("email receivers:", email_receivers.shape)

emails: (7945, 22)
persons: (513, 2)
aliases: (850, 3)
email receivers: (9306, 3)


For the moment, let's discover what we have. Extraction was performed according to some fields. Let us check whether or not we can rely on it.

In [5]:
persons.dtypes

Id       int64
Name    object
dtype: object

In [6]:
aliases.dtypes

Id           int64
Alias       object
PersonId     int64
dtype: object

##     1.1. how are `aliases` and `personId` related ? 


In [7]:
aliases.sort_values(by='PersonId')

Unnamed: 0,Id,Alias,PersonId
0,1,111th congress,1
1,2,agna usemb kabul afghanistan,2
2,3,ap,3
3,4,asuncion,4
4,5,alec,5
5,6,dupuy alex,6
6,7,american beverage association,7
7,8,mayock andrew,8
9,10,shapiroa@state.gov,9
8,9,shapiro andrew j,9


In [8]:
aliases.Alias

0                     111th congress
1       agna usemb kabul afghanistan
2                                 ap
3                           asuncion
4                               alec
5                         dupuy alex
6      american beverage association
7                      mayock andrew
8                   shapiro andrew j
9                 shapiroa@state.gov
10                slaughter annmarie
11              slaughter anne marie
12               slaughter annemarie
13              slaughtera@state.gov
14                      lake anthony
15               valenzuela arturo a
16            valenzuelaaa@state.gov
17                        kimoon ban
18                      obama barack
19                         president
20           bam@mikulski.senate.gov
21                      mikulski bam
22           mikulski bam (mikulski)
23          mikulski bam (mitkulski)
24            mikulskibam (mikulski)
25                     betsy.ebeling
26                     ebeling betsy
2

In [9]:
aliases.PersonId.is_unique

False

In [10]:
aliases.Alias.is_unique

True

### Remarks and Issues

`Aliases` is not to be unique per Aliases while `persons` is. Multiple obviously same Aliases point on a single person. We have to handle this in order to have an efficient database.

Moreover we remark that `Id` from the person dataframe is supposed to be unique by name. Obviously it is not the case. Indeed, considering the `Id` 512 and 513 for instance, correspond both to the same person. There are several similar example. We have to create a Metabase that is absolutely unique per person, or at least as good as possible.

In [11]:
dico_name={}
for i in aliases.PersonId.unique():
    dico_name.update({i:aliases[aliases.PersonId==i].Alias.tolist()})
dico_name

{1: ['111th congress'],
 2: ['agna usemb kabul afghanistan'],
 3: ['ap'],
 4: ['asuncion'],
 5: ['alec'],
 6: ['dupuy alex'],
 7: ['american beverage association'],
 8: ['mayock andrew'],
 9: ['shapiro andrew j', 'shapiroa@state.gov'],
 10: ['slaughter annmarie',
  'slaughter anne marie',
  'slaughter annemarie',
  'slaughtera@state.gov',
  'annemarie slaughter'],
 11: ['lake anthony'],
 12: ['valenzuela arturo a', 'valenzuelaaa@state.gov'],
 13: ['kimoon ban'],
 14: ['obama barack', 'president'],
 15: ['bam@mikulski.senate.gov',
  'mikulski bam',
  'mikulski bam (mikulski)',
  'mikulski bam (mitkulski)',
  'mikulskibam (mikulski)'],
 16: ['betsy.ebeling',
  'ebeling betsy',
  'betsyebeling',
  'betsyebeling1050',
  'betsyebelin'],
 17: ['clinton william j', 'dad'],
 18: ['biography'],
 19: ['klehr bonnie'],
 20: ['brian'],
 21: ['bstrider', 'strider burns', 'burns strider', 'burns strider b6'],
 22: ['capricia marshall',
  'marshall capricia',
  'marshall capricia p',
  'capriciamarsh

Even if we remark that some inconcistence seems to appear (some evident similar Alias have different PersonId [Jacob Sullivan]), we decide firstly to ignore such issue. We will treat it deeper in the Milestone 3.

For the next tasks, and particularly to the creation of the network of Hillary, we decide first to associate each `MetadataTo` and `MetadataFrom` to its PersonId through Alias merge and PersonId. Afterthat we will focus on the X persons who communicates the most with Hillary and to represent the corresponding Network.

##     1.2. how clean are `MetaDataTo` ('To') and `MetaDataFrom` ('From') ? are there missing values ? 

### 2.1. Clean the `aliases` and `personId` relation
    
###    2.2. Construction clean `To` and `From` features by processing and NaN filling
    

As said above, we get some issues with these two features since data are really dirty. Even if the Aliases and Persons datatable are provided, we remark some lack of correspondance. Keeping this in mind for the next Milestone, we focus in this section on handling missing values for those data. Then our goal is to transpose MetadataTo to its corresponding personId.

The different steps are as follows:

1. For missing MetadataTo/SenderPersonId, find them in the rawText

2. Fetch MetadataTo to PersonId



In [12]:
# We only keep the fields which appear in the itemization above
emails = emails_raw[['Id', 'MetadataSubject', 'SenderPersonId', 'MetadataTo', 'MetadataDateSent', 'MetadataDateReleased', 'MetadataCaseNumber', 'MetadataDocumentClass', 'RawText']]

emails.head()

Unnamed: 0,Id,MetadataSubject,SenderPersonId,MetadataTo,MetadataDateSent,MetadataDateReleased,MetadataCaseNumber,MetadataDocumentClass,RawText
0,1,WOW,87.0,H,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...
1,2,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,,H,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...
2,3,CHRIS STEVENS,32.0,;H,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...
3,4,CAIRO CONDEMNATION - FINAL,32.0,H,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...
4,5,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,80.0,"Abedin, Huma",2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,B6\nUNCLASSIFIED\nU.S. Department of State\nCa...


In [13]:
# These are our files and contains the implementation of the interface we describe in the notebook.
from extractor import Extractor
from process import Process

## find missing SenderPersonId and fetch with PersonId

In [14]:
# SenderPersonId
def fetch_from_alias(alias_raw, content_raw):
    if alias_raw is not None:
        alias_found = Process.alias(alias_raw)
        person_id = aliases[aliases.Alias.str.strip() == alias_found].PersonId.values
        if len(person_id) == 1:
            return person_id[0]
    return np.nan

def fetch_from_pid(personId, content_raw):
    if not np.isnan(personId):
        return personId
    alias_extracted = Extractor.sender_alias(content_raw)
    return fetch_from_alias(alias_extracted, content_raw)

emails['from'] = emails.apply(lambda row: fetch_from_pid(row['SenderPersonId'], row['RawText']), axis=1)
nb_nan = emails.SenderPersonId.isna().sum()
nb_from_recover = nb_nan - emails['from'].isna().sum()

emails.drop('SenderPersonId', axis=1, inplace=True)

print("We manage to recover %d out of %d NaN for the sender_id." % (nb_from_recover, nb_nan))

We manage to recover 11 out of 157 NaN for the sender_id.


### How many NaN can we manage ?

In [15]:
# MetadataTo
counter_not_nan = 0
counter_recover = 0
emails['to'] = np.nan

for i in range(emails.shape[0]):
    if not isinstance(emails.MetadataTo[i], str):
        continue
    counter_not_nan += 1
    alias = Process.alias(str(emails.MetadataTo[i]))
    person_id = aliases[aliases.Alias.str.strip() == alias].PersonId.values
    if len(person_id) == 1:
        emails.iat[i, -1] = person_id[0]
        counter_recover += 1
    else:
        print(emails.MetadataTo[i]) # could not match a person

emails.drop('MetadataTo', axis=1, inplace=True)
print('***')
print("We manage to compute %d out of %d for the receiver_id." % (counter_recover, counter_not_nan))

michele.flournoy
Axelrod_D
Terry.Duffy
glantz.
rosemarie.howe ;H
cheryl.mills ;H
rrh.interiors
mh.interiors
H;preines
H;preines
Abedin, Huma; H
Abedin, Huma; H
Ki-moon, Ban
Sullivan, Jake; H
Etats-Unis D'Amerique
Etat-Unis D'Amerique
Duk-soo, Han
Duk-soo, Han
Betsy.Ebeling
***
We manage to compute 7671 out of 7690 for the receiver_id.


The `MetadataTo` field contains many `NaN`s and some elements (list above) could match a person. Hope is not lost because we can extract this value from the `RawText` field.

## Proceed similarly for MetadataTo

In [16]:
def fetch_alias(to, content_raw):
    if not np.isnan(to):
        return to
    else:
        alias_extracted = Extractor.destination_alias(content_raw)
        return fetch_from_alias(alias_extracted, content_raw)

emails['to'] = emails.apply(lambda row: fetch_alias(row['to'], row['RawText']), axis=1)
nb_nan = emails.to.isna().sum()
nb_from_recover = nb_nan - emails.to.isna().sum()

print("We manage to recover %d out of %d NaN for the receiver." % (nb_from_recover, nb_nan))

We manage to recover 0 out of 244 NaN for the receiver.


It looks like the one who preprocessed the database before us followed the same procedure.

There is one more thing we can do: we use the table `email_receivers`. As said earlier, it does not make any difference between the one who received the email directly or the one who received it through the 'cc' option.

Our first guess is that the first row containing a given email is always the receiver and the others, if any, are the cc. Let's check it this assumption holds.

In [17]:
nb_tot = 0
nb_correct = 0
for i in range(emails.shape[0]):
    if not np.isnan(emails.to[i]):
        pids = email_receivers[email_receivers.EmailId == emails.Id[i]].PersonId.values
        if len(pids) > 1:
            nb_tot += 1
            if pids[0] == emails.to[i]:
                nb_correct += 1

print("%d / %d." % (nb_correct, nb_tot))

795 / 1244.


There are 1244 emails for which we have the destination AND for which there are at least 2 rows in `email_receivers`. Among these 1244, the first row is the destination in 795 times. This is number is not huge but it is not small either. Assuming this ratio always holds, it is better to adopt this strategy than choosing randomly.

## Find receiver as Cc

In [18]:
def fetch_pid_through_email_receiver(eid, previous_to):
    if np.isnan(previous_to):
        pids = email_receivers[email_receivers.EmailId == eid].PersonId.values
        if len(pids) >= 1:
            return pids[0]
    return previous_to

nb_nan_before = emails.to.isna().sum()
emails['to'] = emails.apply(lambda row: fetch_pid_through_email_receiver(row['Id'], row['to']), axis=1)
nb_nan_after = emails.to.isna().sum()
recover = nb_nan_before - nb_nan_after
print("%d out of %d NaN." % (recover, nb_nan_before))

15 out of 244 NaN.


Using the stategy we have described earlier, we feed 15 more rows with a value for the field `to`.

Now that we are done with the 'from' and 'to', we can work on the 'cc'. We use the table `email_receivers` and drop the row whose `EmailId` and `PersonId` combination appears in the table `emails`.

In [19]:
table_cc = pd.merge(emails, email_receivers, left_on='Id', right_on='EmailId')
table_cc = table_cc[table_cc.to != table_cc.PersonId][['EmailId', 'PersonId']]

In [20]:
print(len(table_cc))
print(len(email_receivers))

1539
9306


As one would expect, we get a subset of the intial set.

Finally, we have our new structure. But it's only a beginning.

In [21]:
emails

Unnamed: 0,Id,MetadataSubject,MetadataDateSent,MetadataDateReleased,MetadataCaseNumber,MetadataDocumentClass,RawText,from,to
0,1,WOW,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...,87.0,80.0
1,2,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...,194.0,80.0
2,3,CHRIS STEVENS,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...,32.0,80.0
3,4,CAIRO CONDEMNATION - FINAL,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...,32.0,80.0
4,5,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,B6\nUNCLASSIFIED\nU.S. Department of State\nCa...,80.0,81.0
5,6,MEET THE RIGHT-WING EXTREMIST BEHIND ANTI-MUSL...,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,B6\nUNCLASSIFIED\nU.S. Department of State\nCa...,80.0,185.0
6,7,"ANTI-MUSLIM FILM DIRECTOR IN HIDING, FOLLOWING...",2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...,32.0,80.0
7,8,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,B6\nUNCLASSIFIED\nU.S. Department of State\nCa...,80.0,81.0
8,9,SECRETARY'S REMARKS,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...,87.0,80.0
9,10,MORE ON LIBYA,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,F-2015-04841,HRC_Email_296,UNCLASSIFIED\nU.S. Department of State\nCase N...,,80.0


## 1.3. what is the behaviour of the content of `rawText` ? (is it messy, multi-emails, etc)
        
    
    ### 2.3. Text cleaning for `rawText` (remove unappropriate lines)
    
    ### 2.4. Text processing (lowcase, remove stopwords, remove short sentences, stemmatization, etc)
    
We remark that the rawText are not clean at all. It is quite difficult to catch the real content of the email, to now if a supposed email is one or an exchange of emails, etc. 

To ease the tasks of the Milestone we first assume that one rawText is one exchange, but that all the content is used at the same time in the analysis (we do not split exchange of emails, email per email). Secondly, we need to process the data to get the core of the content. We proceed as follows:

## Text pre-processing

Now that we have extracted the relevant information, it is time to clean the content. Since the database we have contains less than 10,000 emails, we need to preprocess the data in an effective way. We need to remove common words and common sentences which appear almost on each email. This is necessary when running marchine learning algorithm on it to get better result.

This could have been enough but the database is really poor. So we need to perform more actions... We will limit ourselves to:

1. Remove lines which start by a frequent sequences (e.g. "Case No...", "U.S. Department of State", ...)
2. Replace upper case by lower case
3. Remove the emails
4. Tokenization
5. Remove punctionations except '.' to seperate sentences
6. Remove stop words
7. [Stemming and Lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) (reducing words, "car" and "cars" should not be considered as different entity for example)
8. ignore sentence if it contains less than 4 words.
9. join the sentence to have an email

In [22]:
# You can test the function here
lemmatize = True
sentences = "I like flowers. This was useless, I write completely stupid stuff because I suck for these exercises"
print(Process.content(sentences, not lemmatize))
print("***")
print(Process.content(sentences, lemmatize))

useless write complet stupid stuff suck exercis.
***
useless write completely stupid stuff suck exercise.


In [23]:
# This may take a while...
emails['content'] = emails.RawText.map(Process.content)

In [24]:
# testing results manually
idx = 200
print(emails.content[idx])
print('***')
print(emails.RawText[idx])

thursday januari 5 2012 5 22 pm. h latest intel libyan conflict leader militia. thank alway happi new year. sourc sourc direct access libyan nation transit council well. highest level european govern western intellig secur servic. last week decemb 2011 first week 2012 libya prime minist. abdurrahim el keib presid mustafa abdul jalil engag seri emerg plan. meet attempt deal specif issu threaten stabil new nation. transit council ntc govern. accord extrem sensit sourc speak strict. confid paramount among issu question disarm reward region. militia bore major fight regim muammar al qaddafi well. relat issu find minist senior administr new govern. individu note four occas begin. decemb 23 2011 group angri militiamen came el keib offic demand better. treatment clear messag support role islam law remov former qaddafi. sourc comment opinion sensit sourc el keib genuin concern. situat could spiral control threaten regim. unrest stir general abdel hakim alamin belhaj conserv islamist. stay new 

In [25]:
# now we can drop the raw content
emails = emails.drop('RawText', axis=1)

### 2.5. Features engineering and data comprehension
    
        2.5.0. Emails time distribution
        
        2.5.1. Map emails to countries/regions
        
        2.5.2. Word-frequency
        
        2.5.3. Research of thematics
        
The goal of this part is to provide a first step in features engineering. According to our needs for the Milestone tasks, we will first present an algorithm that permit to map as good as possible emails with countries, regions, persons, etc. Secondly we provide some basic tools in order to get a better understand about the subjects discussed, time distribution, etc. Finally we provide some ideas in order to find some reccurent thematics.

### 3. Pre-results

    3.1. Adjacency matrix (Hillary's network) construction