In [3]:
import pandas as pd
import networkx as nx

we will be using networkx package http://networkx.readthedocs.io/en/networkx-1.11/index.html

### Load enron dataset into pandas
Link to Enron company https://en.wikipedia.org/wiki/Enron. You can find the data [here](https://drive.google.com/file/d/1R8TW3CUN5UM_EtQT923EtyCmICPX2NZ8/view?usp=sharing).

# Data cleaning and preparation

In [4]:
# we are using = as a separator so we don't mix text with columns separator
df_enron = pd.read_csv("data/enron.csv", sep="=")
df_enron.head()

Unnamed: 0.1,Unnamed: 0,from,to,date,text,day_of_week
0,1,shelley.corman@enron.com,"jean.adams@enron.com, william.aldinger@enron.c...",20 Nov 2001 15,Please Send to ETS Gas Logistics Staff (Exclud...,Tue
1,5,frank.ermis@enron.com,mike.grigsby@enron.com,14 Feb 2001 02,no,Wed
2,6,joan.veselack@enron.com,"jwhited@columbiaenergygroup.com, chris.germany...",22 Mar 2000 08,Per my voicemail message to John.I talked to S...,Wed
3,7,sharon.crawford@enron.com,"chris.gaffney@enron.com, mark.powell@enron.com...",27 Oct 2000 04,Attached is Stikeman Elliott's derivatives upd...,Fri
4,9,chris.germany@enron.com,"victor.lamadrid@enron.com, edward.terry@enron....",25 Sep 2000 05,FOM New Power numbers.---------------------- F...,Mon


In [70]:
df_enron['day_of_week'] = df_enron['day_of_week'].str.strip()

In [72]:
df_enron[df_enron['day_of_week']=='Sat']

Unnamed: 0.1,Unnamed: 0,from,to,date,text,day_of_week,link
395,590,chris.germany@enron.com,andrea.ring@enron.com,10 Mar 2001 07,"I created deal 665150, TP2 buy from Gulf 1 at ...",Sat,chris.germany - andrea.ring
411,607,pete.davis@enron.com,pete.davis@enron.com,21 Apr 2001 10,Start Date: 4/21/01; HourAhead hour: 18; No a...,Sat,pete.davis - pete.davis
423,626,jeffrey.shankman@enron.com,jennifer.burns@enron.com,9 Dec 2000 06,put on calendar---------------------- Forwarde...,Sat,jeffrey.shankman - jennifer.burns
524,791,janel.guerrero@enron.com,"james.steffes@enron.com, richard.shapiro@enron...",31 Mar 2001 05,"I gave Marchris my comments, but if you've got...",Sat,janel.guerrero - james.steffes - richard.shapiro
555,842,peggy.hedstrom@enron.com,sally.beck@enron.com,5 Aug 2000 02,I wanted to update you on a few items for the ...,Sat,peggy.hedstrom - sally.beck
...,...,...,...,...,...,...,...
33504,48941,david.delainey@enron.com,colleen.sullivan@enron.com,28 Oct 2000 05,"Colleen, thanks for the update - as another pa...",Sat,david.delainey - colleen.sullivan
33756,49305,ernie@enron.com,sara.shackleton@enron.com,17 Feb 2001 23,You are scheduled to attend: Harassment Avoida...,Sat,ernie - sara.shackleton
33938,49579,pete.davis@enron.com,pete.davis@enron.com,27 Oct 2001 12,Start Date: 10/27/01; HourAhead hour: 15; No ...,Sat,pete.davis - pete.davis
33947,49593,jeffrey.shankman@enron.com,harry.arora@enron.com,9 Dec 2000 06,Thanks for your help \n\t\n\t\n\tFrom: Harr...,Sat,jeffrey.shankman - harry.arora


## Task
Take only part of email address until the @ (not important for analysis) and add a column with link (concatenation of sender and receiver).

In [41]:
def get_email_name(x):    
    emails = x.split(', ')
    concat_emails = ''
    if len(emails) == 1:
        return emails[0].split('@')[0]
    else:
        concat_emails = emails[0].split('@')[0] 
        for i in range(1, len(emails)):
            concat_emails += ' - ' + emails[i].split('@')[0].replace('\n\t', '')
    return concat_emails

In [80]:
df_enron['link'] = df_enron['from'].apply(get_email_name) + ' - ' + df_enron['to'].apply(get_email_name)
# df_enron[df_enron['link'] == 'a..bibi - sonya.johnson - dl-dji']
df_enron

Unnamed: 0.1,Unnamed: 0,from,to,date,text,day_of_week,link
0,1,shelley.corman@enron.com,"jean.adams@enron.com, william.aldinger@enron.c...",20 Nov 2001 15,Please Send to ETS Gas Logistics Staff (Exclud...,Tue,shelley.corman - jean.adams - william.aldinger...
1,5,frank.ermis@enron.com,mike.grigsby@enron.com,14 Feb 2001 02,no,Wed,frank.ermis - mike.grigsby
2,6,joan.veselack@enron.com,"jwhited@columbiaenergygroup.com, chris.germany...",22 Mar 2000 08,Per my voicemail message to John.I talked to S...,Wed,joan.veselack - jwhited - chris.germany - scot...
3,7,sharon.crawford@enron.com,"chris.gaffney@enron.com, mark.powell@enron.com...",27 Oct 2000 04,Attached is Stikeman Elliott's derivatives upd...,Fri,sharon.crawford - chris.gaffney - mark.powell ...
4,9,chris.germany@enron.com,"victor.lamadrid@enron.com, edward.terry@enron....",25 Sep 2000 05,FOM New Power numbers.---------------------- F...,Mon,chris.germany - victor.lamadrid - edward.terry...
...,...,...,...,...,...,...,...
34019,49691,david.delainey@enron.com,robert.virgo@enron.com,9 Feb 2001 10,"Bob, please give me an update on our progress ...",Fri,david.delainey - robert.virgo
34020,49692,james.centilli@enron.com,tracy.geaccone@enron.com,17 Oct 2001 07,"The project can not be justified, if you have ...",Wed,james.centilli - tracy.geaccone
34021,49693,charles.weldon@enron.com,"bryan.hull@enron.com, a..martin@enron.com, joe...",12 Sep 2001 10,"-----Original Message-----From: \tSchlueter, ...",Wed,charles.weldon - bryan.hull - a..martin - joe....
34022,49694,vince.kaminski@enron.com,toni.graham@enron.com,9 Oct 2000 04,"Toni,FYI.Vince---------------------- Forwarded...",Mon,vince.kaminski - toni.graham


## Task
Group by the dataset on the link level (variables `from` and `to`) and create these features for each link (this is input for the networkx package):

+ count of mails
+ count of emails over the weekend
+ average length of text

These features will be used as Edge Weights.

In [59]:
df_enron.groupby(['link']).size().reset_index(name='email_counts')

Unnamed: 0,link,email_counts
0,'todd'.delahoussaye - susan.bailey,1
1,a..bibi - sonya.johnson - dl-dji,1
2,a..connor - a..connor - r..brackett - stephani...,1
3,a..garcia - sara.shackleton,2
4,a..gomez - louise.kitchen,1
...,...,...
18546,zimin.lu - vince.kaminski,2
18547,zimin.lu - vince.kaminski - stinson.gibner,3
18548,zimin.lu - zhiyong.wei,1
18549,zimin.lu - zhiyong.wei - zhiyun.yang,1


In [79]:
df_enron.groupby('link')['day_of_week'].apply(lambda x: (x.isin(['Sat', 'Sun'])).sum()).reset_index(name='count')

Unnamed: 0,link,count
0,'todd'.delahoussaye - susan.bailey,0
1,a..bibi - sonya.johnson - dl-dji,0
2,a..connor - a..connor - r..brackett - stephani...,0
3,a..garcia - sara.shackleton,0
4,a..gomez - louise.kitchen,0
...,...,...
18546,zimin.lu - vince.kaminski,0
18547,zimin.lu - vince.kaminski - stinson.gibner,0
18548,zimin.lu - zhiyong.wei,0
18549,zimin.lu - zhiyong.wei - zhiyun.yang,0


## SNA Analysis

## Task
Create directed graph with networkx package. use the networkx function `from_pandas_edgelist` with count of mails, count of emails over the weekend and average length of text as edge metrics. (parameter `edge_attr`)


## Basic analysis: graph properties
Find the number of nodes and edges, the average degree and the number of
connected components

### Degree distribution: find people who wrote directly to biggest number of people

## Node centralities - find the top 10 people for each case

#### find people who could be the fastest to reach all other people

#### find the most important people, most information goes through them

#### find people who are best connected with well connected people


## Communities

#### Cluster the users to different communities using function `k_clique_communities`

#### Compute clustering coefficients for each node - low clustering coefficients are in places, where network usually falls apart

## (Stretch) 
### Visualizations

#### visualize the network
it might be slow on our computers with all nodes. Feel free to play around with the graph using the smaller sample of the nodes