# Multi-table Datasets - ENRON Archive

## 1. Data import

Connect to the file 'assets/datasets/enron.db' using one of these methods:

- sqlite3 python package
- pandas.read_sql
- SQLite Manager Firefox extension

Take a look at the database and query the master table. How many Tables are there in the db?

> Answer:
There are 3 tables:
- MessageBase
- RecipientBase
- EmployeeBase

In [2]:
import sqlite3
conn = sqlite3.connect('../../assets/datasets/enron.db')
cur = conn.cursor()
results = cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
results.fetchall()

[(u'MessageBase',), (u'RecipientBase',), (u'EmployeeBase',)]

Query the `sqlite_master` table to retrieve the schema of the `EmployeeBase` table.

1. What fields are there?
1. What's the type of each of them?

In [25]:
fields = cur.execute("SELECT sql FROM sqlite_master WHERE type='table' and name ='EmployeeBase';").fetchall()
print ''.join(fields[0])

fields = cur.execute("SELECT sql FROM sqlite_master WHERE type='table' and name ='MessageBase';").fetchall()
print ''.join(fields[0])

fields = cur.execute("SELECT sql FROM sqlite_master WHERE type='table' and name ='RecipientBase';").fetchall()
print ''.join(fields[0])

CREATE TABLE EmployeeBase (
                  [eid] INTEGER,
  [name] TEXT,
  [department] TEXT,
  [longdepartment] TEXT,
  [title] TEXT,
  [gender] TEXT,
  [seniority] TEXT
                  
                  )
CREATE TABLE MessageBase (
    mid INTEGER,
    filename TEXT,
    unix_time INTEGER,
    subject TEXT,
    from_eid INTEGER,
    
    PRIMARY KEY(mid ASC),
    FOREIGN KEY(from_eid) REFERENCES Employee(eid)
)
CREATE TABLE RecipientBase (
    mid INTEGER,
    rno INTEGER,
    to_eid INTEGER,
    
    PRIMARY KEY(mid ASC, rno ASC)
    FOREIGN KEY(mid) REFERENCES Message(mid)
    FOREIGN KEY(to_eid) REFERENCES Employee(eid)
)


1. Print the first 5 rows of EmployeeBase table
1. Print the first 5 rows of MessageBase table
1. Print the first 5 rows of RecipientBase table

**Hint**  use `SELECT` and `LIMIT`.

In [33]:
q = "SELECT * FROM EmployeeBase LIMIT 5"
results = cur.execute(q).fetchall()
for row in results:
    print row

(1, u'John Arnold', u'Forestry', u'ENA Gas Financial', u'VP Trading', u'Male', u'Senior')
(2, u'Harry Arora', u'Forestry', u'ENA East Power', u'VP Trading', u'Male', u'Senior')
(3, u'Robert Badeer', u'Forestry', u'ENA West Power', u'Mgr Trading', u'Male', u'Junior')
(4, u'Susan Bailey', u'Legal', u'ENA Legal', u'Specialist Legal', u'Female', u'Junior')
(5, u'Eric Bass', u'Forestry', u'ENA Gas Texas', u'Trader', u'Male', u'Junior')


Import each of the 3 tables to a Pandas Dataframes

In [49]:
import pandas as pd
EmployeeBase = pd.read_sql('SELECT * FROM EmployeeBase;',con = conn)
MessageBase = pd.read_sql('SELECT * FROM MessageBase;',con = conn)
RecipientBase = pd.read_sql('SELECT * FROM RecipientBase;',con = conn)

In [39]:
MessageBase.head()

Unnamed: 0,mid,filename,unix_time,subject,from_eid
0,1,taylor-m/sent/11,910930020,Cd$ CME letter,138
1,2,taylor-m/sent/17,911459940,Indemnification,138
2,3,taylor-m/sent/18,911463840,Re: Indemnification,138
3,4,taylor-m/sent/23,911874180,"Re: Coral Energy, L.P.",138
4,5,taylor-m/sent/27,912396120,Bankruptcy Code revisions,138


## 2. Data Exploration

Use the 3 dataframes to answer the following questions:

1. How many employees are there in the company?
- How many messages are there in the database?
- Convert the timestamp column in the messages. When was the oldest message sent? And the newest?
- Some messages are sent to more than one recipient. Group the messages by message_id and count the number of recepients. Then look at the distribution of recepient numbers.
    - How many messages have only one recepient?
    - How many messages have >= 5 recepients?
    - What's the highest number of recepients?
    - Who sent the message with the highest number of recepients?
- Plot the distribution of recepient numbers using Bokeh.

In [64]:
# 1
print(EmployeeBase.shape[0])
# 2
print(MessageBase.shape[0])
# 3
import pandas as pd
MessageBase['time'] = MessageBase['unix_time'].apply(pd.datetime.fromtimestamp)
print(min(MessageBase['time']))
print(max(MessageBase['time']))
# 4
counts = RecipientBase.groupby('mid')['to_eid'].count().value_counts()
print(counts)

156
21635
1998-11-13 04:07:00
2002-06-21 14:37:34
1     14985
2      2962
3      1435
4       873
5       711
6       180
7       176
8        61
13       57
11       47
12       33
10       29
15       28
9        24
14       11
16        9
21        2
17        2
57        2
22        1
52        1
20        1
55        1
19        1
24        1
18        1
49        1
Name: to_eid, dtype: int64


In [67]:
# 5
from collections import Counter
counts = Counter(RecipientBase.groupby('mid')['to_eid'].count())


from bokeh.plotting import figure,show,output_notebook
output_notebook()

x = [i[0] for i in counts.most_common()]
y = [i[1] for i in counts.most_common()]
left_border = [val-0.5 for val in x]
right_border = [val+0.5 for val in x]


p= figure(title="Message Recipients",tools='',x_axis_label='# of recipients',y_axis_label='Counts')
p.quad(top=y,left=left_border,right=right_border,bottom=0,line_color='black')
show(p)

############## new cell ##########

x = [i[0] for i in counts.most_common()[5:]]  # chop off the first 5
y = [i[1] for i in counts.most_common()[5:]]  # chop off the first 5
left_border = [val-0.5 for val in x]
right_border = [val+0.5 for val in x]


p= figure(title="Message Recipients",tools='',x_axis_label='# of recipients',y_axis_label='Counts')
p.quad(top=y,left=left_border,right=right_border,bottom=0,line_color='black')
show(p)

Rescale to investigate the tail of the curve

## 3. Data Merging

Use the pandas merge function to combine the information in the 3 dataframes to answer the following questions:

1. Are there more Men or Women employees?
- How is gender distributed across departments?
- Who is sending more emails? Men or Women?
- What's the average number of emails sent by each gender?
- Are there more Juniors or Seniors?
- Who is sending more emails? Juniors or Seniors?
- Which department is sending more emails? How does that relate with the number of employees in the department?
- Who are the top 3 senders of emails? (people who sent out the most emails)

In [71]:
EmployeeBase.gender.value_counts()

Male      113
Female     43
Name: gender, dtype: int64

In [74]:
EmployeeBase.gender.value_counts() / EmployeeBase.gender.count()

Male      0.724359
Female    0.275641
Name: gender, dtype: float64

In [75]:
EmployeeBase.groupby('department')['gender'].value_counts() / EmployeeBase.groupby('department')['gender'].count()

department  gender
Forestry    Male      0.833333
            Female    0.166667
Legal       Female    0.520000
            Male      0.480000
Other       Male      0.718310
            Female    0.281690
Name: gender, dtype: float64

In [76]:
df = pd.merge(EmployeeBase,MessageBase, left_on='eid',right_on='from_eid' )
df.gender.value_counts() /df.gender.count()

Male      0.593529
Female    0.406471
Name: gender, dtype: float64

In [77]:
EmployeeBase.seniority.value_counts()

Junior    82
Senior    74
Name: seniority, dtype: int64

In [79]:
df.seniority.value_counts()

Senior    12439
Junior     9196
Name: seniority, dtype: int64

In [100]:
df.groupby('seniority')['name'].value_counts()[:3]

seniority  name           
Junior     Tana Jones         1379
           Sara Shackleton    1142
           Chris Germany       443
Name: name, dtype: int64

In [90]:
df.department.value_counts() / df.department.count()

Legal       0.480518
Other       0.316709
Forestry    0.202773
Name: department, dtype: float64

In [68]:
EmployeeBase.head(1)

Unnamed: 0,eid,name,department,longdepartment,title,gender,seniority
0,1,John Arnold,Forestry,ENA Gas Financial,VP Trading,Male,Senior


In [69]:
MessageBase.head(1)

Unnamed: 0,mid,filename,unix_time,subject,from_eid,time
0,1,taylor-m/sent/11,910930020,Cd$ CME letter,138,1998-11-13 04:07:00


In [70]:
RecipientBase.head(1)

Unnamed: 0,mid,rno,to_eid
0,1,1,59


Answer the following questions regarding received messages:

- Who is receiving more emails? Men or Women?
- Who is receiving more emails? Juniors or Seniors?
- Which department is receiving more emails? How does that relate with the number of employees in the department?
- Who are the top 5 receivers of emails? (people who received the most emails)

Which employees sent the most 'mass' emails?

Keep exploring the dataset, which other questions would you ask?

Work in pairs. Give each other a challenge and try to solve it.