**Table of contents**<a id='toc0_'></a>    
- [Enron email analysis](#toc1_)    
  - [💡 Do it yourself](#toc1_1_)    
  - [💡 Do it yourself](#toc1_2_)    
- [References/Acknowledgments](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Enron email analysis](#toc0_)

In [None]:
# Your BFF is back
import pandas as pd

In [None]:
# Read our dataset and get an idea of how it looks like
enron = pd.read_csv('enron.csv')
display(enron.shape)
enron.head()

In [None]:
# How does a message look like?
print(enron.iloc[0]['raw message'])

We see we have a sender (`From:`), a subject (`Subject:`), CC, BCC, the date (`Date:`) and the body (`body:`) of the message. Therefore, we can parse our dataset so it has a column for each of these bits of information.

In [None]:
# Get the sender/s of the message
def get_sender(message):
    return re.findall('From: [\w@\.]+ ', message)

In [None]:
# Apply to dataframe
enron['From'] = enron['raw message'].apply(get_sender)
enron

What if there's no `From:`? We can extract the first email we find instead:

In [None]:
# Let's do it better
def get_sender(message):
    return re.findall('(From: )([\w\@\.-]+)( )',message)[0][1]

In [None]:
enron['From'] = enron['raw message'].apply(get_sender)
enron

What if there's no email at all?

In [None]:
def get_sender(message):
    try:
        out = re.findall('(From: )([\w\@\.-]+)( )', message)[0][1]
    except:
        out = ''
    return out

In [None]:
enron['From'] = enron['raw message'].apply(get_sender)
enron

## <a id='toc1_1_'></a>[💡 Do it yourself](#toc0_)

Following a similar logic, extract the recipient column!

In [None]:
# Your code here

#solution
def get_receiver(message):
  to_list = re.findall('To:.*Subject:',message)
  if len(to_list)>0:
    out = to_list
  else:
    out=''
  return out

In [None]:
enron['To'] = enron['raw message'].apply(get_receiver)
enron


In [None]:
# Your code here

#solution
def get_receiver(message):
  to_list = re.findall('(To: )([\w\@\.-]+)([ ,])',message)
  if len(to_list)>0:
    out = to_list[0][1]
  else:
    out=''
  return out

In [None]:
enron['To'] = enron['raw message'].apply(get_receiver)
enron


In [None]:
print(enron.iloc[3]['raw message'])

Now let's get the date in a column:

In [None]:
# Check raw message again
print(enron.iloc[0]['raw message'])

We see the date is formatted like: {`Day of the week` (3 letters)}, {`Day`} {`Month` (3 letters)} {`Year` (4 digits)} {`Hours`}:{`Minutes`}:{`Seconds`} {`Time zone` (+/- 4 digits)} ({`Timezone name`})

In [None]:
date_pattern = 'Date: \w{3}, \d{1,2} \w{3} \d{4}'
enron['Date'] = enron['raw message'].apply(lambda x: re.findall(date_pattern, x)[0])
enron

In [None]:
# Let's remove the Date
date_pattern = '(Date: )(\w{3}, \d{1,2} \w{3} \d{4})'
enron['Date'] = enron['raw message'].apply(lambda x: re.findall(date_pattern, x)[0][1])
enron

In [None]:
# Let's remove the day of the week
date_pattern = '(Date: )(\w{3}, )(\d{1,2} \w{3} \d{4})'
enron['Date'] = enron['raw message'].apply(lambda x: re.findall(date_pattern, x)[0][2])
enron

Let's also find potential names by looking for the following pattern: {`First Name`} {`Last Name`}

In [None]:
def names_mentioned_narrow_down(message):
    return re.findall('[A-Z][a-z]+ [A-Z][a-z]+', message)

**Notes:**
- This time we don't use `\w` as we know that names do not have digits (unless you're `X AE A-XII`, formerly known as `X Æ A-12`)
- We can define ranges of characters to search for `[a-z]`
- We can specify the capitalization of the range we're interested in `[A-Z]`, `[a-z]`, or `[A-z]` 

In [None]:
enron['names_mentioned'] = enron['raw message'].apply(names_mentioned_narrow_down)
enron

## <a id='toc1_2_'></a>[💡 Do it yourself](#toc0_)

Now find the emails mentioned!

In [None]:
# Your code here

We can also extract any phone numbers that appear in our message, as they typically have this pattern: `###-###-###`

In [None]:
def phone_nr_mentioned(message):
    return re.findall('([0-9]{3}-[0-9]{3}-[0-9]{3})', message)

In [None]:
enron['phone_nr_mentioned'] = enron['raw message'].apply(phone_nr_mentioned)
enron

# <a id='toc2_'></a>[References/Acknowledgments](#toc0_)

This lesson was taken from David Henriques with a couple of edits.