# Stephanie's Working Notebook

## Acquire
Emails library documentation: [link](https://docs.python.org/3/library/email.html)<br>
Data source: [link](https://www.kaggle.com/wcukierski/enron-email-dataset)

In [1]:
# standard python imports
import numpy as np
import pandas as pd

# imports from python emails library
from email.parser import Parser

In [2]:
# reading csv to df
df = pd.read_csv('../emails.csv')

df.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [3]:
# getting size of df
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns of data.')

There are 517401 rows and 2 columns of data.


In [4]:
# looking at the first message to begin the process of parsing out the message text
df.message[0]

"Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>\nDate: Mon, 14 May 2001 16:39:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo: tim.belden@enron.com\nSubject: \nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nX-From: Phillip K Allen\nX-To: Tim Belden <Tim Belden/Enron@EnronXGate>\nX-cc: \nX-bcc: \nX-Folder: \\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail\nX-Origin: Allen-P\nX-FileName: pallen (Non-Privileged).pst\n\nHere is our forecast\n\n "

### The message has some key compents that we can parse out using the `emails` Python Library
- `datetime`
- `sender`
- `recipient`
- `subject`
- `message content` 

In [5]:
# looking to see if pattern matches in other messages
print(df.message[1])
print()
print(df.message[23])
print()
print(df.message[54])

Message-ID: <15464986.1075855378456.JavaMail.evans@thyme>
Date: Fri, 4 May 2001 13:51:00 -0700 (PDT)
From: phillip.allen@enron.com
To: john.lavorato@enron.com
Subject: Re:
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: John J Lavorato <John J Lavorato/ENRON@enronXgate@ENRON>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Traveling to have a business meeting takes the fun out of the trip.  Especially if you have to prepare a presentation.  I would suggest holding the business plan meetings here then take a trip without any formal business meetings.  I would even try and get some honest opinions on whether a trip is even desired or necessary.

As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not.  Too often the

### Something else to possibly look at for later would be cc and bcc recipients as well as forwarding
If there were any/quantity/who they were/how. many times a message was forwarded

In [6]:
# saving x1 test message to variable to write code for function that will loop through all messages
message = df.message[63]

# function step 1 | getting full message data
message = Parser().parsestr(message)

# looking at output Parsed message object
message

<email.message.Message at 0x7fce964d3340>

In [7]:
# looking at printout of Parsed message object
print(message)

Message-ID: <1776521.1075855688576.JavaMail.evans@thyme>
Date: Wed, 6 Sep 2000 04:46:00 -0700 (PDT)
From: phillip.allen@enron.com
To: thomas.martin@enron.com, mike.grigsby@enron.com, keith.holst@enron.com,
	jay.reitmeyer@enron.com, frank.ermis@enron.com
Subject: Wow
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Thomas A Martin, Mike Grigsby, Keith Holst, Jay Reitmeyer, Frank Ermis
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

---------------------- Forwarded by Phillip K Allen/HOU/ECT on 09/06/2000 
10:49 AM ---------------------------


Jeff Richter
09/06/2000 07:39 AM
To: Phillip K Allen/HOU/ECT@ECT
cc:  
Subject: Wow


---------------------- Forwarded by Jeff Richter/HOU/ECT on 09/06/2000 09:45 
AM ---------------------------
To: Mike Swerzbin/HOU/ECT@ECT, Robert Badeer/HOU/ECT@ECT, Sean 
Crandall/PDX/ECT@ECT, Tim Belden/HOU/ECT@ECT, Jeff 

In [8]:
# looking at the message dtype
type(message)

email.message.Message

### We can use the [email.message.Message](https://docs.python.org/3/library/email.compat32-message.html) documents to further parse the message info

In [9]:
# parsed message text content
text = message.get_payload()
text

'---------------------- Forwarded by Phillip K Allen/HOU/ECT on 09/06/2000 \n10:49 AM ---------------------------\n\n\nJeff Richter\n09/06/2000 07:39 AM\nTo: Phillip K Allen/HOU/ECT@ECT\ncc:  \nSubject: Wow\n\n\n---------------------- Forwarded by Jeff Richter/HOU/ECT on 09/06/2000 09:45 \nAM ---------------------------\nTo: Mike Swerzbin/HOU/ECT@ECT, Robert Badeer/HOU/ECT@ECT, Sean \nCrandall/PDX/ECT@ECT, Tim Belden/HOU/ECT@ECT, Jeff Richter/HOU/ECT@ECT, John \nM Forney/HOU/ECT@ECT, Matt Motley/PDX/ECT@ECT, Tom Alonso/PDX/ECT@ECT, Mark \nFischer/PDX/ECT@ECT\ncc:  \nSubject: Wow\n\n\n---------------------- Forwarded by Tim Belden/HOU/ECT on 09/06/2000 07:27 AM \n---------------------------\n   \n\tEnron Capital & Trade Resources Corp.\n\t\n\tFrom:  Kevin M Presto                           09/05/2000 01:59 PM\n\t\n\nTo: Tim Belden/HOU/ECT@ECT\ncc: Rogers Herndon/HOU/ECT@ect, John Zufferli/HOU/ECT@ECT, Lloyd \nWill/HOU/ECT@ECT, Doug Gilbert-Smith/Corp/Enron@ENRON, Mike \nSwerzbin/HOU/ECT

In [10]:
# looking at text printout
print(text)

---------------------- Forwarded by Phillip K Allen/HOU/ECT on 09/06/2000 
10:49 AM ---------------------------


Jeff Richter
09/06/2000 07:39 AM
To: Phillip K Allen/HOU/ECT@ECT
cc:  
Subject: Wow


---------------------- Forwarded by Jeff Richter/HOU/ECT on 09/06/2000 09:45 
AM ---------------------------
To: Mike Swerzbin/HOU/ECT@ECT, Robert Badeer/HOU/ECT@ECT, Sean 
Crandall/PDX/ECT@ECT, Tim Belden/HOU/ECT@ECT, Jeff Richter/HOU/ECT@ECT, John 
M Forney/HOU/ECT@ECT, Matt Motley/PDX/ECT@ECT, Tom Alonso/PDX/ECT@ECT, Mark 
Fischer/PDX/ECT@ECT
cc:  
Subject: Wow


---------------------- Forwarded by Tim Belden/HOU/ECT on 09/06/2000 07:27 AM 
---------------------------
   
	Enron Capital & Trade Resources Corp.
	
	From:  Kevin M Presto                           09/05/2000 01:59 PM
	

To: Tim Belden/HOU/ECT@ECT
cc: Rogers Herndon/HOU/ECT@ect, John Zufferli/HOU/ECT@ECT, Lloyd 
Will/HOU/ECT@ECT, Doug Gilbert-Smith/Corp/Enron@ENRON, Mike 
Swerzbin/HOU/ECT@ECT 
Subject: Wow

Do not underestim

In [11]:
# parsing sender
sender = message['From']
sender

'phillip.allen@enron.com'

In [12]:
# parsing recipient
recip = message['To']
recip

'thomas.martin@enron.com, mike.grigsby@enron.com, keith.holst@enron.com, \n\tjay.reitmeyer@enron.com, frank.ermis@enron.com'

### `recip` will be something to possibly look at post MVP
- number of recipients
- frequency of certainr recipient groups

In [13]:
# parsing message date
date = message['Date']
date

'Wed, 6 Sep 2000 04:46:00 -0700 (PDT)'

In [14]:
# # running dataframe message contents through for loop that will above code to parse contents and save to lists

# # empty lists to be appended in for loop and added to df later
# content = []
# date = []
# sender = []

# # for loop 
# for string in df.message:
#     message = Parser().parsestr(string)
#     msg_content = message.get_payload()
#     msg_date = message['Date']
#     msg_sender = message['From']
#     # adding to empty lists
#     content.append(msg_content)
#     date.append(msg_date)
#     sender.append(msg_sender)

### Skipping ahead and using code from `acquire.pj.py` file in PJ's folder

In [15]:
def acquire_emails():
    df = pd.read_csv('../emails.csv')

    bodies = []
    dates = []

    # loop through email messages
    for i in df.message:
        # parse and set message to email data type
        headers = Parser().parsestr(i)
        # get the body text of the email
        body = headers.get_payload()
        # get the date from email
        date = headers['Date']
        # append date and body text to lists
        bodies.append(body)
        dates.append(date)

    # Set lists to dataframes
    body_df = pd.DataFrame(bodies, columns = ['Content'])
    dates_df = pd.DataFrame(dates, columns = ['Content'])

    # Insert those data frames into our orignal dataframe
    df.insert(1, "content", body_df)
    df.insert(1, "date", dates_df)

    return df

In [16]:
raw_data = pd.read_csv('raw_data.csv')

raw_data.head()

Unnamed: 0.1,Unnamed: 0,file,date,content,message
0,0,allen-p/_sent_mail/1.,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",Here is our forecast\n\n,Message-ID: <18782981.1075855378110.JavaMail.e...
1,1,allen-p/_sent_mail/10.,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",Traveling to have a business meeting takes the...,Message-ID: <15464986.1075855378456.JavaMail.e...
2,2,allen-p/_sent_mail/100.,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",test successful. way to go!!!,Message-ID: <24216240.1075855687451.JavaMail.e...
3,3,allen-p/_sent_mail/1000.,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)","Randy,\n\n Can you send me a schedule of the s...",Message-ID: <13505866.1075863688222.JavaMail.e...
4,4,allen-p/_sent_mail/1001.,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",Let's shoot for Tuesday at 11:45.,Message-ID: <30922949.1075863688243.JavaMail.e...


### For prepare using `stem` for MVP because it is faster. 
Also saving removing stop words for post MVP work
- lowercase
- dtypes
- nulls

In [17]:
import prepare_steph as prepare

In [18]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKD', string)\
            .encode('ascii', 'ignore')\
            .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

In [19]:
raw_data.content[54]

'why is aeco basis so low on the list?  Is NWPL mapped differently than AECO?  \nWhat about the correlation to Nymex on AECO?'

In [20]:
import unicodedata
import re
import json
import os

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
import acquire_pj
from time import strftime

from sklearn.model_selection import train_test_split

basic_clean(raw_data.content[54])

'why is aeco basis so low on the list  is nwpl mapped differently than aeco  \nwhat about the correlation to nymex on aeco'

In [21]:
print(basic_clean(raw_data.content[54]))

why is aeco basis so low on the list  is nwpl mapped differently than aeco  
what about the correlation to nymex on aeco


In [22]:
raw_data.head(10)['content'].apply(basic_clean)

0                            here is our forecast\n\n 
1    traveling to have a business meeting takes the...
2                           test successful  way to go
3    randy\n\n can you send me a schedule of the sa...
4                     lets shoot for tuesday at 1145  
5    greg\n\n how about either next tuesday or thur...
6    please cc the following distribution list with...
7                      any morning between 10 and 1130
8    1 login  pallen pw ke9davis\n\n i dont think t...
9     forwarded by phillip k allenhouect on 1016200...
Name: content, dtype: object

In [23]:
# # caching raw data
# raw_data.to_csv('raw_data.csv')

In [None]:
prepare.clean_emails(raw_data)