**Import required Libraries**

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

**Read the csv file**

In [2]:
emails = pd.read_csv('email.csv')

**Extract Sender, Receiver, Subject and Message Body for each raw email**  

email['from'] = 'sender1@hcl.com'  
email['to'] = 'receiver1@hcl.com'  
email['subject'] = 'subject1'  
email['body'] = 'first message body'

In [3]:
def parse_raw_email(raw_email):
    lines = raw_email.split('\n')
    email = {}
    message = ''
    keys_to_extract = ['from', 'to', 'subject']
    
    for line in lines:
        if ':' not in line:
            message += line.strip()
            email['body'] = message
        else:
            pairs = line.split(':', 1)
            key = pairs[0].lower()
            val = pairs[1].strip()
            if key in keys_to_extract:
                email[key] = val
    return email

**Create an array of data for each extracted fields (from, to, subject, body)**  

results['sender1@hcl.com']  
results['sender2@hcl.com']  
results['sender3@hcl.com']  

In [4]:
def map_to_list(emails, key):
    results = []
    for email in emails:
        if key not in email:
            results.append('')
        else:
            results.append(email[key])
    return results

**Create the array suitable for Pandas Dataframe**  

{  
    'from': [sender1@hcl.com, sender2@hcl.com, sender3@hcl.com],  
    'to': [receiver1@hcl.com, receiver2@hcl.com, receiver3@hcl.com],  
    'subject': [subject1@hcl.com, subject2@hcl.com, subject3@hcl.com],  
    'body': [message1, message2, message3]    
}

In [5]:
def map_extracted_fields(messages):
    emails = [parse_raw_email(message) for message in messages]
    return {
        'from': map_to_list(emails, 'from'),
        'to': map_to_list(emails, 'to'),
        'subject': map_to_list(emails, 'subject'),
        'body': map_to_list(emails, 'body')
    }

**Create the Dataframe using Pandas**  

d = {'col1': [1, 2], 'col2': [3, 4]}  
df = pd.DataFrame(data=d)  
df

 |  | Col1 | Col2  
 |--|------|------  
  0 |   1  |  3  
  1 |   2  |  4

In [6]:
email_df = pd.DataFrame(map_extracted_fields(emails.message))
email_df

Unnamed: 0,from,to,subject,body
0,phillip.allen@enron.com,tim.belden@enron.com,,Here is our forecast
1,phillip.allen@enron.com,john.lavorato@enron.com,Re:,Traveling to have a business meeting takes the...
2,phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,test successful. way to go!!!
3,phillip.allen@enron.com,randall.gay@enron.com,,"Randy,Can you send me a schedule of the salary..."
4,phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,
...,...,...,...,...
9995,eric.bass@enron.com,Brian Hoskins/HOU/ECT@ECT,Evite: Super Bowl Party,"shes pretty sexy, huh? are we getting togethe..."
9996,eric.bass@enron.com,danielles@jonesgranger.com,,i copied your idea - and it screwed up your name!
9997,eric.bass@enron.com,Eric Bass/HOU/ECT@ECT,9912 Texas Financial Liquidations,"---------------------------Eric,Just a reminde..."
9998,eric.bass@enron.com,lwbthemarine@bigplanet.com,,did you buy any enron in the 60s?


**Instantiate TF-IDF Vectorizer**

*max_df - ignore terms that appear in more than 50% of the documents*  
*min_df - ignore terms that appear in less than 2 documents*

In [7]:
countvectorizer = CountVectorizer(stop_words='english')
tfvectorizer = TfidfVectorizer(stop_words='english', use_idf=False)
tfidfvectorizer = TfidfVectorizer(stop_words='english', max_df=0.50, min_df=2)

**Convert email body to matrix**

In [8]:
count_matrix = countvectorizer.fit_transform(email_df.body)
tf_matrix = tfvectorizer.fit_transform(email_df.body)
tfidf_matrix = tfidfvectorizer.fit_transform(email_df.body)

**Retrieve the terms found in corpus**

In [9]:
count_feats = countvectorizer.get_feature_names()
tf_feats = tfvectorizer.get_feature_names()
tfidf_feats = tfidfvectorizer.get_feature_names()

**Display Vectorizer output in a Dataframe**

*This will display the top 10 words commonly used in an individual email*

*Count - number of times the word appeared in an email*  
*TF - frequency score of the word in an email*

In [10]:
count = 10
email_index = 1

# (document_id, token_id) tfidf_score
print("\n\nEmail Message\n")
print(email_df.body[email_index])

print("\n\nCount Vectorizer\n")
count_df = pd.DataFrame(count_matrix[email_index].T.todense(), index=count_feats, columns=["Count"])
count_df = count_df.sort_values('Count', ascending=False)
print (count_df.head(count))

print("\n\nTF Vectorizer\n")
tf_df = pd.DataFrame(tf_matrix[email_index].T.todense(), index=tf_feats, columns=["TF"])
tf_df = tf_df.sort_values('TF', ascending=False)
print (tf_df.head(count))



Email Message

Traveling to have a business meeting takes the fun out of the trip.  Especially if you have to prepare a presentation.  I would suggest holding the business plan meetings here then take a trip without any formal business meetings.  I would even try and get some honest opinions on whether a trip is even desired or necessary.As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not.  Too often the presenter speaks and the others are quiet just waiting for their turn.   The meetings might be better if held in a round table discussion format.My suggestion for where to go is Austin.  Play golf and rent a ski boat and jet ski's.  Flying somewhere takes too much time.


Count Vectorizer

             Count
meetings         4
business         4
trip             3
takes            2
ski              2
try              2
turn             1
presenter        1
better      

*This will display the top 10 words for the entire dataset using the TF-IDF score*

In [11]:
tfidf_means = np.mean(tfidf_matrix.toarray(), axis=0)

top_ids = np.argsort(tfidf_means)[::-1][:count]
top_feats = [(tfidf_feats[i], tfidf_means[i]) for i in top_ids]
df_top_feats = pd.DataFrame(top_feats, columns=['words', 'score'])

df_top_feats

Unnamed: 0,words,score
0,enron,0.049938
1,com,0.037819
2,ect,0.028561
3,message,0.022798
4,hou,0.019696
5,original,0.019511
6,phillip,0.01887
7,thanks,0.014263
8,john,0.014109
9,gas,0.013763
