# Email Classification - Testing

*The main goal of this project is to perform **information extraction** on emails and **classify** them. This documentation guides through the process involved in email classification and the **modules and packages** associated with it.*  

The steps involved in testing email classification are as follows:

    1. Preprocessing
    2. Feature Engineering
    3. Machine Learning

## Importing the test data
To test the email classification system, the file containing the emails and the number of rows from that file is declared using the below variables.

*"file_path"* variable denotes the **path of the emails csv file** that needs to be processed for testing the email classification

*"rows"* variable denotes the **number of rows** that must be read from the csv file

In [19]:
file_path = "/Users/Sangireddy Siva/Dropbox/CaseStudy1/New Production code/Scripts_V14/Data/emails.csv"
rows = 10

## 1. Pre Processing

In preprocessing we need to perform following steps,

    1.1. Read the CSV File
    1.2. Split the Emails according to the email headers
    1.3. Elimination of forwarded content from email body

## 1.1. Read the CSV file
Here, the imported test data file is read using the ***read_file*** function available in ***read_csv*** module under ***Preprocessing*** package.

### read_file function
#### Description:
Reads the csv file and checks with the extension whether it is csv else throws an exception to upload the file with csv extension.
#### Parameters:
Accepts two parameters such as path of the file and the number of rows.
#### Returns:
returns the csv file read in a dataframe.

In [20]:
from Code.Preprocessing.read_csv import read_file
emails = read_file(file_path,rows)
emails.head(5)

CSV imported successfully 



Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


## 1.2. Splitting the emails
The ***Preprocess*** class in ***split_emails*** module under ***Preprocessing*** package is developed to split the testing emails based on headers.  

### class Preprocess
### preprocessing_emails function
#### Description:
splits the testing emails based on the headers present within the email data.
#### Parameters:
accepts two parameters as a dataframe and a data column name.
#### Returns:
returns input testing emails with respective headers in each column of the dataframe.

In [21]:
from Code.Preprocessing.split_emails import Preprocess
information = Preprocess()
information.preprocessing_emails(emails,'message')
emails.head(5)

Unnamed: 0,file,message,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,email_body,Sender,Receiver
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,<18782981.1075855378110.JavaMail.evans@thyme>,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",phillip.allen@enron.com,tim.belden@enron.com,,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Tim Belden <Tim Belden/Enron@EnronXGate>,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Here is our forecast\n\n,phillip.allen@enron.com,tim.belden@enron.com
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,<15464986.1075855378456.JavaMail.evans@thyme>,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",phillip.allen@enron.com,john.lavorato@enron.com,Re:,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,John J Lavorato <John J Lavorato/ENRON@enronXg...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,phillip.allen@enron.com,john.lavorato@enron.com
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,<24216240.1075855687451.JavaMail.evans@thyme>,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Leah Van Arsdall,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,test successful. way to go!!!,phillip.allen@enron.com,leah.arsdall@enron.com
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,<13505866.1075863688222.JavaMail.evans@thyme>,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",phillip.allen@enron.com,randall.gay@enron.com,,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Randall L Gay,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Randy,\n\n Can you send me a schedule of the s...",phillip.allen@enron.com,randall.gay@enron.com
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,<30922949.1075863688243.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Greg Piper,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,Let's shoot for Tuesday at 11:45.,phillip.allen@enron.com,greg.piper@enron.com


## 1.3. Elimination of forwarded content from testing emails body 
The original body of the testing email is seggregated from the forwarded content using ***clean_emails*** function in ***clean_content*** package under ***Preprocessing*** package.   

### clean_emails function
#### Description:
eliminates the forwarded content and retains the original body of the testing emails and checks the spelling of the content
#### Parameters:
accepts one parameter as a dataframe.
#### Returns:
returns only the original body of testing emails with corrected spelling in a column of the dataframe.

In [22]:
from Code.Preprocessing.clean_content import CleanEmails
clean_class = CleanEmails(emails, 'message')
clean_class.clean_emails()
emails.head(5)

Unnamed: 0,file,message,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,...,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,email_body,Sender,Receiver,tokens,clean_text
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,<18782981.1075855378110.JavaMail.evans@thyme>,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",phillip.allen@enron.com,tim.belden@enron.com,,1.0,text/plain; charset=us-ascii,7bit,...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Here is our forecast\n\n,phillip.allen@enron.com,tim.belden@enron.com,"(Here, is, our, forecast, \n\n )",forecast
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,<15464986.1075855378456.JavaMail.evans@thyme>,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",phillip.allen@enron.com,john.lavorato@enron.com,Re:,1.0,text/plain; charset=us-ascii,7bit,...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,phillip.allen@enron.com,john.lavorato@enron.com,"(Traveling, to, have, a, business, meeting, ta...",travel business meeting take fun trip especial...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,<24216240.1075855687451.JavaMail.evans@thyme>,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,test successful. way to go!!!,phillip.allen@enron.com,leah.arsdall@enron.com,"(test, successful, ., , way, to, go, !, !, !)",test successful way
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,<13505866.1075863688222.JavaMail.evans@thyme>,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",phillip.allen@enron.com,randall.gay@enron.com,,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Randy,\n\n Can you send me a schedule of the s...",phillip.allen@enron.com,randall.gay@enron.com,"(Randy, ,, \n\n , Can, you, send, me, a, sched...",randy send schedule salary level scheduling gr...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,<30922949.1075863688243.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,Let's shoot for Tuesday at 11:45.,phillip.allen@enron.com,greg.piper@enron.com,"(Let, 's, shoot, for, Tuesday, at, 11:45, ., )",let shoot tuesday


## 2. Feature Engineering
In this step the Named Enitity Recognitions and their respective Named Entity Linking would be extracted based on the trained model of NER and NEL uses
***ner_df*** function in ***ner_frame*** package under ***Testing*** package

### ner_df function
#### Description:
Functions loads the trained NER & NEL model and retrieved the entities and entity linking from the testing emails.
#### Parameters:
accepts two parameters as a existing trained NER & NEL model, and testing emails.
#### Returns:
returns the entities available in the testing emails along with entity linking

In [23]:
model = "/Users/Sangireddy Siva/Dropbox/CaseStudy1/New Production code/Scripts_V14/Derived_Outputs/Custom_Spacy_Model_NER_NEL"

In [24]:
from Code.Testing.ner_frame import ner_df
ner_df(model, emails)

Unnamed: 0,file,message,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,...,X-Folder,X-Origin,X-FileName,email_body,Sender,Receiver,tokens,clean_text,Conent_NER,NER_Info
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,<18782981.1075855378110.JavaMail.evans@thyme>,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",phillip.allen@enron.com,tim.belden@enron.com,,1.0,text/plain; charset=us-ascii,7bit,...,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Here is our forecast\n\n,phillip.allen@enron.com,tim.belden@enron.com,"(Here, is, our, forecast, \n\n )",forecast,"(Here, is, our, forecast, \n\n )","{'Person': [], 'Product': [], 'Time': [], 'Dat..."
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,<15464986.1075855378456.JavaMail.evans@thyme>,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",phillip.allen@enron.com,john.lavorato@enron.com,Re:,1.0,text/plain; charset=us-ascii,7bit,...,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,phillip.allen@enron.com,john.lavorato@enron.com,"(Traveling, to, have, a, business, meeting, ta...",travel business meeting take fun trip especial...,"(Traveling, to, have, a, business, meeting, ta...","{'Person': [], 'Product': [], 'Time': [], 'Dat..."
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,<24216240.1075855687451.JavaMail.evans@thyme>,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,1.0,text/plain; charset=us-ascii,7bit,...,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,test successful. way to go!!!,phillip.allen@enron.com,leah.arsdall@enron.com,"(test, successful, ., , way, to, go, !, !, !)",test successful way,"(test, successful, ., , way, to, go, !, !, !)","{'Person': [], 'Product': [], 'Time': [], 'Dat..."
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,<13505866.1075863688222.JavaMail.evans@thyme>,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",phillip.allen@enron.com,randall.gay@enron.com,,1.0,text/plain; charset=us-ascii,7bit,...,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Randy,\n\n Can you send me a schedule of the s...",phillip.allen@enron.com,randall.gay@enron.com,"(Randy, ,, \n\n , Can, you, send, me, a, sched...",randy send schedule salary level scheduling gr...,"(Randy, ,, \n\n , Can, you, send, me, a, sched...","{'Person': [('Randy', 'Q10044'), ('Phillip', '..."
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,<30922949.1075863688243.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,...,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,Let's shoot for Tuesday at 11:45.,phillip.allen@enron.com,greg.piper@enron.com,"(Let, 's, shoot, for, Tuesday, at, 11:45, ., )",let shoot tuesday,"(Let, 's, shoot, for, Tuesday, at, 11:45, ., )","{'Person': [], 'Product': [], 'Time': [], 'Dat..."
5,allen-p/_sent_mail/1002.,Message-ID: <30965995.1075863688265.JavaMail.e...,<30965995.1075863688265.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 04:17:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,...,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Greg,\n\n How about either next Tuesday or Thu...",phillip.allen@enron.com,greg.piper@enron.com,"(Greg, ,, \n\n , How, about, either, next, Tue...",grew tuesday thursday philip,"(Greg, ,, \n\n , How, about, either, next, Tue...","{'Person': [('Greg', 'Q10043'), ('Phillip', 'Q..."
6,allen-p/_sent_mail/1003.,Message-ID: <16254169.1075863688286.JavaMail.e...,<16254169.1075863688286.JavaMail.evans@thyme>,"Tue, 22 Aug 2000 07:44:00 -0700 (PDT)",phillip.allen@enron.com,"david.l.johnson@enron.com, john.shafer@enron.com",,1.0,text/plain; charset=us-ascii,7bit,...,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,Please cc the following distribution list with...,phillip.allen@enron.com,"john.shafer@enron.com,david.l.johnson@enron.com","(Please, cc, the, following, distribution, lis...",follow distribution list update philip allen m...,"(Please, cc, the, following, distribution, lis...","{'Person': [('Phillip Allen', 'Q10007'), ('Mik..."
7,allen-p/_sent_mail/1004.,Message-ID: <17189699.1075863688308.JavaMail.e...,<17189699.1075863688308.JavaMail.evans@thyme>,"Fri, 14 Jul 2000 06:59:00 -0700 (PDT)",phillip.allen@enron.com,joyce.teixeira@enron.com,Re: PRC review - phone calls,1.0,text/plain; charset=us-ascii,7bit,...,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,any morning between 10 and 11:30,phillip.allen@enron.com,joyce.teixeira@enron.com,"(any, morning, between, 10, and, 11:30)",morning,"(any, morning, between, 10, and, 11:30)","{'Person': [], 'Product': [], 'Time': [], 'Dat..."
8,allen-p/_sent_mail/101.,Message-ID: <20641191.1075855687472.JavaMail.e...,<20641191.1075855687472.JavaMail.evans@thyme>,"Tue, 17 Oct 2000 02:26:00 -0700 (PDT)",phillip.allen@enron.com,mark.scott@enron.com,Re: High Speed Internet Access,1.0,text/plain; charset=us-ascii,7bit,...,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,1. login: pallen pw: ke9davis\n\n I don't thi...,phillip.allen@enron.com,mark.scott@enron.com,"(1, ., login, :, , pallen, pw, :, ke9davis, \...",login fallen think require is static address s...,"(1, ., login, :, , pallen, pw, :, ke9davis, \...","{'Person': [], 'Product': [], 'Time': [], 'Dat..."
9,allen-p/_sent_mail/102.,Message-ID: <30795301.1075855687494.JavaMail.e...,<30795301.1075855687494.JavaMail.evans@thyme>,"Mon, 16 Oct 2000 06:44:00 -0700 (PDT)",phillip.allen@enron.com,zimam@enron.com,FW: fixed forward or other Collar floor gas pr...,1.0,text/plain; charset=us-ascii,7bit,...,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,---------------------- Forwarded by Phillip K ...,phillip.allen@enron.com,zimam@enron.com,"(----------------------, Forwarded, by, Philli...",,"(----------------------, Forwarded, by, Philli...","{'Person': [('Phillip K Allen', 'Q10001'), ('B..."


## 3. Machine learning

The following steps are performed in Machine learning to label and clustering as part of testing,

    3.1. Activity Labelling
    3.2. Clustering

## 3.1. Activity Labelling
Document similarity model, that labels emails based on the generic email phrases
uses ***Activity*** class under ***similarity*** package under ***Testing*** package

###  Activity Class
#### Description:
class will compare the generic email phrases with input testing emails and label them with Acitivity.
#### Parameters:
accepts one parameters such as a generic input email phrases

###  activity_entity Function
#### Description:
Function creates the vectors for both generic phrases and input testing emails by splitting into words and charaters, and checks the cosine similarity between them. 
#### Parameters:
accepts one parameters such as a input testing emails
#### Returns:
Returns the activity labels for the input testing emails

In [25]:
from Code.Testing.similarity import Activity
inp_phrases = "/Users/Sangireddy Siva/Dropbox/CaseStudy1/New Production code/Scripts_V14/Data/Email_phrases.csv"
extraction_2 = Activity(inp_phrases)
extraction_2.activity_entity(emails)

Unnamed: 0,file,message,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,...,X-Origin,X-FileName,email_body,Sender,Receiver,tokens,clean_text,Conent_NER,NER_Info,activity
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,<18782981.1075855378110.JavaMail.evans@thyme>,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",phillip.allen@enron.com,tim.belden@enron.com,,1.0,text/plain; charset=us-ascii,7bit,...,Allen-P,pallen (Non-Privileged).pst,Here is our forecast\n\n,phillip.allen@enron.com,tim.belden@enron.com,"(Here, is, our, forecast, \n\n )",forecast,"(Here, is, our, forecast, \n\n )","{'Person': [], 'Product': [], 'Time': [], 'Dat...",INFORMATION
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,<15464986.1075855378456.JavaMail.evans@thyme>,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",phillip.allen@enron.com,john.lavorato@enron.com,Re:,1.0,text/plain; charset=us-ascii,7bit,...,Allen-P,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,phillip.allen@enron.com,john.lavorato@enron.com,"(Traveling, to, have, a, business, meeting, ta...",travel business meeting take fun trip especial...,"(Traveling, to, have, a, business, meeting, ta...","{'Person': [], 'Product': [], 'Time': [], 'Dat...",EXPECTED
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,<24216240.1075855687451.JavaMail.evans@thyme>,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,1.0,text/plain; charset=us-ascii,7bit,...,Allen-P,pallen.nsf,test successful. way to go!!!,phillip.allen@enron.com,leah.arsdall@enron.com,"(test, successful, ., , way, to, go, !, !, !)",test successful way,"(test, successful, ., , way, to, go, !, !, !)","{'Person': [], 'Product': [], 'Time': [], 'Dat...",EXPECTED
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,<13505866.1075863688222.JavaMail.evans@thyme>,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",phillip.allen@enron.com,randall.gay@enron.com,,1.0,text/plain; charset=us-ascii,7bit,...,Allen-P,pallen.nsf,"Randy,\n\n Can you send me a schedule of the s...",phillip.allen@enron.com,randall.gay@enron.com,"(Randy, ,, \n\n , Can, you, send, me, a, sched...",randy send schedule salary level scheduling gr...,"(Randy, ,, \n\n , Can, you, send, me, a, sched...","{'Person': [('Randy', 'Q10044'), ('Phillip', '...",CLARIFICATION
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,<30922949.1075863688243.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,...,Allen-P,pallen.nsf,Let's shoot for Tuesday at 11:45.,phillip.allen@enron.com,greg.piper@enron.com,"(Let, 's, shoot, for, Tuesday, at, 11:45, ., )",let shoot tuesday,"(Let, 's, shoot, for, Tuesday, at, 11:45, ., )","{'Person': [], 'Product': [], 'Time': [], 'Dat...",REQUEST
5,allen-p/_sent_mail/1002.,Message-ID: <30965995.1075863688265.JavaMail.e...,<30965995.1075863688265.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 04:17:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,...,Allen-P,pallen.nsf,"Greg,\n\n How about either next Tuesday or Thu...",phillip.allen@enron.com,greg.piper@enron.com,"(Greg, ,, \n\n , How, about, either, next, Tue...",grew tuesday thursday philip,"(Greg, ,, \n\n , How, about, either, next, Tue...","{'Person': [('Greg', 'Q10043'), ('Phillip', 'Q...",CLARIFICATION
6,allen-p/_sent_mail/1003.,Message-ID: <16254169.1075863688286.JavaMail.e...,<16254169.1075863688286.JavaMail.evans@thyme>,"Tue, 22 Aug 2000 07:44:00 -0700 (PDT)",phillip.allen@enron.com,"david.l.johnson@enron.com, john.shafer@enron.com",,1.0,text/plain; charset=us-ascii,7bit,...,Allen-P,pallen.nsf,Please cc the following distribution list with...,phillip.allen@enron.com,"john.shafer@enron.com,david.l.johnson@enron.com","(Please, cc, the, following, distribution, lis...",follow distribution list update philip allen m...,"(Please, cc, the, following, distribution, lis...","{'Person': [('Phillip Allen', 'Q10007'), ('Mik...",EXPECTED
7,allen-p/_sent_mail/1004.,Message-ID: <17189699.1075863688308.JavaMail.e...,<17189699.1075863688308.JavaMail.evans@thyme>,"Fri, 14 Jul 2000 06:59:00 -0700 (PDT)",phillip.allen@enron.com,joyce.teixeira@enron.com,Re: PRC review - phone calls,1.0,text/plain; charset=us-ascii,7bit,...,Allen-P,pallen.nsf,any morning between 10 and 11:30,phillip.allen@enron.com,joyce.teixeira@enron.com,"(any, morning, between, 10, and, 11:30)",morning,"(any, morning, between, 10, and, 11:30)","{'Person': [], 'Product': [], 'Time': [], 'Dat...",INFORMATION
8,allen-p/_sent_mail/101.,Message-ID: <20641191.1075855687472.JavaMail.e...,<20641191.1075855687472.JavaMail.evans@thyme>,"Tue, 17 Oct 2000 02:26:00 -0700 (PDT)",phillip.allen@enron.com,mark.scott@enron.com,Re: High Speed Internet Access,1.0,text/plain; charset=us-ascii,7bit,...,Allen-P,pallen.nsf,1. login: pallen pw: ke9davis\n\n I don't thi...,phillip.allen@enron.com,mark.scott@enron.com,"(1, ., login, :, , pallen, pw, :, ke9davis, \...",login fallen think require is static address s...,"(1, ., login, :, , pallen, pw, :, ke9davis, \...","{'Person': [], 'Product': [], 'Time': [], 'Dat...",INFORMATION
9,allen-p/_sent_mail/102.,Message-ID: <30795301.1075855687494.JavaMail.e...,<30795301.1075855687494.JavaMail.evans@thyme>,"Mon, 16 Oct 2000 06:44:00 -0700 (PDT)",phillip.allen@enron.com,zimam@enron.com,FW: fixed forward or other Collar floor gas pr...,1.0,text/plain; charset=us-ascii,7bit,...,Allen-P,pallen.nsf,---------------------- Forwarded by Phillip K ...,phillip.allen@enron.com,zimam@enron.com,"(----------------------, Forwarded, by, Philli...",,"(----------------------, Forwarded, by, Philli...","{'Person': [('Phillip K Allen', 'Q10001'), ('B...",


## 3.2. Clustering
This step categorise the input testing emails into clusters based on the trained clustering model
uses ***test_cluster_model*** Function in ***classification_test*** package under ***Testing*** package

###  test_cluster_model Function
#### Description:
Function creates transforms the text into vectors and fits into the trained clustering model
#### Parameters:
accepts two parameters such as a trained clustering model and input testing emails
#### Returns:
Returns the clusters for the input testing emails

In [26]:
from Code.Testing.classification_test import test_cluster_model
model = "/Users/Sangireddy Siva/Dropbox/CaseStudy1/New Production code/Scripts_V14/Pre_loaded_inputs_and_outputs/kmeans_model.sav"
test_cluster_model(model,emails)

Unnamed: 0,file,message,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,...,X-FileName,email_body,Sender,Receiver,tokens,clean_text,Conent_NER,NER_Info,activity,cluster
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,<18782981.1075855378110.JavaMail.evans@thyme>,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",phillip.allen@enron.com,tim.belden@enron.com,,1.0,text/plain; charset=us-ascii,7bit,...,pallen (Non-Privileged).pst,Here is our forecast\n\n,phillip.allen@enron.com,tim.belden@enron.com,"(Here, is, our, forecast, \n\n )",forecast,"(Here, is, our, forecast, \n\n )","{'Person': [], 'Product': [], 'Time': [], 'Dat...",INFORMATION,0
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,<15464986.1075855378456.JavaMail.evans@thyme>,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",phillip.allen@enron.com,john.lavorato@enron.com,Re:,1.0,text/plain; charset=us-ascii,7bit,...,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,phillip.allen@enron.com,john.lavorato@enron.com,"(Traveling, to, have, a, business, meeting, ta...",travel business meeting take fun trip especial...,"(Traveling, to, have, a, business, meeting, ta...","{'Person': [], 'Product': [], 'Time': [], 'Dat...",EXPECTED,0
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,<24216240.1075855687451.JavaMail.evans@thyme>,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,1.0,text/plain; charset=us-ascii,7bit,...,pallen.nsf,test successful. way to go!!!,phillip.allen@enron.com,leah.arsdall@enron.com,"(test, successful, ., , way, to, go, !, !, !)",test successful way,"(test, successful, ., , way, to, go, !, !, !)","{'Person': [], 'Product': [], 'Time': [], 'Dat...",EXPECTED,0
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,<13505866.1075863688222.JavaMail.evans@thyme>,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",phillip.allen@enron.com,randall.gay@enron.com,,1.0,text/plain; charset=us-ascii,7bit,...,pallen.nsf,"Randy,\n\n Can you send me a schedule of the s...",phillip.allen@enron.com,randall.gay@enron.com,"(Randy, ,, \n\n , Can, you, send, me, a, sched...",randy send schedule salary level scheduling gr...,"(Randy, ,, \n\n , Can, you, send, me, a, sched...","{'Person': [('Randy', 'Q10044'), ('Phillip', '...",CLARIFICATION,0
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,<30922949.1075863688243.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,...,pallen.nsf,Let's shoot for Tuesday at 11:45.,phillip.allen@enron.com,greg.piper@enron.com,"(Let, 's, shoot, for, Tuesday, at, 11:45, ., )",let shoot tuesday,"(Let, 's, shoot, for, Tuesday, at, 11:45, ., )","{'Person': [], 'Product': [], 'Time': [], 'Dat...",REQUEST,1
5,allen-p/_sent_mail/1002.,Message-ID: <30965995.1075863688265.JavaMail.e...,<30965995.1075863688265.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 04:17:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,...,pallen.nsf,"Greg,\n\n How about either next Tuesday or Thu...",phillip.allen@enron.com,greg.piper@enron.com,"(Greg, ,, \n\n , How, about, either, next, Tue...",grew tuesday thursday philip,"(Greg, ,, \n\n , How, about, either, next, Tue...","{'Person': [('Greg', 'Q10043'), ('Phillip', 'Q...",CLARIFICATION,1
6,allen-p/_sent_mail/1003.,Message-ID: <16254169.1075863688286.JavaMail.e...,<16254169.1075863688286.JavaMail.evans@thyme>,"Tue, 22 Aug 2000 07:44:00 -0700 (PDT)",phillip.allen@enron.com,"david.l.johnson@enron.com, john.shafer@enron.com",,1.0,text/plain; charset=us-ascii,7bit,...,pallen.nsf,Please cc the following distribution list with...,phillip.allen@enron.com,"john.shafer@enron.com,david.l.johnson@enron.com","(Please, cc, the, following, distribution, lis...",follow distribution list update philip allen m...,"(Please, cc, the, following, distribution, lis...","{'Person': [('Phillip Allen', 'Q10007'), ('Mik...",EXPECTED,0
7,allen-p/_sent_mail/1004.,Message-ID: <17189699.1075863688308.JavaMail.e...,<17189699.1075863688308.JavaMail.evans@thyme>,"Fri, 14 Jul 2000 06:59:00 -0700 (PDT)",phillip.allen@enron.com,joyce.teixeira@enron.com,Re: PRC review - phone calls,1.0,text/plain; charset=us-ascii,7bit,...,pallen.nsf,any morning between 10 and 11:30,phillip.allen@enron.com,joyce.teixeira@enron.com,"(any, morning, between, 10, and, 11:30)",morning,"(any, morning, between, 10, and, 11:30)","{'Person': [], 'Product': [], 'Time': [], 'Dat...",INFORMATION,0
8,allen-p/_sent_mail/101.,Message-ID: <20641191.1075855687472.JavaMail.e...,<20641191.1075855687472.JavaMail.evans@thyme>,"Tue, 17 Oct 2000 02:26:00 -0700 (PDT)",phillip.allen@enron.com,mark.scott@enron.com,Re: High Speed Internet Access,1.0,text/plain; charset=us-ascii,7bit,...,pallen.nsf,1. login: pallen pw: ke9davis\n\n I don't thi...,phillip.allen@enron.com,mark.scott@enron.com,"(1, ., login, :, , pallen, pw, :, ke9davis, \...",login fallen think require is static address s...,"(1, ., login, :, , pallen, pw, :, ke9davis, \...","{'Person': [], 'Product': [], 'Time': [], 'Dat...",INFORMATION,0
9,allen-p/_sent_mail/102.,Message-ID: <30795301.1075855687494.JavaMail.e...,<30795301.1075855687494.JavaMail.evans@thyme>,"Mon, 16 Oct 2000 06:44:00 -0700 (PDT)",phillip.allen@enron.com,zimam@enron.com,FW: fixed forward or other Collar floor gas pr...,1.0,text/plain; charset=us-ascii,7bit,...,pallen.nsf,---------------------- Forwarded by Phillip K ...,phillip.allen@enron.com,zimam@enron.com,"(----------------------, Forwarded, by, Philli...",,"(----------------------, Forwarded, by, Philli...","{'Person': [('Phillip K Allen', 'Q10001'), ('B...",,0


In [28]:
emails_out = emails[['From','To','Subject','message','Date','activity','cluster','NER_Info']]

In [29]:
emails_out

Unnamed: 0,From,To,Subject,message,Date,activity,cluster,NER_Info
0,phillip.allen@enron.com,tim.belden@enron.com,,Message-ID: <18782981.1075855378110.JavaMail.e...,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",INFORMATION,0,"{'Person': [], 'Product': [], 'Time': [], 'Dat..."
1,phillip.allen@enron.com,john.lavorato@enron.com,Re:,Message-ID: <15464986.1075855378456.JavaMail.e...,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",EXPECTED,0,"{'Person': [], 'Product': [], 'Time': [], 'Dat..."
2,phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,Message-ID: <24216240.1075855687451.JavaMail.e...,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",EXPECTED,0,"{'Person': [], 'Product': [], 'Time': [], 'Dat..."
3,phillip.allen@enron.com,randall.gay@enron.com,,Message-ID: <13505866.1075863688222.JavaMail.e...,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",CLARIFICATION,0,"{'Person': [('Randy', 'Q10044'), ('Phillip', '..."
4,phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,Message-ID: <30922949.1075863688243.JavaMail.e...,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",REQUEST,1,"{'Person': [], 'Product': [], 'Time': [], 'Dat..."
5,phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,Message-ID: <30965995.1075863688265.JavaMail.e...,"Thu, 31 Aug 2000 04:17:00 -0700 (PDT)",CLARIFICATION,1,"{'Person': [('Greg', 'Q10043'), ('Phillip', 'Q..."
6,phillip.allen@enron.com,"david.l.johnson@enron.com, john.shafer@enron.com",,Message-ID: <16254169.1075863688286.JavaMail.e...,"Tue, 22 Aug 2000 07:44:00 -0700 (PDT)",EXPECTED,0,"{'Person': [('Phillip Allen', 'Q10007'), ('Mik..."
7,phillip.allen@enron.com,joyce.teixeira@enron.com,Re: PRC review - phone calls,Message-ID: <17189699.1075863688308.JavaMail.e...,"Fri, 14 Jul 2000 06:59:00 -0700 (PDT)",INFORMATION,0,"{'Person': [], 'Product': [], 'Time': [], 'Dat..."
8,phillip.allen@enron.com,mark.scott@enron.com,Re: High Speed Internet Access,Message-ID: <20641191.1075855687472.JavaMail.e...,"Tue, 17 Oct 2000 02:26:00 -0700 (PDT)",INFORMATION,0,"{'Person': [], 'Product': [], 'Time': [], 'Dat..."
9,phillip.allen@enron.com,zimam@enron.com,FW: fixed forward or other Collar floor gas pr...,Message-ID: <30795301.1075855687494.JavaMail.e...,"Mon, 16 Oct 2000 06:44:00 -0700 (PDT)",,0,"{'Person': [('Phillip K Allen', 'Q10001'), ('B..."


In [30]:
emails_out.to_csv("Email_Class_Final_Output.csv")