# Email Classification - Training

*The main goal of this project is to perform **information extraction** on emails and **classify** them. This documentation guides through the process involved in email classification and the **modules and packages** associated with it.*  

The steps involved in training email classification are as follows:

    1. Preprocessing
    2. Feature Engineering
    3. Machine Learning

## Importing the file
Initially, the file containing the emails and the number of rows from that file is declared using the below variables.

*"file_path"* variable denotes the **path of the emails csv file** that needs to be processed for email classification

*"rows"* variable denotes the **number of rows** that must be read from the csv file

In [7]:
file = "/Users/Sangireddy Siva/Dropbox/CaseStudy1/New Production code/Scripts_V14/Data/emails.csv"
rows = 10

## 1. Pre Processing

In preprocessing we need to perform following steps,

    1.1. Read the CSV File
    1.2. Split the Emails according to the email headers
    1.3. Export the email body column to "txt" for the Manual Annotation
    1.4. Elimination of forwarded content from email body

## 1.1. Read the CSV file
Here, the imported file is read using the ***read_file*** function available in ***read_csv*** module under ***Preprocessing*** package.

### read_file function
#### Description:
Reads the csv file and checks with the extension whether it is csv else throws an exception to upload the file with csv extension.
#### Parameters:
Accepts two parameters such as path of the file and the number of rows.
#### Returns:
returns the csv file read in a dataframe.

In [9]:
from Code.Preprocessing.read_csv import read_file
emails = read_file(file,rows)
emails.head(5)

CSV imported successfully 



Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


## 1.2. Splitting the emails
The ***Preprocess*** class in ***split_emails*** module under ***Preprocessing*** package is developed to split the emails based on headers.  

### class Preprocess
### preprocessing_emails function
#### Description:
splits the emails based on the headers present within the email data.
#### Parameters:
accepts two parameters as a dataframe and a data column name.
#### Returns:
returns emails with respective headers in each column of the dataframe.

In [10]:
from Code.Preprocessing.split_emails import Preprocess
information = Preprocess()
analysis = information.preprocessing_emails(emails, 'message')
analysis.head(5)

Unnamed: 0,file,message,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,email_body,Sender,Receiver
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,<18782981.1075855378110.JavaMail.evans@thyme>,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",phillip.allen@enron.com,tim.belden@enron.com,,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Tim Belden <Tim Belden/Enron@EnronXGate>,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Here is our forecast\n\n,phillip.allen@enron.com,tim.belden@enron.com
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,<15464986.1075855378456.JavaMail.evans@thyme>,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",phillip.allen@enron.com,john.lavorato@enron.com,Re:,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,John J Lavorato <John J Lavorato/ENRON@enronXg...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,phillip.allen@enron.com,john.lavorato@enron.com
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,<24216240.1075855687451.JavaMail.evans@thyme>,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Leah Van Arsdall,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,test successful. way to go!!!,phillip.allen@enron.com,leah.arsdall@enron.com
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,<13505866.1075863688222.JavaMail.evans@thyme>,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",phillip.allen@enron.com,randall.gay@enron.com,,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Randall L Gay,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Randy,\n\n Can you send me a schedule of the s...",phillip.allen@enron.com,randall.gay@enron.com
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,<30922949.1075863688243.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Greg Piper,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,Let's shoot for Tuesday at 11:45.,phillip.allen@enron.com,greg.piper@enron.com


## 1.3. Export the email body column to text file for the Manual Annotations
The email column in the dataframe is exported as a text file using ***create_textfile*** function available in ***create_textfile_content*** module under ***Preprocessing*** package for manual annotations.
<a id="create_textfile"></a>
### create_textfile function
#### Description:
writes the content of a dataframe to a text file.
#### Parameters:
accepts one parameter as a column of the dataframe.
#### Returns:
returns a text file with the content of the dataframe which is saved to the local directory.

In [4]:
from Code.Preprocessing.create_textfile_content import create_textfile
create_textfile(analysis['email_body'])

## 1.4. Elimination of forwarded content from email body 
The original body of the email is seggregated from the forwarded content using ***clean_emails*** function in ***clean_content*** package under ***Preprocessing*** package.   

### clean_emails function
#### Description:
eliminates the forwarded content and retains the original body of the email and checks the spelling of the content
#### Parameters:
accepts two parameters as a dataframe and a data column name.
#### Returns:
returns only the original body of email with corrected spelling in a column of the dataframe.

The output of clean_emails function is exported as a text file using the [create_textfile](#create_textfile) function.

In [5]:
from Code.Preprocessing.clean_content import CleanEmails
clean_class = CleanEmails(analysis,'message')
clean_nlp_content = clean_class.clean_emails()
clean_nlp_content.head(5)

Unnamed: 0,file,message,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,...,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,email_body,Sender,Receiver,tokens,clean_text
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,<18782981.1075855378110.JavaMail.evans@thyme>,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",phillip.allen@enron.com,tim.belden@enron.com,,1.0,text/plain; charset=us-ascii,7bit,...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Here is our forecast\n\n,phillip.allen@enron.com,tim.belden@enron.com,"(Here, is, our, forecast, \n\n )",forecast
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,<15464986.1075855378456.JavaMail.evans@thyme>,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",phillip.allen@enron.com,john.lavorato@enron.com,Re:,1.0,text/plain; charset=us-ascii,7bit,...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,phillip.allen@enron.com,john.lavorato@enron.com,"(Traveling, to, have, a, business, meeting, ta...",travel business meeting take fun trip especial...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,<24216240.1075855687451.JavaMail.evans@thyme>,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,test successful. way to go!!!,phillip.allen@enron.com,leah.arsdall@enron.com,"(test, successful, ., , way, to, go, !, !, !)",test successful way
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,<13505866.1075863688222.JavaMail.evans@thyme>,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",phillip.allen@enron.com,randall.gay@enron.com,,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Randy,\n\n Can you send me a schedule of the s...",phillip.allen@enron.com,randall.gay@enron.com,"(Randy, ,, \n\n , Can, you, send, me, a, sched...",randy send schedule salary level scheduling gr...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,<30922949.1075863688243.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,Let's shoot for Tuesday at 11:45.,phillip.allen@enron.com,greg.piper@enron.com,"(Let, 's, shoot, for, Tuesday, at, 11:45, ., )",let shoot tuesday
5,allen-p/_sent_mail/1002.,Message-ID: <30965995.1075863688265.JavaMail.e...,<30965995.1075863688265.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 04:17:00 -0700 (PDT)",phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Greg,\n\n How about either next Tuesday or Thu...",phillip.allen@enron.com,greg.piper@enron.com,"(Greg, ,, \n\n , How, about, either, next, Tue...",grew tuesday thursday philip
6,allen-p/_sent_mail/1003.,Message-ID: <16254169.1075863688286.JavaMail.e...,<16254169.1075863688286.JavaMail.evans@thyme>,"Tue, 22 Aug 2000 07:44:00 -0700 (PDT)",phillip.allen@enron.com,"david.l.johnson@enron.com, john.shafer@enron.com",,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,Please cc the following distribution list with...,phillip.allen@enron.com,"john.shafer@enron.com,david.l.johnson@enron.com","(Please, cc, the, following, distribution, lis...",follow distribution list update philip allen m...
7,allen-p/_sent_mail/1004.,Message-ID: <17189699.1075863688308.JavaMail.e...,<17189699.1075863688308.JavaMail.evans@thyme>,"Fri, 14 Jul 2000 06:59:00 -0700 (PDT)",phillip.allen@enron.com,joyce.teixeira@enron.com,Re: PRC review - phone calls,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,any morning between 10 and 11:30,phillip.allen@enron.com,joyce.teixeira@enron.com,"(any, morning, between, 10, and, 11:30)",morning
8,allen-p/_sent_mail/101.,Message-ID: <20641191.1075855687472.JavaMail.e...,<20641191.1075855687472.JavaMail.evans@thyme>,"Tue, 17 Oct 2000 02:26:00 -0700 (PDT)",phillip.allen@enron.com,mark.scott@enron.com,Re: High Speed Internet Access,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,1. login: pallen pw: ke9davis\n\n I don't thi...,phillip.allen@enron.com,mark.scott@enron.com,"(1, ., login, :, , pallen, pw, :, ke9davis, \...",login fallen think require is static address s...
9,allen-p/_sent_mail/102.,Message-ID: <30795301.1075855687494.JavaMail.e...,<30795301.1075855687494.JavaMail.evans@thyme>,"Mon, 16 Oct 2000 06:44:00 -0700 (PDT)",phillip.allen@enron.com,zimam@enron.com,FW: fixed forward or other Collar floor gas pr...,1.0,text/plain; charset=us-ascii,7bit,...,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,---------------------- Forwarded by Phillip K ...,phillip.allen@enron.com,zimam@enron.com,"(----------------------, Forwarded, by, Philli...",


In [9]:
create_textfile(clean_nlp_content['clean_text'])

## 2. Feature Engineering

The following steps are performed in feature engineering,

    2.1. Named Entity Recognition
    2.2. Named Entity Linking

## 2.1. Named Entity Recognition 
A neural model specific to the training data is developed to extract named entities using ***train_spacy*** function in ***training_ner*** module under ***Train_NER*** package.

### train_spacy function
#### Description:
with the manually annotated data as training data, a blank neural model in spacy library is trained for extracting named entities from email data.
#### Parameters:
accepts two parameters such as training data(manually annotated pickle file) and number of iterations.
#### Returns:
a neural network model trained specific to the training data and training data for evaluation purpose.

In [12]:
from Code.Feature_Engineering.training_ner import train_spacy
import pickle
ner_pickle_input = "/Users/Sangireddy Siva/Dropbox/CaseStudy1/New Production code/Scripts_V14/Data/ner_train_data.pickle"
ner_iterations = 20
_, nlp = train_spacy(ner_pickle_input,ner_iterations)
ner_model = input("Provide the name for NER model")
nlp.to_disk(ner_model)

## 2.2. Named Entity Linking 
A neural model specific to the training data is developed to extract named entity linking for the exising named entities in the data using  ***NelTraining*** class under ***Train_NEL*** package.

###  NelTraining class
#### Description:
class uses trained NER model to generate the knowledge base and trains the names entity linking.
#### Parameters:
accepts one parameter such as location of the trained NER model.

In [12]:
from Code.Feature_Engineering.training_nel import NelTraining
import pandas as pd

cus_ner = "/Users/Sangireddy Siva/Dropbox/CaseStudy1/New Production code/Scripts_V14/Derived_Outputs/Custom_NER"
cus_ner_path = NelTraining(cus_ner)

## 2.2.1 creating_knowledge
###  creating_knowledge function
#### Description:
with NER training data and by converting the NER data with custom NER model, knowledge base is generated.
#### Parameters:
accepts one parameter such as location of the NER training data pickle file.
#### Returns:
a .csv file with entities and frequncies.

In [14]:
cus_ner_path.creating_knowledge(ner_pickle_input)

## 2.2.2 settingup_knowledgebase
###  settingup_knowledgebase function
#### Description:
Fucntion imports knowledge base data to set up entities and aliases using the vocabulary from trained NER model, and also traines the named entitity linking on top of the trained NER model.
#### Parameters:
accepts two parameters such as location of the knowledge base file and training NER data with QID's
#### Returns:
two pickle files with knowledge base dump and knowledge base vocab, and a neural model trained specific to the training data

In [13]:
kb_input = "/Users/Sangireddy Siva/Dropbox/CaseStudy1/New Production code/Scripts_V14/Pre_loaded_inputs_and_outputs/Knowledge_Base_demo.csv"
names =  pd.read_csv(kb_input, sep=';')
training_nel_data = "/Users/Sangireddy Siva/Dropbox/CaseStudy1/New Production code/Scripts_V14/Pre_loaded_inputs_and_outputs/Training_NER_demo.csv"
train_data_2 =  pd.read_csv(training_nel_data, sep=';')
cus_ner_path.settingup_knowledgebase(names,train_data_2)

.csv
None
CSV imported successfully 

.csv
None
CSV imported successfully 



## 3. Machine learning
Machine learning unsupervised model specific to the trained data would be generate to automatically clustering the data
***TrainML*** class under ***Training_ML*** package

###  TrainML Class
#### Description:
class will train Machine learning model to generate the clusters on preprocessed training data.
#### Parameters:
accepts two parameters such as a dataframe and a data column.

In [3]:
from Code.Machine_Learning.Training_ML import TrainML
from Code.Preprocessing.read_csv import read_file

ml_input = "/Users/Sangireddy Siva/Dropbox/CaseStudy1/New Production code/Scripts_V14/Pre_loaded_inputs_and_outputs/cleaned_emails.csv"
ml_data = read_file(ml_input)
print(ml_data.head(5))
cust_ml_model = TrainML(ml_data,'clean_text')

CSV imported successfully 

                                          clean_text
0                                           forecast
1  travel business meeting take fun trip especial...
2                                test successful way
3  randy send schedule salary level scheduling gr...
4                                  let shoot tuesday


## 3.1 Training Clustering model
###  train_ml function
#### Description:
Fucntion takes the dataframe from the TrainML class to generate the model with clusters based on the Elbow strength of the Inertia scores
#### Parameters:
accepts three parameters while training the model such as random state, number of iterations and top words for display purpose.
#### Default parameters:
random state = 42, iterations = 100, top words = 20
#### Returns:
one pickle file with trained clustering model specific to the training data

In [4]:
cust_ml_model.train_ml()

Clustering using KMeans...
Please wait for model training..........
clusters are 2
Clustering model has been successfully trained
