## Machine Learning: Email Classification


Each user had a chance to classify their emails and put them in labeled folders. Herein I built a classifier for emails owned by a single employee based on the labels that they used to classify their own emails. The classifier uses the subject and content to train the model and uses the Multinomial Naive Bayes algorithm. 

I chose to work with the emails owned by user 'kaminski-v'. Vince Kaminski is the top email user in the database who sent and received 28465 emails. One of their multiple email accounts "vince.kaminski@enron.com" has sent 14368 emails which is the second most among all users. It seems that they labeled his emails better than the rest of the top users and that's why I chose them. 

I start with the cleaned data that I prepared in the previous sections of the project. 


- **Part 1: Extracting labels from data **
- **Part 2: Tokenization & Cleaning **
- **Part 3: Machine Learning: Multinomial Naive Bayes, n-gram Naive Bayes, Logistic Regression**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Tokenization and cleaning 
import re
from nltk.tokenize.regexp import RegexpTokenizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

# Machine Learning: Bag of words, Multinomial Naive Bayes 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# TF-IDF, Logistic Regression and n-gram will be imported on the spot

In [2]:
# reading the preprocessed data frame
df = pd.read_csv('out.csv', index_col='Message-ID', low_memory=False)
df.head()

Unnamed: 0_level_0,Bcc,Cc,Date,From,Subject,To,X-FileName,X-Folder,X-From,X-Origin,X-To,X-bcc,X-cc,content,user
Message-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
<18782981.1075855378110.JavaMail.evans@thyme>,,,2001-05-14 23:39:00,phillip.allen@enron.com,,tim.belden@enron.com,pallen (Non-Privileged).pst,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Phillip K Allen,Allen-P,Tim Belden <Tim Belden/Enron@EnronXGate>,,,Here is our forecast\r\n\r\n,allen-p
<15464986.1075855378456.JavaMail.evans@thyme>,,,2001-05-04 20:51:00,phillip.allen@enron.com,Re:,john.lavorato@enron.com,pallen (Non-Privileged).pst,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Phillip K Allen,Allen-P,John J Lavorato <John J Lavorato/ENRON@enronXg...,,,Traveling to have a business meeting takes the...,allen-p
<24216240.1075855687451.JavaMail.evans@thyme>,,,2000-10-18 10:00:00,phillip.allen@enron.com,Re: test,leah.arsdall@enron.com,pallen.nsf,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Phillip K Allen,Allen-P,Leah Van Arsdall,,,test successful. way to go!!!,allen-p
<13505866.1075863688222.JavaMail.evans@thyme>,,,2000-10-23 13:13:00,phillip.allen@enron.com,,randall.gay@enron.com,pallen.nsf,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Phillip K Allen,Allen-P,Randall L Gay,,,"Randy,\r\n\r\n Can you send me a schedule of t...",allen-p
<30922949.1075863688243.JavaMail.evans@thyme>,,,2000-08-31 12:07:00,phillip.allen@enron.com,Re: Hello,greg.piper@enron.com,pallen.nsf,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Phillip K Allen,Allen-P,Greg Piper,,,Let's shoot for Tuesday at 11:45.,allen-p


### Part 1: Data Cleaning & Wrangling - Extracting Labels

- Building a new dataframe from emails sent and received by the top email user 'kaminski-v' (Vince Kaminski)
- Extracting not too general labels with about 200 emails or more.
- Removing the unwanted columns from the data frame.

In [3]:
df_ml = df[df['user'] == 'kaminski-v']
print('Number of emails sent or received by top email user (kaminski-v):', len(df_ml))

Number of emails sent or received by top email user (kaminski-v): 28465


In [4]:
# The X-Folder directiries (header) which we will use for labeling
print(df_ml.groupby('X-Folder')['X-Folder'].count().sort_values(ascending=False)[:20])

X-Folder
\Vincent_Kaminski_Jun2001_1\Notes Folders\All documents                 5066
\Vincent_Kaminski_Jun2001_2\Notes Folders\Discussion threads            3980
\Vincent_Kaminski_Jun2001_4\Notes Folders\'sent mail                    2574
\Vincent_Kaminski_Jun2001_3\Notes Folders\Sent                          2573
\Vincent_Kaminski_Jun2001_6\Notes Folders\All documents                 2108
\Vincent_Kaminski_Jun2001_7\Notes Folders\Discussion threads            1570
\Vincent_Kaminski_Jun2001_8\Notes Folders\'sent mail                     890
\Vincent_Kaminski_Jun2001_8\Notes Folders\Sent                           890
\VKAMINS (Non-Privileged)\Kaminski, Vince J\Sent Items                   827
\vkamins\Deleted Items                                                   691
\VKAMINS (Non-Privileged)\Kaminski, Vince J\Deleted Items                572
\Vince_Kaminski_Jun2001_10\Sent Items                                    498
\Vincent_Kaminski_Jun2001_5\Notes Folders\C:\Mangmt\Group\Managemen

In [5]:
# exctracting the final folder name from the directory
# split does not work on \ (escape character, string literal) so I replaced \ with \\
df_ml.loc[:, 'X-Folder'] = df_ml['X-Folder'].astype(str) # some of the folders were float
df_ml.loc[:,'labels'] = df_ml['X-Folder'].map(lambda x: x.lower().split('\\')[-1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [6]:
print(df_ml.groupby('labels')['labels'].count().sort_values(ascending=False))

labels
all documents         7174
discussion threads    5550
'sent mail            3464
sent                  3463
deleted items         1792
sent items            1696
management             802
inbox                  560
resumes                547
projects               379
universities           367
personal               281
ene_ect                270
conferences            223
notes inbox            223
london                 195
techmemos              187
rice                   175
australia              112
var                    110
eci                    109
ut                      76
stanford                76
credit                  65
calendar                60
evaluation              57
risk                    54
ei                      51
consultants             44
sites                   30
cera                    28
esai                    25
rac                     24
ees                     23
weather                 21
presentations           21
gpg                  

In [7]:
# labels with about 200 emails or more that are not too general.
# I did not select Personal because in contains lot of emails in Polish 
labels_to_keep = ['management', 'resumes', 'universities',\
                  'projects', 'conferences', 'london', 'ene_ect']
print('Categorical labels:', labels_to_keep)

Categorical labels: ['management', 'resumes', 'universities', 'projects', 'conferences', 'london', 'ene_ect']


In [8]:
# removing rows with unwanted labels
df_ml = df_ml[df_ml['labels'].isin(labels_to_keep)]

In [9]:
print(df_ml.groupby('labels')['labels'].count().sort_values(ascending=False))

labels
management      802
resumes         547
projects        379
universities    367
ene_ect         270
conferences     223
london          195
Name: labels, dtype: int64


- to avoid having unbalanced training data and putting weight on more frequent labels I reduce the size of different categories to the minimum size which is 195 samples for each label.

In [10]:
# sampling to have equal size categories
df_ml_r = pd.DataFrame()
for lab in labels_to_keep:
    df_ml_r = pd.concat([df_ml_r, df_ml.loc[df_ml['labels'] == lab, :].sample(195, random_state=42)])

In [11]:
# Check the sampled data
df_ml = df_ml_r
del df_ml_r
print(df_ml.groupby('labels')['labels'].count().sort_values(ascending=False))

labels
universities    195
resumes         195
projects        195
management      195
london          195
ene_ect         195
conferences     195
Name: labels, dtype: int64


In [12]:
# adding a column for subject + content
# some of the emails have no subject line. Without filling the nan the result of the addition becomes null.
df_ml['Subject'] = df_ml['Subject'].fillna('')
df_ml['text'] = df_ml['Subject'] + '\n' + df_ml['content']

In [13]:
# removing the extra columns
df_ml = df_ml[['labels', 'Subject', 'content', 'text']].reset_index()
df_ml.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1365 entries, 0 to 1364
Data columns (total 5 columns):
Message-ID    1365 non-null object
labels        1365 non-null object
Subject       1365 non-null object
content       1365 non-null object
text          1365 non-null object
dtypes: object(5)
memory usage: 53.4+ KB


- By focusing on the not too general labels with 195 data points each, we end up with only 1365 emails for classification.

## Part 2: Tokenization & Cleaning
- Removing numerical characters from the text 
- Tokenization (list of tokens)
- Removing stop words
- Removing " _ " used as empty spaces to fill in present in the emails
- Concatanating the list of tokens to make a string for CountVectorizer

In [14]:
# removing digits
df_ml['text_nonum'] = df_ml['text'].map(lambda x: re.sub(r'\d+', '',x))

In [15]:
# tokenizing to remove stop words. This converts the text to a list of tokens
tokenizer = RegexpTokenizer(r'(?u)\b\w\w+\b')
df_ml['text_token'] = df_ml['text_nonum'].map(lambda x: tokenizer.tokenize(x)) # class of each element: list

In [16]:
# removing the stopwords and also lowercasing
df_ml['text_token'] = df_ml['text_token'].map(lambda x: [word.lower() for word in x if word not in (ENGLISH_STOP_WORDS)])

In [17]:
# removing some common words in emails
words_to_go = ['to', 'from', 'am', 'pm']
df_ml['text_token'] = df_ml['text_token'].map(lambda x: [word for word in x if word not in (words_to_go)])

In [18]:
# to remove ____... from the text
df_ml['text_token'] = df_ml['text_token'].map(lambda x: [word.replace('_', ' ') for word in x])

In [19]:
# making a string from a list of tokens so that it can work with CountVectorizer
df_ml['text_str'] = df_ml['text_token'].map(lambda x: ' '.join(x))

## Part 3: Machine Learning 

I used the bag of words model for feature extraction from text.

In the bag of words model, each document is treated as a vector. Each element in the vector contains some kind of data about the words that appear in the document such as presence/absence (1/0), count (an integer) or some other statistic. Each vector has the same length because each document shared the same vocabulary across the full collection of documents. This collection is called a corpus. Then, a set of documents becomes, in the usual sklearn style, a sparse matrix with rows being sparse arrays representing documents and columns representing the features/words in the vocabulary. Notice that the bag of words treatment doesn't preserve information about the order of words, just their frequency.

For classification I used Multinomial Naive Bayes with two different methods for vectorization, n-gram naive bayes, and logistic regression. I used cross validation to tune the hyperparameters for all of the models.

### 3.1. Multinomial Naive Bayes

Naive Bayes classifier is a general term which refers to conditional independence of each of the features in the model, while Multinomial Naive Bayes classifier is a specific instance of a Naive Bayes classifier which uses a multinomial distribution for each of the features. 

The main assumption of Naive Bayes is that the features are conditionally independent given the class. While the presence of a particular discriminative word may uniquely identify the document as being part of a particular class and thus violate general feature independence, conditional independence means that the presence of that term is independent of all the other words that appear within that class. 


In [20]:
# Making the Bag of Words and the labels!
vectorizer = CountVectorizer(min_df=1)
vectorizer.fit(df_ml.text_str)
X = vectorizer.transform(df_ml.text_str)
print('First 100 features: \n', vectorizer.get_feature_names() [:100])
print ('...')
print('Last 100 features: \n', vectorizer.get_feature_names() [-100:])

First 100 features: 
 ['aa', 'aadedeji', 'aaldous', 'aanalysis', 'aaro', 'ab', 'abahy', 'abb', 'abdul', 'abdullah', 'abhay', 'abilit', 'abilities', 'ability', 'abitibi', 'able', 'abler', 'abliged', 'abn', 'abo', 'aboard', 'aboriginal', 'abou', 'about', 'above', 'abrams', 'abreast', 'abridged', 'abroad', 'abs', 'absence', 'absolutely', 'absorb', 'absorbed', 'absorbing', 'abstract', 'abstracts', 'abu', 'abundance', 'abuse', 'ac', 'aca', 'academe', 'academia', 'academic', 'academically', 'academics', 'academy', 'acadrep', 'acc', 'accelerate', 'accelerated', 'accelerating', 'acceleration', 'accenture', 'accept', 'acceptable', 'acceptance', 'accepted', 'accepting', 'accepts', 'acces', 'access', 'accessenergy', 'accessibility', 'accessible', 'accessing', 'accident', 'acclaimed', 'accommodate', 'accommodates', 'accommodating', 'accommodation', 'accommodations', 'accomodate', 'accomodating', 'accomodation', 'accomodative', 'accompanied', 'accompanies', 'accompany', 'accomplish', 'accomplished'

In [21]:
print('Number of features:', len(vectorizer.get_feature_names()))

Number of features: 13676


In [22]:
print('Shape of the sparse matrix: ', X.shape)
print('Non-Zero occurrences: ', X.nnz)
# Percentage of non-zero values
density = (100.0 * X.nnz / (X.shape[0] * X.shape[1]))
print('Percentage of non-zero values: %', density)

Shape of the sparse matrix:  (1365, 13676)
Non-Zero occurrences:  124627
Percentage of non-zero values: % 0.6676062554974518


###### So far
- 1365 emails labeled in 8 different categories are vectorized.
- There are 13676 features, with min-df = 1 (keepeing words with frequency >= 1) .
- Some of the words in this bag are in Polish, not English.

In [23]:
# y labels
y = df_ml['labels'].values # df_ml.labels.values would not work, '.labels' might be considered an attribute

In [24]:
# Splitting the data for training and test
Xlr, Xtest, ylr, ytest = train_test_split(X, y, test_size=0.2, random_state=5)

# Naive Bayes Multinomial
MNB = MultinomialNB(alpha=1) # cross-validation will be performed later
MNB.fit(Xlr, ylr)
print('Accuracy Score for Training Set (MNB): {}'.format(accuracy_score(ylr, MNB.predict(Xlr))))
print('Accuracy Score for Test Set (MNB): {}'.format(accuracy_score(ytest, MNB.predict(Xtest))))

Accuracy Score for Training Set (MNB): 0.9230769230769231
Accuracy Score for Test Set (MNB): 0.7289377289377289


In [25]:
prediction = MNB.predict(X)
# print(prediction)

In [26]:
# name of the classes, see dir(MNB) to see available methods
print ('Name of classes:\n', MNB.classes_)
print('Number of samples encountered for each class during training:')
print(MNB.class_count_) # in the training phase

Name of classes:
 ['conferences' 'ene_ect' 'london' 'management' 'projects' 'resumes'
 'universities']
Number of samples encountered for each class during training:
[ 153.  147.  154.  162.  159.  162.  155.]


### Cross validation to tune alpha (regularization parameter) of MNB

In [27]:
# Tune the hyperparameter of MNB using GridSearchCV

alpha_space = [1, 1.5, 2, 5, 10, 50]
param_grid = {'alpha': alpha_space}

mnb = MultinomialNB()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

# Instantiate the GridSearchCV object
mnb_cv = GridSearchCV(mnb, param_grid, scoring='accuracy', cv=5)

# Fit it to the training data
mnb_cv.fit(X_train,y_train)

# Print the optimal parameters and best score
# Parameter setting that gave the best results on the hold out data
print("Tuned alpha parameter: {}".format(mnb_cv.best_params_)) 
# Mean cross-validated score of the best_estimator
print("Mean cross-validated accuracy of tuned model for the hold out data: {}".format(mnb_cv.best_score_)) 

Tuned alpha parameter: {'alpha': 1}
Mean cross-validated accuracy of tuned model for the hold out data: 0.7380952380952381


In [28]:
# Model accuracy using the best alpha which turned out the be equal to the default alpha = 1 
print('Accuracy Score for Training Set (MNB-CV): {}'.format(accuracy_score(y_train, mnb_cv.predict(X_train))))
print('Accuracy Score for Test Set (MNB-CV): {}'.format(accuracy_score(y_test, mnb_cv.predict(X_test))))

Accuracy Score for Training Set (MNB-CV): 0.9230769230769231
Accuracy Score for Test Set (MNB-CV): 0.7289377289377289


- Let's try another way of vectorization instead of Count Vectorizer called TF-IDF vectorizer. TF-IDF stands for “term frequency / inverse document frequency” and is a method for emphasizing words that occur frequently in a given document, while at the same time de-emphasising words that occur frequently in many documents.

### TF-IDF

In [29]:
# TF-IDF weighting instead of word counts
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvectorizer = TfidfVectorizer(min_df=1, stop_words='english') # some stop words have been previously removed
Xtfidf = tfidfvectorizer.fit_transform(df_ml.text_str)
print('Number of features:', len(tfidfvectorizer.get_feature_names()))

Number of features: 13479


In [30]:
# MNB with TF-IDF weighting instead of word counts + Cross Validation

mnb_tfidf = MultinomialNB()
alpha_space = [1, 1.5, 2, 5, 10, 50]
param_grid = {'alpha': alpha_space}


# Create train and test sets
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(Xtfidf, y, test_size=0.2, random_state=5)

# Instantiate the GridSearchCV object
mnb_cv_tfidf = GridSearchCV(mnb_tfidf, param_grid, scoring='accuracy', cv=5)

# Fit it to the training data
mnb_cv_tfidf.fit(X_train_tfidf,y_train_tfidf)

# Print the optimal parameters and best score
# Parameter setting that gave the best results on the hold out data
print("Tuned alpha parameter: {}".format(mnb_cv_tfidf.best_params_)) 
# Mean cross-validated score of the best_estimator
print("Mean cross-validated accuracy of tuned model for the hold out data: {}".format(mnb_cv_tfidf.best_score_)) 

Tuned alpha parameter: {'alpha': 1}
Mean cross-validated accuracy of tuned model for the hold out data: 0.7490842490842491


In [31]:
# Model accuracy using the best alpha which turned out to be same as the default alpha = 1 
print('Accuracy Score for Training Set (MNB-TFIDF-CV): {}'\
      .format(accuracy_score(y_train_tfidf, mnb_cv_tfidf.predict(X_train_tfidf))))
print('Accuracy Score for Test Set (MNB-TFIDF-CV): {}'\
      .format(accuracy_score(y_test_tfidf, mnb_cv_tfidf.predict(X_test_tfidf))))

Accuracy Score for Training Set (MNB-TFIDF-CV): 0.9276556776556777
Accuracy Score for Test Set (MNB-TFIDF-CV): 0.7472527472527473


### 3.2. n-gram Feature Multinomial Naive Bayes

N-grams are phrases containing n words next to each other. In a simple n-gram language model, the probability of a word, conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.) can be described as following a categorical distribution. n-gram of size 1 is referred to as a unigram, size 2 is a bigram or digram, size 3 is a trigram.

In [32]:
# bi-gram feature Multinomial Naive Bayes + Cross Validation
vectorizer = CountVectorizer(ngram_range=(1,2))
X_ngram = vectorizer.fit_transform(df_ml.text_str)

x_ngram_train, x_ngram_test, y_ngram_train, y_ngram_test = train_test_split(X_ngram, y, test_size=0.2, random_state=5)

nb_ngram = MultinomialNB()
alpha_space = [1, 2, 5, 10, 50]
param_grid = {'alpha': alpha_space}
nb_ngram_CV = GridSearchCV(nb_ngram, param_grid, scoring='accuracy', cv=5)

nb_ngram_CV.fit(x_ngram_train,y_ngram_train)

print("Tuned alpha parameter: {}".format(nb_ngram_CV.best_params_)) 
print("Mean cross-validated accuracy of tuned model for the hold out data: {}".format(nb_ngram_CV.best_score_)) 
print('')
print('Accuracy Score for Training Set (bi-gram-NB-CV): {}'\
      .format(accuracy_score(y_ngram_train, nb_ngram_CV.predict(x_ngram_train))))
print('Accuracy Score for Test Set (bi-gram-NB-CV): {}'\
      .format(accuracy_score(y_ngram_test, nb_ngram_CV.predict(x_ngram_test))))

Tuned alpha parameter: {'alpha': 1}
Mean cross-validated accuracy of tuned model for the hold out data: 0.73992673992674

Accuracy Score for Training Set (bi-gram-NB-CV): 0.9743589743589743
Accuracy Score for Test Set (bi-gram-NB-CV): 0.717948717948718


In [33]:
# tri-gram feature Multinomial Naive Bayes + Cross Validation
vectorizer = CountVectorizer(ngram_range=(1,3))

X_ngram = vectorizer.fit_transform(df_ml.text_str)

x_ngram_train, x_ngram_test, y_ngram_train, y_ngram_test = train_test_split(X_ngram, y, test_size=0.2, random_state=5)

nb_ngram = MultinomialNB()
alpha_space = [1, 2, 5, 10, 50]
param_grid = {'alpha': alpha_space}
nb_ngram_CV = GridSearchCV(nb_ngram, param_grid, scoring='accuracy', cv=5)

nb_ngram_CV.fit(x_ngram_train,y_ngram_train)

print("Tuned alpha parameter: {}".format(nb_ngram_CV.best_params_)) 
print("Mean cross-validated accuracy of tuned model for the hold out data: {}".format(nb_ngram_CV.best_score_)) 
print('')
print('Accuracy Score for Training Set (tri-gram-NB-CV): {}'\
      .format(accuracy_score(y_ngram_train, nb_ngram_CV.predict(x_ngram_train))))
print('Accuracy Score for Test Set (tri-gram-NB-CV): {}'\
      .format(accuracy_score(y_ngram_test, nb_ngram_CV.predict(x_ngram_test))))

Tuned alpha parameter: {'alpha': 1}
Mean cross-validated accuracy of tuned model for the hold out data: 0.739010989010989

Accuracy Score for Training Set (tri-gram-NB-CV): 0.9862637362637363
Accuracy Score for Test Set (tri-gram-NB-CV): 0.717948717948718


### 3.3. Logistic Regression

In [34]:
# Logistic Regression + Cross Validation
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression() # regularization: ridge

# large values of C give more freedom to the model. Conversely, smaller values of C constrain the model more.
parameters = {'C':[ 0.1, 1, 5, 10, 20, 30, 40, 50]} 

X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(Xtfidf, y, test_size=0.2, random_state=5)

# GridSearchCV
LR_cv_tfidf = GridSearchCV(clf, param_grid = parameters, scoring='accuracy', cv=5)

# Fit it to the training data
LR_cv_tfidf.fit(X_train_tfidf, y_train_tfidf)

print("Tuned C - Model Regularization Parameter: {}".format(LR_cv_tfidf.best_params_))
print("Mean cross-validated accuracy of tuned model for the hold out data: {}".format(LR_cv_tfidf.best_score_)) 
print('')
print('Accuracy Score for Training Set (Logistic Regression-CV): {}'\
      .format(accuracy_score(y_train_tfidf, LR_cv_tfidf.predict(X_train_tfidf))))
print('Accuracy Score for Test Set (Logistic Regression-CV): {}'\
      .format(accuracy_score(y_test_tfidf, LR_cv_tfidf.predict(X_test_tfidf))))

Tuned C - Model Regularization Parameter: {'C': 5}
Mean cross-validated accuracy of tuned model for the hold out data: 0.771978021978022

Accuracy Score for Training Set (Logistic Regression-CV): 0.9871794871794872
Accuracy Score for Test Set (Logistic Regression-CV): 0.7545787545787546


### Comparison between the different models used to classify the emails:

Bag of words model was used for feature extraction from text. For text classification, multinomial Naive Bayes with two different types of vectorization for bag of words (count & TF-DIF), ngram feature Naive Bayes, and Logistic Regression with TF-IDF featuring were used. Cross-validation to tune the regularization parameters was performed for all models. The accuracy score for training set for these four models varied between 92% and 99% while the accuracy score for the test set varied between 71% and 75%. Logistic Regression had the highest accuracy score for the test set with a score of 75%. 

This classification has 13,479 unigram features and 1,365 data points labeled into 8 categories. The number of features were much higher than data points. Naive Bayes classifiers and Logistic Regression had comparable performances.

## Summary

An email classifier for labeling emails of a particular employee has been built. This classifier or extended versions of it could be used to label his emails automatically. Building the classifier for the whole company would require predefined labels and data from at least a few employees but the process for building the classifier is the same.

Any of the followings could help to improve the performance of the classifier: 
- Getting access to more data points and well labeled data - labels that were assigned by the user more thoughtfully
- Expanding our stop words list
- Stemming or lemmazation
- Working with data in one language, not two.