### Enron email classification - Supervied learning

Using 10k random records of Enron email data, I have labelled the topic number for each email with the help of Topic modelling(unsupervised learning) and got the labelled email data.

I am going to do text classification of emails corresponding to its topic using supervised classification models.

In [55]:
# Imports
import numpy as np
import pandas as pd

In [2]:
# getting labelled csv data after topic modelling 
df = pd.read_csv('email_df.csv')

In [3]:
df.head()

Unnamed: 0,from,to,email,Topic
0,tana.jones@enron.com,alicia.goodrow@enron.com,"It would be nice if you could be at my dinner,...",4
1,Sheila Tweed@ECT on 05/15/2001 06,Kay Mann/Corp/Enron@ENRON,Absolutely.Good point! Can Peter start to dra...,7
2,jeff.dasovich@enron.com,christine.piesco@oracle.com,My apologies. My schedule melted down after w...,4
3,tanya.tamarchenko@enron.com,"Richard Lewis/LON/ECT@ECT, James New/LON/ECT@E...","Vince,UK VAR breached the limit last week.UK t...",4
4,kay.mann@enron.com,"Don Hammond/PDX/ECT@ECT, Jody Blackburn/PDX/EC...",Any problems/comments?AM ---------------------...,7


In [4]:
len(df)

10348

In [5]:
# create new dataframe with the labels we want such as email and Topic
email_df = df[['email','Topic']].copy()

In [6]:
email_df.head()

Unnamed: 0,email,Topic
0,"It would be nice if you could be at my dinner,...",4
1,Absolutely.Good point! Can Peter start to dra...,7
2,My apologies. My schedule melted down after w...,4
3,"Vince,UK VAR breached the limit last week.UK t...",4
4,Any problems/comments?AM ---------------------...,7


In [7]:
# check for null values
email_df.isnull().sum()

email    91
Topic     0
dtype: int64

In [8]:
# drop null values
email_df.dropna(inplace=True)

In [9]:
# check for blank spaces in email column
blanks = []

for i,email,top in email_df.itertuples():
    if email.isspace():
        blanks.append(i)
print(blanks)

[]


In [10]:
#df.drop(blanks,inplace=True)

In [11]:
email_df.shape

(10257, 2)

In [12]:
# train test split
from sklearn.model_selection import train_test_split
X = email_df['email']
y = email_df['Topic']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

In [17]:
# necessary imports
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.metrics import confusion_matrix,classification_report

In [56]:
# using linear SVC model with Count vectorizer
from sklearn.svm import LinearSVC
text_clf = Pipeline([('cv',CountVectorizer(max_df=.95,min_df=2,stop_words='english')),
                     ('clf',LinearSVC(random_state=10,class_weight='balanced'))])

In [57]:
# training the model
text_clf.fit(X_train,y_train)

Pipeline(steps=[('cv',
                 CountVectorizer(max_df=0.95, min_df=2, stop_words='english')),
                ('clf', LinearSVC(class_weight='balanced', random_state=10))])

In [58]:
# predictions
y_pred = text_clf.predict(X_test)

In [59]:
# confusion matrix
print(confusion_matrix(y_test, y_pred))

[[ 63   6   0   2  20   0   9   2   2   1   1   0]
 [  9  19   2   4   7   0   0   5   1   1   0   0]
 [ 10   1 111   1  13   3   9   8   1   2   1   3]
 [  5   1   5  60  15   6   2   3   4   3   6   1]
 [ 35  10   7  10 846   9   8  43  13   6  14   4]
 [  7   0   5   3  22 109   5  12   0   4   2   0]
 [ 10   3   5   0  14   2 388   8   1   5   5   1]
 [ 14   5   6   4  54   1  14 415   3   5   4   0]
 [ 13   2   2   1  19   4   1   4  78   0   2   0]
 [  2   3   3   0   4   1   4   6   0 116   3   0]
 [  7   3   1   4  20   4   1  15   8   0 116   2]
 [  5   0   0   5   4   2   0   2   4   0   6  32]]


In [49]:
# classification report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.36      0.61      0.45       106
           1       0.37      0.40      0.38        48
           2       0.74      0.71      0.72       163
           3       0.61      0.53      0.57       111
           4       0.82      0.83      0.83      1005
           5       0.78      0.65      0.71       169
           6       0.89      0.88      0.88       442
           7       0.79      0.80      0.79       525
           8       0.68      0.61      0.64       126
           9       0.80      0.82      0.81       142
          10       0.73      0.65      0.69       181
          11       0.70      0.53      0.60        60

    accuracy                           0.77      3078
   macro avg       0.69      0.67      0.67      3078
weighted avg       0.77      0.77      0.77      3078



In [65]:
# using linear SVC model with Tf-Idf vectorizer
from sklearn.svm import LinearSVC
text_clf = Pipeline([('tfidf',TfidfVectorizer(max_df=.95,min_df=2,stop_words='english')),('clf',LinearSVC(random_state=10,class_weight='balanced'))])

In [66]:
# model training
text_clf.fit(X_train,y_train)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')),
                ('clf', LinearSVC(class_weight='balanced', random_state=10))])

In [67]:
# predictions
y_pred = text_clf.predict(X_test)

In [68]:
# confusion matrix
print(confusion_matrix(y_test, y_pred))

[[ 67   2   0   6  18   0   6   4   1   1   1   0]
 [  9  21   2   2   9   1   0   3   0   1   0   0]
 [  5   0 129   1  10   3   7   5   1   1   0   1]
 [  2   1   1  73  12   1   0   6   8   3   3   1]
 [ 19   5   7  10 875  10  10  41  11   5   9   3]
 [  7   0   3   1  12 129   3   9   1   2   2   0]
 [  3   2   3   1  12   4 406   6   0   4   1   0]
 [  9   0   6   3  45   2  13 434   0   8   5   0]
 [  9   0   0   5  12   3   1   3  90   0   3   0]
 [  1   2   1   1   4   0   0   6   0 126   0   1]
 [  4   3   0   0  10   0   2  15   5   0 141   1]
 [  3   0   1   4   5   0   0   4   2   0   6  35]]


In [69]:
# classification report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.49      0.63      0.55       106
           1       0.58      0.44      0.50        48
           2       0.84      0.79      0.82       163
           3       0.68      0.66      0.67       111
           4       0.85      0.87      0.86      1005
           5       0.84      0.76      0.80       169
           6       0.91      0.92      0.91       442
           7       0.81      0.83      0.82       525
           8       0.76      0.71      0.73       126
           9       0.83      0.89      0.86       142
          10       0.82      0.78      0.80       181
          11       0.83      0.58      0.69        60

    accuracy                           0.82      3078
   macro avg       0.77      0.74      0.75      3078
weighted avg       0.82      0.82      0.82      3078



After trying many classification models, Linear SVC model outperforms all the other models. Based on the classification report, Linear SVC with Tf-idf vectorizer performs well than Linear SVC with count vectorizer. Hence Linear SVC with Tf-idf can be used for this classification for better results.