### Classification of emails - supervised model

Using the output dataset from topic modelling which contains dominant topic for each email text, taking it as a supervised classfication model dataset, need to create a classfication model to predict the topic for new emails.

I am taking email text column which will be converted to matrix using Tf-idf vectorizer as a feature and Dominant topic as a label

In [19]:
import pandas as pd
import numpy as np
import re
import nltk

In [20]:
data = pd.read_csv('email_df_final.csv')
data.head()

Unnamed: 0,from,to,email,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,tana.jones@enron.com,alicia.goodrow@enron.com,nice dinner probably knowanyone else anytime w...,0,2.0,0.9333,"original_message, know, thanks, get, please, m...","['nice', 'dinner', 'probably', 'knowanyone', '..."
1,Sheila Tweed@ECT on 05/15/2001 06,Kay Mann/Corp/Enron@ENRON,absolutely good point peter start draft overri...,1,3.0,0.7429,"please, thanks, enron, need, know, deal, attac...","['absolutely', 'good', 'point', 'peter', 'star..."
2,jeff.dasovich@enron.com,christine.piesco@oracle.com,apology schedule melted talked monday swhere f...,2,2.0,0.6207,"original_message, know, thanks, get, please, m...","['apology', 'schedule', 'melted', 'talked', 'm..."
3,tanya.tamarchenko@enron.com,"Richard Lewis/LON/ECT@ECT, James New/LON/ECT@E...",vince uk var breached limit last week uk trade...,3,1.0,0.6694,"power, market, energy, state, price, californi...","['vince', 'uk', 'var', 'breached', 'limit', 'l..."
4,kay.mann@enron.com,"Don Hammond/PDX/ECT@ECT, Jody Blackburn/PDX/EC...",problem comment dale_rasmussen ectmann corp en...,4,3.0,0.6876,"please, thanks, enron, need, know, deal, attac...","['problem', 'comment', 'dale_rasmussen', 'ectm..."


In [21]:
data=data.dropna()

In [22]:
data.Dominant_Topic.unique()

array([ 2.,  3.,  1.,  8., 11., 14.,  5., 13.,  0., 10.,  4.,  9.,  7.,
        6., 12.])

In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9770 entries, 0 to 10347
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   from                9770 non-null   object 
 1   to                  9770 non-null   object 
 2   email               9770 non-null   object 
 3   Document_No         9770 non-null   int64  
 4   Dominant_Topic      9770 non-null   float64
 5   Topic_Perc_Contrib  9770 non-null   float64
 6   Keywords            9770 non-null   object 
 7   Text                9770 non-null   object 
dtypes: float64(2), int64(1), object(5)
memory usage: 687.0+ KB


In [24]:
X= data.email
y=data.Dominant_Topic

In [25]:
X

0        nice dinner probably knowanyone else anytime w...
1        absolutely good point peter start draft overri...
2        apology schedule melted talked monday swhere f...
3        vince uk var breached limit last week uk trade...
4        problem comment dale_rasmussen ectmann corp en...
                               ...                        
10343    attached redline change discussed please revie...
10344    review forwarded corp enron anila hoxhaattache...
10345    hi jerry final execution version letter agreem...
10346    richard_shapiro enron com david_parquet enron ...
10347    better original_message original_message fyi d...
Name: email, Length: 9770, dtype: object

In [26]:
y.unique()

array([ 2.,  3.,  1.,  8., 11., 14.,  5., 13.,  0., 10.,  4.,  9.,  7.,
        6., 12.])

In [27]:
# train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1)

In [28]:
# Tf-idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer(stop_words='english',
                        sublinear_tf=True,
                        smooth_idf=True,
                        analyzer='word',
                        min_df=10,
                        max_df=0.95,
                        max_features=30000)
X_train = tfv.fit_transform(X_train)
X_test = tfv.transform(X_test)

### Random forest classifier

In [29]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=250,class_weight='balanced')
rfc.fit(X_train, y_train)

RandomForestClassifier(class_weight='balanced', n_estimators=250)

In [30]:
# prediction
y_pred_rfc = rfc.predict(X_test)

In [31]:
# confusion matrix
from sklearn.metrics import confusion_matrix, f1_score
cm = confusion_matrix(y_test, y_pred_rfc)
print(cm)


[[  3   0   0   1   0   0   0   0   0   0   0   0   0   0   0]
 [  8 114  17  94   0   0   1   0   1   0   1   3   0   0   1]
 [ 13   2 264  77   0   0   0   0   1   0   0   5   0   0   0]
 [ 11   8  36 461   0   1   0   0   1   0   0   8   1  12   0]
 [  1   0   0   0   3   0   0   0   0   0   0   0   0   0   0]
 [  6   9   8  42   0  25   0   0   0   0   0   0   0   0   0]
 [  1   1   1   0   0   0   2   0   0   0   0   5   0   0   0]
 [  1   1   0   8   0   0   0   8   0   1   0   0   0   0   0]
 [  5   7  19  35   2   0   0   0  83   0   1   6   0   0   3]
 [  1   0   1   4   0   0   0   0   1   7   0   0   0   0   0]
 [  2   6   1  12   0   0   0   0   0   0  30   2   0   0   0]
 [  0   7   7   9   0   0   0   0   0   0   1 255   0   3   2]
 [  0   0   1   1   0   0   0   0   0   0   0   1   5   0   0]
 [  1   2   0  10   0   0   0   0   0   0   0   3   1  55   0]
 [  7   3  45   9   0   0   0   0   2   0   0   1   0   1  26]]


In [32]:
# f1_score
f1_score(y_test, y_pred_rfc,average='weighted')

0.6809468136541

### Support Vector classifier

In [33]:
from sklearn.svm import SVC
svc = SVC(kernel='rbf',random_state=0,class_weight='balanced')
svc.fit(X_train,y_train)

SVC(class_weight='balanced', random_state=0)

In [34]:
# prediction
y_pred_svc = svc.predict(X_test)

In [35]:
# confusion matrix
from sklearn.metrics import confusion_matrix, f1_score
cm = confusion_matrix(y_test, y_pred_svc)
print(cm)

[[  3   0   0   1   0   0   0   0   0   0   0   0   0   0   0]
 [  0 168  15  44   0   0   0   0   3   0   2   4   0   1   3]
 [  0   2 303  37   0   1   0   0   1   0   0   4   0   0  14]
 [  0  17  35 455   0   1   0   1   3   0   0   9   2  16   0]
 [  0   0   0   1   3   0   0   0   0   0   0   0   0   0   0]
 [  1   9  10  26   0  39   0   0   1   0   2   0   0   1   1]
 [  0   0   0   0   0   0   4   0   1   0   0   2   0   0   3]
 [  0   0   0   5   0   0   0  12   1   0   0   0   0   0   1]
 [  0   2  12  12   0   0   0   0 125   0   2   4   0   0   4]
 [  0   0   2   2   0   0   0   0   4   6   0   0   0   0   0]
 [  0   1   5   5   0   0   0   0   4   0  38   0   0   0   0]
 [  0   7   3   8   0   0   0   0   0   0   1 260   0   3   2]
 [  0   0   2   0   0   0   0   0   0   0   0   0   6   0   0]
 [  0   0   2   6   0   0   0   0   0   0   0   3   1  60   0]
 [  0   1  16   3   0   0   0   0   2   0   0   1   0   1  70]]


In [36]:
# f1_score
f1_score(y_test, y_pred_svc,average='weighted')

0.7907425544422999

Support Vector classifier gives much better classification results than Random forest classifier