## Feature Extraction Using spaCy from text document for binary classification 
<b>Objective :</b> Predict 'ham' or 'spam' class of using text of sms
* This is demo of regular classification model.
* For feature extraction we will use spacy package

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

In [2]:
os.chdir('D://datasets')
sms=pd.read_csv('sms_spam.csv')
print(sms.columns)
print(sms.shape)

Index(['type', 'text'], dtype='object')
(5559, 2)


In [3]:
round((sms.type.value_counts()/len(sms))*100,3)

ham     86.562
spam    13.438
Name: type, dtype: float64

In [4]:
sms.type.value_counts()

ham     4812
spam     747
Name: type, dtype: int64

## Feature extrcation using spacy package
* Load spacy package for english language
* Create function to extract GloVec for all text document
* GloVe: Global Vectors for Word or document Representation

In [5]:
%%time
import spacy
nlp = spacy.load('en')

Wall time: 4.42 s


In [6]:
def text_train_dl(msg):
    t=nlp(msg)
    return t.vector

In [7]:
%%time
sms['glovec_sms']=sms['text'].apply(text_train_dl)

Wall time: 7.49 s


In [8]:
X=sms['glovec_sms']
y=sms['type']

#### Tranform GloVec into numpy array

In [9]:
print(X.shape)
X=np.array([x for x in X])
print(X.shape)

(5559,)
(5559, 300)


In [10]:
le=LabelEncoder()
y=le.fit_transform(y)

In [11]:
x_train,x_test,y_train,y_test=train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

In [12]:
model = RandomForestClassifier(n_estimators=100,n_jobs=-1)
model.fit_transform(x_train, y_train)
predictions=model.predict(x_test)
print(confusion_matrix(y_test, predictions))



[[1583    5]
 [  57  190]]


In [13]:
labels=['ham', 'spam']
f1score=list(f1_score(y_test,predictions,average=None))
fscore=pd.Series(f1score,index=labels)
print("Fscore  of Individual class")
print(fscore)
print("Accuracy : ",accuracy_score(y_test,predictions))
print("Recall : ",recall_score(y_test,predictions))
print("Precision : ",precision_score(y_test,predictions))
#confusion matrix
pred_labels = ['Predicted '+ l for l in labels]
cm=confusion_matrix(y_test, predictions)
cm = pd.DataFrame(cm, index=labels, columns=pred_labels)
cm['Actual Total'] = cm.sum(axis=1)
cm.loc['Predicted Total']= cm.sum()
# cm.to_csv('Confustion_Matrix_RF_100.csv')
cm

Fscore  of Individual class
ham     0.980793
spam    0.859729
dtype: float64
Accuracy :  0.96621253406
Recall :  0.769230769231
Precision :  0.974358974359


Unnamed: 0,Predicted ham,Predicted spam,Actual Total
ham,1583,5,1588
spam,57,190,247
Predicted Total,1640,195,1835
