## Email Spam Detection

Binary classification problem - Spam(1) or Not-Spam(0)

Scikit-learn: tokenization -> vectorization -> statistical classification algorithm

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd

In [8]:
# read the cleaned data from csv file
source_csv = '../data/cleaned_emails.csv'

In [9]:
df = pd.read_csv(source_csv)

In [10]:
df.head()

Unnamed: 0,clean_text,spam
0,subject naturally irresistible your corporate ...,1
1,subject the stock trading gunslinger fanny is ...,1
2,subject unbelievable new homes made easy im wa...,1
3,subject color printing special request additio...,1
4,subject do not have money get software cds fro...,1


In [13]:
# convert the text data into matrix of numbers using CountVectorizer
text_vect = CountVectorizer()
text_vect = text_vect.fit_transform(df['clean_text'])

In [14]:
# split the text vector into a training and testing set
x_train, x_test, y_train, y_test = train_test_split(text_vect, df['spam'], test_size=0.2, random_state=42, shuffle=True)

In [15]:
# using GradientBoostingClassifier() from scikit-learn's ensemble collection
classifier = ensemble.GradientBoostingClassifier(n_estimators=100, learning_rate=0.5, max_depth=6)

In [17]:
# fit the classifier on the training set
classifier.fit(x_train, y_train)

GradientBoostingClassifier(learning_rate=0.5, max_depth=6)

In [19]:
# make predictions
y_pred = classifier.predict(x_test)

In [21]:
# generate the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98       843
           1       0.98      0.92      0.95       296

    accuracy                           0.97      1139
   macro avg       0.98      0.96      0.97      1139
weighted avg       0.97      0.97      0.97      1139

