# Studying Mr. Geron's Spam Classifier Notebook 

Code often borrowed from [Aurélien Geron's famous Jupyter Notebook on Classification.](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb)

Data can be pulled from [Apache SpamAssassin's old corpus.](http://spamassassin.apache.org/old/publiccorpus/)

In [1]:
import os
import sys 
import nltk
import time
import pickle
import numpy as np

from datetime import datetime
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

import custom_functions as F # see custom module for code

start_time = time.time()
dt_object = datetime.fromtimestamp(time.time())
dt_object = str(dt_object).split('.')[0]
Date, StartTime = dt_object.split(' ')
print('Revised on: ' + Date)

Revised on: 2020-07-21


## Purpose 

Train models despite not having a good way to efficiently save/load the preprocessed data. 

### Data Ingestion

In [2]:
F.get_data_if_needed('spam', 'easy_ham', '20030228')

Data successfully downloaded.


In [3]:
data_dir = 'data'
spam_dir = os.path.join(data_dir, 'spam')
ham_dir = os.path.join(data_dir, 'easy_ham')

ham_filenames = [name for name in sorted(os.listdir(ham_dir)) if name != 'cmds']
spam_filenames = [name for name in sorted(os.listdir(spam_dir)) if name != 'cmds']

print('There are ' +str(len(ham_filenames)) + ' ham emails and ' + str(len(spam_filenames)) + ' spam emails.')

There are 2500 ham emails and 500 spam emails.


In [4]:
# extracting emails
spam = F.extract_emails(_path=spam_dir, _names=spam_filenames)
ham = F.extract_emails(_path=ham_dir, _names=ham_filenames)

### Split into Training and Test datasets

We need to split the traing and test sets before gaining too much information on the test set and biasing ourselves in creating the features for the training set.

In [5]:
X = np.array(ham + spam)
y = np.array([0] * len(ham) + [1] * len(spam))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Preprocess, Train, Validate

In [9]:
# Mr. Geron's pipeline - using stopwords
preprocess_pipeline = Pipeline([
    ("email_to_wordcount", F.EmailToWordCounterTransformer_revised(remove_stopwords=False)),
    ("wordcount_to_vector", F.WordCounterToVectorTransformer()),
])

In [11]:
# preprocess data if need be
X_train_transformed = preprocess_pipeline.fit_transform(X_train)

In [12]:
# train a logistic regression classifier

# RTFM liblinear vs?
# variance on diff random states?

log_clf = LogisticRegression(solver="liblinear", random_state=42)
cv_score = cross_val_score(log_clf, X_train_transformed, y_train, cv=5, verbose=3)
cv_score.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.981, total=   0.2s
[CV]  ................................................................
[CV] .................................... , score=0.990, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.985, total=   0.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV] .................................... , score=0.990, total=   0.2s
[CV]  ................................................................
[CV] .................................... , score=0.990, total=   0.4s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s finished


0.9870833333333333

In [13]:
# preprocess test set
try:
    X_test_transformed = preprocess_pipeline.transform(X_test)
except AttributeError as e:
    print(e)

In [14]:
# predict
log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))

Precision: 96.88%
Recall: 97.89%


### Rinse & Repeat: without stopwords

In [15]:
# New pipeline without stopwords
preprocess_pipeline_NEW = Pipeline([
    ("email_to_wordcount", F.EmailToWordCounterTransformer_revised(remove_stopwords=True)),
    ("wordcount_to_vector", F.WordCounterToVectorTransformer()),
])

X_train_transformed_NEW = preprocess_pipeline_NEW.fit_transform(X_train)

In [16]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
cv_score = cross_val_score(log_clf, X_train_transformed_NEW, y_train, cv=5, verbose=3)
cv_score.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.985, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.988, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.983, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.977, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.988, total=   0.1s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.4s finished


0.9841666666666666

In [17]:
try:
    X_test_transformed_NEW = preprocess_pipeline_NEW.transform(X_test)
except AttributeError as e:
    print(e)  

In [18]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train_transformed_NEW, y_train)

y_pred = log_clf.predict(X_test_transformed_NEW)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))

Precision: 98.85%
Recall: 90.53%


Removing stopwords increases precision while lowering recall in this one particular instance. The trade-off rate between precision and recall in the second classifier is perhaps justified - a user might prefer seeing a few spam emails in her inbox (lower recall) to having her ham be incorrectly sent to the spam folder (lower precision).

[TODO: is there a logic behind lower recall and higher precision when removing stopwords? Does it generalize (more tests)?]

[TODO: compare with lemmatized words]

[TODO: compare with shorter list of most significant words]

---