# Studying Mr. Geron's Spam Classifier Notebook  - Part II

Code often borrowed from [Aurélien Geron's famous Jupyter Notebook on Classification.](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb)

Data can be pulled from [Apache SpamAssassin's old corpus.](http://spamassassin.apache.org/old/publiccorpus/)

In [1]:
import os
import sys 
import nltk
import time
import pickle
import numpy as np
import scipy.sparse as ssp

from datetime import datetime
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

import custom_functions as F # see custom module for code

dt_object = datetime.fromtimestamp(time.time())
dt_object = str(dt_object).split('.')[0]
Date, StartTime = dt_object.split(' ')
print('Revised on: ' + Date)

Revised on: 2020-07-19


## Purpose

TEST the previous notebook to ensure it worked.

### Data Ingestion

Random sampling 10% of the data for quick troubleshooting.

In [2]:
F.get_data_if_needed('spam', 'easy_ham', '20030228')

Data successfully downloaded.


In [3]:
# extracting emails
data_dir = 'data'
spam_dir = os.path.join(data_dir, 'spam')
ham_dir = os.path.join(data_dir, 'easy_ham')

ham_filenames = [name for name in sorted(os.listdir(ham_dir)) if name != 'cmds']
spam_filenames = [name for name in sorted(os.listdir(spam_dir)) if name != 'cmds']

len(ham_filenames)/10, len(spam_filenames)/10

(250.0, 50.0)

In [4]:
import random

random.seed(42)
            
ham_sample = random.sample(ham_filenames, 250)
spam_sample = random.sample(spam_filenames, 50)

spam = F.extract_emails(_path=spam_dir, _names=spam_sample)
ham = F.extract_emails(_path=ham_dir, _names=ham_sample)

len(ham), len(spam)

(250, 50)

### Split into Training and Test datasets

We need to split the traing and test sets before gaining too much information on the test set and biasing ourselves in creating the features for the training set.

In [5]:
X = np.array(ham + spam)
y = np.array([0] * len(ham) + [1] * len(spam))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Preprocess, Train, Validate


In [6]:
# with stopwords included
preprocess_pipeline = Pipeline([
    ("email_to_wordcount", F.EmailToWordCounterTransformer_revised(remove_stopwords=False)),
    ("wordcount_to_vector", F.WordCounterToVectorTransformer_plusvocab()),
])

### FIX

How to fix the vocabulary problem?

In [7]:
vocabulary_, X_train_transformed = F.load_processed_X_train('vocabulary_sample1', 
                                                          'X_train_processed_sample1',
                                                           preprocess_pipeline,
                                                           X_train)

Loading vocabulary.
Loading sparse matrix.
Processed data loaded.


In [11]:
# train a logistic regression classifier
log_clf = LogisticRegression(solver="liblinear", random_state=42)
cv_score = cross_val_score(log_clf, X_train_transformed, y_train, cv=5, verbose=3)
cv_score.mean()

[CV]  ................................................................
[CV] .................................... , score=0.938, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=1.000, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=1.000, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.958, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.979, total=   0.0s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


0.975

In [8]:
# preprocess test set
try:
    test_vocab, X_test_transformed = preprocess_pipeline.transform(X_test)
except AttributeError as e:
    print(e)

'WordCounterToVectorTransformer_plusvocab' object has no attribute 'vocabulary_'


In [18]:
# check that the vocabulary is the same
for ((w1, ct1), (w2, ct2)) in zip(test_vocab.items(), vocabulary_.items()):
    try:
        assert w1 == w2 and ct1 == ct2
    except AssertionError:
        print((w1, ct1), (w2, ct2))

In [19]:
# predict and calculate precision and recall
log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))

Precision: 85.71%
Recall: 75.00%
