# Studying Mr. Geron's Spam Classifier Notebook  - Part II

Code often borrowed from [Aurélien Geron's famous Jupyter Notebook on Classification.](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb)

Data can be pulled from [Apache SpamAssassin's old corpus.](http://spamassassin.apache.org/old/publiccorpus/)

In [1]:
import os
import sys 
import nltk
import time
import pickle
import numpy as np
import scipy.sparse as ssp

from datetime import datetime
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

import custom_functions as F # see custom module for code

dt_object = datetime.fromtimestamp(time.time())
dt_object = str(dt_object).split('.')[0]
Date, StartTime = dt_object.split(' ')
print('Revised on: ' + Date)

Revised on: 2020-07-18


## Purpose

Test saving sparse matrices with scipy.sparse, instead of pickling the processed data.

UPDATE: both scipy.sparse and pickle have issues of not saving the metadata so when we preprocess using a class that has attributes (vocabulary_) we lose those attributes needed for preprocessing the test data.

### Data Ingestion

In [2]:
F.get_data_if_needed('spam', 'easy_ham', '20030228')

Data successfully downloaded.


In [3]:
# extracting emails
data_dir = 'data'
spam_dir = os.path.join(data_dir, 'spam')
ham_dir = os.path.join(data_dir, 'easy_ham')

ham_filenames = [name for name in sorted(os.listdir(ham_dir)) if name != 'cmds']
spam_filenames = [name for name in sorted(os.listdir(spam_dir)) if name != 'cmds']

spam = F.extract_emails(_path=spam_dir, _names=spam_filenames)
ham = F.extract_emails(_path=ham_dir, _names=ham_filenames)

### Split into Training and Test datasets

We need to split the traing and test sets before gaining too much information on the test set and biasing ourselves in creating the features for the training set.

In [4]:
X = np.array(ham + spam)
y = np.array([0] * len(ham) + [1] * len(spam))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Preprocess, Train, Validate

Saving file with scipy.sparse using a custom function resulted in a very low precision and recall compared to not doing so.

Trying again without passing a pipeline into a function - this might've caused errors.

In [8]:
import scipy.sparse

_path = 'processed_data'
if not os.path.exists(_path):
    os.mkdir(_path)       

In [9]:
filename = 'X_train_transformed_stopwordsFalse'
_fullpath = os.path.join(_path, ''.join([filename, '.npz'])) 
_fullpath

'processed_data\\X_train_transformed_stopwordsFalse.npz'

In [10]:
# Mr. Geron's pipeline (with stopwords)
preprocess_pipeline = Pipeline([
    ("email_to_wordcount", F.EmailToWordCounterTransformer_revised(remove_stopwords=False)),
    ("wordcount_to_vector", F.WordCounterToVectorTransformer()),
])

In [11]:
try:
    X_train_transformed = scipy.sparse.load_npz(_fullpath)

except FileNotFoundError:
    
    X_train_transformed = preprocess_pipeline.fit_transform(X_train)
    
    scipy.sparse.save_npz(_fullpath, X_train_transformed)

In [12]:
# train a logistic regression classifier
log_clf = LogisticRegression(solver="liblinear", random_state=42)
cv_score = cross_val_score(log_clf, X_train_transformed, y_train, cv=5, verbose=3)
cv_score.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.981, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.990, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.985, total=   0.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV] .................................... , score=0.990, total=   0.2s
[CV]  ................................................................
[CV] .................................... , score=0.990, total=   0.3s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.7s finished


0.9870833333333333

In [13]:
X_train_transformed 

<2400x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 188522 stored elements in Compressed Sparse Row format>

In [6]:
X_test_transformed = preprocess_pipeline.transform(X_test)

AttributeError: 'WordCounterToVectorTransformer' object has no attribute 'vocabulary_'

In [1]:

X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))