# Studying Mr. Geron's Spam Classifier Notebook  - Part II

Code often borrowed from [Aurélien Geron's famous Jupyter Notebook on Classification.](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb)

Data can be pulled from [Apache SpamAssassin's old corpus.](http://spamassassin.apache.org/old/publiccorpus/)

In [1]:
import os
import sys 
import nltk
import time
import pickle
import numpy as np
import scipy.sparse as ssp

from datetime import datetime
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

import custom_functions as F # see custom module for code

dt_object = datetime.fromtimestamp(time.time())
dt_object = str(dt_object).split('.')[0]
Date, StartTime = dt_object.split(' ')
print('Revised on: ' + Date)

Revised on: 2020-07-19


## Purpose

Study structure of sparse matrix.

Why saving it with scipy.sparse or pickle did not work as expected?

The metadata (WordCounterToVectorTransformer's vocabulary_ attribute) is not saved - this might be an issue with the class itself. The way it is written, it returns only the sparse matrix which contains a reference to the vocabulary_ but not the vocabulary_ itself. 

It appears as if the vocabulary_ is saved in the WordCounterToVectorTransformer class inside the Jupyter Notebook envinroment after it trains X_train, but it is not saved in the output of the pipeline. This way when we run the X_test pipeline we get no errors. By simply importing the saved sparse matrix we get the missing vocabulary error.

### Data Ingestion

Random sampling 10% of the data for quick troubleshooting.

In [2]:
F.get_data_if_needed('spam', 'easy_ham', '20030228')

Data successfully downloaded.


In [3]:
# extracting emails
data_dir = 'data'
spam_dir = os.path.join(data_dir, 'spam')
ham_dir = os.path.join(data_dir, 'easy_ham')

ham_filenames = [name for name in sorted(os.listdir(ham_dir)) if name != 'cmds']
spam_filenames = [name for name in sorted(os.listdir(spam_dir)) if name != 'cmds']

len(ham_filenames)/10, len(spam_filenames)/10

(250.0, 50.0)

In [4]:
import random
random.seed(42)
ham_sample = random.sample(ham_filenames, 250)
spam_sample = random.sample(spam_filenames, 50)

spam = F.extract_emails(_path=spam_dir, _names=spam_sample)
ham = F.extract_emails(_path=ham_dir, _names=ham_sample)

len(ham), len(spam)

(250, 50)

### Split into Training and Test datasets

We need to split the traing and test sets before gaining too much information on the test set and biasing ourselves in creating the features for the training set.

In [5]:
X = np.array(ham + spam)
y = np.array([0] * len(ham) + [1] * len(spam))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Preprocess, Train, Validate


In [6]:
from sklearn.base import BaseEstimator, TransformerMixin
from collections import Counter
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer_plusvocab(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
        
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        
        return self
    
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
                
        return (self.vocabulary_, # CHANGE: add vocabulary to results for study
                csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1)))

In [7]:
preprocess_pipeline = Pipeline([
    ("email_to_wordcount", F.EmailToWordCounterTransformer_revised(remove_stopwords=False)),
    ("wordcount_to_vector", WordCounterToVectorTransformer_plusvocab()),
])

In [8]:
# process
vocabulary_, X_train_transformed = preprocess_pipeline.fit_transform(X_train)

In [9]:
# get a list of X counts of words in a specific email, given that a vocabulary_ exists
def get_words_counts(X_transformed, X_counts):
    list_of_counts = X_transformed.toarray()[0][1:X_counts].tolist()
    word_list = [(word, index) for (word, index) in vocabulary_.items() if index < (X_counts+1)] # needs a vocabulary_
    out = [(count, word) for (word, index), count in zip(word_list, list_of_counts)]
    return(out)

In [10]:
get_words_counts(X_train_transformed[1], 20)

[(10, 'number'),
 (7, 'the'),
 (6, 'to'),
 (3, 'a'),
 (3, 'and'),
 (1, 'of'),
 (2, 'in'),
 (7, 'i'),
 (9, 'it'),
 (3, 'is'),
 (1, 'url'),
 (1, 'for'),
 (4, 'that'),
 (4, 'you'),
 (0, 's'),
 (2, 'thi'),
 (0, 'on'),
 (2, 'with'),
 (3, 'have')]

In [11]:
# check counts
print(X_train[1].get_content())

    Date:        Mon, 26 Aug 2002 09:27:56 -0500
    From:        Chris Garrigues <cwg-dated-1030804078.e8b0d5@DeepEddy.Com>
    Message-ID:  <1030372078.11075.TMDA@deepeddy.vircio.com>

  | Tell me what keystroke made it happen so I can reproduce it and I'll
  | see what I can do about it (or if I can't, I'll hand it off to Brent).

Don't worry too much about it, you seem to have plenty of other things
to do in the immediate future, and this one isn't so critical that people
can't use the code in normal ways.

But, to make it happen, type (with normal key bindings) any digit, so the
code thinks you're trying a message number, then backspace, so the digit
goes away, then '-' (other junk characters don't seem to have the
problem, I have just been playing).   That will do it (every time).

That is: 0 ^h -

Once you get into that state, the same traceback occurs for every
character you type, until a message is selected with the mouse.

This is looking like it might be easy to find and fix

### FIX

How to fix the vocabulary problem?

In [25]:
# forget processed data
vocabulary_ = []
X_train_transformed = []

In [32]:
def load_processed_X_train(vocab_name, X_train_name, X_train):
    
    import os
    import json
    import scipy.sparse
    
    # setup directory and file paths
    path = 'processed_data'
    if not os.path.exists(path):
        os.mkdir(path)       
    vocab_path = os.path.join(path, ''.join([vocab_name, '.json']))
    matrix_path = os.path.join(path, ''.join([X_train_name, '.npz']))
    
    # load vocabulary and matrix if exist
    try:
        with open(vocab_path, 'r') as fp:
            vocabulary_ = json.load(fp)
        print('Loading vocabulary.\n')
    except FileNotFoundError as e:  
        print('Json file not found.\n')
        pass
    try:
        X_train_transformed = scipy.sparse.load_npz(matrix_path)
        print('Loading sparse matrix.\n')
    except FileNotFoundError as e:  
        print('Sparse matrix not found.\n')
        pass
    
    if 'vocabulary_' in locals() and 'X_train_transformed' in locals():
        print('Processed data loaded.')
        return(vocabulary_, X_train_transformed)
    else:
        pass
    
    # if not, process data
    try:
        vocabulary_, X_train_transformed = preprocess_pipeline.fit_transform(X_train)
        print('Processing data...\n')    
    except:
        print('Processing error.\n')
        pass
    
    # save processed data
    try:
        with open(vocab_path, 'w') as fp:
            json.dump(vocabulary_, fp, indent=4)
        print('Saving vocabulary...\n')
    except:
        print('Error saving vocabulary_...\n')
        pass
    try:
        scipy.sparse.save_npz(matrix_path, X_train_transformed)
        print('Saving sparse matrix...\n')
    except:
        print('Error saving matrix...\n')
    
    print('Processed data loaded and saved.')   
    return(vocabulary_, X_train_transformed)

In [33]:
vocabulary_, X_train_transformed = load_processed_X_train('vocabulary_sample1', 
                                                          'X_train_processed_sample1',
                                                           X_train)

Json file not found.

Sparse matrix not found.

Processing data...

Saving vocabulary...

Saving sparse matrix...

Processed data loaded and saved.


In [34]:
vocabulary_

{'number': 1,
 'the': 2,
 'to': 3,
 'a': 4,
 'and': 5,
 'of': 6,
 'in': 7,
 'i': 8,
 'it': 9,
 'is': 10,
 'url': 11,
 'for': 12,
 'that': 13,
 'you': 14,
 's': 15,
 'thi': 16,
 'on': 17,
 'with': 18,
 'have': 19,
 'from': 20,
 'be': 21,
 'your': 22,
 'are': 23,
 'not': 24,
 't': 25,
 'as': 26,
 'or': 27,
 'but': 28,
 'at': 29,
 'if': 30,
 'we': 31,
 'list': 32,
 'can': 33,
 'my': 34,
 'by': 35,
 'use': 36,
 'an': 37,
 'wa': 38,
 'time': 39,
 'all': 40,
 'ha': 41,
 'get': 42,
 'they': 43,
 'do': 44,
 'one': 45,
 'mail': 46,
 'so': 47,
 'more': 48,
 'just': 49,
 'will': 50,
 'com': 51,
 'about': 52,
 'onli': 53,
 'there': 54,
 'out': 55,
 'new': 56,
 'no': 57,
 'up': 58,
 'what': 59,
 'which': 60,
 'would': 61,
 'our': 62,
 'their': 63,
 'free': 64,
 'now': 65,
 'messag': 66,
 'when': 67,
 'some': 68,
 'been': 69,
 'other': 70,
 'year': 71,
 'email': 72,
 'work': 73,
 'ani': 74,
 'don': 75,
 'peopl': 76,
 'who': 77,
 'make': 78,
 'like': 79,
 'group': 80,
 'into': 81,
 'said': 82,
 'firs

In [35]:
X_train_transformed

<240x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 21185 stored elements in Compressed Sparse Row format>

In [36]:
vocabulary_ = []
X_train_transformed = []

# test loading when data is found
vocabulary_, X_train_transformed = load_processed_X_train('vocabulary_sample1', 
                                                          'X_train_processed_sample1',
                                                           X_train)

Loading vocabulary.

Loading sparse matrix.

Processed data loaded.


In [37]:
vocabulary_

{'number': 1,
 'the': 2,
 'to': 3,
 'a': 4,
 'and': 5,
 'of': 6,
 'in': 7,
 'i': 8,
 'it': 9,
 'is': 10,
 'url': 11,
 'for': 12,
 'that': 13,
 'you': 14,
 's': 15,
 'thi': 16,
 'on': 17,
 'with': 18,
 'have': 19,
 'from': 20,
 'be': 21,
 'your': 22,
 'are': 23,
 'not': 24,
 't': 25,
 'as': 26,
 'or': 27,
 'but': 28,
 'at': 29,
 'if': 30,
 'we': 31,
 'list': 32,
 'can': 33,
 'my': 34,
 'by': 35,
 'use': 36,
 'an': 37,
 'wa': 38,
 'time': 39,
 'all': 40,
 'ha': 41,
 'get': 42,
 'they': 43,
 'do': 44,
 'one': 45,
 'mail': 46,
 'so': 47,
 'more': 48,
 'just': 49,
 'will': 50,
 'com': 51,
 'about': 52,
 'onli': 53,
 'there': 54,
 'out': 55,
 'new': 56,
 'no': 57,
 'up': 58,
 'what': 59,
 'which': 60,
 'would': 61,
 'our': 62,
 'their': 63,
 'free': 64,
 'now': 65,
 'messag': 66,
 'when': 67,
 'some': 68,
 'been': 69,
 'other': 70,
 'year': 71,
 'email': 72,
 'work': 73,
 'ani': 74,
 'don': 75,
 'peopl': 76,
 'who': 77,
 'make': 78,
 'like': 79,
 'group': 80,
 'into': 81,
 'said': 82,
 'firs

In [38]:
X_train_transformed

<240x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 21185 stored elements in Compressed Sparse Row format>

In [8]:
import scipy.sparse

_path = 'processed_data'
if not os.path.exists(_path):
    os.mkdir(_path)       

filename = 'X_train_transformed_stopwordsFalse'
_fullpath = os.path.join(_path, ''.join([filename, '.npz'])) 
_fullpath

try:
    X_train_transformed = scipy.sparse.load_npz(_fullpath)

except FileNotFoundError:
    
    X_train_transformed = preprocess_pipeline.fit_transform(X_train)
    
    scipy.sparse.save_npz(_fullpath, X_train_transformed)

In [12]:
# train a logistic regression classifier
log_clf = LogisticRegression(solver="liblinear", random_state=42)
cv_score = cross_val_score(log_clf, X_train_transformed, y_train, cv=5, verbose=3)
cv_score.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.981, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.990, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.985, total=   0.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV] .................................... , score=0.990, total=   0.2s
[CV]  ................................................................
[CV] .................................... , score=0.990, total=   0.3s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.7s finished


0.9870833333333333

In [13]:
X_train_transformed 

<2400x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 188522 stored elements in Compressed Sparse Row format>

In [6]:
X_test_transformed = preprocess_pipeline.transform(X_test)

AttributeError: 'WordCounterToVectorTransformer' object has no attribute 'vocabulary_'

In [1]:

X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))