# Studying Mr. Geron's Spam Classifier Notebook 

Code often borrowed from [Aurélien Geron's famous Jupyter Notebook on Classification.](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb)

Data can be pulled from [Apache SpamAssassin's old corpus.](http://spamassassin.apache.org/old/publiccorpus/)

In [1]:
import os
import sys 
import nltk
import time
import pickle
import numpy as np

from datetime import datetime
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

import custom_functions as F # see custom module for code

start_time = time.time()
dt_object = datetime.fromtimestamp(time.time())
dt_object = str(dt_object).split('.')[0]
Date, StartTime = dt_object.split(' ')
print('Revised on: ' + Date)

Revised on: 2020-07-18


### Data Ingestion

In [2]:
F.get_data_if_needed('spam', 'easy_ham', '20030228')

Data successfully downloaded.


In [3]:
data_dir = 'data'
spam_dir = os.path.join(data_dir, 'spam')
ham_dir = os.path.join(data_dir, 'easy_ham')

ham_filenames = [name for name in sorted(os.listdir(ham_dir)) if name != 'cmds']
spam_filenames = [name for name in sorted(os.listdir(spam_dir)) if name != 'cmds']

print('There are ' +str(len(ham_filenames)) + ' ham emails and ' + str(len(spam_filenames)) + ' spam emails.')

There are 2500 ham emails and 500 spam emails.


In [4]:
# extracting emails
spam = F.extract_emails(_path=spam_dir, _names=spam_filenames)
ham = F.extract_emails(_path=ham_dir, _names=ham_filenames)

### Quick EDA

Email headers can be lengthy and contain more than 50% of the information, however, a lot of this information is not standard across headers and also might not be informative for ML on a first pass, except for the Subject and the Content-Type perhaps.

In [5]:
# example ham header
for header, value in ham[6].items():
    padding=25-len(header)
    print(header + ' '*padding, ':', value[:50])

Return-Path               : <martin@srv0.ems.ed.ac.uk>
Delivered-To              : zzzz@localhost.netnoteinc.com
Received                  : from localhost (localhost [127.0.0.1])	by phobos.l
Received                  : from phobos [127.0.0.1]	by localhost with IMAP (fe
Received                  : from n11.grp.scd.yahoo.com (n11.grp.scd.yahoo.com 
X-Egroups-Return          : sentto-2242572-52738-1030024499-zzzz=spamassassin.
Received                  : from [66.218.66.94] by n11.grp.scd.yahoo.com with 
X-Sender                  : martin@srv0.ems.ed.ac.uk
X-Apparently-To           : zzzzteana@yahoogroups.com
Received                  : (EGP: mail-8_1_0_1); 22 Aug 2002 13:54:59 -0000
Received                  : (qmail 43039 invoked from network); 22 Aug 2002 13
Received                  : from unknown (66.218.66.216) by m1.grp.scd.yahoo.c
Received                  : from unknown (HELO haymarket.ed.ac.uk) (129.215.12
Received                  : from srv0.ems.ed.ac.uk (srv0.ems.ed.ac.uk [1

In [6]:
# example spam header
for header, value in spam[83].items():
    padding=25-len(header)
    print(header + ' '*padding, ':', value[:50])

Return-Path               : <jamesalabi@mail.com>
Delivered-To              : zzzz@localhost.spamassassin.taint.org
Received                  : from localhost (localhost [127.0.0.1])	by phobos.l
Received                  : from phobos [127.0.0.1]	by localhost with IMAP (fe
Received                  : from webnote.net (mail.webnote.net [193.120.211.21
Received                  : from ok61094.com ([217.78.76.138]) by webnote.net 
Message-Id                : <200208241717.SAA16606@webnote.net>
From                      : "Dr.James Ologun" <jamesalabi@mail.com>
Reply-To                  : jamesalabi@mail.com
To                        : zzzz-sa-listinfo@spamassassin.taint.org
Date                      : Sat, 24 Aug 2002 20:18:02 -0700
Subject                   : Immediate Reply Needed
X-Mailer                  : Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version              : 1.0
Content-Type              : text/plain; charset="us-ascii"
X-MIME-Autoconverted      : from quoted-printa

In [7]:
# example ham body
print(ham[6].get_content().strip())

The Scotsman - 22 August 2002

 Playboy wants to go out with a bang 
 
 
 AN AGEING Berlin playboy has come up with an unusual offer to lure women into
 his bed - by promising the last woman he sleeps with an inheritance of 250,000
 (£160,000). 
 
 Rolf Eden, 72, a Berlin disco owner famous for his countless sex partners,
 said he could imagine no better way to die than in the arms of an attractive
 young woman - preferably under 30. 
 
 "I put it all in my last will and testament - the last woman who sleeps with
 me gets all the money," Mr Eden told Bild newspaper. 
 
 "I want to pass away in the most beautiful moment of my life. First a lot of
 fun with a beautiful woman, then wild sex, a final orgasm - and it will all
 end with a heart attack and then Im gone." 
 
 Mr Eden, who is selling his nightclub this year, said applications should be
 sent in quickly because of his age. "It could end very soon," he said.


------------------------ Yahoo! Groups Sponsor ---------------------~

In [8]:
# example spam body
print(spam[83].get_content().strip())

Dear Sir,

I am Dr James Alabi, the chairman of contract
award and review committee set up by the federal
government of Nigeria under the new civilian
dispensation to award new contracts and review
existing ones.
I came to know of you in my search for a reliable and
reputable person to handle a very confidential
transaction, which involves the transfer of a huge
sum of money to a foreign account. 

There were series of contracts executed by a 
consortium of multi-nationals in the oil industry in
favor of N.N.P.C. The original values of these 
contracts were deliberately over invoiced to the sum
of US$12,320,000.00 (Twelve Million Three Hundred and Twenty Thousand 
United
State Dollars). This amount has now been approved and
is now ready to be transferred being that the
companies
that actually executed these contracts have been
fully Paid and the projects officially commissioned. 


Consequently, my colleagues and I are willing to 
transfer the total amount to your account for
subsequen

---

**Email Structures can be complex.**

In [9]:
# payload can return single email or a list of objects
ham[13].get_payload()

[<email.message.EmailMessage at 0x246060bcc88>,
 <email.message.EmailMessage at 0x246060a4f60>]

In [10]:
# an email.message can be text/plain, text/html, and various other categories
for email in ham[13].get_payload():
    print(email.get_content_type())

text/plain
application/pgp-signature


In [11]:
# using Mr.Geron's nifty structure counters (see custom code)

# most common ham structures
for i in F.structures_counter(ham).most_common():
    padding = 4-len(str(i[1]))
    print(' '*padding + str(i[1]) + ': ' +i[0])

2408: text/plain
  66: multipart(text/plain | application/pgp-signature)
   8: multipart(text/plain | text/html)
   4: multipart(text/plain | text/plain)
   3: multipart(text/plain)
   2: multipart(text/plain | application/octet-stream)
   1: multipart(text/plain | text/enriched)
   1: multipart(text/plain | application/ms-tnef | text/plain)
   1: multipart(multipart(text/plain | text/plain | text/plain) | application/pgp-signature)
   1: multipart(text/plain | video/mng)
   1: multipart(text/plain | multipart(text/plain))
   1: multipart(text/plain | application/x-pkcs7-signature)
   1: multipart(text/plain | multipart(text/plain | text/plain) | text/rfc822-headers)
   1: multipart(text/plain | multipart(text/plain | text/plain) | multipart(multipart(text/plain | application/x-pkcs7-signature)))
   1: multipart(text/plain | application/x-java-applet)


In [12]:
# most common spam structures
for i in F.structures_counter(spam).most_common():
    padding = 4-len(str(i[1]))
    print(' '*padding + str(i[1]) + ': ' +i[0])

 218: text/plain
 183: text/html
  45: multipart(text/plain | text/html)
  20: multipart(text/html)
  19: multipart(text/plain)
   5: multipart(multipart(text/html))
   3: multipart(text/plain | image/jpeg)
   2: multipart(text/html | application/octet-stream)
   1: multipart(text/plain | application/octet-stream)
   1: multipart(text/html | text/plain)
   1: multipart(multipart(text/html) | application/octet-stream | image/jpeg)
   1: multipart(multipart(text/plain | text/html) | image/gif)
   1: multipart/alternative


### Split into Training and Test datasets

We need to split the traing and test sets before gaining too much information on the test set and biasing ourselves in creating the features for the training set.

In [13]:
X = np.array(ham + spam)
y = np.array([0] * len(ham) + [1] * len(spam))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Handling HTML

In [14]:
HTML_spam = [email for email in X_train[y_train==1] if email.get_content_type() == "text/html"]
print(HTML_spam[9].get_content().strip()[:1400], "...")

<html>

<head>
<title>Home Page</title>
</head>

<body>

<p align="center"><font color="#000000" face="Arial" size="+0"><b><IMG SRC="http://mail4.mortgages101.net/logo.php?id=88&id2=1143953"></p>

<p align="center">If this promotion has reached you in error and you would prefer not to
receive marketing messages from us, please send an email to&nbsp; <a
href="mailto:cease-and-desist@mortgages101.net">cease-and-desist@mortgages101.net</a>
&nbsp; (all one word, no spaces) giving us the email address in question or call
1-888-748-7751 for further assistance.</p>

<p align="center"><u>Gain access to a</b></font><font size="+1" color="#000000"
face="Arial"> <i><b>Vast Network Of Qualified Lenders at Nationwide Network!</b></i></font></u></p>

<p align="center"><font color="#000000" face="Arial">This is a zero-cost service which
enables you to shop for a mortgage conveniently from your home computer. &nbsp; Our
nationwide database will give you access to lenders with a variety of loan program

In [15]:
print(F.html_to_plaintext(HTML_spam[9].get_content())[:800], "...")


If this promotion has reached you in error and you would prefer not to
receive marketing messages from us, please send an email to   HYPERLINK cease-and-desist@mortgages101.net
  (all one word, no spaces) giving us the email address in question or call
1-888-748-7751 for further assistance.
Gain access to a Vast Network Of Qualified Lenders at Nationwide Network!
This is a zero-cost service which
enables you to shop for a mortgage conveniently from your home computer.   Our
nationwide database will give you access to lenders with a variety of loan programs that
will work for Excellent, Good, Fair or even Poor Credit!
  We will choose up to 3 mortgage companies
from our database of  registered brokers/lenders. Each will contact you to offer you their best rate and terms - at
no charge.
 
  ...


### Stemming

Here I start a comparison of Stemmers and Lemmatization. 

Thought: 

* test each change, one at a time, always comparing with the baseline which is Mr. Geron's original models:
1. Rerun with the Lancaster Stemmer
2. Rerun with Lemmatization 
3. Rerun removing Stop Words

Have we taken into consideration the number of words vs number of tokens (pct unique)? 

How about the size of the vocabulary?



In [16]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer() # more aggressive stemmer --> better?
wordlen = 0
for word in ("Barbaric", "Barbarian", "Confusion", "Confusing", "Confer", "Conferred", "Confabulate"):       
    padding = 11-len(word)
    print("Porter:    " +word +padding*' ' + " => " +porter.stem(word))
    print("Lancaster: " +word +padding*' ' + " => " +lancaster.stem(word))

Porter:    Barbaric    => barbar
Lancaster: Barbaric    => barb
Porter:    Barbarian   => barbarian
Lancaster: Barbarian   => barb
Porter:    Confusion   => confus
Lancaster: Confusion   => confus
Porter:    Confusing   => confus
Lancaster: Confusing   => confus
Porter:    Confer      => confer
Lancaster: Confer      => conf
Porter:    Conferred   => confer
Lancaster: Conferred   => confer
Porter:    Confabulate => confabul
Lancaster: Confabulate => confab


### Lemmatization

In [17]:
lemma = WordNetLemmatizer()

sentence = "Barbaric barbarians! confused, confusion conferred - confabulations confabulate Biblical meanderings."
punctuations="?:!.,;-"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
print("{0:20}{1:20}".format("-----","-----"))

for word in sentence_words:
    print("{0:20}{1:20}".format(word,lemma.lemmatize(word)))

Word                Lemma               
-----               -----               
Barbaric            Barbaric            
barbarians          barbarian           
confused            confused            
confusion           confusion           
conferred           conferred           
confabulations      confabulation       
confabulate         confabulate         
Biblical            Biblical            
meanderings         meanderings         


### Email to Word Counter

Mr. Geron:

"*We are ready to put all this together into a transformer that we will use to convert emails to word counters. Note that we split sentences into words using Python's `split()` method, which uses whitespaces for word boundaries. This works for many written languages, but not all. For example, Chinese and Japanese scripts generally don't use spaces between words, and Vietnamese often uses spaces even between syllables. It's okay in this exercise, because the dataset is (mostly) in English.*"

In [18]:
exampleX = X_train[11]

print(exampleX.get_content())

On Fri, 6 Sep 2002, Russell Turpin wrote:

> Don't swallow too quickly what you have read about
> more traditional cultures, today or in the past. Do

I don't swallow ;>

I was just offering anecdotal first-hand experiences from a number of
cultures indicating 1) we apparently have a problem 2) which requires more
than ad hoc hand-waving approach (it's trivial! it's obvious! all we have
to do is XY!).

> we have any statistics on the poor man's divorce from
> centuries past? Are you so sure that the kids in 18th

That's easy. Divorce didn't happen. The church and the society looked
after that. Only relatively recently that privilege was granted to kings, 
and only very recently to commoners.

> century England were any more "functional" than those
> today? What about 20th century Saudi Arabia?

Is Saudi Arabia a meaningful emigration source?
 
> >At least from the viewpoint of demographics sustainability and 
> >counterpressure to gerontocracy and resulting innovatiophobia we're doing


---

Mr. Geron's counter does not remove stop words - I'm curious to test whether this is a good move or whether stop words prevent classifiers from further gleaning information from more useful words. 

In [19]:
GeronsCounter = F.EmailToWordCounterTransformer_revised(remove_stopwords=False)
GeronsCounter.fit_transform([exampleX])

array([Counter({'the': 16, 'that': 8, 'and': 8, 'number': 6, 'i': 6, 'to': 6, 'a': 5, 'of': 5, 'we': 5, 't': 4, 'have': 4, 'about': 4, 'more': 4, 'in': 4, 'do': 4, 'from': 4, 's': 4, 'are': 4, 'innov': 4, 'on': 3, 'you': 3, 'wa': 3, 'first': 3, 'is': 3, 'ani': 3, 'centuri': 3, 'demograph': 3, 'gerontocraci': 3, 'import': 3, 'don': 2, 'swallow': 2, 'what': 2, 'read': 2, 'cultur': 2, 'today': 2, 'or': 2, 'past': 2, 'hand': 2, 'problem': 2, 'than': 2, 'it': 2, 'all': 2, 'divorc': 2, 'numberth': 2, 'onli': 2, 'recent': 2, 'grant': 2, 'saudi': 2, 'arabia': 2, 'sustain': 2, 'someth': 2, 'point': 2, 'm': 2, 'last': 2, 'see': 2, 'america': 2, 'specif': 2, 'us': 2, 'as': 2, 'not': 2, 'west': 2, 'gener': 2, 'lack': 2, 'vi': 2, 'trend': 2, 'by': 2, 'work': 2, 'thi': 2, 'fri': 1, 'sep': 1, 'russel': 1, 'turpin': 1, 'wrote': 1, 'too': 1, 'quickli': 1, 'tradit': 1, 'just': 1, 'offer': 1, 'anecdot': 1, 'experi': 1, 'indic': 1, 'appar': 1, 'which': 1, 'requir': 1, 'ad': 1, 'hoc': 1, 'wave': 1, 'approa

In [20]:
NewCounter = F.EmailToWordCounterTransformer_revised(remove_stopwords=True)
NewCounter.fit_transform([exampleX])

array([Counter({'number': 6, 'innov': 4, 'first': 3, 'centuri': 3, 'demograph': 3, 'gerontocraci': 3, 'import': 3, 'swallow': 2, 'read': 2, 'cultur': 2, 'today': 2, 'past': 2, 'hand': 2, 'problem': 2, 'divorc': 2, 'numberth': 2, 'recent': 2, 'grant': 2, 'saudi': 2, 'arabia': 2, 'sustain': 2, 'someth': 2, 'point': 2, 'last': 2, 'see': 2, 'america': 2, 'specif': 2, 'us': 2, 'west': 2, 'gener': 2, 'lack': 2, 'vi': 2, 'trend': 2, 'work': 2, 'fri': 1, 'sep': 1, 'russel': 1, 'turpin': 1, 'wrote': 1, 'quickli': 1, 'tradit': 1, 'offer': 1, 'anecdot': 1, 'experi': 1, 'indic': 1, 'appar': 1, 'requir': 1, 'ad': 1, 'hoc': 1, 'wave': 1, 'approach': 1, 'trivial': 1, 'obviou': 1, 'xy': 1, 'statist': 1, 'poor': 1, 'man': 1, 'sure': 1, 'kid': 1, 'easi': 1, 'happen': 1, 'church': 1, 'societi': 1, 'look': 1, 'rel': 1, 'privileg': 1, 'king': 1, 'common': 1, 'england': 1, 'function': 1, 'meaning': 1, 'emigr': 1, 'sourc': 1, 'least': 1, 'viewpoint': 1, 'counterpressur': 1, 'result': 1, 'innovatiophobia': 1,

It will also be useful to test whether including unique words, even if they appear only once, is useful. Words like `innovatiophobia`, `autocatalyt`, and `counterpressur` might be unique because of their length, so we wouldn't need to consult a dictionary of most common words which might incur expensive processing.

### Word Counter to Vector Transformer

Mr. Geron: 

"*Now we have the word counts, and we need to convert them to vectors. For this, we will build another transformer whose `fit()` method will build the vocabulary (an ordered list of the most common words) and whose `transform()` method will use the vocabulary to convert word counts to vectors. The output is a sparse matrix.*"


In [21]:
GeronsWordCounts = GeronsCounter.fit_transform([exampleX])

countertovec = F.WordCounterToVectorTransformer()
sparsematrix1 = countertovec.fit_transform(GeronsWordCounts)
sparsematrix1

<1x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 175 stored elements in Compressed Sparse Row format>

Here is a potential way to limit a list to interesting words from Mr.Geron's counter:

In [22]:
np.set_printoptions(threshold=sys.maxsize)

for i in range(len(countertovec.vocabulary_)):
    
    ct = sparsematrix1.toarray()[0][i+1]
    
    word, count = list(countertovec.vocabulary_.items())[i]
    if ct > 1:
        print(ct, word)
    elif ct == 1 and len(word) > 7:
        print(ct, word)
    else:
        pass

16 the
8 that
8 and
6 number
6 i
6 to
5 a
5 of
5 we
4 t
4 have
4 about
4 more
4 in
4 do
4 from
4 s
4 are
4 innov
3 on
3 you
3 wa
3 first
3 is
3 ani
3 centuri
3 demograph
3 gerontocraci
3 import
2 don
2 swallow
2 what
2 read
2 cultur
2 today
2 or
2 past
2 hand
2 problem
2 than
2 it
2 all
2 divorc
2 numberth
2 onli
2 recent
2 grant
2 saudi
2 arabia
2 sustain
2 someth
2 point
2 m
2 last
2 see
2 america
2 specif
2 us
2 as
2 not
2 west
2 gener
2 lack
2 vi
2 trend
2 by
2 work
2 thi
1 approach
1 privileg
1 function
1 viewpoint
1 counterpressur
1 innovatiophobia
1 eurotrash
1 autocatalyt
1 american
1 foremost


### Preprocess, Train, Validate using stopwords (original)

In [23]:
# Mr. Geron's pipeline
preprocess_pipeline = Pipeline([
    ("email_to_wordcount", F.EmailToWordCounterTransformer_revised(remove_stopwords=False)),
    ("wordcount_to_vector", F.WordCounterToVectorTransformer()),
])

In [24]:
def load_pickled_file(filename, _pipeline):
    _path = 'pickled_files'
    if not os.path.exists(_path):
        os.mkdir(_path)
        
    filepath = os.path.join(_path, ''.join([filename, '.pickle']))
               
    try:
        X_train_transformed = pickle.load(open(os.path.join(filepath, 'rb')))
        print('Loading X_train_transformed..')
                                          
                                          
    except FileNotFoundError as e:
        
        print(e)
        X_train_transformed = _pipeline.fit_transform(X_train)
        
        # pickle the model for future ease
        pickle.dump(X_train_transformed, open(filepath, 'wb'))
        
        return(X_train_transformed)

In [25]:
# preprocess data if need be
X_train_transformed = load_pickled_file('X_train_transformed_stopwordsFalse', preprocess_pipeline)

[Errno 2] No such file or directory: 'pickled_files\\X_train_transformed_stopwordsFalse.pickle\\rb'


In [26]:
# train a logistic regression classifier
log_clf = LogisticRegression(solver="liblinear", random_state=42)
cv_score = cross_val_score(log_clf, X_train_transformed, y_train, cv=5, verbose=3)
cv_score.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.981, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.990, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.985, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.990, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.990, total=   0.1s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished


0.9870833333333333

In [27]:
X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))

Precision: 96.88%
Recall: 97.89%


### Rinse & Repeat: without stopwords

In [28]:
# New pipeline without stopwords
preprocess_pipeline_NEW = Pipeline([
    ("email_to_wordcount", F.EmailToWordCounterTransformer_revised(remove_stopwords=True)),
    ("wordcount_to_vector", F.WordCounterToVectorTransformer()),
])

X_train_transformed = load_pickled_file('X_train_transformed_stopwordsTrue', preprocess_pipeline_NEW)

[Errno 2] No such file or directory: 'pickled_files\\X_train_transformed_stopwordsTrue.pickle\\rb'


In [29]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
cv_score = cross_val_score(log_clf, X_train_transformed, y_train, cv=5, verbose=3)
cv_score.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.985, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.988, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.983, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.977, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.988, total=   0.0s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished


0.9841666666666666

In [30]:
X_test_transformed = preprocess_pipeline_NEW.transform(X_test)

log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))

Precision: 98.85%
Recall: 90.53%


Removing stopwords increases precision while lowering recall in this one particular instance. The trade-off rate between precision and recall in the second classifier is perhaps justified - a user might prefer seeing a few spam emails in her inbox (lower recall) to having her ham be incorrectly sent to the spam folder (lower precision).

[TODO: is there a logic behind lower recall and higher precision when removing stopwords? Does it generalize (more tests)?]

[TODO: compare with lemmatized words]

[TODO: compare with shorter list of most significant words]

---

In [31]:
# with no pickling, it takes 412 seconds
# with pickling, it takes 170 seconds

secs = round(time.time() - start_time, 1)
print(''.join(['Time elapsed: ', str(secs), ' seconds.']))

Time elapsed: 170.0 seconds.
