<h1>Spam filters</h1>

In [135]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [134]:
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import StratifiedKFold
from sklearn.decomposition import TruncatedSVD
from sklearn import linear_model

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

from sklearn.model_selection import KFold

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import confusion_matrix


from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

<h1>Starting off</h1>
I started by computing how the majority classifier would fare on this dataset, so I would have some marker to go off of. The majority classifier always predicts the most prominent class.

In [79]:
len_spam_folder = len(os.listdir("spam"))
len_ham_folder  = len(os.listdir("ham"))
print(len_spam_folder)
print(len_ham_folder)
majority_c = len_ham_folder / (len_spam_folder + len_ham_folder)
print(majority_c) #Majority classifier will be correct ~56.9% of the time (predicts 'ham').

1248
1650
0.5693581780538303


<h1>Retrieving the email files</h1>
I then went about creating a dataframe containing all of the emails using a simple loop. Then, I shuffled the dataframe to intermingle the spam and ham examples. I do this so that when using error estimation later on, that kfolds will be operating off of a randomized list and not the spam and then ham examples in sequence.

In [136]:
email_vectors = []
dirs = ["spam", "ham"]
for directory in dirs:
    all_files = os.listdir(directory)
    for file in all_files:
        email = open(directory + "/" + file, encoding="utf8")
        content = email.read()
        vector = {"y_value": [directory], "content": [content]}
        email_vector = pd.DataFrame(data=vector)
        email_vectors.append(email_vector)
        email.close()
        
df = pd.concat(email_vectors)


In [137]:
df = df.take(np.random.permutation(len(df)))

<h1>Applying vectorizers</h1>
To compare Count Vectorization and Tfidf Vectorization, I created a pipeline for each and transformed the data, creating two new dataframes.
Count Vectorization turns each tokenized word into a feature, with the feature values being the frequencies of these words for each example. This results in quite a lot of features. In an effort to reduce the number of features, I specified to not tokenize any stop words.
Tfidf works similarly, where instead of word frequencies, a Tfidf score is used instead. This Tfidf score increases proportionally to the frequency of a word in a particular document, but is also offset by the frequency of the frequency of the word in the corpus. I took the same action on the stop words with this as I did for Count Vectorization.

Tfidf definition taken from wikipedia :
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [129]:
content = df['content'] #Take out the email contents.

cv_pipeline = Pipeline ([
    ("vectorizer", CountVectorizer(stop_words="english")) #Create count vectorizer pipeline.
])
tfidf_pipeline = Pipeline ([
    ("tfidf", TfidfVectorizer(stop_words="english")) #Create the Tfidf vectorizer pipeline.
])

cv_pipeline.fit(content)
cv_df = cv_pipeline.transform(content) #Create the Count Vectorizer dataframe.

tfidf_pipeline.fit(content)
tfidf_df = tfidf_pipeline.transform(content) #Create the Tfidf dataframe.

Here, I encoded the y values for use later on in error estimation.

In [130]:
y = df['y_value'].values
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)


<h1>Create the pipeline and compare accuracy</h1>
Here, I create a Logistic Regression pipeline, and then specify to use 10-fold cross validation.

I then compare the kfolds error values of the Count Vectorizer model vs. the Tfidf Vectorizer model.

I solely used kfolds for error estimation and neglected to use holdout as I believe the dataset is too small for holdout. In the lectures, it was mentioned that the dataset would have to be very large to mitigate holdouts flaws, such as sub-optimal use of the dataset for training. Therefore, I decided to stick with kfolds.

In [131]:
pipeline = Pipeline([
    ("estimator", LogisticRegression())
])

cv_accuracy = np.mean(cross_val_score(pipeline, cv_df, y_encoded, scoring="accuracy", cv=10))

tfidf_accuracy = np.mean(cross_val_score(pipeline, tfidf_df, y_encoded, scoring="accuracy", cv=10))

print("Count Vectorization accuracy: ", cv_accuracy)
print("\n")
print("Tfidf accuracy              : ", tfidf_accuracy)

Count Vectorization accuracy:  0.980674143897


Tfidf accuracy              :  0.954106908483


<h1>Analysis</h1>
From the results above, it would appear that Count Vectorization is the better method to go with.

I decided to take a look at the confusion matrices of both, to see the rate how many False Positives and False Negatives occurred within each. In the case of a Spam filter, False Positives are much worse, so I looked at these with particular scrutiny.

While in terms of accuracy Tfidf is usually ~2% worse than Count Vectorization, it usually has about twice as many false positives, which arguably makes it twice as bad.

In [132]:
cv_y_predicted = cross_val_predict(pipeline, cv_df, y_encoded, cv=kf)

cv_conf_matrix = confusion_matrix(y_encoded, cv_y_predicted)
print("Count Vectorization Confusion Matrix:")
print(cv_conf_matrix)
print("\n")


tfidf_y_predicted = cross_val_predict(pipeline, tfidf_df, y_encoded, cv=kf)

tfidf_conf_matrix = confusion_matrix(y_encoded, tfidf_y_predicted)
print("Tfidf Confusion Matrix:")
print(tfidf_conf_matrix)



Count Vectorization Confusion Matrix:
[[1622   28]
 [  28 1220]]


Tfidf Confusion Matrix:
[[1584   66]
 [  69 1179]]


<h1>Extra work</h1>
I searched online for a bit of extra reading that might give me some good ideas for what else to try. I stumbled across an interesting article by a man named Paul Graham, which gave me a few ideas. The article was based on a talk he gave at the 2003 Spam conference about Bayesian Filtering.
http://www.paulgraham.com/better.html

<h2>Preserving case</h2>
For example, he discusses how important preserving case is for detecting spam. I tried to implement this in the CountVectorizer pipeline, but for some reason the number of features didn't change, even when I ran it by itself before running the count vectorizer I created above.
If implemented properly, although it would increase the number of features, it would be interesting to see the effect it would have in reducing the efficacy of the spam filter. I would have compared this to the plain Count Vectorizer filter and paid particular attention to the False Positives number again.


In [123]:
content = df['content'] #Take out the email contents.

cv_pipeline_case_preserved = Pipeline ([
    ("vectorizer", CountVectorizer(lowercase="False",stop_words="english")) 
])

cv_pipeline_case_preserved.fit(content)
cv_casep_df = cv_pipeline_case_preserved.transform(content) #Create the Count Vectorizer dataframe.

In [124]:
cv_casep_df.shape

(2898, 109449)

In [125]:
cv_casep_accuracy = np.mean(cross_val_score(pipeline, cv_casep_df, y_encoded, scoring="accuracy", cv=kf))
print(cv_casep_accuracy)

0.979986875075


<h2>Reducing features</h2>

Paul Graham discussed how having more features in your dataset is akin to having a smaller corpus, so I attempted to try and reduce the number of features.

In an attempt to reduce the number of features in, I tried to use maxdf, and also SVD. 
SVD was a bit naive, as I wouldn't have been able to map any y values to the transformed values anyway. In the end, it didn't work, as I kept getting memory errors.

For max_df, I went as far as to reduce ~100 features, but this ended up nuking the accuracy of model. Hence, I concluded that any reduction in features in this way to be pointless, as the accuracy tradeoff was too great.


In [195]:
content = df['content']

cv_svd_pipeline = Pipeline ([
    ("vectorizer", CountVectorizer(stop_words="english",max_df=0.9)) #Create count vectorizer pipeline.
])

#cv_svd_pipeline.fit(content)
#cv_svd_df = cv_svd_pipeline.transform(content)


#svd = TruncatedSVD(n_components=10000)
#svd.fit(cv_df)

cv_max_df_pipeline = Pipeline ([
    ("vectorizer", CountVectorizer(stop_words="english",max_df=0.5)) #Create count vectorizer pipeline.
])

cv_max_df_pipeline.fit(content)
cv_max_df_df = cv_max_df_pipeline.transform(content)



In [197]:
cv_max_df_df.shape

(2898, 109387)

In [199]:
max_df_pipeline = Pipeline([
    ("estimator", LogisticRegression())
])

cv_max_df_accuracy = np.mean(cross_val_score(max_df_pipeline, cv_max_df_df, y_encoded, scoring="accuracy", cv=10))

In [200]:
print(cv_max_df_accuracy)

0.514858608758
