Charles Swedensky<br>
CSC570<br>
Module 2 / Week 9<br>
##Goal:  Train a Naive Bayes model to classify future SMS messages as either spam or ham.

Steps:

1.  Convert the words ham and spam to a binary indicator variable(0/1)

2.  Convert the txt to a sparse matrix of TFIDF vectors

3.  Fit a Naive Bayes Classifier

4.  Measure your success using roc_auc_score



In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

In [2]:
df= pd.read_csv("SMSSpamCollection",sep='\t', names=['spam', 'txt'])

In [17]:
df.head()

Unnamed: 0,spam,txt
0,0.0,"Go until jurong point, crazy.. Available only ..."
1,0.0,Ok lar... Joking wif u oni...
2,1.0,Free entry in 2 a wkly comp to win FA Cup fina...
3,0.0,U dun say so early hor... U c already then say...
4,0.0,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
#one-hot-encode 'spam' category
df['spam'] = pd.get_dummies(df.spam)['spam']

In [5]:
df.head()

Unnamed: 0,spam,txt
0,0.0,"Go until jurong point, crazy.. Available only ..."
1,0.0,Ok lar... Joking wif u oni...
2,1.0,Free entry in 2 a wkly comp to win FA Cup fina...
3,0.0,U dun say so early hor... U c already then say...
4,0.0,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
#vectorize the text and remove stopwords
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

In [37]:
#define the class
y = df.spam

In [29]:
#convert df.txt from text to features
X= vectorizer.fit_transform(df.txt)

In [32]:
#5572 instances x 8587 unique words.
print (y.shape)
print (X.shape)

(5572L,)
(5572, 8587)


In [33]:
#test Train Split
X_train, X_test,y_train, y_test = train_test_split(X, y, random_state=42)

In [34]:
#train a naive_bayes classifier
clf = naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [35]:
#test using roc_auc_score

roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

0.98589322144123448

In [52]:
#try with some fresh spam
spam=np.array(["Smexy Asian HOTGURLS in yoru area text SEX to 338382 18+ only"])
spam1=np.array(["A cactus in feline"])
spam2=np.array(["Do you want your dick to be in a million womens screensavers?"])
spam3=np.array(["Gigantic Meat Poles in Action for $1"])

spam_vector = vectorizer.transform(spam)
spam_vector1 = vectorizer.transform(spam1)
spam_vector2 = vectorizer.transform(spam2)
spam_vector3 = vectorizer.transform(spam3)

print (clf.predict(spam_vector))
print (clf.predict(spam_vector1))
print (clf.predict(spam_vector2))
print (clf.predict(spam_vector3))

[ 1.]
[ 0.]
[ 0.]
[ 0.]


I tried several different spammy titles above from (http://www.cracked.com/article_17270_100-unintentionally-hilarious-spam-subject-lines.html) and most of them failed to be classified correctly. I suspect it's due to the relatively small training set and possibly other factors like capitalization and misspelled words. I think that with a larger training set and some wider filtering heuristics (i.e. regarding from whom the message originated or ensembling the dataset with a dictionary of well-known spam "trigger words" https://blog.hubspot.com/blog/tabid/6307/bid/30684/The-Ultimate-List-of-Email-SPAM-Trigger-Words.aspx#sm.00010dzg7m98jf4vw0l1ih5f0qlnl) this over-fitting might be addressed.