# Naive Bayes model to classify SMS as either spam or ham

Andrew Peabody <apeab2@uis.edu>

1. Converts the words ham and spam to a binary indicator variable
2. Converts the txt to a sparse matrix of TFIDF vectors
3. Fits a Naive Bayes Classifier
4. Measures roc_auc_score
5. Tests a fake sample spam SMS

In [36]:
import pandas as pd
import nltk
import numpy as np
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

In [37]:
df= pd.read_csv("SMSSpamCollection",sep='\t', names=['spam', 'txt'])

In [38]:
df.head()

Unnamed: 0,spam,txt
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [39]:
#Convert spam to binary indicator
df['spam'] = pd.get_dummies(df.spam)['spam']

#I prefer working with booleans
df.spam = df.spam.astype(bool)

In [40]:
df.head()

Unnamed: 0,spam,txt
0,False,"Go until jurong point, crazy.. Available only ..."
1,False,Ok lar... Joking wif u oni...
2,True,Free entry in 2 a wkly comp to win FA Cup fina...
3,False,U dun say so early hor... U c already then say...
4,False,"Nah I don't think he goes to usf, he lives aro..."


In [41]:
#TFIDF Vectorizer, just like before
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

In [42]:
#Dependent variable will be spam as true
y = df.spam

In [43]:
#Convert df.txt from text to features
X = vectorizer.fit_transform(df.txt)

In [44]:
#5572 observations x 8587 unique words.
print y.shape
print X.shape

(5572L,)
(5572, 8587)


In [45]:
#Test Train Split as usual
X_train, X_test,y_train, y_test = train_test_split(X, y, random_state=42)

In [46]:
#Train a naive_bayes classifier
clf = naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [47]:
#AUC
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

0.98589322144123448

In [50]:
# Check with a test SMS

test_sms_array=np.array(["Free entry to win fre stuf"])

test_sms_vector = vectorizer.transform(test_sms_array)

print clf.predict(test_sms_vector)

[ True]


Yup, that looks like spam to me!