# SMS Spam Classification with Naive Bayes
### By Jacob Metzger
### Due 04/04/2016

### Goal:  Train a Naive Bayes model to classify future SMS messages as either spam or ham.

Steps:

1.  Convert the words ham and spam to a binary indicator variable(0/1)

2.  Convert the txt to a sparse matrix of TFIDF vectors

3.  Fit a Naive Bayes Classifier

4.  Measure your success using roc_auc_score

###### Note: This workbook is based on the Week8HW course notebook and the "Using Naive Bayes for Sentiment Analysis" lecture at https://youtu.be/oXZThwEF4r0

In [51]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

In [52]:
df= pd.read_csv("SMSSpamCollection",sep='\t', names=['spam', 'txt'])

In [53]:
df.shape

(5572, 2)

In [54]:
df.head()

Unnamed: 0,spam,txt
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [55]:
df.spam.value_counts() #Note no missing values for spam

ham     4825
spam     747
Name: spam, dtype: int64

In [56]:
#Convert spam column to binary variable
df.spam = pd.get_dummies(df.spam)['spam'] #Just replace, since there are no missing values.
df.head()

Unnamed: 0,spam,txt
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [57]:
#Convert text to sparse matrix
vectorizer = TfidfVectorizer(stop_words = set(stopwords.words('english')), 
                             strip_accents='ascii',
                             ngram_range = (1,1) #Apparently single words work the best for this set.
                            )

In [58]:
y = df.spam
X = vectorizer.fit_transform(df.txt)
print X.shape, y.shape

(5572, 8605) (5572L,)


In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [60]:
nbClassifier = naive_bayes.MultinomialNB()
nbClassifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [61]:
roc_auc_score(y_test, nbClassifier.predict_proba(X_test)[:, 1])

0.98558587451336743

In [62]:
# Get a cross-validated score with an uncertainty estimate
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem
roc_scores = cross_val_score(nbClassifier, X, y, cv=10, scoring='roc_auc')
(roc_scores.mean(), 2.262*sem(roc_scores, ddof=0))

(0.98897251518160778, 0.0054713200112414548)

In [63]:
### Just a sample test on paradigmatic-looking cases.
testText1 = """Reply now to get your daily quiz question. To opt out text STOP""" #Spam
testText2 = """Dave, could you go to the store and pick up some milk on the way home?""" #Ham
testText3 = """Text now for free tix! Reply UNSUBSCRIBE to stop""" #Spam
testText4 = """Meet me by the library at 3pm""" #Ham
testTextVec = vectorizer.transform([testText1, testText2, testText3, testText4])

print nbClassifier.predict(testTextVec)

[ 1.  0.  1.  0.]
