# Final Project - SMS Spam Filtering
### 600.475 - Introduction to Machine Learning
### Archan Patel, et. al. 

We aim to create a simple binary spam classifier based on the UCI SMS Spam Collection Dataset. 
```
https://www.kaggle.com/uciml/sms-spam-collection-dataset
```
Messages in the dataset are classifed as either "ham" for regular legitimate messages or spam. 


In [115]:
%matplotlib inline
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt
#from textblob import TextBlob
#import cPickle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

### Data Input
First we read the data from the data/spam.csv as retrieve from Kaggle.
We then partition the data into training, validation, and testing sets. Much of the input and conversion to bag of words was taken from http://radimrehurek.com/data_science_python/

In [64]:
#read in csv, and take relevant columns
messages = pd.read_csv('./data/spam.csv',skiprows=1, names=["class", "text", "r1", "r2", 'r3'], encoding='latin-1')
del messages['r1']
del messages['r2']
del messages['r3']
#messages['v2'] =messages['v2'].astype(str)

#Add a length column
#messages['length'] = messages['text'].map(lambda text: len(text))
messages.head()

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [152]:
#TO-DO: add a feature for the size of the string

#Below code adds length to the chart above.
#print("Mean of ham length: "+str(np.mean(messages[messages['class']=='ham']['length'])))
#print("Mean of spam length: "+str(np.mean(messages[messages['class']=='spam']['length'])))
#messages.length.plot(bins=20, kind='hist')
#messages[messages['class']=='ham']['length']
#messages.hist(column='length', by='class', bins=50)


#Split into training and validation data
#TO-DO: SWITCH TO K-fold cross-validation, just used as a placeholder
X_train,X_test,y_train,y_test = train_test_split(messages["text"],messages["class"], test_size = 0.2, random_state = 10)

In [160]:
# Now we create a bag of words model.
# Convert a collection of text documents to a matrix of token counts
from sklearn.feature_extraction.text import TfidfVectorizer #Alternate Vectorizer, look at docs, not as good
vect = CountVectorizer(analyzer='word', ngram_range=(1, 3),stop_words='english') #uses 1 - 3 word length grams
vect.fit(X_train)
X_train_df = vect.transform(X_train)
X_test_df = vect.transform(X_test)

### Data Format
Here we can determine a representation for raw text of the text messages. The methods used for the representation of text will determine the features of our data.

The following representations where used:

| Representation       | How it works   |
| ------------- |:-------------:|
| Bag of Words      | [stuff] |
|       | [stuff]      |
|  | [stuff]      |

# Multinomial Naive bayes

In [161]:
prediction = dict() # Holds all classes for prediction.

#really simple to create a model and train it.
model = MultinomialNB()
model.fit(X_train_df, y_train)
prediction["MultinomialNB"] = model.predict(X_test_df)

accuracy_score(y_test, prediction["MultinomialNB"])

0.98923766816143499

In [143]:
#y_test less than actual prediction
#i.e. prediction = 1 (spam), y_test = 0 (ham)
X_test[y_test < prediction["MultinomialNB"] ]

693     Will purchase d stuff today and mail to you. D...
5475    Dhoni have luck to win some big title.so we wi...
4860                               Nokia phone is lovly..
Name: text, dtype: object

In [144]:
# Prediction ham, actually Spam
X_test[y_test > prediction["MultinomialNB"] ]

5035    You won't believe it but it's true. It's Incre...
3130    LookAtMe!: Thanks for your purchase of a video...
2002    TheMob>Yo yo yo-Here comes a new selection of ...
68      Did you hear about the new \Divorce Barbie\"? ...
2662    Hello darling how are you today? I would love ...
4211    Missed call alert. These numbers called but le...
3572    You won't believe it but it's true. It's Incre...
4912    Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry ...
3979                                   ringtoneking 84484
Name: text, dtype: object

In [142]:
#Confusion matrix
conf_mat = confusion_matrix(y_test, prediction['MultinomialNB'])
conf_mat

array([[962,   3],
       [  9, 141]])

In [94]:
model = BernoulliNB()
model.fit(X_train_df, y_train)
prediction["BernoulliNB"] = model.predict(X_test_df)

accuracy_score(y_test, prediction["BernoulliNB"])

0.97309417040358748

In [95]:
conf_mat = confusion_matrix(y_test, prediction['BernoulliNB'])
conf_mat

array([[965,   0],
       [ 30, 120]])

# Classification via multi-layer Perceptron

In [145]:
from sklearn.neural_network import MLPClassifier
model = MLPClassifier()
model.fit(X_train_df, y_train)
prediction["NeuralNet"] = model.predict(X_test_df)

accuracy_score(y_test, prediction["NeuralNet"])


#TBH I think this is the bayes Risk without adding additional features

0.96771300448430497

In [86]:
#Prediction ham, actually Spam
X_test[y_test > prediction["NeuralNet"] ]

5035    You won't believe it but it's true. It's Incre...
1153    1000's of girls many local 2 u who r virgins 2...
3130    LookAtMe!: Thanks for your purchase of a video...
1506    Thanks for the Vote. Now sing along with the s...
2002    TheMob>Yo yo yo-Here comes a new selection of ...
68      Did you hear about the new \Divorce Barbie\"? ...
2662    Hello darling how are you today? I would love ...
4211    Missed call alert. These numbers called but le...
3572    You won't believe it but it's true. It's Incre...
4912    Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry ...
2913    Sorry! U can not unsubscribe yet. THE MOB offe...
3979                                   ringtoneking 84484
4014    You will be receiving this week's Triple Echo ...
Name: text, dtype: object