# Spam Detection via Machine Learning
Learning excercise as part of the Udacity Machine Learning Nanodegree. Detect spam SMS messages using the sample data set available the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection.

This is a following-along of this walkthrough: https://github.com/udacity/machine-learning/blob/master/projects/practice_projects/naive_bayes_tutorial/Naive_Bayes_tutorial.ipynb

First we load the data set in a pandas data frame. 

In [4]:
import pandas as pd
df = pd.read_table('spam collection\\SMSSpamCollection',
                  sep = '\t',
                  header=None,
                  names=['label', 'sms_message'])
print (df['sms_message'][2])
df.head()

Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's


Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Transfer the strings in column 'label' to integers. This is recommended as machine learning algorithms like to run on numerical values.    

In [5]:
print (df.shape)
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

(5572, 2)


Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Convert all text to lower case

In [6]:
import time

startTime = time.time()
df['sms_message'] = df['sms_message'].str.lower()
elapsedTime = time.time() - startTime
print ("time:", elapsedTime, "s")
df.head()

time: 0.00700068473815918 s


Unnamed: 0,label,sms_message
0,0,"go until jurong point, crazy.. available only ..."
1,0,ok lar... joking wif u oni...
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor... u c already then say...
4,0,"nah i don't think he goes to usf, he lives aro..."


Replace punctuations with space characters. Reduce multiple space characters to one. In the code below this is performed in one line of code. To make the code more readeble I'd like to split the long comand up into two. The question is does this impact perfomarmance? Let's do a test. First we both actions in one line.  

In [7]:
import string

rep = str.maketrans(string.punctuation, ' '*len(string.punctuation))
a = df
print (a['sms_message'][0])

startTime = time.time()
a['sms_message'] = df['sms_message'].str.translate(rep).replace(' +', ' ', regex=True).str.strip()
elapsedTime1 = time.time() - startTime

print (a['sms_message'][0])
print ("time:", elapsedTime1, "s")

go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
time: 0.07900452613830566 s


Next we separate punctuation replacement and removal of white space characters in separate lines of code.

In [8]:
a = df
print (a['sms_message'][0])

startTime = time.time()
# remove punctuation
a['sms_message'] = df['sms_message'].str.translate(rep).replace(' +', ' ', regex=True)
# remove pre and postfix white spaces
a['sms_message'] = df['sms_message'].str.strip()
elapsedTime2 = time.time() - startTime

print (a['sms_message'][0])
print ("time:", elapsedTime2, "s")
d = (elapsedTime1-elapsedTime2)*100/elapsedTime2
print ("performance difference: %.2f %%" %d)

df = a

go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
time: 0.06700372695922852 s
performance difference: 17.91 %


Funny. Looks like it's actually a bit faster.

Add column that contains the word frequencies.

In [9]:
from collections import Counter

df['word_frequency'] = df['sms_message'].str.split(' ').apply(lambda x: Counter(x))
# a['word_frequency'] = a['sms_message'].apply(lambda x: Counter(x))
# print (type(a))
# print (a.shape)
print (df['sms_message'][2])
print (df['word_frequency'][2])
df.head()

free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry question std txt rate t c s apply 08452810075over18 s
Counter({'to': 3, 'entry': 2, 'fa': 2, 's': 2, 'free': 1, 'in': 1, '2': 1, 'a': 1, 'wkly': 1, 'comp': 1, 'win': 1, 'cup': 1, 'final': 1, 'tkts': 1, '21st': 1, 'may': 1, '2005': 1, 'text': 1, '87121': 1, 'receive': 1, 'question': 1, 'std': 1, 'txt': 1, 'rate': 1, 't': 1, 'c': 1, 'apply': 1, '08452810075over18': 1})


Unnamed: 0,label,sms_message,word_frequency
0,0,go until jurong point crazy available only in ...,"{'go': 1, 'until': 1, 'jurong': 1, 'point': 1,..."
1,0,ok lar joking wif u oni,"{'ok': 1, 'lar': 1, 'joking': 1, 'wif': 1, 'u'..."
2,1,free entry in 2 a wkly comp to win fa cup fina...,"{'free': 1, 'entry': 2, 'in': 1, '2': 1, 'a': ..."
3,0,u dun say so early hor u c already then say,"{'u': 2, 'dun': 1, 'say': 2, 'so': 1, 'early':..."
4,0,nah i don t think he goes to usf he lives arou...,"{'nah': 1, 'i': 1, 'don': 1, 't': 1, 'think': ..."


Implement count frequency with scikit-learn.

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer(stop_words = {'english'})
count_vector.fit(df['sms_message'])
doc_array = count_vector.transform(df['sms_message']).toarray()
frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())
frequency_matrix.shape

(5572, 8710)

Split data into training and test sets.

In [34]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], df['label'], random_state = 1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)


Do Bag of Words processing on training dataset.

In [41]:
count_vector = CountVectorizer()
X_train_frequency_matrix = count_vector.fit_transform(X_train)
X_test_frequency_matrix = count_vector.transform(X_test)

Bayes Theorem implementation from scratch. Simple example to start with.

In [4]:
P_D = 0.01     # probability of a person having diabetes
P_negD = 1 - P_D # probability of a person NOT having diabetes
P_Pos_D = 0.9  # probability of getting a positive test result for diabetes ...
               # when having diabetes = Sensitivity = True Postive Rate
P_Neg_negD = 0.9 # probability of getting a negative test result for diabetes ...
               # when not having diabetes = Specificity = True Negative Rate

P_Pos = (P_D * P_Pos_D) + (P_negD * (1 - P_Neg_negD)) # probability of getting a positive test result
print (P_Pos)

P_D_Pos = (P_D * P_Pos_D)/P_Pos # probability of having diabetes given a positive test result
print (P_D_Pos)

P_negD_Pos = (P_negD *(1 - P_Neg_negD))/P_Pos # probability of not having diabetes given a positive test result
print (P_negD_Pos)

0.10799999999999998
0.08333333333333336
0.9166666666666666


Naive Bayes implementation from scratch.

In [8]:
P_J_freedom = 0.1
P_J_immigration = 0.1
P_J_environment = 0.8
P_J = 0.5

P_G_freedom = 0.7
P_G_immigration = 0.2
P_G_environment = 0.1
P_G = 0.5

P_J_text = P_J * P_J_freedom * P_J_immigration
P_G_text = P_G * P_G_freedom * P_G_immigration

print (P_J_text)
print (P_G_text)

P_freedom_immigration = P_J_text + P_G_text

print (P_freedom_immigration)

P_J_freedom_immigration = (P_J * P_J_freedom * P_J_immigration) / P_freedom_immigration
P_G_freedom_immigration = (P_G * P_G_freedom * P_G_immigration) / P_freedom_immigration

print (P_J_freedom_immigration)
print (P_G_freedom_immigration)

0.005000000000000001
0.06999999999999999
0.075
0.06666666666666668
0.9333333333333332
