# Probability Project: Spam Filter Against SMS messages

## Goal
**Build a spam filter with naive bayes from scratch (without importing additional packages) agianst text messages to detect spam messages** 

## Data
**Data was downloaded from <a href="https://archive.ics.uci.edu/ml/datasets/sms+spam+collection"> The UCI Machine Learning Repository</a>, which includes two columns with one of which shows us original messages while the other signs a label to it, there are 2 types of labels : "spam" and "ham"**

In [10]:
import pandas as pd
import numpy as np
import re

## Process
- Import data
- Data Cleaning
- Compute some constant values for naive bayes
- Accuracy Test

**Import data**

In [11]:
sms_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

print(sms_spam.shape)
sms_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
sms_spam.info()
print('\n')
print("Shape of Data: ",sms_spam.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


Shape of Data:  (5572, 2)


In [13]:
#Find out percentage of each label
sms_spam['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

Split data into train/test

In [14]:
# Randomize the dataset
data_randomized = sms_spam.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Training/Test split
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


Check into the percentage of both datasets

In [15]:
#Training data
training_set['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [16]:
#Test data
test_set['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

**Data Cleaning**

In [17]:
#Remove all Punctuations
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
#Lowercase all characters
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


Create a vocabulary list to store all unique words in the training data

In [19]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [20]:
print("Unique Words: ",len(vocabulary))

Unique Words:  7783


transform the dataset using the unique words set

In [21]:
#Create a dictionary to calculate frequency for each unique word
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [22]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,overdose,janarige,2lands,vegetables,virgin,patent,wipro,persian,mobs,08712400200,...,apples,sliding,occurs,crammed,u2moro,fetching,vco,december,unjalur,suganya
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
#Concate the data sets
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,overdose,janarige,2lands,vegetables,virgin,patent,wipro,persian,...,apples,sliding,occurs,crammed,u2moro,fetching,vco,december,unjalur,suganya
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Compute some constant values for naive bayes**

- P(Spam)
- P(Ham)
- N_spam: number of words in all spam messages 
- N_ham: number of words in all ham messages
- N_vocabulary: Total words in the data
- Alpha : Laplace Smoothing, in case dome words don't appear in other messages

In [24]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

- Proability of all words given the message is spam: P(Wi | Spam) = (N,wi|spam + Alpha) / (N,spam + alpha * N,vocabulary)
- Probability of all words given the message is ham: P(Wi | Ham) = (N,wi|ham + Alpha) / (N,ham + alpha * N,vocabulary)

In [25]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

**Utilized all variables to classify new messages**

- Takes in as input a new message (w1, w2, ..., wn).
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
  - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham 
  - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam
  - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm stand-along can't may not be able to detect the message

- P(Spam | w1, w2, w3 ... wn) => P(Spam) * P(w1 | Spam) * p(w2 | Spam) * ... * p(wi | Spam)
- P(Ham | w1, w2, w3 ... wn) => P(Ham) * P(w1 | Ham) * p(w2 | Ham) * ... * p(wi | Ham)

In [47]:
def classify(message):
   
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'n/a'

In [48]:
#Randomly test the function
print(sms_spam.loc[1,'SMS'])
classify(sms_spam.loc[1,'SMS'])

Ok lar... Joking wif u oni...


'ham'

**Test Accuracies**

Add a column for predicted results

In [57]:
test_set['Predicted']=test_set['SMS'].apply(classify)
test_set.head()

Unnamed: 0,Label,SMS,Predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [59]:
correct_count=0
total_count=len(test_set)

for row in test_set.iterrows():
    row=row[1]
    if row['Label']==row['Predicted']:
        correct_count += 1

incorrect_count=total_count-correct_count
print('Correct Number: ', correct_count)
print('Incorrect Number: ', incorrect_count)
print('Accuracy: ', "{:.2%}".format(correct_count/total_count))

Correct Number:  1100
Incorrect Number:  14
Accuracy:  98.74%


## Conclusion
The naive bayes here proves itself efficient for messages detection, with almost 99% accuracy acheived