<a href="https://colab.research.google.com/github/Saiteja421/spam_detection/blob/main/spam_detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Spam Detection with Multinomial Naive Bayes Algorithm**
---
# Description

> Naive Bayes is a simple and effective method for predicting the class of a data point based on the values of features that describe that data point. It is based on the idea of using Bayes' theorem, a rule in probability theory, to estimate the probability that an event will occur given the prior knowledge of certain conditions.

> In the context of classification, Naive Bayes can be used to predict the class of a data point based on the values of a set of features that describe that data point. For example, in a spam filtering application, the features might include the presence of certain words in an email, and the class would be "spam" or "not spam". The Naive Bayes classifier would use the values of these features to estimate the probability that an email is spam, and make a prediction accordingly.

>In this project we'll build a model that classifies messages as spam or non-spam.




## Table of contents

* **Importing Libraries**
* **Reading Data**
* **Creating Training and Testing Set**
* **Data Cleaning**
* **Calculating Probabilities**
* **Building the Classifier**
* **Testing**
* **Conclusions**

## Importing Libraries 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re


## Reading Data

This dataset is taken from [Kaggle.com](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

Import the Google Drive module in Google Colab and mount it to download the dataset for training and testing the model.

I have downloaded the dataset by providing the path to my Google Drive where I had already saved the dataset. 


In [None]:
messages = pd.read_csv('/content/drive/MyDrive/DWDM Dataset/SMSSpamCollection', sep = "\t", header = None)

In [None]:
messages

In [None]:
messages.columns = ["Label", "Message"] #Labeling the columns as Label and Message

In [None]:
messages

In [None]:
messages["Label"].value_counts().plot.bar(rot = 30)
plt.xlabel("Message Label")
plt.ylabel("Frequency")

In [None]:
messages["Label"].value_counts(normalize = True)



## Creating Training and Testing Set

We will split the original dataset randomly

80% of the data --> will be used for training, 

20% remaining --> will be used for testing. 

In [None]:
# random data
random_data = messages.sample(frac = 1, random_state = 1)

# lengths of the training and testing data, that will be used as future indexes
len_train = round(len(random_data) * .8)
len_test = len(random_data) - len_train

print(len_train, len_test, len_train + len_test)

In [None]:
random_data

In [None]:
# creating the sets using slicing
training = random_data[:len_train].reset_index(drop = True)
testing = random_data[-len_test:].reset_index(drop = True)

In [None]:
print(training.shape, testing.shape)

In [None]:
training.head()

In [None]:
testing.head()

We now check that the proportions of spam and non-spam messages are kept similar to those of the original dataset

In [None]:
training["Label"].value_counts(normalize = True)

In [None]:
testing["Label"].value_counts(normalize = True)

## Data Cleaning


So we will need to clean the data in order to obtain the pieces of information we need.

Recall that the model treats each word indepedently, so we don't care about the entire message. We only care about the frequencies of each word in a message.

We will do some assumptions

All words will be lowercased and punctuation will be neglegted.

In [None]:
# "\W" is a regex command that matches character that are not a-z, A-Z, 0-9 and _
training["Message"] = training["Message"].str.replace("\W", " ", regex  = True).str.lower()

In [None]:
training.head()

In [None]:
# creating the vocabulary
vocabulary = []
training["Message"] = training["Message"].str.split()
for message in training["Message"]:
    for word in message:
        if word not in vocabulary:
            vocabulary.append(word)


In [None]:
# checking that there are no duplicates in the vocabulary
print(len(vocabulary) == len(set(vocabulary)))

In [None]:
vocabulary[:5]

In [None]:
d = len(vocabulary)
print(d)

There are 7783 unique words in the messages. Now we will split the Message column into multiple columns for each word in the vocabulary, setting the values as the frequencies of the words in each message. We will do so creating a dictionary first and then converting it to a dataframe that we will concatenate to the original training dataset.

In [None]:
# creating the dictionary that will be converted to a dataframe
word_freq_per_message = {word:[0]*len(training["Message"]) for word in vocabulary}

# adding the frequencies to word_freq_per_message
for i, message in enumerate(training["Message"]):
    for word in message: # recall that 'message' is a list of words, saved as strings
        word_freq_per_message[word][i] += 1
        
words_freq_per_message = pd.DataFrame(word_freq_per_message)
words_freq_per_message

In [None]:
training_final = pd.concat([training, words_freq_per_message], axis = 1)

In [None]:
training_final

## Calculating Probabilities

Now that we have a dataset that is useful for our scenario, we can proceed doing our calculations. Since the formulas are a bit long, I will repaste them here:
$$P(spam|word_1, word_2,\dots , word_n) \propto P(spam)\cdot \prod_{i = 1}^{n}P(word_i|spam)$$
$$P(non\ spam|word_1,word_2,\dots , word_n) \propto P(non\ spam)\cdot \prod_{i = 1}^{n}P(word_i|non\ spam)$$
$$P(word_i|spam) = \frac{X_i + \alpha}{N + \alpha \cdot d}$$ where $X_i$ is represents the frequency of $word_i$ in a spam message, $\alpha$ is the smoothing parameter (we'll set it to 1), $N$ is the number of words in the spam set, $d$ is the number of words in our vocabulary. The same is applied when we condition on "non-spam" messages.

Let's start with $P(spam)$ and $P(non\ spam)$.

In [None]:
prob_spam = len(training_final[training_final["Label"] == "spam"]) / len(training_final["Label"])
prob_nonspam = 1 - prob_spam
print(prob_spam, prob_nonspam)



Now, Let's go on calculating $P(word_i|C_k)$ where $C_k$ is either $spam$ or $non\ spam$. Since there are 7783 words in the vocabulary, and we need to calculate the probabilities in both cases, we need to so 15566 computations.


In [None]:
# calculating P(word_i|C_k)
spam_messages = training_final[training_final["Label"] == "spam"]
nonspam_messages = training_final[training_final["Label"] == "ham"]

alpha = 1
n_spam = spam_messages["Message"].apply(len).sum()
n_nonspam = nonspam_messages["Message"].apply(len).sum()
# we defined "d" previously in our code
print(alpha, n_spam, n_nonspam, d)

prob_word_given_spam = {}
prob_word_given_nonspam = {}

for word in vocabulary:
    prob_word_given_spam[word] = (spam_messages[word].sum() + alpha) / (n_spam + alpha * d)
    prob_word_given_nonspam[word] = (nonspam_messages[word].sum() + alpha) / (n_nonspam + alpha * d)

In [None]:
print(dict(list(prob_word_given_spam.items())[:3]))

In [None]:
print(dict(list(prob_word_given_nonspam.items())[:3]))

**NOTE**: doing so many calculations before the classification is what makes Naive Bayes very fast! If we did not do so, we would need to do all these calculations for every new message! Now, instead, most of them are already done. Hence Naive Bayes is
more accurate than many other methods of classification.

## Building the Classifier

In [None]:
def classify(message):
    if not isinstance(message, str):
        raise Exception("Argument must be a string")
    
    message = re.sub("\W", " ", message)
    message = message.lower().split()
    
    prob_spam_given_message = prob_spam
    prob_nonspam_given_message = prob_nonspam
    for word in message:
        if word in prob_word_given_spam:
            prob_spam_given_message *= prob_word_given_spam[word]
        if word in prob_word_given_nonspam:
            prob_nonspam_given_message *= prob_word_given_nonspam[word]
    # we added these if clauses to avoid issues when a word of a message is not present in our list (see README for more)
    
    if prob_spam_given_message > prob_nonspam_given_message:
        res = "spam"
    elif prob_spam_given_message < prob_nonspam_given_message:
        res = "ham"
    else: # if there is equality. It is unlikely to occur, since we're comparing float numbers
        res = "Classification failed"
        
    return prob_spam_given_message, prob_nonspam_given_message, res

In [None]:
# checking boundary case
classify('3')

Looking at the documentation, we have some examples of spam and non-spam messages. Let's test them:

* ham: What you doing?how are you?
* ham: Ok lar... Joking wif u oni...
* ham: dun say so early hor... U c already then say...
* ham: MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
* ham: Siva is in hostel aha:-.
* ham: Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.
* spam: FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
* spam: Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
* spam: URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU

In [None]:
classify("What you doing?how are you?")

In [None]:
classify("Ok lar... Joking wif u oni...")

In [None]:
classify("dun say so early hor... U c already then say...")

In [None]:
classify("MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*")

In [None]:
classify("Siva is in hostel aha:-.")

In [None]:
classify("Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.")

In [None]:
classify("FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop")

In [None]:
classify("Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B")

In [None]:
classify("URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU")

All classifications are correct!


## Testing

We will now use the classifier on our testing set to test and check the accuracy.

In [None]:
testing.head()

Let's add another column classification that indicates the classification made by the classify function.

In [None]:
testing["classification"] = testing["Message"].apply(lambda message: classify(message)[2])

In [None]:
testing.head()

To calculate the model accuracy, we will check the proportions of rows in which classification == Label

In [None]:
correct_class_len = 0
total_messages = testing.shape[0] #1114

for row in testing.iterrows():
    if row[1]["Label"] == row[1]["classification"]: #we use row[1] because iterrows() returns (index,row)
        correct_class_len += 1

accuracy = correct_class_len / total_messages
accuracy

We achieved an accuracy of 98.7%, which is amazing!




## Conclusions


>We have built a model that predicted whether a message was spam or not with 98.7% accuracy. The wrong classifications would require further time-consuming investigation.





## Future Optimisations

* Make the model more complex making it case sensitive
* Understand what caused misclassifications in the current model to understand if accuracy can be improved.
* It would be interesting to apply a Logistic Regression (LR) and see how it performs differently. Indeed, when the classification is binary, Naive Bayes (NB) gets very close to LR. The intuitive difference is that LR directly estimates $P(C_k|\textbf{X})$, while NB estimates values for $P(C_k)$ and $P(\textbf{X}|C_k)$

In [None]:
print("All Cells ececuted...")
print("SUCCESSS")