# Spam Classifier from scratch using Naive Bayes

## Goal

To implement simple spam classifier from scratch using naive bayes concept.

## Dataset used for training

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.


Source: [Kaggle](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)


## References

- [Naive Bayes, Clear Explained - Youtube](https://www.youtube.com/watch?v=O2L2Uv9pdDA)
- [Naive Bayes Classifier From Scratch](https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)


In [1]:
import pandas as pd
from collections import Counter

In [2]:
class ClassifySpam:
    '''
    Class that classifies spam and ham data
    '''

    def __init__(self, df, spam_probability, ham_probability):
        """
        Initialize the ClassifySpam object.

        Parameters:
        - sms: The input SMS text to be classified.
        - df: DataFrame containing word counts and probabilities.
        - spam_probability: Prior probability of spam.
        - ham_probability: Prior probability of ham.
        """
        self.word_count_df = df
        self.sms = ''
        self.spam_probability = spam_probability
        self.ham_probability = ham_probability

    def _get_sanitized_sms(self):
      return ''.join(e.lower() for e in self.sms if e.isalpha() or e.isspace())

    def _word_exists(self, word):
        """
        Check if a given word exists in the word_count_df DataFrame.

        Parameters:
        - word: The word to check for existence.

        Returns:
        - bool: True if the word exists, False otherwise.
        """
        return any(self.word_count_df['words'] == word)

    def _filter_known_words(self, words):
        """
        Filter out unknown words from the list of words.

        Parameters:
        - words: List of words to filter.

        Returns:
        - list: List of known words.
        """
        return [word for word in words if self._word_exists(word)]

    def _get_known_words(self):
        """
        Get the list of known words from the SMS.

        Returns:
        - list: List of known words in the SMS.
        """
        words = self._get_sanitized_sms().split()

        return self._filter_known_words(words)

    def calc_prob_spam(self):
        """
        Calculate the probability of the SMS being spam using known words.

        Returns:
        - float: Probability of the SMS being spam.
        """
        prob = self.spam_probability
        for word in self._get_known_words():
            prob *= self.word_count_df[self.word_count_df["words"] == word]["is_spam_prob"].iloc[0]
        return prob

    def calc_prob_ham(self):
        """
        Calculate the probability of the SMS being ham using known words.

        Returns:
        - float: Probability of the SMS being ham.
        """
        prob = self.ham_probability
        for word in self._get_known_words():
            prob *= self.word_count_df[self.word_count_df["words"] == word]["is_ham_prob"].iloc[0]
        return prob

    def is_spam_or_ham(self, sms):
        """
        Classify the SMS as spam or ham based on probabilities.

        Returns:
        - str: Classification result - 'Cannot classify', 'Classified as normal message', or 'Classified as spam message'.
        """
        self.sms = sms
        if len(self._get_known_words()) == 0:
            return 'Cannot classify'
        else:
            if self.calc_prob_ham() > self.calc_prob_spam():
                return 'Classified as normal message'
            else:
                return 'Classified as spam message'


In [3]:
# Training DataSet


df = pd.read_csv('spam.csv', encoding = 'ISO-8859-1')

### Preprocessing Data

In [4]:
df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)

In [5]:
df.columns = ['category', 'sms']

In [6]:
# changing all the sms data to lower case and removing non alpha characters

df['sms'] = df['sms'].apply(lambda x: ''.join(e.lower() for e in x if e.isalpha() or e.isspace()))

## Preparing the dataset


In [7]:
# Transforming the dataset

spam_counter = Counter()
ham_counter = Counter()

for index, row in df.iterrows():
    words = row['sms'].split()
    if row['category'] == 'spam':
        spam_counter.update(words)
    else:
        ham_counter.update(words)

word_count_df = pd.DataFrame({ 'words': list(set(spam_counter.keys()).union(set(ham_counter.keys())))})

# Adding alpha = 1 to counts
word_count_df['count_spam'] = word_count_df['words'].apply(lambda word: spam_counter[word] + 1)
word_count_df['count_ham'] = word_count_df['words'].apply(lambda word: ham_counter[word] + 1)

In [8]:
total_words_in_spam = word_count_df['count_spam'].sum()
total_words_in_ham = word_count_df['count_ham'].sum()

In [9]:
word_count_df['is_spam_prob'] = word_count_df['count_spam']/total_words_in_spam
word_count_df['is_ham_prob'] = word_count_df['count_ham']/total_words_in_ham

In [10]:
word_count_df.head()

Unnamed: 0,words,count_spam,count_ham,is_spam_prob,is_ham_prob
0,aproach,1,3,4e-05,4e-05
1,mrng,1,15,4e-05,0.000199
2,nightnight,1,2,4e-05,2.7e-05
3,internetservice,2,1,8.1e-05,1.3e-05
4,u,156,989,0.006317,0.013127


In [11]:
# Calculating probability of spam and ham

p_spam = df[df['category'] == 'spam']['category'].count()/df.shape[0]
p_ham = df[df['category'] == 'ham']['category'].count()/df.shape[0]

# Testing the classifier

In [12]:
spam_classifier = ClassifySpam(word_count_df, p_spam, p_ham)

In [13]:
spam_classifier.is_spam_or_ham("Free subscription!!!!")

'Classified as spam message'

In [14]:
spam_classifier.is_spam_or_ham("Hi Friends")

'Classified as normal message'

In [15]:
spam_classifier.is_spam_or_ham("Please call me when you are free.")

'Classified as normal message'

In [16]:
spam_classifier.is_spam_or_ham("Foobar.")

'Cannot classify'

## Drawbacks

It cannot classify the message when the word is not present in the provided words dictionary/dataset and as it's using naive bayes, it considers each feature independent and does not take into account the relation among other words in the sentence.