# Implementation of a naive Bayes classifier for checking spam\non-spam emails

Our task is to calculate the most probable class, so we need to calculate the value of the random variable $C$ at which the posteriori maximum is reached

$$C = argmax P(c|d) \quad c \in C$$

According to Bayes' theorem , we decompose P(c|d)

$$C = argmax \frac{P(d|c)*P(c)}{P(d)}$$

Considering that we are looking for an argument that maximizes the likelihood function, and that the denominator does not depend on this argument and is a constant in this case, we can cross out the value of the composite probability P(d)

$$C = argmax P(d|c)*P(c)$$

Since the maximum of any function f(x) will be identical to the maximum of ln(f(x)), then

$$C = arg max ln(P(d|c)*P(c))$$

Given the naivety of our classifier and the property of the logarithm:

$$C = arg max ln(P(f_1,f_2,...,f_n|c)*P(c)) \newline
= arg max ln(P(c)*\prod_{i=1}^{n} P(f_i,c)) \newline
= arg max ln(P(c))+\sum_{i=1}^{n} ln(P(f_i,c)) $$

So, using the maximum likelihood method and Laplace blur, we get:
$$ P(f_i,c_j) = \frac{count(f_i,c_j) + z}{\sum_{k=1}^{q} count(f_k,c_j) + zq} \newline$$
Where z >= 0 - blur factor ,q - total number of words

#### Let's start implementing our classifier

In [90]:
# Imports
from math import log
import pandas as pd 
from typing import Dict, List

In [80]:
# Define a dict of spam\non-spam emails and initialize the blur factor z
s = {'spam':['путевки низкой цене','акция купи шоколадку получи телефон подарок'],'notspam':['завтра состоится собрание','купи килограмм яблок шоколадку']}
z: int = 1
data =  pd.read_csv("../input/spam-or-not-spam-dataset/spam_or_not_spam.csv")

In [81]:
# Calculate the number of words related to the spam\non-spam categories
count_spam: int = 0
count_not_spam: int  = 0
count_total: int = 0
for k, v in s.items():
    if(k == 'spam'):
        count_spam += len(v)
    if(k == 'notspam'):
        count_not_spam += len(v)
count_total = count_not_spam + count_spam

In [82]:
# Define dicts of 'bag of words' for spam\unspam emails
bag_of_words_spam: Dict[str, int] = {}
bag_of_words_not_spam: Dict[str, int] = {}
list_of_words_spam: List[str] = []
list_of_words_not_spam: List[str] = []
for k, v in s.items():
    if(k == 'spam'):
        for sentence in v:
            tmp = sentence.split( )
            list_of_words_spam.extend(tmp)
            
    if(k == 'notspam'):
        for sentence in v:
            tmp = sentence.split( )
            list_of_words_not_spam.extend(tmp)
                 
for v in list_of_words_spam:
    bag_of_words_spam[v] = list_of_words_spam.count(v)
    
for v in list_of_words_not_spam:
    bag_of_words_not_spam[v] = list_of_words_not_spam.count(v)



In [83]:
# Define the email that will be classified
sentence = 'магазине гора яблок купи семь килограмм шоколадку'
split_sentence = sentence.split(' ')

In [84]:
def count_word_in_bag_of_words(word: str, bag: Dict[str, int]) -> str:
    """
    count_word_in_bag_of_words - counting the number of occurrences of a word in a dict of 'bag of words'
    """
    if(word not in bag.keys()):
        return 0
    else:
        return bag[word]

In [85]:
def sum_words_in_bag_of_words(bag: Dict[str, int]) -> int:
    """
    sum_words - counting words in a dict of 'bag of words'
    """
    return sum(bag.values())

In [86]:
# Calculate the number of words in both dicts
total_sum: int = (sum_words_in_bag_of_words(bag_of_words_spam) + sum_words_in_bag_of_words(bag_of_words_not_spam))

In [87]:
# Calculating the final probability for a letter for spam object
try:
    probability_spam = log(count_spam/count_total)
    for word in split_sentence:
        probability_spam += log((count_word_in_bag_of_words(word,bag_of_words_spam) + z)
                                /(sum_words_in_bag_of_words(bag_of_words_spam) + (z * total_sum)))
except ValueError and ZeroDivisionError:
    probability_spam = 0


In [88]:
# Calculating the final probability for a letter for non-spam object
try:
    probability_not_spam = log((count_total-count_spam)/count_total)
    for word in split_sentence:
        probability_not_spam += log((count_word_in_bag_of_words(word,bag_of_words_not_spam) + z)
                                    /(sum_words_in_bag_of_words(bag_of_words_not_spam) + (z * total_sum)))
except ValueError and ZeroDivisionError:
    probability_not_spam = 0     

In [89]:
# Determining if the email is spam
if(probability_not_spam>probability_spam):
    print(f'The email "{sentence}" is not spam')
else:
    print(f'The email "{sentence}" is spam')

The email "магазине гора яблок купи семь килограмм шоколадку" is not spam
