#### Implementing Naive Bayes' Classifier from scratch.
#### Generating Dataset used for classifier.

## Section 1 - Dog Breed Classifier

##### Building a Dog Breed Classifier using Naive Bayes.
- We will utilize the generator fuction to create features for a synthetic dataset containing information about three different dog breeds.
- Implementing Naive Bayes algorithm to classify the dogs accurately based on their featires.

##### Generating the Dataset

We will generate a dataset that consists of four features for each dog:
- height(in cm): follows a gaussian distribution
- weight(in kg): follows a gaussian distribution
- bark_days: representing the number of days (out of 30) that the dog barks, follows a binomial distribution. 
- ear_head_ratio: ratio between the length of the ears and the length of the head, It follows a uniform distribution.

In [1]:
FEATURES = ["height","weight","bark_days","ear_head_ratio"]

Introducing the concept of *dataclass*
- A 'dataclass' in python is a decorator (@dataclass) that automatically generates boilerplate code for classes that are mainly used to store data.
- It saves you from writing repetitive code for classes that just hold data. 
- You can think of them as container for data in which you can access each variable by using the dot notation.

In [2]:
from dataclasses import dataclass
# We create a data class where each one of these has different parameters of different distribution.
@dataclass
class params_gaussian:
    mu : float
    sigma: float
    
    def __repr__(self) -> str:
        return f"params_gaussian(mu={self.mu:.3f}, sigma={self.sigma:.3f})"
    
@dataclass
class params_binomial:
    n: int
    p: float
    
    def __repr__(self) -> str:
        return f"params_binomial(mu={self.n:.3f}, p={self.p:.3f})"
    
@dataclass
class params_uniform:
    a:int
    b:int
    
    def __repr__(self) -> str:
        return f"params_uniform(a={self.a}, b={self.b})"

In [3]:
# we have place to store the params for different probability distribution
# we will define a dictionary that has different infomation for every breed of dogs.

breed_info = {
    0 : {
        "height" : params_gaussian(mu=35,sigma=1.5),
        "weight" : params_gaussian(mu=20,sigma=1),
        "bark_days" : params_binomial(n=30,p=0.8),
        "ear_head_ratio" : params_uniform(a=0.6,b=0.1)
    },
    1 : {
        "height" : params_gaussian(mu=35,sigma=2),
        "weight" : params_gaussian(mu=20,sigma=5),
        "bark_days" : params_binomial(n=30,p=0.5),
        "ear_head_ratio" : params_uniform(a=0.2,b=0.5)
    },
    2 : {
        "height" : params_gaussian(mu=40,sigma=3.5),
        "weight" : params_gaussian(mu=32,sigma=3),
        "bark_days" : params_binomial(n=30,p=0.3),
        "ear_head_ratio" : params_uniform(a=0.1,b=0.3)
    },
}

Introducing the concept of *match* in python
- The match statement allows you to compare a value against patterns and execute code based on which pattern matches.
- similar to switch case in other language




In [4]:
def greet(language):
    match language:
        case "English":
            print("Hello!")
        case "Spanish":
            print("¡Hola!")
        case "French":
            print("Bonjour!")
        case _:
            print("Language not supported.")
greet("Spanish")


¡Hola!


case _: acts like default in switch — it matches anything not previously matched.

In [5]:
import numpy as np 
import pandas as pd 
from distribution import Distribution

# Generate data for different breed
def generate_data_for_breed(breed,features,n_sample,params):
    """Generate synthetic data for a specific breed of dogs based on given feature and parameters.

    Args:
        breed (str): The specific breed of dog
        features (list[str]): List of feautres to generate data for (e.g., "height", "weight", "bark_days", "ear_head_ratio")
        n_sample (int): the number of samples to generate for each features.
        params (dict): dictionary containing parameters for each breed and its features.
        
    return:
        - dataframe(pd.DataFrame): A dataframe containing the generated synthetic data.
    """
    df = pd.DataFrame()
    dis = Distribution()
    for feature in features:
        match feature:
            case "height" | "weight":
                df[feature] = dis.gaussian_generator(params[breed][feature].mu, params[breed][feature].sigma,n_sample)
            case "bark_days":
                df[feature] = dis.binomial_generator(params[breed][feature].n,params[breed][feature].p,n_sample)
            case "ear_head_ratio":
                df[feature] = dis.uniform_generator(params[breed][feature].a, params[breed][feature].b,n_sample)
                
    df["breed"] = breed
    
    return df

    

In [6]:
df_0 = generate_data_for_breed(breed=0,features=FEATURES,n_sample=1200,params=breed_info)
df_1 = generate_data_for_breed(breed=1,features=FEATURES,n_sample=1350,params=breed_info)
df_2 = generate_data_for_breed(breed=2,features=FEATURES,n_sample=900,params=breed_info)

dog_data = pd.concat([df_0,df_1,df_2]).reset_index(drop=True)
dog_data


Unnamed: 0,height,weight,bark_days,ear_head_ratio,breed
0,34.520221,19.680148,23.0,0.412730,0
1,37.477729,21.651819,27.0,0.124643,0
2,35.928282,20.618855,25.0,0.234003,0
3,35.374814,20.249876,25.0,0.300671,0
4,33.483565,18.989044,22.0,0.521991,0
...,...,...,...,...,...
3445,37.116179,29.528153,7.0,0.140997,2
3446,38.095257,30.367363,8.0,0.158630,2
3447,44.413310,35.782838,12.0,0.279267,2
3448,32.208460,25.321537,4.0,0.102600,2


In [7]:
dogs_breed_data = dog_data.sample(frac=1) # shuffle the data
dogs_breed_data.head(10)

Unnamed: 0,height,weight,bark_days,ear_head_ratio,breed
2836,39.69781,31.74098,9.0,0.19312,2
1002,36.710641,21.140427,26.0,0.163527,0
1075,34.72693,19.817954,24.0,0.386113,0
1583,37.324884,25.81221,18.0,0.463242,1
248,37.691499,21.794333,28.0,0.11819,0
814,36.688852,21.125901,26.0,0.165052,0
1407,35.844078,22.110196,16.0,0.399051,1
3376,38.616784,30.814387,8.0,0.169269,2
2700,44.655532,35.990456,12.0,0.281653,2
533,35.209095,20.139397,24.0,0.322284,0


In [8]:
dogs_breed_data.describe()

Unnamed: 0,height,weight,bark_days,ear_head_ratio,breed
count,3450.0,3450.0,3450.0,3450.0,3450.0
mean,36.275919,23.095936,16.530145,0.310283,0.913043
std,3.211112,6.348911,6.457825,0.126155,0.775441
min,29.795694,6.989234,3.0,0.100141,0.0
25%,34.134189,19.182379,11.0,0.213161,0.0
50%,35.61201,20.888449,16.0,0.288337,1.0
75%,37.527885,28.233666,23.0,0.411872,2.0
max,52.068168,42.344144,30.0,0.597684,2.0


In [9]:
# split the data for training and testing

split = int(len(dogs_breed_data)*0.7)
df_train = dogs_breed_data[:split].reset_index(drop=True)
df_test = dogs_breed_data[split:].reset_index(drop=True)

##### Implementation of Naive Bayes Classifier

- Computing parameters out of the training data


In [10]:
def compute_training_params(df, features):
    # dict that should contain the estimated parameters
    params_dict = {}
    # dict that should contain the proportion of data belonging to each of the class.
    probs_dict = {}
    for breed in df['breed']:
        df_breed = df[df['breed']==breed][features]
        probs_dict[breed] = round(len(df_breed)/len(df),3)
        
        inner_dict = {}
        
        for feature in df_breed.columns:
            match feature:
                case "height" | "weight":
                    mu = round(df_breed[feature].mean(),3)
                    sigma = round(df_breed[feature].std(),3)
                    params = params_gaussian(mu=mu,sigma=sigma)
                case "bark_days":
                    n = df_breed[feature].max()
                    p = round(df_breed[feature].mean() / n,3) 
                    params = params_binomial(n=n,p=p)
                case "ear_head_ratio":
                    a = df_breed[feature].min()
                    b = df_breed[feature].max()
                    params = params_uniform(a=a,b=b)
                    
            inner_dict[feature] = params
        params_dict[breed] = inner_dict
        
    return params_dict, probs_dict

In [11]:
# Test the function
train_params, train_class_prob = compute_training_params(df=df_train,features=FEATURES)

In [12]:
train_params

{2: {'height': params_gaussian(mu=39.814, sigma=3.575),
  'weight': params_gaussian(mu=31.841, sigma=3.064),
  'bark_days': params_binomial(mu=18.000, p=0.493),
  'ear_head_ratio': params_uniform(a=0.10101231676924374, b=0.2999435346572261)},
 0: {'height': params_gaussian(mu=35.030, sigma=1.519),
  'weight': params_gaussian(mu=20.020, sigma=1.013),
  'bark_days': params_binomial(mu=30.000, p=0.801),
  'ear_head_ratio': params_uniform(a=0.10014116335693468, b=0.5974692080768906)},
 1: {'height': params_gaussian(mu=34.971, sigma=2.011),
  'weight': params_gaussian(mu=19.927, sigma=5.028),
  'bark_days': params_binomial(mu=24.000, p=0.622),
  'ear_head_ratio': params_uniform(a=0.20138960690138086, b=0.49991530198583917)}}

In [13]:
train_class_prob

{2: 0.26, 0: 0.346, 1: 0.393}

#### Compute the probability of X given the Breed -> P(x|Ci)

In [16]:
def prob_of_x_given_C(X,features, breed, params_dict):
    if len(X) != len(features):
        print("X and list of features should have the same length")
        return 0
    
    probability = 1.0
    dis = Distribution()
    
    for x, feature in zip(X,features):
        params = params_dict[breed][feature]
        
        match feature:
            case "height" | "weight":
                probability_f = dis.pdf_gaussian(x,params.mu,params.sigma)
            case "bark_days":
                probability_f = dis.pdf_binomial(x, params.n,params.p)
            case "ear_head_ratio":
                probability_f = dis.pdf_uniform(x,params.a,params.b)
        
        probability *= probability_f 
    return probability


In [52]:
# test the function

example_dog = df_test[FEATURES].loc[0]
example_breed = df_test[["breed"]].loc[0]["breed"]


example_dog1 = df_test[FEATURES].loc[3]
example_breed1 = df_test[["breed"]].loc[3]["breed"]



In [33]:
print(example_dog)
print(f"Its breed: {example_breed}")

height            33.625469
weight            16.563673
bark_days         13.000000
ear_head_ratio     0.273787
Name: 0, dtype: float64
Its breed: breed    1.0
Name: 0, dtype: float64


In [53]:
print(example_dog1)
print(f"Its breed: {example_breed1}")

height            38.842889
weight            31.008190
bark_days          8.000000
ear_head_ratio     0.174094
Name: 3, dtype: float64
Its breed: 2


In [36]:
print(f"Probability of these features if dog is classified as breed 0: {prob_of_x_given_C([*example_dog], FEATURES, 0, train_params)}")
print(f"Probability of these features if dog is classified as breed 1: {prob_of_x_given_C([*example_dog], FEATURES, 1, train_params)}")
print(f"Probability of these features if dog is classified as breed 2: {prob_of_x_given_C([*example_dog], FEATURES, 2, train_params)}")

Probability of these features if dog is classified as breed 0: 3.2393868484420737e-09
Probability of these features if dog is classified as breed 1: 0.003950124435768676
Probability of these features if dog is classified as breed 2: 1.902418041317492e-09


##### Predict the Breed 

In [41]:
def predict_breed(X, features, params_dict, probs_dict):
    
    posterior_breed_0 = prob_of_x_given_C(X, features, 0, params_dict) * probs_dict[0]
    posterior_breed_1 = prob_of_x_given_C(X, features, 1, params_dict) * probs_dict[1]
    posterior_breed_2 = prob_of_x_given_C(X, features, 2, params_dict) * probs_dict[2]
    
    posterior_array = np.array([posterior_breed_0, posterior_breed_1, posterior_breed_2])
    print(posterior_array)
    prediction = np.argmax(posterior_array)
    
    return prediction

In [42]:
example_pred = predict_breed([*example_dog], FEATURES, train_params, train_class_prob)
print(f"Example dog has breed {example_breed} and Naive Bayes classified it as {example_pred}")

[1.12082785e-09 1.55239890e-03 4.94628691e-10]
Example dog has breed breed    1.0
Name: 0, dtype: float64 and Naive Bayes classified it as 1


In [54]:
example_pred1 = predict_breed([*example_dog1], FEATURES, train_params, train_class_prob)
print(f"Example dog has breed {example_breed1} and Naive Bayes classified it as {example_pred1}")

[3.23801842e-38 0.00000000e+00 3.02254595e-03]
Example dog has breed 2 and Naive Bayes classified it as 2


In [40]:
from sklearn.metrics import accuracy_score
preds = df_test.apply(lambda x: predict_breed([*x[FEATURES]],FEATURES,train_params, train_class_prob),axis=1)
test_acc = accuracy_score(df_test["breed"],preds)
print(f"Accuracy of the test split: {test_acc:.2f}")

Accuracy of the test split: 1.00


The Naive Bayes' classifier achived an accuracy score of 100% in the testing data.

In [95]:
dog_data.to_csv('Dogs_breed_data.csv',index=False)

## Section 2 - Spam Detector

- The idea is to build a classifer that is able to detect spam from ham emails
- Dataset: over 5500 emails with their corresponding labels
- The features are categorical.

In [63]:
emails = pd.read_csv('emails.csv')
emails.head(10)

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
5,"Subject: great nnews hello , welcome to medzo...",1
6,Subject: here ' s a hot play in motion homela...,1
7,Subject: save your money buy getting this thin...,1
8,Subject: undeliverable : home based business f...,1
9,Subject: save your money buy getting this thin...,1


In [68]:
def preprocess_email(text:str):
    """Pre-processing of the given email text by converting it to lower case, 
    splitting it into words, and returning a list of unique words.

    Args:
        text (str): The email text to be processed
    Returns:
        list: A list of unique words extracted from the email text.
    """
    text = text.lower()
    return list(set(text.split()))

emails['words'] = emails['text'].apply(preprocess_email)
emails.head()

Unnamed: 0,text,spam,words
0,Subject: naturally irresistible your corporate...,1,"[its, even, our, is, -, break, collaboration, ..."
1,Subject: the stock trading gunslinger fanny i...,1,"[continuant, like, herald, hall, chesapeake, a..."
2,Subject: unbelievable new homes made easy im ...,1,"[$, all, we, at, pittman, and, foward, our, be..."
3,Subject: 4 color printing special request add...,1,"[information, request, &, message, and, our, t..."
4,"Subject: do not have money , get software cds ...",1,"[d, all, money, with, me, best, get, be, ,, gr..."


- Naive Bayes' for Categorical Variable
- If `x_k` is categorical, then P(`x_k`|`C_i`) is the number of samples in X that has attrribute `x_k` divided by the number of samples in class `C_i`

##### Frequency of a word in each class

In [70]:
def freq_word_each_class(df:pd.DataFrame):
    """Calculates the frequency of words in each class (spam and ham) based on a given dataframe

    Args:
        df (dataframe): columns[text,words,spam]
        
    Return:
        dict: A dictionary that contains the frequency of words in each class.
        The keys of the dictionary are words, and the values are nested dictionaries with keys as 'spam' and 'ham' 
        representing the frequency of the word in spam and ham emails, respectively. 
    """
    word_freq = {}
    for _, row in df.iterrows():
        words = row['words']
        for word in words:
            if word not in word_freq:
                word_freq[word] = {'spam':0,'ham':0}
                
            match row['spam']:
                case 0:
                    word_freq[word]['ham'] += 1
                case 1:
                    word_freq[word]['spam'] += 1
    return word_freq


In [None]:
# Test the function
word_freq = freq_word_each_class(emails)
word_freq

In [82]:
try:
    print(f"Frequency in both classes for word 'security': {word_freq['security']}")
    print(f"Frequency in both classes for word 'security': {word_freq['website']}")
    print(f"Frequency in both classes for word 'website': {word_freq['2342']}")
except Exception as e:
    print(f"Given word {e} is not in corpus") 
    


Frequency in both classes for word 'security': {'spam': 117, 'ham': 91}
Frequency in both classes for word 'security': {'spam': 203, 'ham': 134}
Given word '2342' is not in corpus


##### Frequency of classes

In [84]:
def class_frequencies(df):
    """
    Calculate the frequencies of classes in a DataFrame

    """
    class_freq = {
        "spam" : len(df[df["spam"]==1]),
        "ham" : len(df[df['spam']==0])
    }
    return class_freq

class_freq = class_frequencies(emails)
class_freq

{'spam': 1368, 'ham': 4360}

In [86]:
print(f"The proportion of the spam in the dataset: {100*(class_freq['spam']/len(emails)):.2f} %")
print(f"The proportion of the spam in the dataset: {100*(class_freq['ham']/len(emails)):.2f} %")

The proportion of the spam in the dataset: 23.88 %
The proportion of the spam in the dataset: 76.12 %


##### Naive Bayes for Categorical features

In [88]:
def email_classifier(text:str, word_freq=word_freq, class_freq=class_freq):
    text = text.lower()
    words = set(text.split())
    cumulative_product_spam = 1.0
    cumulative_product_ham = 1.0
    
    for word in words:
        if word in word_freq:
            word_freq_dict = word_freq[word]
            spam_count = word_freq_dict['spam']
            ham_count = word_freq_dict['ham']
            cumulative_product_spam *= spam_count/class_freq['spam']
            cumulative_product_ham *= ham_count/class_freq['ham']
    
    # calculate the likelihood of the words appearing the email given that it is spam
    likelihood_word_given_spam = cumulative_product_spam * (class_freq['spam']/(class_freq['spam']+class_freq['ham']))
    
    # calculate the likelihood of the words appearing in the email given it is ham
    likelihood_word_given_ham = cumulative_product_ham * (class_freq['ham']/(class_freq['spam']+class_freq['ham']))
    
    # calculate the posterior probabilty of the email  being spam given that the words appear in the email (the probaibility of being a spam given the email content)
    prob_spam = likelihood_word_given_spam / (likelihood_word_given_spam+likelihood_word_given_ham)
    
    return prob_spam
    

In [93]:
msg = "Dear friend, you won a lottery of three million dollars."
print(f"Probability of spam for email '{msg}': {100*email_classifier(text=msg):.2f} %")
msg1 = "I hope you're doing well."
print(f"Probability of spam for email '{msg1}': {100*email_classifier(text=msg1):.2f} %")


Probability of spam for email 'Dear friend, you won a lottery of three million dollars.': 100.00 %
Probability of spam for email 'I hope you're doing well.': 0.96 %
