# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [24]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [2]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [4]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

flip_result
tails    518
heads    482
Name: count, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [5]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.482
Probability of Tails: 0.518


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [56]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)
data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df = pd.DataFrame(data)

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)


In [57]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
df_emails = pd.read_csv('simulated_email_dataset.csv')
df_emails.head(20)

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,spam
2,112,0,0,morning,spam
3,130,1,0,afternoon,ham
4,95,0,1,afternoon,spam
5,95,1,0,morning,ham
6,131,0,0,morning,ham
7,115,0,0,afternoon,ham
8,90,1,1,evening,spam
9,110,1,0,night,ham


### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [58]:
# Your code for Data Preprocessing goes here

# Encoding the time_of_the_day column such that morning is 1, afternoon is 2, evening is 3, and night is 4
df_emails['time_of_day'] = df_emails['time_of_day'].map({'morning': 1, 'afternoon': 2,'evening': 3,'night': 4})

# Encoding label column such that ham (not spam) is 0 and spam is 1
df_emails['label'] = df_emails['label'].map({'ham': 0, 'spam': 1})

# print the result of first 5 rows
print(df_emails.head(20))

    email_length  contains_free  contains_winner  time_of_day  label
0            109              0                0            1      0
1             97              0                0            1      1
2            112              0                0            1      1
3            130              1                0            2      0
4             95              0                1            2      1
5             95              1                0            1      0
6            131              0                0            1      0
7            115              0                0            2      0
8             90              1                1            3      1
9            110              1                0            4      0
10            90              1                0            4      0
11            90              1                0            1      1
12           104              0                0            3      1
13            61              1   

### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [59]:
# Your code for calculating the probability of spam and ham emails in the dataset goes here
spam_count = df_emails["label"].sum()
not_spam_count = len(df_emails)-spam_count
spam_prob = spam_count / len(df_emails)
not_spam_prob = not_spam_count / len(df_emails)
print(f"Probability of spam: {spam_prob}, Probability of not spam: {not_spam_prob}")

Probability of spam: 0.409, Probability of not spam: 0.591


### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [60]:
# Likelihood for categorical features
def likelihood(feature, label_value, feature_value):
    return len(df_emails[(df_emails['label'] == label_value) & (df_emails[feature] == feature_value)]) / len(df_emails[df_emails['label'] == label_value])

# Likelihoods for contains_free, contains_winner, and time_of_day given spam or not spam
likelihoods_spam = {
    'contains_free': likelihood('contains_free', 1, 1),
    'contains_winner': likelihood('contains_winner', 1, 1),
    'time_of_day': {
        1: likelihood('time_of_day', 1, 1),
        2: likelihood('time_of_day', 1, 2),
        3: likelihood('time_of_day', 1, 3),
        4: likelihood('time_of_day', 1, 4),
    }
}

likelihoods_not_spam = {
    'contains_free': likelihood('contains_free', 0, 1),
    'contains_winner': likelihood('contains_winner', 0, 1),
    'time_of_day': {
        1: likelihood('time_of_day', 0, 1),
        2: likelihood('time_of_day', 0, 2),
        3: likelihood('time_of_day', 0, 3),
        4: likelihood('time_of_day', 0, 4),
    }
}

spam_prob, not_spam_prob, likelihoods_spam, likelihoods_not_spam

(0.409,
 0.591,
 {'contains_free': 0.27383863080684595,
  'contains_winner': 0.19315403422982885,
  'time_of_day': {1: 0.23227383863080683,
   2: 0.24938875305623473,
   3: 0.2371638141809291,
   4: 0.28117359413202936}},
 {'contains_free': 0.3197969543147208,
  'contains_winner': 0.19120135363790186,
  'time_of_day': {1: 0.22842639593908629,
   2: 0.23519458544839256,
   3: 0.26903553299492383,
   4: 0.2673434856175973}})

In [61]:
from scipy.stats import norm

# Gaussian likelihood for continuous variables
def gaussian_likelihood(x, mean, std):
    return norm.pdf(x, mean, std)

def classify_email(email_features, mean_spam, std_spam, mean_not_spam, std_not_spam):
    # Extract the features from the input email
    email_length, contains_free, contains_winner, time_of_day = email_features
    
    # Calculate likelihoods for categorical features
    likelihood_spam_contains_free = likelihood('contains_free', 1, contains_free)
    likelihood_not_spam_contains_free = likelihood('contains_free', 0, contains_free)

    likelihood_spam_contains_winner = likelihood('contains_winner', 1, contains_winner)
    likelihood_not_spam_contains_winner = likelihood('contains_winner', 0, contains_winner)

    likelihood_spam_time_of_day = likelihood('time_of_day', 1, time_of_day)
    likelihood_not_spam_time_of_day = likelihood('time_of_day', 0, time_of_day)

    # Calculate Gaussian likelihood for email_length
    likelihood_spam_email_length = gaussian_likelihood(email_length, mean_spam, std_spam)
    likelihood_not_spam_email_length = gaussian_likelihood(email_length, mean_not_spam, std_not_spam)

    # Calculate posterior probabilities using Bayes' theorem
    P_features_given_spam = (likelihood_spam_contains_free * likelihood_spam_contains_winner *
                             likelihood_spam_time_of_day * likelihood_spam_email_length)
    P_features_given_not_spam = (likelihood_not_spam_contains_free * likelihood_not_spam_contains_winner *
                                 likelihood_not_spam_time_of_day * likelihood_not_spam_email_length)

    # Calculate posterior probabilities
    P_spam_given_features = P_features_given_spam * spam_prob
    P_not_spam_given_features = P_features_given_not_spam * not_spam_prob

    # Normalize to get final probabilities
    total_probability = P_spam_given_features + P_not_spam_given_features
    posterior_spam = P_spam_given_features / total_probability
    posterior_not_spam = P_not_spam_given_features / total_probability

    # Return the classification based on the higher posterior probability
    if posterior_spam > posterior_not_spam:
        return "Spam", posterior_spam
    else:
        return "Not Spam", posterior_not_spam
    
# Classify the three mails
email1 = [90, 1, 1, 1]
email2 = [140, 0, 0, 1]
email3 = [50,1,0,3]

classification_email1 = classify_email(email1, mean_spam, std_spam, mean_not_spam, std_not_spam)
classification_email2 = classify_email(email2, mean_spam, std_spam, mean_not_spam, std_not_spam)
classification_email3 = classify_email(email3, mean_spam, std_spam, mean_not_spam, std_not_spam)

print("Email 1 classification:", classification_email1)
print("Email 2 classification:", classification_email2)
print("Email 3 classification:", classification_email3)

Email 1 classification: ('Not Spam', 0.6273131297622822)
Email 2 classification: ('Not Spam', 0.5481313714391256)
Email 3 classification: ('Not Spam', 0.7040175299307767)


### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [92]:
# Your code goes here
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score

# Features and labels
X = df_emails[['email_length', 'contains_free', 'contains_winner', 'time_of_day']]
y = df_emails['label']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6, random_state=42)

# Train the Gaussian Naive Bayes classifier
model = GaussianNB()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

accuracy, conf_matrix

(0.5783333333333334,
 array([[317,  46],
        [207,  30]], dtype=int64))

In [93]:
# Train the multinomial Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

accuracy, conf_matrix

(0.6033333333333334,
 array([[362,   1],
        [237,   0]], dtype=int64))

In [94]:
# Train the Bernoulli Naive Bayes classifier
model = BernoulliNB()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

accuracy, conf_matrix

(0.605,
 array([[363,   0],
        [237,   0]], dtype=int64))

### Task 6: Discussion
1. Which probability distribution would you choose for an email classifier? Explain your answer.
2. Discuss how Bayesian updating improves the accuracy of the classifier.
3. What are the limitations of the model built in this lab?


### 1) From the results above, the Gaussian naive Bayesian model is preferred as it requires lesser training set and less complex so it is easier to implement for spam filtering, it works well for text classification as it is good at handling categorical data and high dimensional data that have irrelevant features, as it considers all of them independent from each others

### 2) Bayesian updating uses the prior knowledge of which emails the user has flagged them as spam based on the usage of specific words in a message, length, and time of delivery to decide if the new email is spam, then the system will continue to update on its belief as more emails were flagged as spam because of other words that can be seen as suspious, as user receives more email, the accuracy will gets better as it learns more words for detecting spams

### 3) The limitations include conditional independence of features are not always held, highly dependent on quality of data, and zero probability issue when encountering words in the email that are not present in the training set, 

## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.