# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [None]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [None]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [None]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

flip_result
heads    500
tails    500
Name: count, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [None]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.5
Probability of Tails: 0.5


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [None]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)

data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.5, 0.5]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df = pd.DataFrame(data)

#Replace labels with ones with some relationship
for index, row in df.iterrows():
    prob = min(1, .7 *row["contains_free"] + .7*row["contains_winner"]+.1)
    df.at[index, 'label'] = np.random.choice(['spam', 'ham'], p=[prob, 1-prob])

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)

In [None]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
df_emails = pd.read_csv('simulated_email_dataset.csv')
df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,ham
2,112,0,0,morning,ham
3,130,1,0,afternoon,spam
4,95,0,1,afternoon,spam


### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [20]:
cutsOfEmailLength = ['small', 'medium', 'large']
df_emails['len_cuts'] = pd.qcut(df_emails['email_length'], len(cutsOfEmailLength), labels=cutsOfEmailLength)
df_emails.head()




Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label,len_cat,len_cuts
0,109,0,0,morning,ham,"medium,",large
1,97,0,0,morning,ham,short,medium
2,112,0,0,morning,ham,"medium,",large
3,130,1,0,afternoon,spam,"medium,",large
4,95,0,1,afternoon,spam,short,medium


### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [15]:
def calc_conditional_probability(df, feature, value, label):
    return len(df[(df[feature] == value) & (df['label'] == label)]) / len(df[df['label'] == label])

def predict_spam(df, data):
    # Step 2: Calculate prior probabilities P(spam) and P(ham)
    P_spam = len(df[df['label'] == 'spam']) / len(df)
    P_ham = len(df[df['label'] == 'ham']) / len(df)
    _, bins = pd.qcut(df_emails['email_length'], len(cutsOfEmailLength), retbins=True)

    # Extract the input data
    for i in range(len(bins) - 1):
        if bins[i] <= data['email_length'] <= bins[i + 1] if i == len(bins) - 2 else bins[i] <= data['email_length'] < bins[i + 1]:
            data.update({'len_cuts': cutsOfEmailLength[i]})
            break

    # Calculate the conditional probabilities for spam
    P_L_given_spam = calc_conditional_probability(df, 'len_cuts', data['len_cuts'], 'spam')
    P_F_given_spam = calc_conditional_probability(df, 'contains_free', data['contains_free'], 'spam')
    P_W_given_spam = calc_conditional_probability(df, 'contains_winner', data['contains_winner'], 'spam')
    P_TOD_given_spam = calc_conditional_probability(df, 'time_of_day', data['time_of_day'], 'spam')

    # Calculate the conditional probabilities for ham
    P_L_given_ham = calc_conditional_probability(df, 'len_cuts', data['len_cuts'], 'ham')
    P_F_given_ham = calc_conditional_probability(df, 'contains_free', data['contains_free'], 'ham')
    P_W_given_ham = calc_conditional_probability(df, 'contains_winner', data['contains_winner'], 'ham')
    P_TOD_given_ham = calc_conditional_probability(df, 'time_of_day', data['time_of_day'], 'ham')

    # Calculate P(spam | features) using Bayes' Theorem
    P_spam_given_data = (P_L_given_spam * P_F_given_spam * P_W_given_spam * P_TOD_given_spam * P_spam) / (
            P_L_given_spam * P_F_given_spam * P_W_given_spam * P_TOD_given_spam * P_spam +
            P_L_given_ham * P_F_given_ham * P_W_given_ham * P_TOD_given_ham * P_ham)

    return P_spam_given_data



### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [21]:
# Write a function using Bayes' Theorem for classification
email_features = {
    "email_length": 36,
    "contains_free": 0,
    "contains_winner": 0,
    "time_of_day": "evening"
}
spam_probability = predict_spam(df_emails, email_features)
print(f"Probability that the email is spam: {spam_probability}")

Probability that the email is spam: 0.19061268008564278


### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [22]:

# Sample size
n_samples = 300

# Simulating data
np.random.seed(42)

data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.5, 0.5]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
}

df = pd.DataFrame(data)

def isSpamBool(p_Spam_Probaility):
    return p_Spam_Probaility > 0.5

df['label'] = df.apply(lambda row: 'spam' if int(isSpamBool(predict_spam(df_emails, dict(row)))) == 1 else 'ham', axis=1)
df


Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,1,evening,spam
1,97,0,0,night,ham
2,112,1,1,afternoon,spam
3,130,1,0,morning,spam
4,95,1,0,evening,spam
...,...,...,...,...,...
295,86,0,1,night,spam
296,117,0,0,morning,ham
297,106,0,0,night,ham
298,116,1,1,morning,spam


### Task 6: Discussion
1. Discuss how Bayesian updating improves the accuracy of the classifier.
2. What are the limitations of the model built in this lab?


In [None]:
### Task 6: Discussion

1. **How Bayesian updating improves the accuracy of the classifier:**
    Continual data acquisition through Bayesian updating enables us to continually advance our modeling accuracy. Our predictions become more accurate when we apply new evidence to update existing probabilities in the system. Through multiple cycles of execution, we can add fresh insights to the model while improving its accuracy by representing actual data trends more precisely. During the operation of our email classifier program, Bayesian updating enables automatic probability modification for spam and ham detection based on observed features in recently processed emails.


2. **Limitations of the model built in this lab:**
    - **Simplistic Assumptions: Based on its design, the model operates under a critical assumption that features function independently from one another, though this assumption fails to hold true when applied to actual real-world situations. The statistical model with unconditional independence between features is treated as the baseline assumption by Naive Bayes classifiers.
    - **Feature Engineering: The prediction accuracy of the model stands or falls based on the nature and value of the features that go into it. The consistency of model accuracy stands at risk when important features disappear or when irrelevant features get added to the model's design.
    - **Static Model: The model cannot automatically adjust to new data distributions unless researchers manually update its input with fresh examples.

## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.