# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [3]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [4]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [5]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

heads    505
tails    495
Name: flip_result, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [6]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.505
Probability of Tails: 0.495


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [8]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)
data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df = pd.DataFrame(data)

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)


In [26]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
df_emails = pd.read_csv('simulated_email_dataset.csv')
df_emails

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,spam
2,112,0,0,morning,spam
3,130,1,0,afternoon,ham
4,95,0,1,afternoon,spam
...,...,...,...,...,...
995,94,0,1,night,ham
996,135,0,0,night,spam
997,112,0,0,evening,spam
998,88,0,1,afternoon,spam


### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [27]:
# Your code for Data Preprocessing goes here
from sklearn.preprocessing import StandardScaler, LabelEncoder

scaler = StandardScaler()
label_encoder = LabelEncoder() # will be used for time of day feature and label

df_emails['email_length'] = scaler.fit_transform(df_emails[['email_length']])
df_emails['time_of_day'] = label_encoder.fit_transform(df_emails['time_of_day'])

df_emails['label'] = label_encoder.fit_transform(df_emails['label'])

df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,0.465685,0,0,2,0
1,-0.146723,0,0,2,1
2,0.618787,0,0,2,1
3,1.537399,1,0,0,0
4,-0.248791,0,1,0,1


### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [29]:
# Your code for calculating the probability of spam and ham emails in the dataset goes here
spam_emails = df_emails[df_emails['label'] == 1]
ham_emails = df_emails[df_emails['label'] == 0]

prob_spam = len(spam_emails) / len(df_emails)
prob_ham = len(ham_emails) / len(df_emails)

print(f"Probability of Spam emails: {prob_spam:.2%}")
print(f"Probability of Ham emails: {prob_ham:.2%}")

Probability of Spam emails: 40.90%
Probability of Ham emails: 59.10%


### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [69]:
# Write a function using Bayes' Theorem for classification

spam_emails = df_emails[df_emails['label'] == 1]
spam_emails
ham_emails = df_emails[df_emails['label'] == 0]


def calculate_conditional_probability(feature, value, label):
    subset = df_emails[df_emails['label'] == label]
    total_label = len(subset)
    count_feature_value = len(subset[subset[feature] == value])
    return count_feature_value / total_label

prob_spam = len(spam_emails) / len(df_emails)
prob_ham = len(ham_emails) / len(df_emails)
def classify_email(email):
    prob_spam_given_email = prob_spam
    prob_ham_given_email = prob_ham

    for feature in email.index[:-1]:  # exclude the 'label' column
        value = email[feature]

        prob_spam_given_email *= calculate_conditional_probability(feature, value, 1)
        prob_ham_given_email *= calculate_conditional_probability(feature, value, 0)

    return int(prob_spam_given_email > prob_ham_given_email)

first_email = df_emails.iloc[0]
classification_result = classify_email(first_email)
print("The email is classified as", 'spam' if classification_result == 1 else 'ham')

The email is classified as spam


### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [68]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df_emails, test_size=0.2, random_state=42)

# Test the model on the testing set
correct_predictions = 0

for index, email in test_data.iterrows():
    predicted_label = classify_email(email)
    actual_label = email['label']

    if predicted_label == actual_label:
        correct_predictions += 1

accuracy = correct_predictions / len(test_data)
print(f"Accuracy for test set: {accuracy:.2%}")

Accuracy for test set: 67.00%


### Task 6: Discussion
1. Which probability distribution would you choose for an email classifier? Explain your answer.
2. Discuss how Bayesian updating improves the accuracy of the classifier.
3. What are the limitations of the model built in this lab?


In [None]:
- Bernoulli distribution would be a fit choice for an email classifier as we can treat each feature
as a binary to indicate the presence or absence of particular "spam" words in emails.

- Bayesian updating improves accuracy by continuously updating beliefs about email class (spam/not spam) 
based on observed word evidence, using Bayes' theorem.

- The main limitation is the Naive assumption that the features are independant because it can never be the case in a real-life scenario. Another limitation is  
Independence Assumption: Assumes word independence.
Vocabulary Size: May struggle with large vocabularies.
Static Model: Doesn't adapt well to language changes.
Imbalanced Data: Bias if classes are imbalanced.
Lack of Context: Doesn't capture word relationships or contextual meaning.
Assuming features are independent which isnt thecase in real-world scenarios. and unreal data 
makes the model unsuitable for real use.

## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.