# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [2]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [4]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [6]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

flip_result
tails    520
heads    480
Name: count, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [8]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.48
Probability of Tails: 0.52


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [41]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)
data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df_emails = pd.DataFrame(data)

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)
df_emails.head()


Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,spam
2,112,0,0,morning,spam
3,130,1,0,afternoon,ham
4,95,0,1,afternoon,spam


In [14]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
# df_emails = pd.read_csv('path_to_dataset.csv')
# df_emails.head()

### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [26]:
from sklearn.preprocessing import LabelEncoder

# 1. Normalize the 'email_length' column to be between 0 and 1
df_emails['email_length_normalized'] = df_emails['email_length'] / df_emails['email_length'].max()

# 2. Convert 'time_of_day' to numbers (encoding categories)
time_of_day_mapping = {'morning': 0, 'afternoon': 1, 'evening': 2, 'night': 3}
df_emails['time_of_day_encoded'] = df_emails['time_of_day'].map(time_of_day_mapping)

# 3. Convert 'label' (spam or ham) to numbers: spam = 1, ham = 0
label_encoder = LabelEncoder()
df_emails['label_encoded'] = label_encoder.fit_transform(df_emails['label'])

# Display the first few rows of the preprocessed data
df_emails[['email_length_normalized', 'time_of_day_encoded', 'label_encoded']].head()


Unnamed: 0,email_length_normalized,time_of_day_encoded,label_encoded
0,0.615819,0,0
1,0.548023,0,1
2,0.632768,0,1
3,0.734463,1,0
4,0.536723,1,1


### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [28]:
# Total number of emails
total_emails = len(df_emails)

# Number of spam and ham emails
num_spam = len(df_emails[df_emails['label'] == 'spam'])
num_ham = len(df_emails[df_emails['label'] == 'ham'])

# Probability of spam and ham emails
prob_spam = num_spam / total_emails
prob_ham = num_ham / total_emails

# Print the probabilities
print(f"Probability of Spam emails: {prob_spam:.2f}")
print(f"Probability of Ham emails: {prob_ham:.2f}")


Probability of Spam emails: 0.41
Probability of Ham emails: 0.59


### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [33]:
import pandas as pd

# Sample dataset
data = {'email': ['Buy now', 'Limited offer', 'Hello friend', 'Meeting at 10', 'Free gift' ],
        'label': ['spam', 'spam', 'ham', 'ham', 'spam' ]}
df = pd.DataFrame(data)
p_spam = len(df[df['label'] == 'spam']) / len(df)
p_ham = len(df[df['label'] == 'ham']) / len(df)
spam_words = ' '.join(df[df['label'] == 'spam']['email']).split()
ham_words = ' '.join(df[df['label'] == 'ham']['email']).split()
def likelihood(word, label):
    if label == 'spam':
        return (spam_words.count(word) + 1) / (len(spam_words) + 2)  # Laplace smoothing
    else:
        return (ham_words.count(word) + 1) / (len(ham_words) + 2)
def classify(email):
    words = email.split()
    p_spam_given_email = p_spam
    p_ham_given_email = p_ham
    for word in words:
        p_spam_given_email *= likelihood(word, 'spam')
        p_ham_given_email *= likelihood(word, 'ham')

    return 'spam' if p_spam_given_email > p_ham_given_email else 'ham'

# Example usage
new_email = 'Buy now and get free gift'
classification = classify(new_email)
print(f'The email "{new_email}" is classified as: {classification}')


The email "Buy now and get free gift" is classified as: spam


### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [39]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# New test dataset
test_data = {
    'email': [ 'Win a free iPhone',  'Let’s catch up next week',  'Exclusive deal just for you',  'Your invoice is ready',  'Congratulations! You have won a prize'],
    'label': ['spam', 'ham', 'spam', 'ham', 'spam' ]}
test_df = pd.DataFrame(test_data)
test_df['predicted'] = test_df['email'].apply(classify)
print("Predictions:\n", test_df[['email', 'predicted']])
accuracy = accuracy_score(test_df['label'], test_df['predicted'])
precision = precision_score(test_df['label'], test_df['predicted'], pos_label='spam', average='binary', zero_division=0)
recall = recall_score(test_df['label'], test_df['predicted'], pos_label='spam', average='binary', zero_division=0)
f1 = f1_score(test_df['label'], test_df['predicted'], pos_label='spam', average='binary', zero_division=0)
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')


Predictions:
                                    email predicted
0                      Win a free iPhone       ham
1               Let’s catch up next week       ham
2            Exclusive deal just for you       ham
3                  Your invoice is ready       ham
4  Congratulations! You have won a prize       ham
Accuracy: 0.40
Precision: 0.00
Recall: 0.00
F1 Score: 0.00


### Task 6: Discussion
1. Which probability distribution would you choose for an email classifier? Explain your answer.
Answer: Probability Distribution Choice for an Email Classifier:
Preferred Distribution is Multinomial Distribution
Reason: This distribution is ideal for handling word counts in text classification, effectively capturing how often words appear in spam versus ham emails.
2. Discuss how Bayesian updating improves the accuracy of the classifier.
Answer: Bayesian Updating allows the model to revise its predictions as it receives new data.
Advantages:
Learning Over Time: The model can adjust its predictions based on new emails, improving accuracy.
Using Existing Knowledge: It can integrate prior information, which helps when there's limited data.
Managing Uncertainty: Regular updates lead to more accurate classifications as the model learns from ongoing data. 
3. What are the limitations of the model built in this lab?
Answer:
Oversimplified Assumptions: It assumes that features (words) are independent, which is often not the case.
Limited Data: Its performance heavily relies on the training dataset's size and variety.
Imbalance Issues: If there are far more ham emails than spam, the model may be biased towards predicting ham.
Context Ignorance: It doesn’t consider word relationships, which can result in misclassifications.
Static Approach: The model needs to be retrained to adapt to new data instead of updating automatically.
Lack of Advanced Techniques: It doesn’t leverage more sophisticated natural language processing methods that could enhance results.


## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.