# 🧑‍🏫 Task 1 Part 1: Building a Spam Classifier with Naive Bayes
In this exercise, you'll implement a spam classifier using the **Naive Bayes algorithm** . You'll work with email data to classify messages as spam or non-spam (ham). Follow the steps below and fill in the code where indicated.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like those from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Data Loading and Preprocessing
First, let's load and examine our data.

In [5]:
import pandas as pd
emails = pd.read_csv('emails.csv')# Your code here

In [6]:
# Display the first few rows
print(emails.head())

# HINT: Use pd.read_csv() to load the data
# HINT: The DataFrame should have 'text' and 'spam' columns

                                                text  spam
0  Subject: naturally irresistible your corporate...     1
1  Subject: the stock trading gunslinger  fanny i...     1
2  Subject: unbelievable new homes made easy  im ...     1
3  Subject: 4 color printing special  request add...     1
4  Subject: do not have money , get software cds ...     1


In [7]:
#Analyse the data and remove or modify rows with missing or invalid values
# Check for missing values in the entire DataFrame
missing_values = emails.isnull().sum()
print(missing_values)

# Check for missing values in specific columns (e.g., 'text', 'spam')
missing_text = emails['text'].isnull().sum()
missing_spam = emails['spam'].isnull().sum()
print(f"Missing values in 'text': {missing_text}")
print(f"Missing values in 'spam': {missing_spam}")

text    0
spam    0
dtype: int64
Missing values in 'text': 0
Missing values in 'spam': 0


## Step 2: Text Preprocessing
We need to process each email to extract unique words.

In [8]:
def process_email(text):
    """
    Convert email text to a list of unique, lowercase words

    Parameters:
        text (str): The email text

    Returns:
        list: List of unique words in the email
    """
    # TODO: Implement the preprocessing function
    # 1. Convert text to lowercase
    text = text.lower()
    # 2. Split into words
    words = text.split()
    # 3. Remove duplicates
    unique_words = list(set(words))
    return unique_words

    # Your code here

    # HINT: Use text.lower() for lowercase conversion
    # HINT: Use split() to convert text to words
    # HINT: Use set() to remove duplicates
    pass

In [9]:
# Apply preprocessing to all emails
emails['words'] = emails['text'].apply(process_email)


In [11]:
# Test your preprocessing by testing on the first email
# Get the text of the first email
first_email_text = emails['text'][0]

# Apply the preprocessing function
processed_words = process_email(first_email_text)

# Print the processed words
print(processed_words) # To see the output, run the code.


['clear', 'we', 'full', 'hard', 'without', 'hotat', 'distinctive', 'will', 'unlimited', 'ordered', 'practicable', 'hand', 'world', 'identity', 'portfolio', 'within', ',', '100', 'affordability', 'satisfaction', 'logos', 'a', ';', 'ieader', 'subject:', 'no', 'in', 'even', 'lt', 'fees', 'promptness', 'break', 'original', '.', 'provide', 'drafts', ':', 'this', 'irresistible', 'to', 'system', 'outstanding', 'really', 'it', 'done', 'love', 'change', 'efforts', 't', 'of', 'isguite', 'isoverwhelminq', 'easy', 'content', 'use', 'market', 'naturally', 'for', 'catchy', 'marketing', 'havinq', 'all', 'website', 'gaps', 'effective', 'logo', 'days', 'our', 'changes', 'corporate', 'statlonery', 'company', 'list', '_', 'at', 'business', 'promise', 'reflect', 'shouldn', 'your', 'stylish', 'you', 'not', 'recollect', 'task', 'letsyou', '-', 'see', 'here', 'that', 'made', 'do', 'collaboration', 'three', "'", 'formats', 'amount', 'provided', 'make', 'ciear', 'budget', 'nowadays', 'are', 'iogo', 'look', 'or

## Step 3: Calculate Prior Probabilities
Let's calculate the basic probability of an email being spam.

In [12]:
# 1. Total number of emails
num_emails = len(emails)

# 2. Number of spam emails
num_spam = sum(emails['spam'])

# 3. Probability of spam
spam_probability = num_spam / num_emails

print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
print(f"Probability of spam: {spam_probability:.4f}")

# HINT: Use len(emails) for total count
# HINT: Use sum(emails['spam']) for spam count

Number of emails: 5728
Number of spam emails: 1368
Probability of spam: 0.2388


## Step 4: Training the Model
Now we'll build our Naive Bayes model by counting word occurrences in spam and ham emails.

In [13]:
def train_naive_bayes(emails_data):
    """
    Train a Naive Bayes model on email data

    Parameters:
        emails_data (DataFrame): DataFrame with 'words' and 'spam' columns

    Returns:
        dict: Dictionary with word frequencies in spam and ham emails
    """
    # TODO: Create a dictionary to store word frequencies
    # For each word, store counts of its occurrence in spam and ham emails
    model = {}

    for index, row in emails_data.iterrows():
        words = row['words']
        is_spam = row['spam']

        # Update word frequencies in the model
        for word in words:
            if word not in model:
                model[word] = {'spam': 1, 'ham': 1}  # Laplace smoothing

            if is_spam:
                model[word]['spam'] += 1
            else:
                model[word]['ham'] += 1
    # HINT: Initialize counts with 1 (Laplace smoothing)
    # HINT: Structure: model[word] = {'spam': count, 'ham': count}

    return model

In [14]:
model = train_naive_bayes(emails)

In [15]:
# Test your model with some words
# Examples: 'lottery', 'sale', 'meeting'
if "lottery" in model:
    print(f"Word: lottery")
    print(f"Spam count: {model['lottery']['spam']}")
    print(f"Ham count: {model['lottery']['ham']}")
else:
    print("Word 'lottery' not found in the model.")



Word: lottery
Spam count: 9
Ham count: 1


## Step 5: Implementing the Prediction Function
Finally, let's implement the function to predict whether an email is spam.

In [16]:
import math

def predict_naive_bayes(email_text, model, num_spam, num_ham):
    """
    Predict whether an email is spam using Naive Bayes

    Parameters:
        email_text (str): The text of the email to classify
        model (dict): Trained Naive Bayes model
        num_spam (int): Number of spam emails in training data
        num_ham (int): Number of ham emails in training data

    Returns:
        float: Probability that the email is spam
    """
    words = process_email(email_text)

    # Calculate probability using the Naive Bayes formula
    log_prob_spam = math.log(num_spam / (num_spam + num_ham))
    log_prob_ham = math.log(num_ham / (num_spam + num_ham))

    for word in words:
        if word in model:
            log_prob_spam += math.log(model[word]['spam'] / num_spam)
            log_prob_ham += math.log(model[word]['ham'] / num_ham)

    # Calculate the probability of spam
    prob_spam = math.exp(log_prob_spam) / (math.exp(log_prob_spam) + math.exp(log_prob_ham))

    return prob_spam
    pass

In [18]:
# Test your prediction function
test_emails = [
    "lottery winner claim prize money",
    "meeting tomorrow at 3pm",
    "buy cheap watches online"
]
num_ham = num_emails - num_spam
for email_text in test_emails:
    probability = predict_naive_bayes(email_text, model, num_spam, num_ham)
    print(f"Email: {email_text}")
    print(f"Probability of spam: {probability:.4f}")

Email: lottery winner claim prize money
Probability of spam: 0.9999
Email: meeting tomorrow at 3pm
Probability of spam: 0.0013
Email: buy cheap watches online
Probability of spam: 0.9980


## Step 6: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

1)This Naive Bayes model shows promising initial results in identifying spam emails. It correctly assigned high spam probabilities to emails with typical spam content, while assigning low probabilities to legitimate emails.
2) I faced challenges in Cleaning and preparing the text data for analysis and also in implementing the prediction function using math.log
3)Using a larger and more diverse dataset for training can improve the model's ability to generalize to different types of emails

### Notes (if any):