# 🧑‍🏫 Task 1 Part 1: Building a Spam Classifier with Naive Bayes
In this exercise, you'll implement a spam classifier using the **Naive Bayes algorithm** . You'll work with email data to classify messages as spam or non-spam (ham). Follow the steps below and fill in the code where indicated.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like those from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Data Loading and Preprocessing
First, let's load and examine our data.

In [2]:
# Load the data
# TODO: Load the 'emails.csv' file into a DataFrame called 'emails'
#import csv
import pandas as pd
emails = pd.read_csv('emails.csv')

In [3]:
# Display the first few rows
emails.head()

# HINT: Use pd.read_csv() to load the data
# HINT: The DataFrame should have 'text' and 'spam' columns

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [4]:
"""#Analyse the data and remove or modify rows with missing or invalid values
eils.isna(), emails.isnull()"""
emails.info(), len(emails['text']), len(emails['spam'])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5728 non-null   object
 1   spam    5728 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB


(None, 5728, 5728)

## Step 2: Text Preprocessing
We need to process each email to extract unique words.

In [5]:
def process_email(text):
    """
    Convert email text to a list of unique, lowercase words
    
    Parameters:
        text (str): The email text
    
    Returns:
        list: List of unique words in the email
    """
    # TODO: Implement the preprocessing function
    # 1. Convert text to lowercase
    # 2. Split into words
    # 3. Remove duplicates

    # Your code here
    text=text.lower()
    return set(text.split())
    
    # HINT: Use text.lower() for lowercase conversion
    # HINT: Use split() to convert text to words
    # HINT: Use set() to remove duplicates
    pass

In [6]:
# Apply preprocessing to all emails
process_email(emails['text'][0])


{'%',
 "'",
 ',',
 '-',
 '.',
 '100',
 ':',
 ';',
 '_',
 'a',
 'affordability',
 'aim',
 'all',
 'amount',
 'and',
 'are',
 'at',
 'automaticaily',
 'be',
 'become',
 'benefits',
 'break',
 'budget',
 'business',
 'but',
 'catchy',
 'change',
 'changes',
 'ciear',
 'clear',
 'collaboration',
 'company',
 'content',
 'convenience',
 'corporate',
 'creativeness',
 'days',
 'distinctive',
 'do',
 'done',
 'drafts',
 'easier',
 'easy',
 'effective',
 'efforts',
 'even',
 'extra',
 'fees',
 'for',
 'formats',
 'full',
 'gaps',
 'good',
 'guaranteed',
 'hand',
 'hard',
 'have',
 'havinq',
 'here',
 'hotat',
 'identity',
 'ieader',
 'image',
 'in',
 'information',
 'interested',
 'iogo',
 'irresistible',
 'is',
 'isguite',
 'isoverwhelminq',
 'it',
 'its',
 'letsyou',
 'list',
 'logo',
 'logos',
 'look',
 'love',
 'lt',
 'made',
 'make',
 'management',
 'market',
 'marketing',
 'more',
 'much',
 'naturally',
 'no',
 'not',
 'nowadays',
 'of',
 'ordered',
 'organization',
 'original',
 'our',


In [7]:
# Test your preprocessing by testing on the first email
f=open("emails.csv",'r')
s=f.readline()
print(process_email(s))
s=f.readline()
print(process_email(s))
f.close()



{'"text","spam"'}
{'practicable', 'have', 'in', 'break', 'within', 'task', 'of', 'hotat', 'see', '.', 'suqgestions', 'naturally', 'to', '_', 'result', 'effective', 'nowadays', 'a', 'that', 'it', 'its', 'amount', 'havinq', 'market', 'hand', 'world', 'all', 'iogo', 'portfolio', 'be', 'good', 'much', 'isguite', 'use', '100', 'made', 'full', 'ciear', 'management', 'image', 'stylish', 'through', 'but', 'become', 'look', 'your', 'logos', 'provided', 'fees', 'gaps', 'catchy', 'structure', 'provide', 'website', 'make', 'love', 'this', 'organization', 't', 'shouldn', 'ordered', 'done', 'budget', 'days', 'list', 'satisfaction', 'drafts', 'statlonery', 'reflect', 'do', '%', 'changes', 'automaticaily', 'at', 'even', 'system', 'you', 'ieader', 'more', 'without', 'outstanding', 'affordability', ';', 'business', 'three', '_",1', 'letsyou', 'collaboration', 'unlimited', 'corporate', "'", 'easier', 'hard', 'change', 'specially', 'identity', 'logo', 'benefits', 'distinctive', 'stationery', ':', 'irresis

## Step 3: Calculate Prior Probabilities
Let's calculate the basic probability of an email being spam.

In [8]:
# TODO: Calculate the following:
# 1. Total number of emails
# 2. Number of spam emails
# 3. Probability of spam

num_emails = len(emails)
num_spam = sum(emails['spam'])# Your code here
spam_probability =num_spam/num_emails # Your code here
print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
print(f"Probability of spam: {spam_probability:.4f}")

# HINT: Use len(emails) for total count
# HINT: Use sum(emails['spam']) for spam count

Number of emails: 5728
Number of spam emails: 1368
Probability of spam: 0.2388


## Step 4: Training the Model
Now we'll build our Naive Bayes model by counting word occurrences in spam and ham emails.

In [9]:

def train_naive_bayes(emails_data):
    """
    Train a Naive Bayes model on email data
    
    Parameters:
        emails_data (DataFrame): DataFrame with 'words' and 'spam' columns
    
    Returns:
        dict: Dictionary with word frequencies in spam and ham emails
    """
    # TODO: Create a dictionary to store word frequencies
    # For each word, store counts of its occurrence in spam and ham emails
    model = {}
    for t in range(len(emails_data['text'])):
        s=emails_data['text'][t]
        ch=emails_data['spam'][t]
        ps=process_email(s)
        for i in ps:
            if ch==1:
                if i not in model:
                    model[i]={'spam': 1, 'ham': 0}
                else:
                    model[i]['spam']+=1
            elif ch==0:
                if i not in model:
                    model[i]={'spam':0, 'ham': 1}
                else:
                    model[i]['ham']+=1     
                '''elif 'ham' in model[i]:
                        model[i]['ham']+=1'''
                        

    

    # Your code here
    # HINT: Initialize counts with 1 (Laplace smoothing)
    # HINT: Structure: model[word] = {'spam': count, 'ham': count}
    
    # {'xy':['spam':2,'ham':3]}

    return model

In [10]:
model = train_naive_bayes(emails)

In [11]:
# Test your model with some words
# Examples: 'lottery', 'sale', 'meeting'

model['sale']


{'spam': 38, 'ham': 41}

## Step 5: Implementing the Prediction Function
Finally, let's implement the function to predict whether an email is spam.

In [12]:
import math as ma
def predict_naive_bayes(email_text, model, num_spam, num_ham):
    """
    Predict whether an email is spam using Naive Bayes
    
    Parameters:
        email_text (str): The text of the email to classify
        model (dict): Trained Naive Bayes model
        num_spam (int): Number of spam emails in training data
        num_ham (int): Number of ham emails in training data
    
    Returns:
        float: Probability that the email is spam
    """
    # TODO: Implement the Naive Bayes prediction
    # 1. Preprocess the email text
    # 2. Calculate probability using the Naive Bayes formula

    # Your code here
    pt=process_email(email_text)
    h_prob=1
    s_prob=1
    total_prob_s=0
    total_prob_h=0
    for i in pt:
        if i in model:
            no_sp_w=model[i]['spam']
            no_ha_w=model[i]['ham']
            s_prob*=(no_sp_w/num_spam)*spam_probability
            h_prob*=((no_ha_w/num_ham)*(1-spam_probability))
            total_prob_h+=((no_ha_w/num_emails-num_spam)*(1-spam_probability))
            total_prob_s+=((no_sp_w/num_spam)*spam_probability)
        """except:
            return "Word not Found"""
    spam_ch=(s_prob/total_prob_s)
    ham_ch=(h_prob/total_prob_h)
    if spam_ch>ham_ch:
        return "It is a spam email", spam_ch,ham_ch
    else:
        return "It is a ham email" , spam_ch,ham_ch  
    
    
    

    # HINT: Use the log of probabilities to avoid numerical underflow
    # HINT: Remember to handle words not in the training data
    pass

In [13]:
# Test your prediction function
test_emails = [
    "lottery winner claim prize money",
    "meeting tomorrow at 3pm",
    "buy cheap watches online"
]
for i in test_emails:
    print(predict_naive_bayes(i, model, num_spam, num_emails-num_spam))


('It is a spam email', 1.881791402834372e-11, -0.0)
('It is a spam email', 3.2028393059557325e-06, -7.60179486097216e-07)
('It is a spam email', 3.240842347107202e-08, -0.0)


## Step 6: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

### Notes (if any):