# 🧑‍🏫 Task 1 Part 1: Building a Spam Classifier with Naive Bayes
In this exercise, you'll implement a spam classifier using the **Naive Bayes algorithm** . You'll work with email data to classify messages as spam or non-spam (ham). Follow the steps below and fill in the code where indicated.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like those from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Data Loading and Preprocessing
First, let's load and examine our data.

In [7]:
import pandas as pd
emails = pd.read_csv('emails.csv')

In [8]:
emails.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [9]:

missing_values = emails.isnull().sum()
print(missing_values)

missing_text = emails['text'].isnull().sum()
missing_spam = emails['spam'].isnull().sum()
print(f"Missing values in 'text': {missing_text}")
print(f"Missing values in 'spam': {missing_spam}")

text    0
spam    0
dtype: int64
Missing values in 'text': 0
Missing values in 'spam': 0


## Step 2: Text Preprocessing
We need to process each email to extract unique words.

In [10]:
def process_email(text):



    text = text.lower()

    words = text.split()

    unique_words = list(set(words))
    return unique_words

    pass

In [11]:

emails['words'] = emails['text'].apply(process_email)


In [12]:

first_email_text = emails['text'][0]


processed_words = process_email(first_email_text)

print(processed_words)


['shouldn', 'structure', 'subject:', '.', 'marketing', ':', 'fees', 'world', 'efforts', 'gaps', 'identity', 'information', 't', 'see', 'organization', 'more', 'much', 'break', 'are', '100', 'without', 'three', 'recollect', 'isoverwhelminq', 'catchy', 'nowadays', 'hard', 'company', 'original', '%', 'stationery', 'you', 'that', 'your', 'market', 'satisfaction', ',', 'be', 'and', 'clear', 'affordability', 'our', 'practicable', 'of', '-', 'drafts', 'have', ';', 'lt', 'not', 'budget', 'the', 'hand', 'provided', 'is', 'make', 'list', 'its', 'formats', 'within', 'easier', 'isguite', 'result', "'", 'interested', 'look', 'love', 'good', 'provide', 'specially', 'ieader', 'it', 'naturally', 'logos', '_', 'automaticaily', 'all', 'full', 'done', 'products', 'system', 'through', 'promise', 'convenience', 'portfolio', 'benefits', 'to', 'distinctive', 'for', 'statlonery', 'with', 'change', 'do', 'made', 'in', 'extra', 'guaranteed', 'this', 'ordered', 'reflect', 'ciear', 'surethat', 'here', 'but', 'bus

## Step 3: Calculate Prior Probabilities
Let's calculate the basic probability of an email being spam.

In [13]:

num_emails = len(emails)


num_spam = sum(emails['spam'])

spam_probability = num_spam / num_emails

print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
print(f"Probability of spam: {spam_probability:.4f}")



Number of emails: 5728
Number of spam emails: 1368
Probability of spam: 0.2388


## Step 4: Training the Model
Now we'll build our Naive Bayes model by counting word occurrences in spam and ham emails.

In [14]:
def train_naive_bayes(emails_data):


    model = {}

    for index, row in emails_data.iterrows():
        words = row['words']
        is_spam = row['spam']


        for word in words:
            if word not in model:
                model[word] = {'spam': 1, 'ham': 1}

            if is_spam:
                model[word]['spam'] += 1
            else:
                model[word]['ham'] += 1


    return model

In [15]:
model = train_naive_bayes(emails)

In [16]:

if "lottery" in model:
    print(f"Word: lottery")
    print(f"Spam count: {model['lottery']['spam']}")
    print(f"Ham count: {model['lottery']['ham']}")
else:
    print("Word 'lottery' not found in the model.")



Word: lottery
Spam count: 9
Ham count: 1


## Step 5: Implementing the Prediction Function
Finally, let's implement the function to predict whether an email is spam.

In [17]:
import math

def predict_naive_bayes(email_text, model, num_spam, num_ham):

    words = process_email(email_text)


    log_prob_spam = math.log(num_spam / (num_spam + num_ham))
    log_prob_ham = math.log(num_ham / (num_spam + num_ham))

    for word in words:
        if word in model:
            log_prob_spam += math.log(model[word]['spam'] / num_spam)
            log_prob_ham += math.log(model[word]['ham'] / num_ham)


    prob_spam = math.exp(log_prob_spam) / (math.exp(log_prob_spam) + math.exp(log_prob_ham))

    return prob_spam
    pass

In [18]:

test_emails = [
    "lottery winner claim prize money",
    "meeting tomorrow at 3pm",
    "buy cheap watches online"
]
num_ham = num_emails - num_spam
for email_text in test_emails:
    probability = predict_naive_bayes(email_text, model, num_spam, num_ham)
    print(f"Email: {email_text}")
    print(f"Probability of spam: {probability:.4f}")

Email: lottery winner claim prize money
Probability of spam: 0.9999
Email: meeting tomorrow at 3pm
Probability of spam: 0.0013
Email: buy cheap watches online
Probability of spam: 0.9980


## Step 6: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

1)This Naive Bayes model shows promising initial results in identifying spam emails. It correctly assigned high spam probabilities to emails with typical spam content, while assigning low probabilities to legitimate emails.
2) I faced challenges in Cleaning and preparing the text data for analysis and also in implementing the prediction function using math.log
3)Using a larger and more diverse dataset for training can improve the model's ability to generalize to different types of emails

### Notes (if any):