# 🧑‍🏫 Task 1 Part 1: Building a Spam Classifier with Naive Bayes
In this exercise, you'll implement a spam classifier using the **Naive Bayes algorithm** . You'll work with email data to classify messages as spam or non-spam (ham). Follow the steps below and fill in the code where indicated.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like those from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Data Loading and Preprocessing
First, let's load and examine our data.

In [2]:
# Load the data
import pandas as pd
# TODO: Load the 'emails.csv' file into a DataFrame called 'emails'
emails = pd.read_csv('/kaggle/input/naive-emails/emails.csv')

In [3]:
# Display the first few rows
print(emails.head())

# HINT: Use pd.read_csv() to load the data
# HINT: The DataFrame should have 'text' and 'spam' columns

                                                text  spam
0  Subject: naturally irresistible your corporate...     1
1  Subject: the stock trading gunslinger  fanny i...     1
2  Subject: unbelievable new homes made easy  im ...     1
3  Subject: 4 color printing special  request add...     1
4  Subject: do not have money , get software cds ...     1


In [4]:
#Analyse the data and remove or modify rows with missing or invalid values
emails = emails.dropna(subset=['text', 'spam'])
emails.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5728 non-null   object
 1   spam    5728 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB


## Step 2: Text Preprocessing
We need to process each email to extract unique words.

In [5]:
def process_email(text):
    """
    Convert email text to a list of unique, lowercase words
    
    Parameters:
        text (str): The email text
    
    Returns:
        list: List of unique words in the email
    """
    # TODO: Implement the preprocessing function
    # 1. Convert text to lowercase
    # 2. Split into words
    # 3. Remove duplicates

    # Your code here
    words = set(text.lower().split())
    return list(words)
    # HINT: Use text.lower() for lowercase conversion
    # HINT: Use split() to convert text to words
    # HINT: Use set() to remove duplicates
    pass

In [6]:
# Apply preprocessing to all emails
emails['words'] = emails['text'].apply(process_email)


In [7]:
# Test your preprocessing by testing on the first email
print(emails['words'].iloc[0])

['.', 'interested', 'image', 'changes', 'done', ':', 'in', 'effective', 'your', 'are', '_', 't', 'products', 'marketing', 'you', ';', ',', 'become', 'isguite', 'promise', 'logo', 'catchy', 'outstanding', 'to', 'made', 'even', 'ieader', 'naturally', 'business', 'formats', 'specially', "'", 'change', 'world', 'provided', 'at', 'benefits', 'website', 'distinctive', 'see', 'collaboration', 'with', 'original', 'management', 'identity', 'make', 'fees', 'days', 'gaps', 'through', 'much', 'subject:', 'isoverwhelminq', 'have', 'no', 'but', 'stylish', 'this', 'corporate', 'all', 'for', 'creativeness', 'stationery', 'of', 'unlimited', 'be', 'budget', '100', 'use', 'result', 'surethat', 'really', 'do', 'more', 'amount', 'easy', 'drafts', 'love', 'satisfaction', 'recollect', 'its', 'that', 'will', 'lt', 'aim', 'we', 'task', 'practicable', 'provide', 'portfolio', 'good', 'list', 'efforts', 'nowadays', 'guaranteed', 'content', 'promptness', 'without', 'hotat', '-', 'information', 'shouldn', 'ordered'

## Step 3: Calculate Prior Probabilities
Let's calculate the basic probability of an email being spam.

In [8]:
# TODO: Calculate the following:
# 1. Total number of emails
# 2. Number of spam emails
# 3. Probability of spam

num_emails = len(emails)
num_spam = emails['spam'].sum()
spam_probability = num_spam / num_emails

print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
print(f"Probability of spam: {spam_probability:.4f}")

# HINT: Use len(emails) for total count
# HINT: Use sum(emails['spam']) for spam count

Number of emails: 5728
Number of spam emails: 1368
Probability of spam: 0.2388


## Step 4: Training the Model
Now we'll build our Naive Bayes model by counting word occurrences in spam and ham emails.

In [10]:
def train_naive_bayes(emails_data):
    """
    Train a Naive Bayes model on email data
    
    Parameters:
        emails_data (DataFrame): DataFrame with 'words' and 'spam' columns
    
    Returns:
        dict: Dictionary with word frequencies in spam and ham emails
    """
    # TODO: Create a dictionary to store word frequencies
    # For each word, store counts of its occurrence in spam and ham emails      

    model = {}
    for _, row in emails_data.iterrows():
        for word in row['words']:
            if word not in model:
                model[word] = {'spam': 1, 'ham': 1}  # Laplace smoothing

            if row['spam']:
                model[word]['spam'] += 1  # If email is spam, increase word's spam count
            else:
                model[word]['ham'] += 1  # Otherwise, increase word's ham count

    return model  # Make sure this return is inside the function





In [11]:
model = train_naive_bayes(emails)

In [12]:
# Test your model with some words
# Examples: 'lottery', 'sale', 'meeting'
print(model.get('lottery', 'Word not in model'))
print(model.get('sale', 'Word not in model'))
print(model.get('meeting', 'Word not in model'))


{'spam': 9, 'ham': 1}
{'spam': 39, 'ham': 42}
{'spam': 11, 'ham': 808}


## Step 5: Implementing the Prediction Function
Finally, let's implement the function to predict whether an email is spam.

In [14]:
import numpy as np

def predict_naive_bayes(email_text, model, num_spam, num_ham):
    """
    Predict whether an email is spam using Naive Bayes
    
    Parameters:
        email_text (str): The text of the email to classify
        model (dict): Trained Naive Bayes model
        num_spam (int): Number of spam emails in training data
        num_ham (int): Number of ham emails in training data
    
    Returns:
        float: Probability that the email is spam
    """
    # TODO: Implement the Naive Bayes prediction
    # 1. Preprocess the email text
    # 2. Calculate probability using the Naive Bayes formula

    # Your code here

    # HINT: Use the log of probabilities to avoid numerical underflow
    # HINT: Remember to handle words not in the training data
    total_emails = num_spam + num_ham
    
    # Preprocess the email text
    words = process_email(email_text)  # Use the same `process_email` function from Step 2
    
    # Initialize log probabilities for spam and ham with prior probabilities
    log_prob_spam = np.log(num_spam / total_emails)
    log_prob_ham = np.log(num_ham / total_emails)
    
    # Calculate probability for each word in the email
    for word in words:
        if word in model:
            # Use the counts from the model and Laplace smoothing
            log_prob_spam += np.log(model[word]['spam'] / num_spam)
            log_prob_ham += np.log(model[word]['ham'] / num_ham)
        else:
            # Handle unseen words with Laplace smoothing
            log_prob_spam += np.log(1 / (num_spam + len(model)))
            log_prob_ham += np.log(1 / (num_ham + len(model)))
    
    # Calculate probabilities by exponentiating log values to return to normal scale
    prob_spam = np.exp(log_prob_spam)
    prob_ham = np.exp(log_prob_ham)
    
    # Return the probability of spam
    return prob_spam / (prob_spam + prob_ham)


In [15]:
# Test your prediction function
test_emails = [
    "lottery winner claim prize money",
    "meeting tomorrow at 3pm",
    "buy cheap watches online"
]
for email in test_emails:
    prob_spam = predict_naive_bayes(email, model, num_spam, num_emails - num_spam)
    print(f"Email: '{email}' - Spam Probability: {prob_spam:.4f}")

Email: 'lottery winner claim prize money' - Spam Probability: 0.9999
Email: 'meeting tomorrow at 3pm' - Spam Probability: 0.0014
Email: 'buy cheap watches online' - Spam Probability: 0.9980


## Step 6: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

1. As we can see, the model gave pretty accurate results.

2. Most of the hints were already given, we had to just work on it. But if they weren't it could have   been more trickier as we could not use sklearn and the pretrained model (for example, hints for laplace smoothing,using log probabilities to calculate was already given, so no issues were found).

3. We could do stemming and lemmitization to improve the model. We could also try using f1score or other metrics to check the result. We could also try checking for the most frequently occuring word in spam emails to classify them more efficiently.

### Notes (if any):