# 🧑‍🏫 Task 1 Part 1: Building a Spam Classifier with Naive Bayes
In this exercise, you'll implement a spam classifier using the **Naive Bayes algorithm** . You'll work with email data to classify messages as spam or non-spam (ham). Follow the steps below and fill in the code where indicated.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like those from `sklearn`.

Follow the instructions step-by-step and answer the questions!

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/ml-trek-1a/emails.csv


## Step 1: Data Loading and Preprocessing
First, let's load and examine our data.

In [2]:
# Load the data
# TODO: Load the 'emails.csv' file into a DataFrame called 'emails'
emails = pd.read_csv("/kaggle/input/ml-trek-1a/emails.csv")

In [3]:
# Display the first few rows
print(emails.head())

# HINT: Use pd.read_csv() to load the data
# HINT: The DataFrame should have 'text' and 'spam' columns

                                                text  spam
0  Subject: naturally irresistible your corporate...     1
1  Subject: the stock trading gunslinger  fanny i...     1
2  Subject: unbelievable new homes made easy  im ...     1
3  Subject: 4 color printing special  request add...     1
4  Subject: do not have money , get software cds ...     1


In [4]:
emails.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5728 non-null   object
 1   spam    5728 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB


In [5]:
#Analyse the data and remove or modify rows with missing or invalid values

## Step 2: Text Preprocessing
We need to process each email to extract unique words.

In [6]:
def process_email(text):
    """
    Convert email text to a list of unique, lowercase words.
    
    Parameters:
        text (str): The email text.
    
    Returns:
        list: List of unique words in the email.
    """
   
    text = text.lower()
    words = text.split()    
    unique_words = list(set(words))
    
    return unique_words




In [7]:
emails['text'] = emails['text'].apply(process_email)
emails.head()


Unnamed: 0,text,spam
0,"[be, identity, much, ordered, amount, within, ...",1
1,"[bedtime, clothesman, hall, yes, nameable, pal...",1
2,"[approved, unconditionally, ask, opportunity, ...",1
3,"[/, fax, golden, printing, &, azusa, version, ...",1
4,"[be, ended, ?, along, me, tradgedies, yet, ', ...",1


In [8]:
# Test your preprocessing by testing on the first email


## Step 3: Calculate Prior Probabilities
Let's calculate the basic probability of an email being spam.

In [9]:
# TODO: Calculate the following:
# 1. Total number of emails
# 2. Number of spam emails
# 3. Probability of spam

num_emails = len(emails)
num_spam = sum(emails['spam'])
spam_probability = num_spam / num_emails

print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
print(f"Probability of spam: {spam_probability:.4f}")

# HINT: Use len(emails) for total count
# HINT: Use sum(emails['spam']) for spam count

Number of emails: 5728
Number of spam emails: 1368
Probability of spam: 0.2388


## Step 4: Training the Model
Now we'll build our Naive Bayes model by counting word occurrences in spam and ham emails.

In [10]:
def train_naive_bayes(emails_data):
    """
    Train a Naive Bayes model on email data
    
    Parameters:
        emails_data (DataFrame): DataFrame with 'words' and 'spam' columns
    
    Returns:
        dict: Dictionary with word frequencies in spam and ham emails
    """
    # TODO: Create a dictionary to store word frequencies
    # For each word, store counts of its occurrence in spam and ham emails
    model = {}
    for index, row in emails_data.iterrows():
        label = 'spam' if row['spam'] == 1 else 'ham'
        words = row['text']  # Reference the preprocessed words here
        
        for word in words:
            if word not in model:
                # Initialize counts for a new word with Laplace smoothing
                model[word] = {'spam': 1, 'ham': 1}
            model[word][label] += 1
    
    return model

    # Your code here
    # HINT: Initialize counts with 1 (Laplace smoothing)
    # HINT: Structure: model[word] = {'spam': count, 'ham': count}

In [11]:
model = train_naive_bayes(emails)

In [12]:
dict(list(model.items())[:10])

{'be': {'spam': 639, 'ham': 2648},
 'identity': {'spam': 81, 'ham': 4},
 'much': {'spam': 118, 'ham': 548},
 'ordered': {'spam': 37, 'ham': 35},
 'amount': {'spam': 93, 'ham': 105},
 'within': {'spam': 174, 'ham': 314},
 'easier': {'spam': 42, 'ham': 44},
 'become': {'spam': 83, 'ham': 112},
 'you': {'spam': 983, 'ham': 3477},
 'really': {'spam': 72, 'ham': 268}}

In [13]:
test_words = ['lottery', 'sale', 'meeting']
test_results = {word: model.get(word) for word in test_words}

In [14]:
test_results

{'lottery': {'spam': 9, 'ham': 1},
 'sale': {'spam': 39, 'ham': 42},
 'meeting': {'spam': 11, 'ham': 808}}

## Step 5: Implementing the Prediction Function
Finally, let's implement the function to predict whether an email is spam.

In [15]:
def predict_naive_bayes(email_text, model, num_spam, num_ham):
    """
    Predict whether an email is spam using Naive Bayes
    
    Parameters:
        email_text (str): The text of the email to classify
        model (dict): Trained Naive Bayes model
        num_spam (int): Number of spam emails in training data
        num_ham (int): Number of ham emails in training data
    
    Returns:
        float: Probability that the email is spam
    """
    # TODO: Implement the Naive Bayes prediction
    # 1. Preprocess the email text
    words = process_email(email_text)
    # 2. Calculate probability using the Naive Bayes formula
    p_spam = num_spam / (num_spam + num_ham)
    p_ham = num_ham / (num_spam + num_ham)
    

    # Your code here

    # HINT: Use the log of probabilities to avoid numerical underflow
    log_prob_spam = np.log(p_spam)
    log_prob_ham = np.log(p_ham)
    # HINT: Remember to handle words not in the training data
    for word in words:
        if word in model:
           
            word_spam_count = model[word]['spam']
            word_ham_count = model[word]['ham']
        else:
            
            word_spam_count = 1
            word_ham_count = 1
        
        
        p_word_given_spam = word_spam_count / (num_spam + 2)
        p_word_given_ham = word_ham_count / (num_ham + 2)
        
       
        log_prob_spam += np.log(p_word_given_spam)
        log_prob_ham += np.log(p_word_given_ham)
    
    prob_spam = np.exp(log_prob_spam) / (np.exp(log_prob_spam) + np.exp(log_prob_ham))
    return prob_spam

In [16]:
num_ham = num_emails - num_spam

In [17]:
# Test your prediction function
test_emails = [
    "lottery winner claim prize money",
    "meeting tomorrow at 3pm",
    "buy cheap watches online"
]

In [18]:
test_predictions = {email: predict_naive_bayes(email, model, num_spam, num_ham) for email in test_emails}
test_predictions

{'lottery winner claim prize money': 0.9999332457087802,
 'meeting tomorrow at 3pm': 0.004248841735740117,
 'buy cheap watches online': 0.99796101009175}

## Step 6: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?


1. The model does really well considering its simplicity,
2. Too much reliance on sklearn doesnt allow to understand the concept sometimes
3. Removal of stop words, punctution might help the model


### Notes (if any):

For a Kaggle notebook, you can render mathematical equations correctly using LaTeX formatting within Markdown cells. You can place math expressions between `$...$` for inline math and `$$...$$` for block math. Here’s how you might structure your notebook content with properly formatted math:

---

### Bayes' Theorem

Bayes' Theorem describes the probability of an event based on prior knowledge of conditions related to the event.

$$
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
$$

where:
- \( P(A|B) \): Posterior probability – the probability of event \( A \) occurring given that \( B \) is true.
- \( P(B|A) \): Likelihood – the probability of event \( B \) occurring given that \( A \) is true.
- \( P(A) \): Prior probability – the initial probability of event \( A \).
- \( P(B) \): Marginal probability – the probability of event \( B \) occurring.

---

### Naive Assumption

The "naive" part of Naive Bayes assumes that all features are independent of each other given the class label. In the context of text classification:
- Each word in a document is treated as an independent feature.
- We calculate the probability of a document belonging to a class (e.g., spam or ham) by combining the probabilities of each word.

---

### Naive Bayes Classifier

Given a document \( D \) with words \( w_1, w_2, \dots, w_n \), we calculate the probability of \( D \) being in class \( C \) (e.g., spam or ham) as:

$$
P(C|D) = \frac{P(C) \cdot \prod_{i=1}^n P(w_i|C)}{P(D)}
$$

Since \( P(D) \) is the same for all classes, we can simplify this to:

$$
P(C|D) \propto P(C) \cdot \prod_{i=1}^n P(w_i|C)
$$

#### Log Transformation
To avoid computational issues with very small probabilities, we typically take the logarithm of the probabilities:

$$
\log P(C|D) \propto \log P(C) + \sum_{i=1}^n \log P(w_i|C)
$$

---

### Laplace Smoothing

To handle words in the test set that do not appear in the training data, we apply **Laplace smoothing**. This technique adds a small constant (typically 1) to all word counts, ensuring that we never multiply by zero.

The smoothed probability of a word \( w \) given class \( C \) is:

$$
P(w|C) = \frac{\text{count}(w, C) + 1}{\text{total word count in } C + \text{number of unique words}}
$$


---
