#  Notebook 02: Text Preprocessing for Spam Classification

### 🧾 Overview
In this notebook, I’ll clean and prepare the email text so it's ready for machine learning.  
The goal is to go from raw email bodies to simplified, tokenized, and filtered text that captures the most meaningful words.

---

## Loading Libraries and Dataset

I start by:
- Loading the same `load_spam_ham_data()` function from the previous notebook
- Re-importing the cleaned raw text dataset
- Importing NLTK (Natural Language Toolkit) to handle tokenization and stopword removal

---

### Why `nltk.download()` is used:
NLTK doesn’t come with all data by default. I need to download:
- `'punkt'`: for tokenizing sentences/words
- `'stopwords'`: a list of common English words to filter out (like "the", "and", "is")

These downloads only need to happen once and will be skipped if already present.

In [1]:
import sys
sys.path.append('../src')

from data import load_spam_ham_data
import pandas as pd

In [2]:
df = load_spam_ham_data(base_path='../data')

In [3]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/shhaseeb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/shhaseeb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 🧹 Text Preprocessing

To prepare the email text for modeling, I define a simple preprocessing function that:

- Removes punctuation and symbols using regex
- Converts text to lowercase
- Tokenizes into words
- Removes stopwords (like "the", "and", etc.)
- Filters out very short words

---

I also use `strip_email_headers()` to remove metadata at the top of each email.

The final cleaned text is saved in a new column: `processed_text`.  
This version is now ready for TF-IDF vectorization in the next notebook.

Finally, I export the cleaned data as `processed_emails.csv` to reuse during modeling.

In [8]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = re.sub(r"[^a-zA-Z]", " ", text)

    words = word_tokenize(text.lower())

    filtered_words = [w for w in words if w not in stop_words and len(w) > 1]

    return " ".join(filtered_words)


In [9]:
def strip_email_headers(text):
    return text.split('\n\n', 1)[-1]

df['clean_text'] = df['text'].apply(strip_email_headers).str.lower()

In [10]:
df['processed_text'] = df['clean_text'].apply(preprocess_text)
df[['label', 'processed_text']].head()

Unnamed: 0,label,processed_text
0,spam,nextpart content type text html charset iso co...
1,spam,mailings sent complying proposed unsolicited c...
2,spam,need health insurance addition featuring large...
3,spam,html align center font ptsize family sansserif...
4,spam,worldwide great restaurants shopping activitie...


In [11]:
df.to_csv('../data/processed_emails.csv', index=False)