## Project Milestone 3

This project aims to build a robust model for the binary classification of email text, predicting if a message is spam (1) or ham (0). Milestone 3 is about transforming the $\approx 83,446$ raw emails from the combined_data.csv file into numerical feature matrices using at least three engineering techniques to maximize the classification model's predictive power, directly addressing the advice that strong feature input is key to success in this milestone.

In [1]:
# Setup and Data Ingestion

import pandas as pd
import numpy as np
import re
import nltk

# Import required libraries for cleaning and vectorization
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('stopwords')
nltk.download('wordnet')

# --- DATA INGESTION ---
# CRITICAL: Loading the full dataset identified in Milestone 2.
file_name = 'combined_data.csv'
try:
    # We use 'latin-1' or 'ISO-8859-1' encoding to handle text data that may contain special characters.
    df = pd.read_csv(file_name, encoding='latin-1') 
    
    # Generic column cleanup for common dataset structure
    df.columns = ['label', 'text']
    df = df[['text', 'label']].copy() # Keep only the two required columns

    # Convert the label column to 0/1 (if not already)
    df['label'] = df['label'].astype(int)
    
    # Handle any potential missing values by filling with an empty string
    df['text'].fillna('', inplace=True)
    
    print(f"Full Dataset Loaded: {file_name}")
    print(f"Shape: {df.shape}")
    print("\nFirst 5 Rows (for confirmation):")
    print(df.head())
    
except FileNotFoundError:
    print(f"ERROR: The file '{file_name}' was not found. Please ensure it is in the correct directory.")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chani\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\chani\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Full Dataset Loaded: combined_data.csv
Shape: (83448, 2)

First 5 Rows (for confirmation):
                                                text  label
0  ounce feather bowl hummingbird opec moment ala...      1
1  wulvob get your medircations online qnb ikud v...      1
2   computer connection from cnn com wednesday es...      0
3  university degree obtain a prosperous future m...      1
4  thanks for all your answers guys i know i shou...      0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['text'].fillna('', inplace=True)


### Feature Engineering 1

Text Normalization is the fundamental first step, reducing noise and standardizing the corpus by applying lowercasing, punctuation removal, and stop word removal. Crucially, lemmatization simplifies the vocabulary by reducing words to their base form (e.g., "running" to "run"), ensuring the model generalizes better and improving overall efficiency. The resulting clean text column is ready for numerical vectorization.

In [2]:
# Text Normalization Function and Application

def normalize_text(text):
    # Ensure text is treated as string
    if pd.isna(text) or text is None:
        return ""
    
    # 1. Lowercasing
    text = str(text).lower()
    
    # 2. Remove punctuation, symbols, and non-alphanumeric characters (keeping only letters and spaces)
    text = re.sub(r'[^a-z\s]', '', text) 
    
    # 3. Tokenization
    tokens = text.split()
    
    # 4. Stop Word Removal
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and word]
    
    # 5. Lemmatization (Reducing to base form)
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Rejoin tokens into a single clean string
    return ' '.join(tokens)

# Apply the normalization function to the text column
df['clean_text'] = df['text'].apply(normalize_text)

print("--- Feature Engineering 1: Text Normalization Results ---")
print(f"Original Text Sample: {df['text'].iloc[0]}")
print(f"Cleaned Text Sample: {df['clean_text'].iloc[0]}")

--- Feature Engineering 1: Text Normalization Results ---
Original Text Sample: ounce feather bowl hummingbird opec moment alabaster valkyrie dyad bread flack desperate iambic hadron heft quell yoghurt bunkmate divert afterimage
Cleaned Text Sample: ounce feather bowl hummingbird opec moment alabaster valkyrie dyad bread flack desperate iambic hadron heft quell yoghurt bunkmate divert afterimage


### Feature Engineering 2

The second step is TF-IDF (Term Frequency-Inverse Document Frequency) Vectorization, which is superior to simple word counts for spam classification. TF-IDF weights words based on their frequency in an email (TF) and their rarity across the corpus (IDF). This is critical for highlighting highly predictive spam terms (e.g., "claim," "urgent"). We limit the resulting sparse matrix to 5000 features to ensure computational efficiency while retaining the most discriminative words.

In [3]:
# TF-IDF Vectorization

# Initialize the TF-IDF Vectorizer
# Max features is set to 5000 to manage complexity and training time
tfidf_vectorizer = TfidfVectorizer(max_features=5000) 

# Fit the vectorizer to the clean text and transform the data
# The resulting matrix (X) is the input feature set for the classification model
tfidf_matrix = tfidf_vectorizer.fit_transform(df['clean_text'])

# Get the feature names (the words/tokens)
feature_names = tfidf_vectorizer.get_feature_names_out()

print("--- Feature Engineering 2: TF-IDF Vectorization Results ---")
print(f"Shape of TF-IDF Matrix (Rows x Features): {tfidf_matrix.shape}")
print(f"The number of unique features (words) retained is: {tfidf_matrix.shape[1]}")
print("\nFirst 10 Features Retained by TF-IDF:")
print(feature_names[:10])

--- Feature Engineering 2: TF-IDF Vectorization Results ---
Shape of TF-IDF Matrix (Rows x Features): (83448, 5000)
The number of unique features (words) retained is: 5000

First 10 Features Retained by TF-IDF:
['aa' 'ab' 'abbott' 'abc' 'ability' 'able' 'ableton' 'abroad' 'absence'
 'absolute']


### 4. Feature Engineering 3

The final step uses Bag of N-Grams (Bigrams) to introduce context by capturing sequences of two consecutive words (e.g., "click here"). This is crucial because phrases are often far more indicative of spam than individual words alone. By creating a 5000-feature matrix of Bigrams, we ensure the model learns the syntax of spam messages, significantly improving predictive power.

In [4]:
# Bag of N-Grams (Bigrams) Vectorization

# Initialize the Count Vectorizer for N-Grams
# ngram_range=(2, 2) specifies that we ONLY want Bigrams (pairs of words).
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2), max_features=5000)

# Fit and transform the clean text
ngram_matrix = ngram_vectorizer.fit_transform(df['clean_text'])

# Get the feature names (the Bigrams)
ngram_feature_names = ngram_vectorizer.get_feature_names_out()

print("--- Feature Engineering 3: Bag of N-Grams (Bigrams) Results ---")
print(f"Shape of N-Gram Matrix (Rows x Bigram Features): {ngram_matrix.shape}")
print(f"The number of unique Bigram features is: {ngram_matrix.shape[1]}")
print("\nFirst 10 N-Gram Feature Names (Bigrams):")
print(ngram_feature_names[:10])

--- Feature Engineering 3: Bag of N-Grams (Bigrams) Results ---
Shape of N-Gram Matrix (Rows x Bigram Features): (83448, 5000)
The number of unique Bigram features is: 5000

First 10 N-Gram Feature Names (Bigrams):
['aac escapenumber' 'able find' 'able get' 'able work' 'ableton live'
 'ac check' 'ac cv' 'ac msg' 'ac subst' 'ac uk']


### Conclusion

I have created three essential feature sets for our model:
1.  Clean Text (for human analysis)
2.  TF-IDF Matrix (single-word importance)
3.  N-Gram Matrix (contextual phrase importance)