# Data Preprocessing

This notebook handles tasks such as further cleaning of the text, handling class imbalance, and feature extraction.

## Table of Contents
1. [Introduction](#data-preprocessing)
2. [Loading the Data](#loading-the-data)
3. [Handling Class Imbalance with Resampling](#handling-class-imbalance-with-resampling)
4. [Feature Extraction](#feature-extraction)

## Loading the Data

### 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

import nltk
from nltk.corpus import stopwords
# nltk.download("stopwords") # Uncomment this to download stopwords

import re
import string

### 2. Load Cleaned Data from EDA Stage

In [None]:
df = pd.read_csv('data/processed/eda_cleaned_spam.csv')

## Text Cleaning and Normalization

Perform additional text cleaning such as removing punctuation, converting to lowercase, and removing stopwords.

In [None]:
def clean_text(text: str) -> str:
    text = text.lower()  # Lowercase
    text = re.sub(f"[{string.punctuation}]", "", text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    words = text.split()  # Split into words
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    return " ".join(words)

# Apply the cleaning function
df['message'] = df['message'].apply(clean_text)

## Handling Class Imbalance with Resampling

In [None]:
# Seed for sampling techniques
rs = 42

# # Convert Series to NumPy array and reshape
X = df['message'].to_numpy().reshape(-1, 1)
y = df['label'].values

### 1. Oversampling the Minority Class

Increase the number of spam messages by randomly duplicating some of them.

In [None]:
ros = RandomOverSampler(random_state=rs) # Remove/change random_state for different results
resampled_ros: tuple[np.ndarray, np.ndarray] = ros.fit_resample(X, y)  # type: ignore (bug in type hint)
X_ros = resampled_ros[0]
y_ros = resampled_ros[1]

df_ros = pd.DataFrame({'label': y_ros, 'message': X_ros.flatten()})

### 2. Undersampling the Majority Class

Decrease the number of ham messages by randomly removing some of them.

In [None]:
rus = RandomUnderSampler(random_state=rs) # Remove/change random_state for different results
resampled_rus: tuple[np.ndarray, np.ndarray] = rus.fit_resample(X, y) # type: ignore (bug in type hint)
X_rus = resampled_rus[0]
y_rus = resampled_rus[1]

df_rus = pd.DataFrame({'label': y_rus, 'message': X_rus.flatten()})

## Feature Extraction

Convert text data into numerical features using Term Frequency-Inverse Document Frequency (TF-IDF).

In [None]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000)

# Apply TF-IDF to the original, oversampled, and undersampled data
X_tfidf = vectorizer.fit_transform(df['message'])
X_tfidf_ros = vectorizer.fit_transform(df_ros['message'])
X_tfidf_rus = vectorizer.fit_transform(df_rus['message'])

## Exporting the Data

After preprocessing the data, we export it to the `processed` folder for use in the modeling step next.

In [None]:
import os

# Define the directory paths
processed_data_dir = 'data/processed'
original_dir = os.path.join(processed_data_dir, 'original')
ros_dir = os.path.join(processed_data_dir, 'ros')
rus_dir = os.path.join(processed_data_dir, 'rus')

# Create directories if they don't exist
os.makedirs(original_dir, exist_ok=True)
os.makedirs(ros_dir, exist_ok=True)
os.makedirs(rus_dir, exist_ok=True)

# Save original data and TF-IDF features
df.to_csv(os.path.join(original_dir, 'original_data.csv'), index=False)
np.save(os.path.join(original_dir, 'original_tfidf_features.npy'), X_tfidf.toarray()) # type: ignore (bug in method detection)

# Save preprocessed data and TF-IDF features after oversampling
df_ros.to_csv(os.path.join(ros_dir, 'ros_data.csv'), index=False)
np.save(os.path.join(ros_dir, 'ros_tfidf_features.npy'), X_tfidf_ros.toarray()) # type: ignore (bug in method detection)

# Save preprocessed data and TF-IDF features after undersampling
df_rus.to_csv(os.path.join(rus_dir, 'rus_data.csv'), index=False)
np.save(os.path.join(rus_dir, 'rus_tfidf_features.npy'), X_tfidf_rus.toarray()) # type: ignore (bug in method detection)
