# Data Preprocessing

This notebook handles tasks such as further cleaning of the text, handling class imbalance, and feature extraction.

## Table of Contents
1. [Introduction](#data-preprocessing)
2. [Loading the Data](#loading-the-data)

## Loading the Data

### 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer

from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

import nltk
from nltk.corpus import stopwords
# nltk.download("stopwords") # Uncomment this to download stopwords

import re
import string

### 2. Load Cleaned Data from EDA Stage

In [2]:
df = pd.read_csv('data/processed/eda_cleaned_spam.csv')

## Text Cleaning and Normalization

Perform additional text cleaning such as removing punctuation, converting to lowercase, and removing stopwords.

In [3]:
def clean_text(text: str) -> str:
    text = text.lower()  # Lowercase
    text = re.sub(f"[{string.punctuation}]", "", text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    words = text.split()  # Split into words
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    return " ".join(words)

# Apply the cleaning function
df['cleaned_message'] = df['message'].apply(clean_text)

# # Convert Series to NumPy array and reshape
X = df['cleaned_message'].to_numpy().reshape(-1, 1)
y = df['label'].values

## Handling Class Imbalance

### Resampling Techniques

#### 1. Oversampling the Minority Class
Increase the number of spam messages by randomly duplicating some of them.

In [26]:
ros = RandomOverSampler(random_state=42) # Remove/change random_state for different results
resampled_ros: tuple[np.ndarray, np.ndarray] = ros.fit_resample(X, y)  # type: ignore (bug in library)
X_ros = resampled_ros[0]
y_ros = resampled_ros[1]

df_ros = pd.DataFrame({'message': X_ros.flatten(), 'label': y_ros})

#### 2. Undersampling the Majority Class
Decrease the number of ham messages by randomly removing some of them.

In [27]:
rus = RandomUnderSampler(random_state=42) # Remove/change random_state for different results
resampled_rus: tuple[np.ndarray, np.ndarray] = rus.fit_resample(X, y) # type: ignore (bug in library)
X_rus = resampled_rus[0]
y_rus = resampled_rus[1]

df_rus = pd.DataFrame({'message': X_rus.flatten(), 'label': y_rus})

### Synthetic Data Generation

#### SMOTE (Synthetic Minority Over-sampling Technique)
Generate synthetic examples for the minority class.

3. **Algorithmic Approaches**:
    - **Use Algorithms That Handle Imbalance Well**: Certain algorithms like Random Forests and Gradient Boosting can handle imbalance better.
    - **Cost-sensitive Learning**: Assign a higher penalty to misclassifying the minority class.

4. **Evaluation Metrics**:
    - **Use Appropriate Metrics**: Accuracy might not be the best metric. Use precision, recall, F1-score, and ROC-AUC to better evaluate the model's performance on imbalanced data.