# Model Training and Evaluation

This notebook focuses on the training and evaluation of machine learning models for spam detection. We will use the preprocessed data from the previous step and apply various machine learning algorithms to classify messages as spam or ham. The notebook will cover model selection, training, hyperparameter tuning, and performance evaluation.

## Table of Contents

1. [Introduction](#model-training-and-evaluation)
2. [Loading the Data](#loading-the-data)
3. [Train-Test Split](#train-test-split)
4. [Model Selection](#model-selection)

## Loading the Data

### 1. Import Libraries

In [24]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import os

### 2. Load Preprocessed Data from Preprocessing Stage

In [25]:
# Define the directory paths
processed_data_dir = 'data/processed'
original_dir = os.path.join(processed_data_dir, 'original')
ros_dir = os.path.join(processed_data_dir, 'ros')
rus_dir = os.path.join(processed_data_dir, 'rus')

# Load original data
df_original = pd.read_csv(os.path.join(original_dir, 'original_data.csv'))
X_tfidf_original = np.load(os.path.join(original_dir, 'original_tfidf_features.npy'))
y_original = df_original['label']

# Load ROS data
df_ros = pd.read_csv(os.path.join(ros_dir, 'ros_data.csv'))
X_tfidf_ros = np.load(os.path.join(ros_dir, 'ros_tfidf_features.npy'))
y_ros = df_ros['label']

# Load RUS data
df_rus = pd.read_csv(os.path.join(rus_dir, 'rus_data.csv'))
X_tfidf_rus = np.load(os.path.join(rus_dir, 'rus_tfidf_features.npy'))
y_rus = df_rus['label']

datasets = {
    'original': (X_tfidf_original, y_original),
    'ros': (X_tfidf_ros, y_ros),
    'rus': (X_tfidf_rus, y_rus)
}

# Print the shapes of the loaded data
print("Original data shape:", df_original.shape)
print("Original TF-IDF features shape:", X_tfidf_original.shape)
print("ROS data shape:", df_ros.shape)
print("ROS TF-IDF features shape:", X_tfidf_ros.shape)
print("RUS data shape:", df_rus.shape)
print("RUS TF-IDF features shape:", X_tfidf_rus.shape)

Original data shape: (5572, 2)
Original TF-IDF features shape: (5572, 5000)
ROS data shape: (9650, 2)
ROS TF-IDF features shape: (9650, 5000)
RUS data shape: (1494, 2)
RUS TF-IDF features shape: (1494, 3847)


## Train-Test Split

Split the preprocessed data into training and testing sets to evaluate our model's performance on unseen data.

In [26]:
rs = 42

def split_data(X, y, test_size=0.2, random_state=rs):
    """
    Split data into training and testing sets.
    
    ## Parameters
    `X`: ndarray 
        Features.
    `y`: ndarray
        Labels.
    `test_size`: float, default=0.2
        Proportion of the dataset to include in the test split.
    `random_state`: int, default=42
        Random seed for reproducibility.
    
    ## Returns
    `dict`: Dictionary containing training and testing sets (X_train, X_test, y_train, y_test).
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)
    return {'X_train': X_train, 'X_test': X_test, 'y_train': y_train, 'y_test': y_test}

# Split the data for all datasets
data_splits = {
    'original': split_data(X_tfidf_original, y_original),
    'ros': split_data(X_tfidf_ros, y_ros),
    'rus': split_data(X_tfidf_rus, y_rus)
}

## Model Selection

In this section, we will explore and evaluate different machine learning models to classify spam emails. We will use various models to understand which one performs best for our dataset. The models we will cover include:

1. Logistic Regression
2. Decision Tree
3. Random Forest
4. Gradient Boosting
5. Support Vector Machine (SVM)
6. k-Nearest Neighbors (k-NN)
7. Naive Bayes

*Note*: Certain algorithms like Random Forests and Gradient Boosting can handle imbalance better.

We will analyze the following metrics in our evaluation of each model:
* **Accuracy** - The proportion of correctly classified instances among the total number of instances. It provides an overall asssessment of the model's correctness.

* **Precision** - Also known as positive predictive value, measures the proportion of correctly predicted positive instances (true positives) among all predicted positive instances (true positives + false positives). It reflects the model's ability to avoid false positives.
    - "Of all the instances predicted as positive, how many are actually positive?"

* **Recall** - Also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances (true positives) among all actual positive instances (true positives + false negatives). It reflects the model's ability to identify all relevant instances.
    - "Of all the actual positive instances, how many were correctly predicted as positive?"

* **F1** - The harmonic mean of precision and recall. It provides a balance between precision and recall and is especially useful when dealing with imbalanced datasets.

*Note*: Accuracy might not be the best metric. Precision, recall, and F1-score can better evaluate the model's performance on imbalanced data.

In [27]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    """
    Evaluate the performance of a machine learning model.
    
    ## Parameters
    `model`: sklearn estimator object
        The model to evaluate.
    `X_train`: array-like
        Training features.
    `X_test`: array-like
        Testing features.
    `y_train`: array-like
        Training labels.
    `y_test`: array-like
        Testing labels.
    
    ## Returns
    `dict`: Dictionary containing evaluation metrics (accuracy, precision, recall, f1).
    """
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # Store evaluation metrics in a dictionary
    metrics = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }
    
    return metrics

In [34]:
def cross_validate_model(model, X, y, n_splits=10, random_state=rs):
    """
    Perform cross-validation to evaluate the performance of a machine learning model.
    
    ## Parameters
    `model`: sklearn estimator object
        The model to evaluate.
    `X`: array-like
        Features.
    `y`: array-like
        Labels.
    `n_splits`: int, default=10
        Number of folds for cross-validation.
    `random_state`: int, default=42
        Random seed for reproducibility.

    ## Returns
    `dict`: Dictionary containing evaluation metrics (accuracy, precision, recall, f1).
    """

    # Initialize StratifiedKFold cross-validator
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    # Initialize dictionary to store evaluation metrics
    scores = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}

    # Iterate over each fold
    for train_index, test_index in skf.split(X, y):
        # Split the data into training and testing sets for the current fold
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Train the model
        model.fit(X_train, y_train)

        # Make predictions
        y_pred = model.predict(X_test)

        # Calculate evaluation metrics for the current fold
        scores['accuracy'].append(accuracy_score(y_test, y_pred))
        scores['precision'].append(precision_score(y_test, y_pred))
        scores['recall'].append(recall_score(y_test, y_pred))
        scores['f1'].append(f1_score(y_test, y_pred))

    # Calculate average metrics across all folds
    metrics = {}
    for metric, values in scores.items():
        metrics[metric] = np.mean(values)

    return metrics

### 1. Logistic Regression

Logistic Regression is a linear model commonly used for binary classification problems. It predicts the probability of a binary outcome using a logistic function.

In [56]:
# Define the Logistic Regression model
logreg_model = LogisticRegression(random_state=rs)

# Dictionary to store evaluation metric for each dataset
evaluation_results = {}

# Iterate over each dataset
for (dataset_name, dataset_splits), (X, y) in zip(data_splits.items(), datasets.values()):
    # Extract the training and testing splits
    X_train, y_train = dataset_splits['X_train'], dataset_splits['y_train']
    X_test, y_test = dataset_splits['X_test'], dataset_splits['y_test']

    # Evaluate the model using train-test split method
    evaluation_results[dataset_name] = {}
    evaluation_results[dataset_name]['train_test_split'] = evaluate_model(logreg_model, X_train, X_test, y_train, y_test)
    # Evaluate the model using cross-validation method
    evaluation_results[dataset_name]['cross_validation'] = cross_validate_model(logreg_model, X, y)

# Display evaluation metrics for each dataset and method
print("Logistic Regression Results\n", '-'*27, sep='')
print("{:^10} | {:^10} | {:^18} | {:^18} |".format("Dataset", "Metric", "Train Test Split", "Cross Validation"))
for dataset_name, methods in evaluation_results.items():
    print(f"{dataset_name:^10} | ", end="")
    metrics_split = methods['train_test_split']
    metrics_cross = methods['cross_validation']
    for metric_name in metrics_split.keys():
        v1 = metrics_split[metric_name]
        v2 = metrics_cross[metric_name]
        if metric_name != 'accuracy':
            print("{:^10} | ".format(''), end="")
        print("{:<10} | ".format(metric_name.capitalize()), end="")
        print("{:^18.3f} | {:^18.3f} |".format(v1, v2))
    print()


Logistic Regression Results
---------------------------
 Dataset   |   Metric   |  Train Test Split  |  Cross Validation  |
 original  | Accuracy   |       0.959        |       0.958        |
           | Precision  |       0.990        |       0.978        |
           | Recall     |       0.698        |       0.703        |
           | F1         |       0.819        |       0.817        |

   ros     | Accuracy   |       0.990        |       0.988        |
           | Precision  |       0.990        |       0.990        |
           | Recall     |       0.991        |       0.987        |
           | F1         |       0.990        |       0.988        |

   rus     | Accuracy   |       0.930        |       0.934        |
           | Precision  |       0.985        |       0.982        |
           | Recall     |       0.872        |       0.885        |
           | F1         |       0.925        |       0.930        |



Based on the results presented:

* Both the train-test split and cross-validation methods yielded similar results across all metrics and datasets, suggesting consistency in model performance. 

1. **Original Dataset**:
    * Despite not addressing class imbalance, the original dataset achieved near-perfect accuracy and precision scores. However, it struggled with recall and F1 scores.
    * The above indicates potential issues with correctly identifying positive instances.

2. **ROS (Random Over-Sampling) Dataset**:
    * The ROS dataset consistently outperformed both the original and RUS datasets across all metrics.

3. **RUS (Random Under-Sampling) Dataset**:
    * The RUS dataset also exhibited improved performance compared to the original, albeit slightly lower than the ROS dataset.

Overall Summary:
* Both oversampling (ROS) and undersampling (RUS) techniques effectively addressed the class imbalance issue, resulting in improved model performance across all metrics.
* The ROS dataset showed the most promising results, achieving high metrics across both train-test split and cross-validation methods.
* These findings suggest that addressing class imbalance through sampling techniques significantly enhances the model's ability to generalize and perform well on both training and unseen data.