# Model Training and Evaluation

This notebook focuses on the training and evaluation of machine learning models for spam detection. We will use the preprocessed data from the previous step and apply various machine learning algorithms to classify messages as spam or ham. The notebook will cover model selection, training, hyperparameter tuning, and performance evaluation.

## Table of Contents

1. [Introduction](#model-training-and-evaluation)
2. [Loading the Data](#loading-the-data)
3. [Train-Test Split](#train-test-split)
4. [Model Selection](#model-selection)

## Loading the Data

### 1. Import Libraries

In [15]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import os

### 2. Load Preprocessed Data from Preprocessing Stage

In [3]:
# Define the directory paths
processed_data_dir = 'data/processed'
original_dir = os.path.join(processed_data_dir, 'original')
ros_dir = os.path.join(processed_data_dir, 'ros')
rus_dir = os.path.join(processed_data_dir, 'rus')

# Load original data
df_original = pd.read_csv(os.path.join(original_dir, 'original_data.csv'))
X_tfidf_original = np.load(os.path.join(original_dir, 'original_tfidf_features.npy'))
y_original = df_original['label']

# Load ROS data
df_ros = pd.read_csv(os.path.join(ros_dir, 'ros_data.csv'))
X_tfidf_ros = np.load(os.path.join(ros_dir, 'ros_tfidf_features.npy'))
y_ros = df_ros['label']

# Load RUS data
df_rus = pd.read_csv(os.path.join(rus_dir, 'rus_data.csv'))
X_tfidf_rus = np.load(os.path.join(rus_dir, 'rus_tfidf_features.npy'))
y_rus = df_rus['label']

datasets = {
    'original': (X_tfidf_original, y_original),
    'ros': (X_tfidf_ros, y_ros),
    'rus': (X_tfidf_rus, y_rus)
}

# Print the shapes of the loaded data
print("Original data shape:", df_original.shape)
print("Original TF-IDF features shape:", X_tfidf_original.shape)
print("ROS data shape:", df_ros.shape)
print("ROS TF-IDF features shape:", X_tfidf_ros.shape)
print("RUS data shape:", df_rus.shape)
print("RUS TF-IDF features shape:", X_tfidf_rus.shape)

Original data shape: (5572, 2)
Original TF-IDF features shape: (5572, 5000)
ROS data shape: (9650, 2)
ROS TF-IDF features shape: (9650, 5000)
RUS data shape: (1494, 2)
RUS TF-IDF features shape: (1494, 3847)


## Train-Test Split

Split the preprocessed data into training and testing sets to evaluate our model's performance on unseen data.

In [4]:
rs = 42

def split_data(X, y, test_size=0.2, random_state=rs):
    """
    Split data into training and testing sets.
    
    ## Parameters
    `X`: ndarray 
        Features.
    `y`: ndarray
        Labels.
    `test_size`: float, default=0.2
        Proportion of the dataset to include in the test split.
    `random_state`: int, default=42
        Random seed for reproducibility.
    
    ## Returns
    `dict`: Dictionary containing training and testing sets (X_train, X_test, y_train, y_test).
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)
    return {'X_train': X_train, 'X_test': X_test, 'y_train': y_train, 'y_test': y_test}

# Split the data for all datasets
data_splits = {
    'original': split_data(X_tfidf_original, y_original),
    'ros': split_data(X_tfidf_ros, y=y_ros),
    'rus': split_data(X_tfidf_rus, y_rus)
}

## Model Selection

*Note*: The `Train and Evaluate` cell blocks may take a while to finish computing

In this section, we will explore and evaluate different machine learning models to classify spam emails. We will use various models to understand which one performs best for our dataset. The models we will cover include:

1. [Logistic Regression](#1-logistic-regression)
2. [Decision Tree](#2-decision-tree)
3. [Random Forest](#3-random-forest)
4. [Gradient Boosting](#4-gradient-boosting)
5. Support Vector Machine (SVM)
6. k-Nearest Neighbors (k-NN)
7. Naive Bayes

*Note*: Certain algorithms like Random Forests and Gradient Boosting can handle imbalance better.

We will analyze the following metrics in our evaluation of each model:
* **Accuracy** - The proportion of correctly classified instances among the total number of instances. It provides an overall asssessment of the model's correctness.

* **Precision** - Also known as positive predictive value, measures the proportion of correctly predicted positive instances (true positives) among all predicted positive instances (true positives + false positives). It reflects the model's ability to avoid false positives.
    - "Of all the instances predicted as positive, how many are actually positive?"

* **Recall** - Also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances (true positives) among all actual positive instances (true positives + false negatives). It reflects the model's ability to identify all relevant instances.
    - "Of all the actual positive instances, how many were correctly predicted as positive?"

* **F1** - The harmonic mean of precision and recall. It provides a balance between precision and recall and is especially useful when dealing with imbalanced datasets.

*Note*: Accuracy might not be the best metric. Precision, recall, and F1-score can better evaluate the model's performance on imbalanced data.

In [5]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    """
    Evaluate the performance of a machine learning model.
    
    ## Parameters
    `model`: sklearn estimator object
        The model to evaluate.
    `X_train`: array-like
        Training features.
    `X_test`: array-like
        Testing features.
    `y_train`: array-like
        Training labels.
    `y_test`: array-like
        Testing labels.
    
    ## Returns
    `tuple[dict, model]`: Tuple containing dictionary of evaluation metrics (accuracy, precision, recall, f1) and the fitted model.
    """
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # Store evaluation metrics in a dictionary
    metrics = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }
    
    return metrics, model

In [6]:
def cross_validate_model(model, X, y, n_splits=10, random_state=rs):
    """
    Perform cross-validation to evaluate the performance of a machine learning model.
    
    ## Parameters
    `model`: sklearn estimator object
        The model to evaluate.
    `X`: array-like
        Features.
    `y`: array-like
        Labels.
    `n_splits`: int, default=10
        Number of folds for cross-validation.
    `random_state`: int, default=42
        Random seed for reproducibility.

    ## Returns
    `dict`: Dictionary containing evaluation metrics (accuracy, precision, recall, f1).
    """

    # Initialize StratifiedKFold cross-validator
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    # Initialize dictionary to store evaluation metrics
    scores = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}

    # Iterate over each fold
    for train_index, test_index in skf.split(X, y):
        # Split the data into training and testing sets for the current fold
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Train the model
        model.fit(X_train, y_train)

        # Make predictions
        y_pred = model.predict(X_test)

        # Calculate evaluation metrics for the current fold
        scores['accuracy'].append(accuracy_score(y_test, y_pred))
        scores['precision'].append(precision_score(y_test, y_pred))
        scores['recall'].append(recall_score(y_test, y_pred))
        scores['f1'].append(f1_score(y_test, y_pred))

    # Calculate average metrics across all folds
    metrics = {}
    for metric, values in scores.items():
        metrics[metric] = np.mean(values)

    return metrics

In [18]:
def train_and_evaluate_model(model, data_splits, datasets = None):
    """
    Train and evaluate a model using train-test split and cross-validation methods.

    ## Parameters
    `model`: sklearn estimator
        The model to be trained and evaluated.
    `data_splits`: dict
        Dictionary containing the train-test splits for each dataset.
    `datasets`: dict (optional)
        Dictionary containing the full datasets for cross-evaluation.
    
    ## Returns
    `tuple[dict, model]`: Tuple containing a dictionary with evaluation metrics for each dataset and method, and the fitted model (from train-test split).

    """
    # Dictionary to store evaluation metrics for each dataset and method
    evaluation_results = {}
    output_model = model

    for dataset_name, dataset_splits in data_splits.items():
        # Extract the training and testing splits
        X_train, y_train = dataset_splits['X_train'], dataset_splits['y_train']
        X_test, y_test = dataset_splits['X_test'], dataset_splits['y_test']

        # Evaluate the model using train-test split method
        evaluation_results[dataset_name] = {}
        evaluation_results[dataset_name]['train_test_split'], model = evaluate_model(model, X_train, X_test, y_train, y_test)
        
        # Return trained ROS model (best model based on below evaluations)
        if dataset_name == 'ros':
            output_model = model

    # Evaluate the model using cross-validation method
    if datasets:
        for dataset_name, (X, y) in datasets.items():
            evaluation_results[dataset_name]['cross_validation'] = cross_validate_model(model, X, y)

    return evaluation_results, output_model

In [21]:
def print_evaluation_summary(model_name, evaluation_results):
    """
    Print the evaluation summary for each dataset and method.

    ## Parameters
    `evaluation_results`: dict
        Dictionary with evaluation metrics for each dataset and method.
    """
    print(f"{model_name} Results\n", '-'*(len(model_name)+8), sep='')
    if evaluation_results['ros'].get('cross_validation', None):
        print("{:^10} | {:^10} | {:^18} | {:^18} |".format("Dataset", "Metric", "Train Test Split", "Cross Validation"))
    else:
        print("{:^10} | {:^10} | {:^18} |".format("Dataset", "Metric", "Train Test Split"))

    for dataset_name, methods in evaluation_results.items():
        print(f"{dataset_name:^10} | ", end="")
        metrics_split = methods['train_test_split']
        metrics_cross = methods.get('cross_validation', None)
        first_metric = True
        for metric_name in metrics_split.keys():
            v1 = "{:.3f} %".format(metrics_split[metric_name] * 100)
            v2 = "{:.3f} %".format(metrics_cross.get(metric_name) * 100) if metrics_cross else None
            if not first_metric:
                print("{:^10} | ".format(''), end="")
            first_metric = False
            print("{:<10} | ".format(metric_name.capitalize()), end="")
            if v2:
                print("{:^18} | {:^18} |".format(v1, v2))
            else:
                print("{:^18} |".format(v1))
        print()

### 1. Logistic Regression

Logistic Regression is a linear model commonly used for binary classification problems. It predicts the probability of a binary outcome using a logistic function. Also known as the sigmoid function, this outputs a probability value that is then mapped to two possible classes. 

It often serves as a good baseline model despite its simplicity, making it a valuable tool for initial analysis before moving on to more complex models. 

#### Train and Evaluate (~1m 15s)

In [9]:
# Define the Logistic Regression model
logreg_model = LogisticRegression(random_state=rs)

# Train and evaluate the model
evaluation_results, logreg_model = train_and_evaluate_model(logreg_model, data_splits, datasets)

#### Results

In [10]:
# Print the evaluation summary
print_evaluation_summary('Logistic Regression', evaluation_results)

Logistic Regression Results
---------------------------
 Dataset   |   Metric   |  Train Test Split  |  Cross Validation  |
 original  | Accuracy   |      95.874 %      |      95.800 %      |
           | Precision  |      99.048 %      |      97.768 %      |
           | Recall     |      69.799 %      |      70.294 %      |
           | F1         |      81.890 %      |      81.739 %      |

   ros     | Accuracy   |      99.016 %      |      98.850 %      |
           | Precision  |      98.965 %      |      99.003 %      |
           | Recall     |      99.067 %      |      98.694 %      |
           | F1         |      99.016 %      |      98.848 %      |

   rus     | Accuracy   |      92.977 %      |      93.443 %      |
           | Precision  |      98.485 %      |      98.232 %      |
           | Recall     |      87.248 %      |      88.494 %      |
           | F1         |      92.527 %      |      93.019 %      |



Based on the results presented:

* Both the train-test split and cross-validation methods yielded similar results across all metrics and datasets, suggesting consistency in model performance.

**Note**: We will only perform the train-test split method for subsequent models to decrease computation time. 

1. **Original Dataset**:
    * Despite not addressing class imbalance, the original dataset achieved near-perfect accuracy and precision scores. However, it struggled with recall and F1 scores.
    * The above indicates potential issues with correctly identifying positive instances.

2. **ROS (Random Over-Sampling) Dataset**:
    * The ROS dataset consistently outperformed both the original and RUS datasets across all metrics.

3. **RUS (Random Under-Sampling) Dataset**:
    * The RUS dataset also exhibited improved performance compared to the original, albeit slightly lower than the ROS dataset.

Overall Summary:
* Both oversampling (ROS) and undersampling (RUS) techniques effectively addressed the class imbalance issue, resulting in improved model performance across all metrics.
* The ROS dataset showed the most promising results, achieving high metrics across both train-test split and cross-validation methods.
* These findings suggest that addressing class imbalance through sampling techniques significantly enhances the model's ability to generalize and perform well on both training and unseen data.

### 2. Decision Tree

Decision Tree is a non-linear supervised learning algorithm used for both classification and regression tasks. It works by splitting the data into subsets based on the value of input features, forming a tree-like model of decisions. 
* Each internal node of the tree represents a decision based on an attribute
* Each branch represents the outcome of the decision, and
* Each leaf node represents a class label or a continuous value.

Unlike linear models, Decision Trees can capture non-linear relationships between features and the target variable. This flexibility allows the model to fit more complex patterns in the data, which can be beneficial for identifying spam emails.

Decision Trees perform implicit feature selection by choosing the most informative features for splitting the data. This property helps in reducing the dimensionality and removing irrelevant features from our spam email dataset.

#### Train and Evaluate (~2m 10s)

In [11]:
# Define the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=rs)

# Train and evaluate the Decision Tree model
dt_evaluation_results, dt_model = train_and_evaluate_model(dt_model, data_splits)

#### Results

In [12]:
# Print the evaluation summary
print_evaluation_summary('Decision Tree', dt_evaluation_results)

Decision Tree Results
---------------------
 Dataset   |   Metric   |  Train Test Split  |
 original  | Accuracy   |      95.874 %      |
           | Precision  |      85.034 %      |
           | Recall     |      83.893 %      |
           | F1         |      84.459 %      |

   ros     | Accuracy   |      97.824 %      |
           | Precision  |      95.829 %      |
           | Recall     |     100.000 %      |
           | F1         |      97.870 %      |

   rus     | Accuracy   |      90.970 %      |
           | Precision  |      89.610 %      |
           | Recall     |      92.617 %      |
           | F1         |      91.089 %      |



Based on the results presented:

The ROS dataset once again demonstrated strong performance, with the Decision Tree model achieving high accuracy, precision, recall, and F1 score. While there was a slight decrease in performance compared to the Logistic Regression model, the differences were minimal, indicating that the ROS dataset remains a viable option for building a spam email classifier.

### 3. Random Forest

Random Forest is a versatile ensemble learning method for classification and regression tasks. It operates by constructing a multitude of decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees.

By aggregating predictions from multiple decision trees, Random Forest reduces the risk of overfitting. It also provides a feature importance score, which indicates the contribution of each feature in making accurate predictions. This can be useful for feature selection and gaining insights into the most influential factors driving the classification task.

#### Train and Evaluate (~2m 5s)

In [13]:
# Initialize Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Train and evaluate the model
evaluation_results_rf, rf_model = train_and_evaluate_model(rf_model, data_splits)

#### Results

In [14]:
# Output summary of evaluation metrics
print_evaluation_summary('Random Forest', evaluation_results_rf)

Random Forest Results
---------------------
 Dataset   |   Metric   |  Train Test Split  |
 original  | Accuracy   |      97.578 %      |
           | Precision  |     100.000 %      |
           | Recall     |      81.879 %      |
           | F1         |      90.037 %      |

   ros     | Accuracy   |      99.948 %      |
           | Precision  |      99.896 %      |
           | Recall     |     100.000 %      |
           | F1         |      99.948 %      |

   rus     | Accuracy   |      93.311 %      |
           | Precision  |      96.403 %      |
           | Recall     |      89.933 %      |
           | F1         |      93.056 %      |



Based on the results presented:

Random Forest excelled on the ROS dataset, achieving near-perfect accuracy, precision, recall, and F1 score, indicating excellent performance on oversampled data. This should be no surprised as mentioned earlier that this model can handle imbalance better.

*Note*: Due to the consistency of ROS performance, we will be disregarding the original and RUS datasets to reduce computation time unless future model results indicate a need to test these methods as well

### 4. Gradient Boosting

Gradient Boosting is an ensemble learning technique used for regression and classification tasks. It builds a model in a stage-wise fashion, and it generalizes by allowing optimization of an arbitrary differentiable loss function. Each new model attempts to correct the errors made by the previously trained model. This iterative process results in a strong predictive model.

For our spam email classification task, Gradient Boosting can potentially capture subtle patterns in the data that simpler models might miss, leading to improved accuracy and robustness in detecting spam emails.

#### Train and Evaluate (~8m)

In [19]:
# Remove original and rus from data_splits
data_splits = {
    'ros': data_splits['ros']
}

# Initialize the Gradient Boosting model
gb_model = GradientBoostingClassifier(random_state=rs)

# Train and evaluate the model on the datasets
evaluation_results, gb_model = train_and_evaluate_model(gb_model, data_splits)

#### Results

In [23]:
# Print the evaluation summary
print_evaluation_summary("Gradient Boosting", evaluation_results)

Gradient Boosting Results
-------------------------
 Dataset   |   Metric   |  Train Test Split  |
   ros     | Accuracy   |      93.005 %      |
           | Precision  |      96.009 %      |
           | Recall     |      89.741 %      |
           | F1         |      92.769 %      |



Based on the results presented:

The Gradient Boosting model performed well on the ROS dataset, though it did not achieve the same high levels of performance as the Random Forest model. The precision and recall values suggest a strong ability to correctly identify positive instances, but with slightly more false positives or negatives compared to the Random Forest results.