# Model Training

This notebook focuses on the training and evaluation of machine learning models for spam detection. We will use the preprocessed data from the previous step and apply various machine learning algorithms to classify messages as spam or ham. The notebook will cover model selection, training, and basic performance evaluation.

## Table of Contents

1. [Introduction](#model-training-and-evaluation)
2. [Loading the Data](#loading-the-data)
3. [Train-Test Split](#train-test-split)
4. [Model Selection](#model-selection)
5. [Exporting the Data](#exporting-the-data)

## Loading the Data

### 1. Import Libraries

In [1]:
import numpy as np
import pandas as pd

import os

### 2. Load Preprocessed Data from Preprocessing Stage

In [2]:
# Define the directory paths
processed_data_dir = 'data/processed'
original_dir = os.path.join(processed_data_dir, 'original')
ros_dir = os.path.join(processed_data_dir, 'ros')
rus_dir = os.path.join(processed_data_dir, 'rus')

# Load original data
df_original = pd.read_csv(os.path.join(original_dir, 'original_data.csv'))
X_tfidf_original = np.load(os.path.join(original_dir, 'original_tfidf_features.npy'))
y_original = df_original['label']

# Load ROS data
df_ros = pd.read_csv(os.path.join(ros_dir, 'ros_data.csv'))
X_tfidf_ros = np.load(os.path.join(ros_dir, 'ros_tfidf_features.npy'))
y_ros = df_ros['label']

# Load RUS data
df_rus = pd.read_csv(os.path.join(rus_dir, 'rus_data.csv'))
X_tfidf_rus = np.load(os.path.join(rus_dir, 'rus_tfidf_features.npy'))
y_rus = df_rus['label']

datasets = {
    'original': (X_tfidf_original, y_original),
    'ros': (X_tfidf_ros, y_ros),
    'rus': (X_tfidf_rus, y_rus)
}

# Print the shapes of the loaded data
print("Original data shape:", df_original.shape)
print("Original TF-IDF features shape:", X_tfidf_original.shape)
print("ROS data shape:", df_ros.shape)
print("ROS TF-IDF features shape:", X_tfidf_ros.shape)
print("RUS data shape:", df_rus.shape)
print("RUS TF-IDF features shape:", X_tfidf_rus.shape)

Original data shape: (5572, 2)
Original TF-IDF features shape: (5572, 5000)
ROS data shape: (9650, 2)
ROS TF-IDF features shape: (9650, 5000)
RUS data shape: (1494, 2)
RUS TF-IDF features shape: (1494, 3847)


### 3. Initialize Hyperparameters

In [3]:
# Random state is used to control the randomness involved in algorithms that rely on random processes,
# such as data shuffling, train-test splits, or random initialization of model parameters.
rs = 42

## Train-Test Split

Split the preprocessed data into training and testing sets to evaluate our model's performance on unseen data.

In [4]:
from sklearn.model_selection import train_test_split, StratifiedKFold

def split_data(X, y, test_size=0.2, random_state=rs):
    """
    Split data into training and testing sets.
    
    ## Parameters
    `X`: ndarray 
        Features.
    `y`: ndarray
        Labels.
    `test_size`: float, default=0.2
        Proportion of the dataset to include in the test split.
    `random_state`: int, default=42
        Random seed for reproducibility.
    
    ## Returns
    `dict`: Dictionary containing training and testing sets (X_train, X_test, y_train, y_test).
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)
    return {'X_train': X_train, 'X_test': X_test, 'y_train': y_train, 'y_test': y_test}

# Split the data for all datasets
data_splits = {
    'original': split_data(X_tfidf_original, y_original),
    'ros': split_data(X_tfidf_ros, y=y_ros),
    'rus': split_data(X_tfidf_rus, y_rus)
}

## Model Selection

*Note*: The `Train and Evaluate` cell blocks may take a while to finish computing. The notebooks are designed in a way such that not every model has to be ran (i.e., Gradient Boosting takes quite a while to complete).

In this section, we will explore and evaluate different machine learning models to classify spam emails. We will use various models to understand which one performs best for our dataset. The models we will cover include:

1. [Logistic Regression](#1-logistic-regression)
2. [Decision Tree](#2-decision-tree)
3. [Random Forest](#3-random-forest)
4. [Gradient Boosting](#4-gradient-boosting)
5. [Support Vector Machine (SVM)](#5-support-vector-machine-svm)
6. [k-Nearest Neighbors (k-NN)](#6-k-nearest-neighbors-k-nn)
7. [Naive Bayes](#7-naive-bayes)

*Note*: Certain algorithms like Random Forests can handle imbalance better.

We will analyze the following metrics in our overall evaluation of each model:
* **Accuracy** - The proportion of correctly classified instances among the total number of instances. It provides an overall assessment of the model's correctness.

* **Precision** - Also known as positive predictive value, measures the proportion of correctly predicted positive instances (true positives) among all predicted positive instances (true positives + false positives). It reflects the model's ability to avoid false positives.
    - "Of all the instances predicted as positive, how many are actually positive?"

* **Recall** - Also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances (true positives) among all actual positive instances (true positives + false negatives). It reflects the model's ability to identify all relevant instances.
    - "Of all the actual positive instances, how many were correctly predicted as positive?"

* **F1** - The harmonic mean of precision and recall. It provides a balance between precision and recall and is especially useful when dealing with imbalanced datasets.

*Note*: Accuracy might not be the best metric. Precision, recall, and F1-score can better evaluate the model's performance on imbalanced data.

In [5]:
# Empty dictionary to store each model and their corresponding metrics
evaluation_results = {}

In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate_model(model, X_train, X_test, y_train, y_test):
    """
    Evaluate the performance of a machine learning model.
    
    ## Parameters
    `model`: sklearn estimator object
        The model to evaluate.
    `X_train`: array-like
        Training features.
    `X_test`: array-like
        Testing features.
    `y_train`: array-like
        Training labels.
    `y_test`: array-like
        Testing labels.
    
    ## Returns
    `dict`: A dictionary of evaluation metrics (accuracy, precision, recall, f1) along with the fitted model and predictions.
    """
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # Store evaluation metrics, trained model, and predictions in a dictionary
    results = {
        'model': model,
        'y_pred': y_pred,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }
    
    return results

In [7]:
def cross_validate_model(model, X, y, n_splits=10, random_state=rs):
    """
    Perform cross-validation to evaluate the performance of a machine learning model.
    
    ## Parameters
    `model`: sklearn estimator object
        The model to evaluate.
    `X`: array-like
        Features.
    `y`: array-like
        Labels.
    `n_splits`: int, default=10
        Number of folds for cross-validation.
    `random_state`: int, default=42
        Random seed for reproducibility.

    ## Returns
    `dict`: Dictionary containing evaluation metrics (accuracy, precision, recall, f1).
    """

    # Initialize StratifiedKFold cross-validator
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    # Initialize dictionary to store evaluation metrics
    scores = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}

    # Iterate over each fold
    for train_index, test_index in skf.split(X, y):
        # Split the data into training and testing sets for the current fold
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Train the model
        model.fit(X_train, y_train)

        # Make predictions
        y_pred = model.predict(X_test)

        # Calculate evaluation metrics for the current fold
        scores['accuracy'].append(accuracy_score(y_test, y_pred))
        scores['precision'].append(precision_score(y_test, y_pred))
        scores['recall'].append(recall_score(y_test, y_pred))
        scores['f1'].append(f1_score(y_test, y_pred))

    # Calculate average metrics across all folds
    metrics = {}
    for metric, values in scores.items():
        metrics[metric] = np.mean(values)

    return metrics

In [8]:
def train_and_evaluate_model(model, data_splits, datasets = None):
    """
    Train and evaluate a model using train-test split and cross-validation methods.

    ## Parameters
    `model`: sklearn estimator
        The model to be trained and evaluated.
    `data_splits`: dict
        Dictionary containing the train-test splits for each dataset.
    `datasets`: dict (optional)
        Dictionary containing the full datasets for cross-evaluation.
    
    ## Returns
    `dict`: A dictionary with evaluation metrics for each dataset and method, and the fitted model (from train-test split) along with its predictions.

    """
    # Dictionary to store evaluation metrics for each dataset and method
    evaluation_results = {}

    for dataset_name, dataset_splits in data_splits.items():
        # Extract the training and testing splits
        X_train, y_train = dataset_splits['X_train'], dataset_splits['y_train']
        X_test, y_test = dataset_splits['X_test'], dataset_splits['y_test']

        # Evaluate the model using train-test split method
        evaluation_results[dataset_name] = {}
        evaluation_results[dataset_name]['train_test_split'] = evaluate_model(model, X_train, X_test, y_train, y_test)

    # Evaluate the model using cross-validation method
    if datasets:
        for dataset_name, (X, y) in datasets.items():
            evaluation_results[dataset_name]['cross_validation'] = cross_validate_model(model, X, y)

    return evaluation_results

In [9]:
def print_evaluation_summary(model_name, evaluation_results):
    """
    Print the evaluation summary for each dataset and method.

    ## Parameters
    `evaluation_results`: dict
        Dictionary with evaluation metrics for each dataset and method.
    """
    # Title and Column Headers
    print(f"{model_name} Results\n", '-'*(len(model_name)+8), sep='')
    if evaluation_results['ros'].get('cross_validation', None):
        print("{:^10} | {:^10} | {:^18} | {:^18} |".format("Dataset", "Metric", "Train Test Split", "Cross Validation"))
    else:
        print("{:^10} | {:^10} | {:^18} |".format("Dataset", "Metric", "Train Test Split"))

    for dataset_name, methods in evaluation_results.items():
        # Curren Dataset (Original, ROS, RUS)
        print(f"{dataset_name:^10} | ", end="")
        metrics_split = methods['train_test_split']
        metrics_cross = methods.get('cross_validation', None)
        first_metric = True
        for metric_name in metrics_split.keys():
            # SKip the model and predictions
            if metric_name == 'model' or metric_name == 'y_pred':
                continue

            v1 = "{:.4f} %".format(metrics_split[metric_name] * 100) # Train-Test Split
            v2 = "{:.4f} %".format(metrics_cross.get(metric_name) * 100) if metrics_cross else None # Cross Validation
            if not first_metric:
                print("{:^10} | ".format(''), end="")
            first_metric = False
            print("{:<10} | ".format(metric_name.capitalize()), end="")
            if v2:
                print("{:^18} | {:^18} |".format(v1, v2))
            else:
                print("{:^18} |".format(v1))
        print()

### 1. Logistic Regression

Logistic Regression is a linear model commonly used for binary classification problems. It predicts the probability of a binary outcome using a logistic function. Also known as the sigmoid function, this outputs a probability value that is then mapped to two possible classes. 

It often serves as a good baseline model despite its simplicity, making it a valuable tool for initial analysis before moving on to more complex models. 

#### Train and Evaluate

In [10]:
from sklearn.linear_model import LogisticRegression

# Define the Logistic Regression model
logreg_model = LogisticRegression(random_state=rs)

# Train and evaluate the model
logreg_evaluation_results = train_and_evaluate_model(logreg_model, data_splits, datasets)

#### Results

In [11]:
# Print the evaluation summary
print_evaluation_summary('Logistic Regression', logreg_evaluation_results)
evaluation_results['Logistic Regression'] = logreg_evaluation_results

Logistic Regression Results
---------------------------
 Dataset   |   Metric   |  Train Test Split  |  Cross Validation  |
 original  | Accuracy   |     95.8744 %      |     95.8003 %      |
           | Precision  |     99.0476 %      |     97.7681 %      |
           | Recall     |     69.7987 %      |     70.2937 %      |
           | F1         |     81.8898 %      |     81.7395 %      |

   ros     | Accuracy   |     99.0155 %      |     98.8497 %      |
           | Precision  |     98.9648 %      |     99.0034 %      |
           | Recall     |     99.0674 %      |     98.6944 %      |
           | F1         |     99.0161 %      |     98.8475 %      |

   rus     | Accuracy   |     92.9766 %      |     93.4425 %      |
           | Precision  |     98.4848 %      |     98.2318 %      |
           | Recall     |     87.2483 %      |     88.4937 %      |
           | F1         |     92.5267 %      |     93.0188 %      |



Based on the results presented:

* Both the train-test split and cross-validation methods yielded similar results across all metrics and datasets, suggesting consistency in model performance.

1. **Original Dataset**:
    * Despite not addressing class imbalance, the original dataset achieved near-perfect accuracy and precision scores. However, it struggled with recall and F1 scores.
    * The above indicates potential issues with correctly identifying positive instances.

2. **ROS (Random Over-Sampling) Dataset**:
    * The ROS dataset consistently outperformed both the original and RUS datasets across all metrics.

3. **RUS (Random Under-Sampling) Dataset**:
    * The RUS dataset also exhibited improved performance compared to the original, albeit slightly lower than the ROS dataset.

Overall Summary:
* Both oversampling (ROS) and undersampling (RUS) techniques effectively addressed the class imbalance issue, resulting in improved model performance across all metrics.
* The ROS dataset showed the most promising results, achieving high metrics across both train-test split and cross-validation methods.
* These findings suggest that addressing class imbalance through sampling techniques significantly enhances the model's ability to generalize and perform well on both training and unseen data.

**Note: We will only perform the train-test split method for subsequent models to decrease computation time.**

### 2. Decision Tree

Decision Tree is a non-linear supervised learning algorithm used for both classification and regression tasks. It works by splitting the data into subsets based on the value of input features, forming a tree-like model of decisions. 
* Each internal node of the tree represents a decision based on an attribute
* Each branch represents the outcome of the decision, and
* Each leaf node represents a class label or a continuous value.

Unlike linear models, Decision Trees can capture non-linear relationships between features and the target variable. This flexibility allows the model to fit more complex patterns in the data, which can be beneficial for identifying spam emails.

Decision Trees perform implicit feature selection by choosing the most informative features for splitting the data. This property helps in reducing the dimensionality and removing irrelevant features from our spam email dataset.

#### Train and Evaluate

In [12]:
from sklearn.tree import DecisionTreeClassifier

# Define the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=rs)

# Train and evaluate the Decision Tree model
dt_evaluation_results= train_and_evaluate_model(dt_model, data_splits)

#### Results

In [13]:
# Print the evaluation summary
print_evaluation_summary('Decision Tree', dt_evaluation_results)
evaluation_results['Decision Tree'] = dt_evaluation_results

Decision Tree Results
---------------------
 Dataset   |   Metric   |  Train Test Split  |
 original  | Accuracy   |     95.8744 %      |
           | Precision  |     85.0340 %      |
           | Recall     |     83.8926 %      |
           | F1         |     84.4595 %      |

   ros     | Accuracy   |     97.8238 %      |
           | Precision  |     95.8292 %      |
           | Recall     |     100.0000 %     |
           | F1         |     97.8702 %      |

   rus     | Accuracy   |     90.9699 %      |
           | Precision  |     89.6104 %      |
           | Recall     |     92.6174 %      |
           | F1         |     91.0891 %      |



Based on the results presented:

The ROS dataset once again demonstrated strong performance, with the Decision Tree model achieving high accuracy, precision, recall, and F1 score. While there was a slight decrease in performance compared to the Logistic Regression model, the differences were minimal, indicating that the ROS dataset remains a viable option for building a spam email classifier.

### 3. Random Forest

Random Forest is a versatile ensemble learning method for classification and regression tasks. It operates by constructing a multitude of decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees.

By aggregating predictions from multiple decision trees, Random Forest reduces the risk of overfitting and performs well on imbalanced datasets.

#### Train and Evaluate

In [14]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Train and evaluate the model
rf_evaluation_results = train_and_evaluate_model(rf_model, data_splits)

#### Results

In [15]:
# Output summary of evaluation metrics
print_evaluation_summary('Random Forest', rf_evaluation_results)
evaluation_results['Random Forest'] = rf_evaluation_results

Random Forest Results
---------------------
 Dataset   |   Metric   |  Train Test Split  |
 original  | Accuracy   |     97.5785 %      |
           | Precision  |     100.0000 %     |
           | Recall     |     81.8792 %      |
           | F1         |     90.0369 %      |

   ros     | Accuracy   |     99.9482 %      |
           | Precision  |     99.8965 %      |
           | Recall     |     100.0000 %     |
           | F1         |     99.9482 %      |

   rus     | Accuracy   |     93.3110 %      |
           | Precision  |     96.4029 %      |
           | Recall     |     89.9329 %      |
           | F1         |     93.0556 %      |



Based on the results presented:

Random Forest excelled on the ROS dataset, achieving near-perfect accuracy, precision, recall, and F1 score, indicating excellent performance on oversampled data. This should be no surprised as mentioned earlier that this model can handle imbalance better.

**Note: Due to the consistency of ROS performance, we will be disregarding the RUS datasets to reduce computation time.**

### 4. Gradient Boosting

Gradient Boosting is an ensemble learning technique used for regression and classification tasks. It builds a model in a stage-wise fashion, and it generalizes by allowing optimization of an arbitrary differentiable loss function. Each new model attempts to correct the errors made by the previously trained model. This iterative process results in a strong predictive model.

For our spam email classification task, Gradient Boosting can potentially capture subtle patterns in the data that simpler models might miss, leading to improved accuracy and robustness in detecting spam emails.

#### Train and Evaluate

In [16]:
from sklearn.ensemble import GradientBoostingClassifier

# Remove rus from data_splits
data_splits = {
    'original': data_splits['original'],
    'ros': data_splits['ros']
}

# Initialize the Gradient Boosting model
gb_model = GradientBoostingClassifier(random_state=rs)

# Train and evaluate the model on the datasets
gb_evaluation_results = train_and_evaluate_model(gb_model, data_splits)

#### Results

In [17]:
# Print the evaluation summary
print_evaluation_summary('Gradient Boosting', gb_evaluation_results)
evaluation_results['Gradient Boosting'] = gb_evaluation_results

Gradient Boosting Results
-------------------------
 Dataset   |   Metric   |  Train Test Split  |
 original  | Accuracy   |     96.1435 %      |
           | Precision  |     94.1667 %      |
           | Recall     |     75.8389 %      |
           | F1         |     84.0149 %      |

   ros     | Accuracy   |     93.0052 %      |
           | Precision  |     96.0089 %      |
           | Recall     |     89.7409 %      |
           | F1         |     92.7691 %      |



Based on the results presented:

The Gradient Boosting model performed well on the ROS dataset, though it did not achieve the same high levels of performance as the Random Forest model. The precision and recall values suggest a strong ability to correctly identify positive instances, but with slightly more false positives or negatives compared to the Random Forest results.

### 5. Support Vector Machine (SVM)

Support Vector Machine (SVM) is a powerful supervised learning algorithm that can be used for both classification and regression tasks. It works by finding the hyperplane that best separates the data into different classes. The SVM is particularly effective in high-dimensional spaces and when the number of dimensions is greater than the number of samples.

SVM is suitable for datasets with a large number of features, which is the case with our TF-IDF vectorized text data. Although our dataset has been balanced using ROS, SVM can handle imbalanced datasets well by maximizing the margin between classes.

#### Train and Evaluate

In [18]:
from sklearn.svm import SVC

# Initialize the SVM model
svm_model = SVC(kernel='linear', probability=True, random_state=rs)

# Train and evaluate the model on the datasets
svm_evaluation_results = train_and_evaluate_model(svm_model, data_splits)

#### Results

In [19]:
# Print the evaluation summary
print_evaluation_summary('Support Vector Machine', svm_evaluation_results)
evaluation_results['Support Vector Machine'] = svm_evaluation_results

Support Vector Machine Results
------------------------------
 Dataset   |   Metric   |  Train Test Split  |
 original  | Accuracy   |     98.2960 %      |
           | Precision  |     100.0000 %     |
           | Recall     |     87.2483 %      |
           | F1         |     93.1900 %      |

   ros     | Accuracy   |     99.7927 %      |
           | Precision  |     99.7927 %      |
           | Recall     |     99.7927 %      |
           | F1         |     99.7927 %      |



Based on the results presented:

The identical scores across all four metrics suggest that the SVM model performed exceptionally well, maintaining a perfect balance between precision and recall. This indicates that the model has a very high capability of distinguishing between spam and non-spam emails without bias towards either class. The performance is on par with the best-performing models in our evaluation.

### 6. k-Nearest Neighbors (k-NN)

k-Nearest Neighbors (k-NN) is a simple, instance-based learning algorithm that classifies a data point based on the majority class among its k nearest neighbors in the feature space. It is a non-parametric method, meaning it makes no explicit assumptions about the form of the function that relates the features to the target variable. Since k-NN makes predictions based on local neighborhoods, it can effectively handle balanced datasets, such as those created by the ROS method

#### Train and Evaluate

In [20]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the k-NN model with k=5 (common default value)
knn_model = KNeighborsClassifier(n_neighbors=5)

# Train and evaluate the model on the datasets
knn_evaluation_results = train_and_evaluate_model(knn_model, data_splits)

#### Results

In [21]:
# Print the evaluation summary
print_evaluation_summary("k-Nearest Neighbors", knn_evaluation_results)
evaluation_results['k-Nearest Neighbors'] = knn_evaluation_results

k-Nearest Neighbors Results
---------------------------
 Dataset   |   Metric   |  Train Test Split  |
 original  | Accuracy   |     91.5695 %      |
           | Precision  |     100.0000 %     |
           | Recall     |     36.9128 %      |
           | F1         |     53.9216 %      |

   ros     | Accuracy   |     98.6010 %      |
           | Precision  |     99.8936 %      |
           | Recall     |     97.3057 %      |
           | F1         |     98.5827 %      |



Based on the results presented:

The k-NN model performs very well on the ROS dataset, with high precision and recall scores. The high accuracy and F1 score further demonstrate the model's effectiveness in classifying the dataset. However, it is slightly worse performance than the best-performing models in our evaluation.

### 7. Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes' theorem with the "naive" assumption of independence between features. It is commonly used for text classification tasks, making it suitable for our dataset of spam email detection.

#### Train and Evaluate

In [22]:
from sklearn.naive_bayes import MultinomialNB

# Initialize the Naive Bayes model
nb_model = MultinomialNB()

# Train and evaluate the model on the ROS dataset
nb_evaluation_results = train_and_evaluate_model(nb_model, data_splits)


#### Results

In [23]:
# Print the evaluation summary
print_evaluation_summary("Naive Bayes", nb_evaluation_results)
evaluation_results['Naive Bayes'] = nb_evaluation_results

Naive Bayes Results
-------------------
 Dataset   |   Metric   |  Train Test Split  |
 original  | Accuracy   |     96.6816 %      |
           | Precision  |     99.1228 %      |
           | Recall     |     75.8389 %      |
           | F1         |     85.9316 %      |

   ros     | Accuracy   |     98.1865 %      |
           | Precision  |     98.2365 %      |
           | Recall     |     98.1347 %      |
           | F1         |     98.1856 %      |



Based on the results presented:

Naive Bayes demonstrates its effectiveness in text classification tasks like spam email detection, offering a computationally efficient solution with competitive performance.

## Exporting the Data

After training and evaluating each model, we export it to the `results` folder for use in the evaluation step next.

In [24]:
import pickle

# Directory to store the results and split data
results_dir = 'results/'
data_dir = 'data/processed/split'

# Create directories if they don't exist
os.makedirs(name=results_dir, exist_ok=True)
os.makedirs(name=data_dir, exist_ok=True)

# Save evaluation_results to a pickle file
with open(os.path.join(results_dir, 'evaluation_results.pkl'), 'wb') as f:
    pickle.dump(evaluation_results, f)

# Save the data_splits dictionary to a pickle file
with open(os.path.join(data_dir, 'data_splits.pkl'), 'wb') as f:
    pickle.dump(data_splits, f)

## Summary and Insights

### Impact of Data Preprocessing Techniques
* Oversampling (ROS) and undersampling (RUS) techniques significantly improved model performance for the original imbalanced dataset.

### Performance of Different Models
* **Logistic Regression**: Achieved high accuracy and precision but had lower recall and F1-score on the original dataset. ROS and RUS techniques improved overall performance.

* **Decision Tree**: Showed promising results, especially on the ROS dataset, with high accuracy and precision.

* **Random Forest**: Outperformed other models on the ROS dataset, demonstrating high accuracy, precision, and recall.

* **Gradient Boosting**: Achieved competitive performance, particularly on the ROS dataset, with high accuracy and precision; however showed low recall scores.

* **Support Vector Machine (SVM)**: Demonstrated excellent performance on the ROS dataset, with high accuracy, precision, recall, and F1-score.

* **k-Nearest Neighbors (k-NN)**: Showed good performance, especially on the ROS dataset, with high accuracy, precision, and recall.

* **Naive Bayes**: Achieved high accuracy and precision on the ROS dataset, with competitive performance compared to other models.

### Key Insights
* The ROS technique consistently improved model performance across different algorithms, indicating its effectiveness in handling class imbalance.
* Certain models such as Random Forest and SVM performed exceptionally well on the ROS dataset, suggesting their suitability for this classification task.

### Next Steps
* **Model Evaluation**: Analyze the performance of each model using various evaluation metrics, including confusion matrices, classification reports, and ROC curves.

* **Model Comparison**: Further compare the performance of different models side by side to identify the most effective approach for our email classification task.

* **Visualization**: Utilize visualizations to present the evaluation results in an intuitive and understandable manner.

* **Further Analysis**: Explore additional avenues for analysis, including feature importance plots. 