# Models Considered

For this assignment, the following machine learning models were chosen to evaluate their performance:

1. **Naive Bayes**  
   A probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

2. **Linear SVM**  
   A type of Support Vector Machine (SVM) that finds the hyperplane which best separates the data into two classes.

3. **Random Forest**  
   An ensemble learning method that combines the predictions of multiple decision trees to improve classification accuracy and reduce overfitting.

# Loading the required libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV

In [2]:
def load_and_preprocess_data(train_path, valid_path, test_path):
    """
    Load and preprocess the data from CSV files
    """
    train_df = pd.read_csv(train_path)
    valid_df = pd.read_csv(valid_path)
    test_df = pd.read_csv(test_path)
    
    # Convert labels to binary
    label_map = {'ham': 0, 'spam': 1}
    train_df['label'] = train_df['label'].map(label_map)
    valid_df['label'] = valid_df['label'].map(label_map)
    test_df['label'] = test_df['label'].map(label_map)
    
    return train_df, valid_df, test_df

def prepare_features(train_df, valid_df, test_df):
    """
    Prepare features using TF-IDF vectorization
    """
    vectorizer = TfidfVectorizer(max_features=5000)
    X_train = vectorizer.fit_transform(train_df['processed_message'])
    X_valid = vectorizer.transform(valid_df['processed_message'])
    X_test = vectorizer.transform(test_df['processed_message'])
    
    y_train = train_df['label']
    y_valid = valid_df['label']
    y_test = test_df['label']
    
    return X_train, X_valid, X_test, y_train, y_valid, y_test, vectorizer

In [3]:
def load_and_preprocess_data(train_path, valid_path, test_path):
    """
    Load and preprocess the data from CSV files
    """
    train_df = pd.read_csv(train_path)
    valid_df = pd.read_csv(valid_path)
    test_df = pd.read_csv(test_path)
    
    # Convert labels to binary
    label_map = {'ham': 0, 'spam': 1}
    train_df['label'] = train_df['label'].map(label_map)
    valid_df['label'] = valid_df['label'].map(label_map)
    test_df['label'] = test_df['label'].map(label_map)
    
    return train_df, valid_df, test_df

def prepare_features(train_df, valid_df, test_df):
    """
    Prepare features using TF-IDF vectorization
    """
    vectorizer = TfidfVectorizer(max_features=5000)
    X_train = vectorizer.fit_transform(train_df['processed_message'])
    X_valid = vectorizer.transform(valid_df['processed_message'])
    X_test = vectorizer.transform(test_df['processed_message'])
    
    y_train = train_df['label']
    y_valid = valid_df['label']
    y_test = test_df['label']
    
    return X_train, X_valid, X_test, y_train, y_valid, y_test, vectorizer

In [4]:
def fit_model(model, X_train, y_train):
    """
    Fit a model on training data
    """
    model.fit(X_train, y_train)
    return model

def score_model(model, X, y):
    """
    Score a model on given data
    """
    return model.score(X, y)

def evaluate_model(model, X, y, dataset_name=""):
    """
    Evaluate model predictions with detailed metrics
    """
    y_pred = model.predict(X)
    print(f"\nEvaluation on {dataset_name} dataset:")
    print("\nClassification Report:")
    print(classification_report(y, y_pred))
    print("\nConfusion Matrix:")
    print(confusion_matrix(y, y_pred))
    print(f"\nAccuracy: {accuracy_score(y, y_pred):.4f}")
    return y_pred

In [5]:
def validate_model(model, X_train, X_valid, y_train, y_valid):
    """
    Validate the model on train and validation sets
    """
    train_score = score_model(model, X_train, y_train)
    valid_score = score_model(model, X_valid, y_valid)
    
    print(f"\nModel Validation Scores:")
    print(f"Training Score: {train_score:.4f}")
    print(f"Validation Score: {valid_score:.4f}")
    
    evaluate_model(model, X_train, y_train, "Training")
    evaluate_model(model, X_valid, y_valid, "Validation")
    
    return train_score, valid_score

In [6]:
def fine_tune_model(model, param_grid, X_train, y_train):
    """
    Fine-tune model hyperparameters using GridSearchCV
    """
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    print("\nBest parameters:", grid_search.best_params_)
    print("Best cross-validation score:", grid_search.best_score_)
    
    return grid_search.best_estimator_

In [8]:
# Load and preprocess data
train_df, valid_df, test_df = load_and_preprocess_data('processed_data/train.csv', 'processed_data/validation.csv', 'processed_data/test.csv')
X_train, X_valid, X_test, y_train, y_valid, y_test, vectorizer = prepare_features(train_df, valid_df, test_df)

## Naive Bayes 

### Training Report

In [12]:
# Initialize Naive Bayes model
naive_bayes_model = MultinomialNB()

# Train and evaluate Naive Bayes model
print(f"\n{'='*50}")
print("Training Naive Bayes")
print('='*50)

# Fit model
naive_bayes_model = fit_model(naive_bayes_model, X_train, y_train)

# Training score
train_score = score_model(naive_bayes_model, X_train, y_train)
print(f"\nTraining Score for Naive Bayes: {train_score:.4f}")

# Get classification report on training set
y_train_pred = naive_bayes_model.predict(X_train)
print(f"\nClassification Report for Naive Bayes on Training Set:\n{classification_report(y_train, y_train_pred)}")


Training Naive Bayes

Training Score for Naive Bayes: 0.9772

Classification Report for Naive Bayes on Training Set:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99      3377
           1       1.00      0.83      0.91       523

    accuracy                           0.98      3900
   macro avg       0.99      0.91      0.95      3900
weighted avg       0.98      0.98      0.98      3900



### Validation Report

In [13]:
# Validate model
valid_score = score_model(naive_bayes_model, X_valid, y_valid)
print(f"\nValidation Score for Naive Bayes: {valid_score:.4f}")

# Get classification report on validation set
y_valid_pred = naive_bayes_model.predict(X_valid)
print(f"\nClassification Report for Naive Bayes on Validation Set:\n{classification_report(y_valid, y_valid_pred)}")

# Fine-tune if necessary
if valid_score < train_score - 0.05:
    print(f"\nFine-tuning Naive Bayes...")
    naive_bayes_model = fine_tune_model(naive_bayes_model, {'alpha': [0.1, 0.5, 1.0, 2.0]}, X_train, y_train)
    valid_score = score_model(naive_bayes_model, X_valid, y_valid)
    print(f"Validation Score after Fine-tuning: {valid_score:.4f}")

    # Get updated classification report after fine-tuning
    y_valid_pred = naive_bayes_model.predict(X_valid)
    print(f"\nUpdated Classification Report for Naive Bayes on Validation Set:\n{classification_report(y_valid, y_valid_pred)}")

# Score on test data
naive_bayes_test_score = score_model(naive_bayes_model, X_test, y_test)
print(f"\nTest Score for Naive Bayes: {naive_bayes_test_score:.4f}")


Validation Score for Naive Bayes: 0.9593

Classification Report for Naive Bayes on Validation Set:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       723
           1       0.99      0.71      0.82       112

    accuracy                           0.96       835
   macro avg       0.97      0.85      0.90       835
weighted avg       0.96      0.96      0.96       835


Test Score for Naive Bayes: 0.9677


## Linear SVM

### Training Report 

In [14]:
# Initialize Linear SVM model
linear_svm_model = LinearSVC(random_state=42)

# Train and evaluate Linear SVM model
print(f"\n{'='*50}")
print("Training Linear SVM")
print('='*50)

# Fit model
linear_svm_model = fit_model(linear_svm_model, X_train, y_train)

# Training score
train_score = score_model(linear_svm_model, X_train, y_train)
print(f"\nTraining Score for Linear SVM: {train_score:.4f}")

# Get classification report on training set
y_train_pred = linear_svm_model.predict(X_train)
print(f"\nClassification Report for Linear SVM on Training Set:\n{classification_report(y_train, y_train_pred)}")


Training Linear SVM

Training Score for Linear SVM: 0.9985

Classification Report for Linear SVM on Training Set:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3377
           1       1.00      0.99      0.99       523

    accuracy                           1.00      3900
   macro avg       1.00      1.00      1.00      3900
weighted avg       1.00      1.00      1.00      3900



### Validation Report

In [15]:
# Validate model
valid_score = score_model(linear_svm_model, X_valid, y_valid)
print(f"\nValidation Score for Linear SVM: {valid_score:.4f}")

# Get classification report on validation set
y_valid_pred = linear_svm_model.predict(X_valid)
print(f"\nClassification Report for Linear SVM on Validation Set:\n{classification_report(y_valid, y_valid_pred)}")

# Fine-tune if necessary
if valid_score < train_score - 0.05:
    print(f"\nFine-tuning Linear SVM...")
    linear_svm_model = fine_tune_model(linear_svm_model, {'C': [0.1, 1.0, 10.0]}, X_train, y_train)
    valid_score = score_model(linear_svm_model, X_valid, y_valid)
    print(f"Validation Score after Fine-tuning: {valid_score:.4f}")

    # Get updated classification report after fine-tuning
    y_valid_pred = linear_svm_model.predict(X_valid)
    print(f"\nUpdated Classification Report for Linear SVM on Validation Set:\n{classification_report(y_valid, y_valid_pred)}")

# Score on test data
linear_svm_test_score = score_model(linear_svm_model, X_test, y_test)
print(f"\nTest Score for Linear SVM: {linear_svm_test_score:.4f}")


Validation Score for Linear SVM: 0.9784

Classification Report for Linear SVM on Validation Set:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       723
           1       0.97      0.87      0.92       112

    accuracy                           0.98       835
   macro avg       0.97      0.93      0.95       835
weighted avg       0.98      0.98      0.98       835


Test Score for Linear SVM: 0.9845


## Random Forest 

### Training Report

In [16]:
# Initialize Random Forest model
random_forest_model = RandomForestClassifier(random_state=42)

# Train and evaluate Random Forest model
print(f"\n{'='*50}")
print("Training Random Forest")
print('='*50)

# Fit model
random_forest_model = fit_model(random_forest_model, X_train, y_train)

# Training score
train_score = score_model(random_forest_model, X_train, y_train)
print(f"\nTraining Score for Random Forest: {train_score:.4f}")

# Get classification report on training set
y_train_pred = random_forest_model.predict(X_train)
print(f"\nClassification Report for Random Forest on Training Set:\n{classification_report(y_train, y_train_pred)}")


Training Random Forest

Training Score for Random Forest: 0.9997

Classification Report for Random Forest on Training Set:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3377
           1       1.00      1.00      1.00       523

    accuracy                           1.00      3900
   macro avg       1.00      1.00      1.00      3900
weighted avg       1.00      1.00      1.00      3900



### Validation Report

In [17]:
# Validate model
valid_score = score_model(random_forest_model, X_valid, y_valid)
print(f"\nValidation Score for Random Forest: {valid_score:.4f}")

# Get classification report on validation set
y_valid_pred = random_forest_model.predict(X_valid)
print(f"\nClassification Report for Random Forest on Validation Set:\n{classification_report(y_valid, y_valid_pred)}")

# Fine-tune if necessary
if valid_score < train_score - 0.05:
    print(f"\nFine-tuning Random Forest...")
    random_forest_model = fine_tune_model(random_forest_model, {'n_estimators': [100, 200], 'max_depth': [10, 20, None]}, X_train, y_train)
    valid_score = score_model(random_forest_model, X_valid, y_valid)
    print(f"Validation Score after Fine-tuning: {valid_score:.4f}")

    # Get updated classification report after fine-tuning
    y_valid_pred = random_forest_model.predict(X_valid)
    print(f"\nUpdated Classification Report for Random Forest on Validation Set:\n{classification_report(y_valid, y_valid_pred)}")

# Score on test data
random_forest_test_score = score_model(random_forest_model, X_test, y_test)
print(f"\nTest Score for Random Forest: {random_forest_test_score:.4f}")


Validation Score for Random Forest: 0.9760

Classification Report for Random Forest on Validation Set:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       723
           1       1.00      0.82      0.90       112

    accuracy                           0.98       835
   macro avg       0.99      0.91      0.94       835
weighted avg       0.98      0.98      0.98       835


Test Score for Random Forest: 0.9821


## Best Model

In [18]:
# Initialize the best model
best_model = None
best_score = 0

# Determine best model from previous results
models_results = {
    'Naive Bayes': naive_bayes_test_score,
    'Linear SVM': linear_svm_test_score,
    'Random Forest': random_forest_test_score
}

for model_name, score in models_results.items():
    if score > best_score:
        best_score = score
        best_model_name = model_name

# Final evaluation of the best model
print(f"\n{'='*50}")
print(f"Best Model: {best_model_name}")
print(f"Final Test Score: {best_score:.4f}")
print('='*50)

# Detailed evaluation of best model on test set
best_model_object = {
    'Naive Bayes': naive_bayes_model,
    'Linear SVM': linear_svm_model,
    'Random Forest': random_forest_model
}[best_model_name]

y_test_pred = best_model_object.predict(X_test)
print(f"\nClassification Report for {best_model_name} on Test Set:\n{classification_report(y_test, y_test_pred)}")


Best Model: Linear SVM
Final Test Score: 0.9845

Classification Report for Linear SVM on Test Set:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       725
           1       0.99      0.89      0.94       112

    accuracy                           0.98       837
   macro avg       0.99      0.95      0.97       837
weighted avg       0.98      0.98      0.98       837



The Linear SVM model achieved an impressive test score of 0.9845, demonstrating excellent performance in classifying unseen data. This high score indicates strong generalization, suggesting the model is not overfitting and is effectively capturing the underlying patterns in the dataset. Compared to other models, the Linear SVM delivered the best results, highlighting its suitability for the task.

The high test score also implies that the model likely has high precision, recall, and F1-score values, although it's recommended to review the classification report for class-specific performance. While the model is already performing very well, there might be minor opportunities for improvement with further hyperparameter tuning or different kernel configurations.

Overall, the Linear SVM model is highly effective and can be considered ready for deployment, but continued monitoring in production is advised to ensure consistent performance.