# Code Comment Classification - Model Training

This notebook performs the following model training operations:
1. Load Preprocessed Data
2. Build Balanced Pipeline
3. Train and Evaluate (with Cross-Validation)
4. Save the Baseline Model

## 1. Load Preprocessed Data
We load the **already encoded** sparse matrices created in the `encoding.ipynb` notebook. 

This ensures we are using the exact training/testing split that prevents data leakage.

In [9]:
import pandas as pd
import numpy as np
from scipy import sparse

# Load Features (Sparse Matrices)
X_train = sparse.load_npz("train_features.npz")
X_test = sparse.load_npz("test_features.npz")

# Load Targets (CSVs)
# use .values.ravel() to convert dataframe column to simple 1D array
y_train = pd.read_csv("code-comment-classification-train-target.csv").values.ravel() 
y_test = pd.read_csv("code-comment-classification-test-target.csv").values.ravel()

print(f"Training Data: {X_train.shape}")
print(f"Test Data:     {X_test.shape}")

Training Data: (2291, 5306)
Test Data:     (573, 5306)


## 2. Build Balanced Pipeline
We use `imbalanced-learn`'s pipeline.
1. **RandomOverSampler:** This will run *inside* the Cross-Validation loop. It balances the training folds but leaves the validation folds unbalanced (honest validation).
2. **Classifier:** Logistic Regression.

In [10]:
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Define the pipeline
# Note: No TfidfVectorizer here because X_train is already vectorized!
pipeline = ImbPipeline([
    ('oversample', RandomOverSampler(random_state=42)),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

# Define Hyperparameters for GridSearch
param_grid = {
    'clf__C': [0.1, 1, 10],
    'clf__solver': ['liblinear', 'lbfgs']
}

print("Pipeline created with Internal Balancing.")

Pipeline created with Internal Balancing.


## 3. Train and Evaluate (with Cross-Validation)
We use `GridSearchCV` to find the best parameters. The `ImbPipeline` ensures that for every fold of cross-validation, the model is trained on balanced data but validated on real, unbalanced data.

In [11]:
from sklearn.metrics import classification_report, accuracy_score

# 1. Run Grid Search
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_macro', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best CV Score (F1 Macro):", grid.best_score_)

# 2. Final Evaluation on the Locked Test Set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("\n=== FINAL TEST SET RESULTS ===")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

Best Parameters: {'clf__C': 1, 'clf__solver': 'lbfgs'}
Best CV Score (F1 Macro): 0.5755785949661258

=== FINAL TEST SET RESULTS ===
Accuracy: 0.5968586387434555

Classification Report:

              precision    recall  f1-score   support

           0       0.35      0.45      0.39        62
           1       0.46      0.46      0.46       101
           2       0.79      0.77      0.78       159
           3       0.57      0.54      0.55        91
           4       0.63      0.61      0.62       160

    accuracy                           0.60       573
   macro avg       0.56      0.56      0.56       573
weighted avg       0.61      0.60      0.60       573



## 4. Save the Baseline Model
We save the trained Logistic Regression pipeline. This serves as our **baseline** performance. Any future complex model (like Random Forest or SVM) must beat the **F1-Macro score of 0.57** to be considered an improvement.

In [12]:
import joblib

# Save the entire pipeline (including the oversampler and classifier)
model_filename = "best_logistic_regression.pkl"
joblib.dump(best_model, model_filename)

print(f"Model saved successfully to {model_filename}")

Model saved successfully to best_logistic_regression.pkl
