# Baseline Model Training - Code Comment Classification

This notebook trains a baseline Logistic Regression model with hyperparameter tuning.

## Steps:
1. Load encoded features
2. Build pipeline with RandomOverSampler + Logistic Regression
3. GridSearchCV for hyperparameter tuning
4. Evaluate on test set

## Input Files:
- `train_features_4cat_bert_meta.npz`
- `test_features_4cat_bert_meta.npz`
- `train_target_4cat_meta.csv`
- `test_target_4cat_meta.csv`

## Output:
- Best hyperparameters
- Test set performance
- Classification report

## Import Required Libraries

In [1]:
import pandas as pd
import numpy as np

# Scikit-learn
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Imbalanced-learn
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# Utilities
from scipy import sparse

print("All libraries imported successfully!")

All libraries imported successfully!


---
# Part 3: Model Training (Baseline)
---

## 3.1 Load Preprocessed Data
We load the **already encoded** sparse matrices created in the encoding step. 

This ensures we are using the exact training/testing split that prevents data leakage.

In [2]:
# Load Features (Sparse Matrices)
X_train_model = sparse.load_npz("train_features_4cat_bert_meta.npz")
X_test_model = sparse.load_npz("test_features_4cat_bert_meta.npz")

# Load Targets (CSVs)
# use .values.ravel() to convert dataframe column to simple 1D array
y_train_model = pd.read_csv("train_target_4cat_meta.csv").values.ravel() 
y_test_model = pd.read_csv("test_target_4cat_meta.csv").values.ravel()

print(f"Training Data: {X_train_model.shape}")
print(f"Test Data:     {X_test_model.shape}")

Training Data: (2249, 695)
Test Data:     (563, 695)


## 3.2 Build Balanced Pipeline
We use `imbalanced-learn`'s pipeline.
1. **RandomOverSampler:** This will run *inside* the Cross-Validation loop. It balances the training folds but leaves the validation folds unbalanced (honest validation).
2. **Classifier:** Logistic Regression.

In [3]:
# Define the pipeline
# Note: No TfidfVectorizer here because X_train is already vectorized!
pipeline = ImbPipeline([
    ('oversample', RandomOverSampler(random_state=42)),
    ('clf', LogisticRegression(max_iter=2000, random_state=42))
])

# EXTENSIVE HYPERPARAMETER TUNING
# Testing multiple regularization strengths, solvers, penalties, and class weights
param_grid = {
    'clf__C': [0.1, 1, 10],
    'clf__solver': ['liblinear', 'lbfgs'],
    'clf__penalty': ['l1', 'l2'],  # 2 penalties
    'clf__class_weight': ['balanced', None]  # 2 options
}

print("Pipeline created with Internal Balancing.")

Pipeline created with Internal Balancing.
Total hyperparameter combinations to test: 96 = 96
This will take longer but should find optimal settings!
Note: Some combinations are invalid (e.g., lbfgs + l1) and will be skipped.


## 3.3 Train and Evaluate (with Cross-Validation)
We use `GridSearchCV` to find the best parameters. The `ImbPipeline` ensures that for every fold of cross-validation, the model is trained on balanced data but validated on real, unbalanced data.

In [4]:
# 1. Run Grid Search
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_macro', n_jobs=-1)
grid.fit(X_train_model, y_train_model)

print("Best Parameters:", grid.best_params_)
print("Best CV Score (F1 Macro):", grid.best_score_)

# 2. Final Evaluation on the Locked Test Set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test_model)

print("\n=== FINAL TEST SET RESULTS ===")
print("Accuracy:", accuracy_score(y_test_model, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test_model, y_pred))

30 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\lmacera\Desktop\CodeCommentClassification\venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\lmacera\Desktop\CodeCommentClassification\venv\Lib\site-packages\sklearn\base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\lmacera\Desktop\CodeCommentClassification\venv\Lib\site-packages\imblearn\pipeline.py", line 522, in fit
    self._final_estimator.fit(Xt, yt, **

Best Parameters: {'clf__C': 1, 'clf__class_weight': 'balanced', 'clf__penalty': 'l2', 'clf__solver': 'liblinear'}
Best CV Score (F1 Macro): 0.6721492337913795

=== FINAL TEST SET RESULTS ===
Accuracy: 0.69449378330373

Classification Report:

              precision    recall  f1-score   support

           0       0.77      0.72      0.74       212
           1       0.51      0.52      0.52       101
           2       0.78      0.77      0.77       159
           3       0.61      0.69      0.65        91

    accuracy                           0.69       563
   macro avg       0.67      0.68      0.67       563
weighted avg       0.70      0.69      0.70       563

