# Code Comment Classification - Model Training

This notebook performs the following model training operations:
1. Load cleaned dataset (no outliers)
2. Select features and target
3. Split data into train and test sets
4. Build ML pipeline (preprocessing + model)
5. Train model
6. Evaluate performance
7. Save final model (pipeline + classifier)
8. Hyperparameter Tuning
9. Evaluate best model on test set
10. Save best model

## 1. Load cleaned dataset (no outliers)

In [24]:
import pandas as pd

df_clean = pd.read_csv("code-comment-classification-cleaned-no-outliers.csv")

print("Cleaned dataset shape:", df_clean.shape)
df_clean.head()

Cleaned dataset shape: (2864, 5)


Unnamed: 0,comment_sentence_id,class,comment_sentence,category,outlier
0,512,MigrationGraph,migrations files can be marked as replacing an...,4,0
1,513,MigrationGraph,this is to support the squash feature.,4,0
2,514,MigrationGraph,the graph handler isn t responsible,4,0
3,515,MigrationGraph,"for these instead, the code to load them in he...",4,0
4,516,MigrationGraph,migration files and if the replaced migrations...,4,0


## 2. Select features and target

In [25]:
# Same features as before
FEATURES = ["class", "comment_sentence"]
TARGET = "category"

X = df_clean[FEATURES]
y = df_clean[TARGET]


## 3. Split data into train and test sets

In [26]:
from sklearn.model_selection import train_test_split

# Standard 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)


Train size: (2291, 2)
Test size: (573, 2)


## 4. Build ML pipeline (preprocessing + model)
This is the cleanest and most professional approach: the pipeline contains both the preprocessing and the classifier.

In [27]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Preprocessing:
# - One-hot encoding for categorical features
# - TF-IDF for text
# - passthrough numeric feature

preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["class"]),
        ("text", TfidfVectorizer(stop_words="english", ngram_range=(1, 2)), "comment_sentence")
    ]
)

# Final ML pipeline
model = Pipeline([
    ("preprocess", preprocess),
    ("clf", LogisticRegression(max_iter=400, n_jobs=-1))
])


## 5. Train model

In [28]:
model.fit(X_train, y_train)

print("Model training complete!")


Model training complete!


## 6. Evaluate performance

In [29]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predictions
y_pred = model.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Detailed metrics
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

# Confusion matrix
print("\nConfusion Matrix:\n")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.5863874345549738

Classification Report:

              precision    recall  f1-score   support

           0       0.33      0.16      0.22        62
           1       0.52      0.40      0.45       101
           2       0.70      0.81      0.75       159
           3       0.56      0.49      0.52        91
           4       0.56      0.71      0.63       160

    accuracy                           0.59       573
   macro avg       0.53      0.51      0.51       573
weighted avg       0.57      0.59      0.57       573


Confusion Matrix:

[[ 10   2  12  13  25]
 [  3  40  17  14  27]
 [  1   5 128   4  21]
 [  5  12  14  45  15]
 [ 11  18  13   5 113]]


## 7. Save final model (pipeline + classifier)
This includes both:
- the preprocessing pipeline
- the classifier
So you can directly call `.predict()` on raw new samples.

In [30]:
import joblib

joblib.dump(model, "final_model_pipeline.pkl")

print("Saved: final_model_pipeline.pkl")


Saved: final_model_pipeline.pkl


## Why `Logistic Regression`Was Chosen First
`Logistic Regression` is the standard baseline model for text classification.
It is preferred as the first model because:

### 1. It performs extremely well on high-dimensional sparse data
Our encoding uses:
- TF-IDF (possibly thousands of features)
- One-Hot encoding
`Logistic Regression` with linear decision boundaries and `L2` regularization is specifically optimized for this kind of dataset.
This is why it is the default first model in:
- NLP text classification papers
- Spam detection
- Sentiment classification
- Document categorization
And why libraries like `scikit-learn`, `spaCy`, `fast.ai` use it as a baseline.

### 2. It trains very fast
Good for iterative development, checking preprocessing pipelines, etc.

### 3. It provides interpretable coefficients
Important for academic and professional settings.

### 4. It handles sparse data better than tree models
Tree-based models (`RandomForest`, `XGBoost`) struggle with high-dimensional sparse `TF-IDF`.
`Logistic Regression` thrives in this setup.

### 5. It works well with imbalanced data after tuning
But our dataset is extremely imbalanced:
- Class `0` support: `1982`
- Class `1` support: `573`
And class `1` is very different semantically (in text), so `LR` did not correctly learn minority class boundaries yet.

## Why the `F1` for class `1` is so low?
Our results:

```bash
Class 1 Precision: 0.17
Class 1 Recall:    0.01
F1:                0.03
```

This means the model is predicting almost everything as class `0`, which is typical when:
1. The dataset is imbalanced: `76%` of our data is class `0`.
2. `LogisticRegression` default parameters are not tuned. By default:
    - It uses `C=1` (weak regularization)
    - Does NOT adjust for class imbalance
    - Does NOT weigh classes equally
3. `TF-IDF` vocabulary is dominated by majority class tokens

## Conclusion
`Logistic Regression` was chosen correctly as a baseline,
but now we need to tune it to learn minority class patterns.

This is exactly what we do next.

## 8. Hyperparameter Tuning
We will tune:
Preprocessing (TF-IDF)
- `max_features`
- `ngram_range`
- `min_df`

Model (Logistic Regression)
- `C (regularization strength)`
- `class_weight`
- `penalty`
- `solver`

We use `GridSearchCV` inside the pipeline so everything stays clean.

In [31]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

# Rebuild pipeline so GridSearch can tune TF-IDF + LR
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["class"]),
        ("text", TfidfVectorizer(stop_words="english"), "comment_sentence"),
    ]
)

pipeline = Pipeline([
    ("preprocess", preprocess),
    ("clf", LogisticRegression(max_iter=400, n_jobs=-1))
])

# Parameter grid to search
param_grid = {
    # TF-IDF parameters
    "preprocess__text__ngram_range": [(1,1), (1,2)],
    "preprocess__text__min_df": [1, 2, 3],
    "preprocess__text__max_df": [0.85, 1.0],
    
    # Logistic Regression parameters
    "clf__C": [0.1, 1, 3, 5],
    "clf__class_weight": [None, "balanced"],
    "clf__penalty": ["l2"],
    "clf__solver": ["lbfgs"],
}

# Grid search: F1 is better for imbalanced data
grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=3,
    scoring="f1_macro",
    n_jobs=-1,
    verbose=2
)

grid.fit(X_train, y_train)

print("Best parameters found:")
print(grid.best_params_)

print("\nBest F1-macro score:", grid.best_score_)


Fitting 3 folds for each of 96 candidates, totalling 288 fits


Best parameters found:
{'clf__C': 1, 'clf__class_weight': 'balanced', 'clf__penalty': 'l2', 'clf__solver': 'lbfgs', 'preprocess__text__max_df': 0.85, 'preprocess__text__min_df': 2, 'preprocess__text__ngram_range': (1, 2)}

Best F1-macro score: 0.5772711903045793


## 9. Evaluate best model on test set

In [32]:
from sklearn.metrics import classification_report, accuracy_score

best_model = grid.best_estimator_

y_pred_best = best_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_best))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred_best))


Accuracy: 0.6195462478184991

Classification Report:

              precision    recall  f1-score   support

           0       0.37      0.52      0.43        62
           1       0.49      0.50      0.49       101
           2       0.80      0.77      0.79       159
           3       0.60      0.60      0.60        91
           4       0.69      0.59      0.64       160

    accuracy                           0.62       573
   macro avg       0.59      0.60      0.59       573
weighted avg       0.63      0.62      0.62       573



##Â 10. Save best model

In [33]:
import joblib

joblib.dump(best_model, "best_logistic_regression_model.pkl")

print("Saved: best_logistic_regression_model.pkl")


Saved: best_logistic_regression_model.pkl
