# Code Comment Classification - Model Training

This notebook performs the following model training operations:
1. Load cleaned dataset (no outliers)
2. Select features and target
3. Split data into train and test sets
4. Build ML pipeline (preprocessing + model)
5. Train model
6. Evaluate performance
7. Save final model (pipeline + classifier)

## 1. Load cleaned dataset (no outliers)

In [1]:
import pandas as pd

df_clean = pd.read_csv("code-comment-classification-cleaned-no-outliers.csv")

print("Cleaned dataset shape:", df_clean.shape)
df_clean.head()

Cleaned dataset shape: (12775, 7)


Unnamed: 0,comment_sentence_id,class,category,comment_sentence,partition,instance_type,outlier
0,1,AccessMixin,DevelopmentNotes,abstract cbv mixin that gives access mixins th...,1,0,0
1,1,AccessMixin,Expand,abstract cbv mixin that gives access mixins th...,1,0,0
2,1,AccessMixin,Parameters,abstract cbv mixin that gives access mixins th...,1,0,0
3,1,AccessMixin,Summary,abstract cbv mixin that gives access mixins th...,1,1,0
4,1,AccessMixin,Usage,abstract cbv mixin that gives access mixins th...,0,0,0


## 2. Select features and target

In [2]:
# Same features as before
FEATURES = ["class", "category", "comment_sentence", "partition"]
TARGET = "instance_type"

X = df_clean[FEATURES]
y = df_clean[TARGET]


## 3. Split data into train and test sets

In [3]:
from sklearn.model_selection import train_test_split

# Standard 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)


Train size: (10220, 4)
Test size: (2555, 4)


## 4. Build ML pipeline (preprocessing + model)
This is the cleanest and most professional approach: the pipeline contains both the preprocessing and the classifier.

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Preprocessing:
# - One-hot encoding for categorical features
# - TF-IDF for text
# - passthrough numeric feature

preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["class", "category"]),
        ("text", TfidfVectorizer(stop_words="english", ngram_range=(1, 2)), "comment_sentence"),
        ("num", "passthrough", ["partition"])
    ]
)

# Final ML pipeline
model = Pipeline([
    ("preprocess", preprocess),
    ("clf", LogisticRegression(max_iter=400, n_jobs=-1))
])


## 5. Train model

In [5]:
model.fit(X_train, y_train)

print("Model training complete!")


Model training complete!


## 6. Evaluate performance

In [6]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predictions
y_pred = model.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Detailed metrics
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

# Confusion matrix
print("\nConfusion Matrix:\n")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.7639921722113503

Classification Report:

              precision    recall  f1-score   support

           0       0.77      0.98      0.87      1982
           1       0.17      0.01      0.03       573

    accuracy                           0.76      2555
   macro avg       0.47      0.50      0.45      2555
weighted avg       0.64      0.76      0.68      2555


Confusion Matrix:

[[1944   38]
 [ 565    8]]


## 7. Save final model (pipeline + classifier)
This includes both:
- the preprocessing pipeline
- the classifier
So you can directly call `.predict()` on raw new samples.

In [7]:
# ===============================================================
# 7. SAVE FINAL MODEL (PIPELINE + CLASSIFIER)
# ===============================================================

import joblib

joblib.dump(model, "final_model_pipeline.pkl")

print("Saved: final_model_pipeline.pkl")


Saved: final_model_pipeline.pkl
