## Motivation for Using Multiple Models

Initially, we tried `Logistic Regression` as a baseline model because it is:

- Fast to train
- Well-suited for high-dimensional sparse data (like `TF-IDF`)
- Interpretable and widely used in text classification

However, after evaluating `Logistic Regression`, we observed:

- High overall accuracy (`~0.76`), but extremely low `F1` for the minority class (`0.03`)
- The model predicted almost all examples as the majority class due to **class imbalance**
- This makes it **unsuitable for our task**, where detecting the minority class (`instance_type=1`) is important

### Therefore, we try multiple alternative models to improve performance:

1. `LinearSVC`
   - `Linear Support Vector Machine` is well-known for text classification  
   - Handles high-dimensional sparse data efficiently  
   - Works well with imbalanced classes using `class_weight='balanced'`

2. `SGDClassifier` (`Hinge` / `Log`)
   - Implements stochastic gradient descent for linear models  
   - Can be faster than traditional linear models on large datasets  
   - Supports `class_weight='balanced'` for imbalanced data

3. `Decision Forest` (`RandomForest`)
   - Ensemble of decision trees capturing non-linear relationships  
   - Robust to noise and outliers  
   - Handles categorical and numeric features without much preprocessing

4. `Gradient Boosting` (`XGBoost` / `LightGBM`) 
   - Powerful ensemble models that can learn complex patterns  
   - Include mechanisms to handle class imbalance (`scale_pos_weight` or `class_weight`)  
   - Often outperform linear models on structured and text-derived features

5. `Neural Network` (`MLPClassifier`)
   - Can capture non-linear interactions between features  
   - Flexible architecture for combining text embeddings (`TF-IDF`) with categorical/numeric features  

### Key Goals

- Improve detection of minority class (class `1`) without sacrificing too much overall accuracy  
- Maximize `F1`-macro, which balances performance across both classes  
- Compare models to select the best performing pipeline for deployment

## 0. Import libraries

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
import lightgbm as lgb
import joblib

## 1. Load cleaned dataset

In [22]:
df_clean = pd.read_csv("code-comment-classification-cleaned-no-outliers.csv")

FEATURES = ["class", "comment_sentence"]
TARGET = "category"

X = df_clean[FEATURES]
y = df_clean[TARGET]

# Train/test split (80/20 stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

Train size: (2291, 2)
Test size: (573, 2)


## 2. Define preprocessing pipeline
One pipeline for all models:
- `OneHotEncoder` for categorical features
- `TF-IDF` for text
- passthrough numeric feature


In [23]:
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["class"]),
        ("text", TfidfVectorizer(stop_words="english", ngram_range=(1,2)), "comment_sentence"),
    ]
)

## 3. Define all models

In [24]:
models = {
    "LinearSVC": LinearSVC(class_weight="balanced", max_iter=4000, random_state=42),
    "SGDClassifier_hinge": SGDClassifier(loss="hinge", class_weight="balanced", max_iter=4000, random_state=42),
    "SGDClassifier_log": SGDClassifier(loss="log_loss", class_weight="balanced", max_iter=4000, random_state=42),
    "DecisionForest": RandomForestClassifier(
        n_estimators=300, 
        max_depth=None, 
        min_samples_split=5, 
        class_weight="balanced", 
        random_state=42,
        n_jobs=-1
    ),
    "XGBoost": xgb.XGBClassifier(
        use_label_encoder=False, 
        eval_metric="logloss", 
        n_estimators=200, 
        scale_pos_weight=len(y[y==1])/len(y[y!=1]), 
        random_state=42,
        n_jobs=-1
    ),
    "LightGBM": lgb.LGBMClassifier(
        n_estimators=200, 
        class_weight="balanced", 
        random_state=42, 
        n_jobs=-1
    ),
    "NeuralNetwork": MLPClassifier(
        hidden_layer_sizes=(100,), 
        activation='relu', 
        solver='adam', 
        max_iter=500, 
        random_state=42
    )
}

## 4. Train and evaluate all models

In [25]:
results = []

for name, clf in models.items():
    print(f"\n=== TRAINING: {name} ===")
    
    # Build pipeline
    pipeline = Pipeline([
        ("preprocess", preprocess),
        ("clf", clf)
    ])
    
    # Fit model
    pipeline.fit(X_train, y_train)
    
    # Predict on test set
    y_pred = pipeline.predict(X_test)
    
    # Compute metrics
    acc = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    
    results.append({
        "model": name,
        "accuracy": acc,
        "f1_class1": report["1"]["f1-score"],
        "f1_class2": report["2"]["f1-score"],
        "f1_class3": report["3"]["f1-score"],
        "f1_class4": report["4"]["f1-score"],
        "f1_macro": report["macro avg"]["f1-score"]
    })
    
    # Print metrics
    print(f"Accuracy: {acc:.3f}")
    print(classification_report(y_test, y_pred))


=== TRAINING: LinearSVC ===
Accuracy: 0.578
              precision    recall  f1-score   support

           0       0.30      0.29      0.29        62
           1       0.46      0.41      0.43       101
           2       0.72      0.75      0.74       159
           3       0.55      0.49      0.52        91
           4       0.61      0.67      0.64       160

    accuracy                           0.58       573
   macro avg       0.53      0.52      0.52       573
weighted avg       0.57      0.58      0.57       573


=== TRAINING: SGDClassifier_hinge ===
Accuracy: 0.567
              precision    recall  f1-score   support

           0       0.33      0.37      0.35        62
           1       0.45      0.42      0.43       101
           2       0.71      0.74      0.72       159
           3       0.52      0.44      0.48        91
           4       0.61      0.64      0.63       160

    accuracy                           0.57       573
   macro avg       0.52      0.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Accuracy: 0.557
              precision    recall  f1-score   support

           0       0.27      0.21      0.24        62
           1       0.47      0.33      0.39       101
           2       0.70      0.75      0.72       159
           3       0.53      0.51      0.52        91
           4       0.55      0.68      0.60       160

    accuracy                           0.56       573
   macro avg       0.50      0.49      0.49       573
weighted avg       0.54      0.56      0.54       573


=== TRAINING: LightGBM ===
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001018 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 784
[LightGBM] [Info] Number of data points in the train set: 2291, number of used features: 91
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGB



Accuracy: 0.483
              precision    recall  f1-score   support

           0       0.26      0.34      0.29        62
           1       0.46      0.34      0.39       101
           2       0.76      0.64      0.70       159
           3       0.32      0.64      0.43        91
           4       0.60      0.39      0.47       160

    accuracy                           0.48       573
   macro avg       0.48      0.47      0.46       573
weighted avg       0.54      0.48      0.49       573


=== TRAINING: NeuralNetwork ===
Accuracy: 0.536
              precision    recall  f1-score   support

           0       0.31      0.34      0.32        62
           1       0.39      0.38      0.38       101
           2       0.72      0.71      0.71       159
           3       0.51      0.35      0.42        91
           4       0.55      0.64      0.60       160

    accuracy                           0.54       573
   macro avg       0.49      0.48      0.49       573
weighted avg

## 5. Compare results

In [26]:
df_results = pd.DataFrame(results).sort_values(by="f1_macro", ascending=False)
df_results.reset_index(drop=True, inplace=True)

print("\n=== MODEL COMPARISON ===")
print(df_results)


=== MODEL COMPARISON ===
                 model  accuracy  f1_class1  f1_class2  f1_class3  f1_class4  \
0    SGDClassifier_log  0.591623   0.469388   0.751515   0.523256   0.643963   
1            LinearSVC  0.577661   0.429319   0.738462   0.520231   0.640719   
2  SGDClassifier_hinge  0.567190   0.430769   0.722222   0.476190   0.628049   
3       DecisionForest  0.551483   0.386473   0.717460   0.546392   0.598726   
4              XGBoost  0.556719   0.385965   0.721212   0.519774   0.603352   
5        NeuralNetwork  0.535777   0.381910   0.712934   0.415584   0.595376   
6             LightGBM  0.483421   0.388571   0.696246   0.428044   0.469697   

   f1_macro  
0  0.541624  
1  0.524283  
2  0.521675  
3  0.504983  
4  0.493333  
5  0.485776  
6  0.455253  


## 6. SAVE BEST MODEL

In [27]:
best_model_name = df_results.loc[0, "model"]
best_model_pipeline = Pipeline([
    ("preprocess", preprocess),
    ("clf", models[best_model_name])
])

# Fit best model on full training set
best_model_pipeline.fit(X_train, y_train)

joblib.dump(best_model_pipeline, "best_model_all_models.pkl")
print(f"Saved best model: {best_model_name} -> best_model_all_models.pkl")

Saved best model: SGDClassifier_log -> best_model_all_models.pkl
