## Motivation for Using Multiple Models

Initially, we tried `Logistic Regression` as a baseline model because it is:

- Fast to train
- Well-suited for high-dimensional sparse data (like `TF-IDF`)
- Interpretable and widely used in text classification

However, after evaluating `Logistic Regression`, we observed:

- High overall accuracy (`~0.76`), but extremely low `F1` for the minority class (`0.03`)
- The model predicted almost all examples as the majority class due to **class imbalance**
- This makes it **unsuitable for our task**, where detecting the minority class (`instance_type=1`) is important

### Therefore, we try multiple alternative models to improve performance:

1. `LinearSVC`
   - `Linear Support Vector Machine` is well-known for text classification  
   - Handles high-dimensional sparse data efficiently  
   - Works well with imbalanced classes using `class_weight='balanced'`

2. `SGDClassifier` (`Hinge` / `Log`)
   - Implements stochastic gradient descent for linear models  
   - Can be faster than traditional linear models on large datasets  
   - Supports `class_weight='balanced'` for imbalanced data

3. `Decision Forest` (`RandomForest`)
   - Ensemble of decision trees capturing non-linear relationships  
   - Robust to noise and outliers  
   - Handles categorical and numeric features without much preprocessing

4. `Gradient Boosting` (`XGBoost` / `LightGBM`) 
   - Powerful ensemble models that can learn complex patterns  
   - Include mechanisms to handle class imbalance (`scale_pos_weight` or `class_weight`)  
   - Often outperform linear models on structured and text-derived features

5. `Neural Network` (`MLPClassifier`)
   - Can capture non-linear interactions between features  
   - Flexible architecture for combining text embeddings (`TF-IDF`) with categorical/numeric features  

### Key Goals

- Improve detection of minority class (class `1`) without sacrificing too much overall accuracy  
- Maximize `F1`-macro, which balances performance across both classes  
- Compare models to select the best performing pipeline for deployment

## 0. Import libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
import lightgbm as lgb
import joblib

## 1. Load cleaned dataset

In [4]:
df_clean = pd.read_csv("code-comment-classification-cleaned-no-outliers.csv")

FEATURES = ["class", "category", "comment_sentence", "partition"]
TARGET = "instance_type"

X = df_clean[FEATURES]
y = df_clean[TARGET]

# Train/test split (80/20 stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

Train size: (10220, 4)
Test size: (2555, 4)


## 2. Define preprocessing pipeline
One pipeline for all models:
- `OneHotEncoder` for categorical features
- `TF-IDF` for text
- passthrough numeric feature


In [5]:
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["class", "category"]),
        ("text", TfidfVectorizer(stop_words="english", ngram_range=(1,2)), "comment_sentence"),
        ("num", "passthrough", ["partition"])
    ]
)

## 3. Define all models

In [None]:
models = {
    "LinearSVC": LinearSVC(class_weight="balanced", max_iter=4000, random_state=42),
    "SGDClassifier_hinge": SGDClassifier(loss="hinge", class_weight="balanced", max_iter=4000, random_state=42),
    "SGDClassifier_log": SGDClassifier(loss="log_loss", class_weight="balanced", max_iter=4000, random_state=42),
    "DecisionForest": RandomForestClassifier(
        n_estimators=300, 
        max_depth=None, 
        min_samples_split=5, 
        class_weight="balanced", 
        random_state=42,
        n_jobs=-1
    ),
    "XGBoost": xgb.XGBClassifier(
        use_label_encoder=False, 
        eval_metric="logloss", 
        n_estimators=200, 
        scale_pos_weight=len(y[y==0])/len(y[y==1]), 
        random_state=42,
        n_jobs=-1
    ),
    "LightGBM": lgb.LGBMClassifier(
        n_estimators=200, 
        class_weight="balanced", 
        random_state=42, 
        n_jobs=-1
    ),
    "NeuralNetwork": MLPClassifier(
        hidden_layer_sizes=(100,), 
        activation='relu', 
        solver='adam', 
        max_iter=500, 
        random_state=42
    )
}

## 4. Train and evaluate all models

In [9]:
results = []

for name, clf in models.items():
    print(f"\n=== TRAINING: {name} ===")
    
    # Build pipeline
    pipeline = Pipeline([
        ("preprocess", preprocess),
        ("clf", clf)
    ])
    
    # Fit model
    pipeline.fit(X_train, y_train)
    
    # Predict on test set
    y_pred = pipeline.predict(X_test)
    
    # Compute metrics
    acc = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    
    results.append({
        "model": name,
        "accuracy": acc,
        "f1_class0": report["0"]["f1-score"],
        "f1_class1": report["1"]["f1-score"],
        "f1_macro": report["macro avg"]["f1-score"]
    })
    
    # Print metrics
    print(f"Accuracy: {acc:.3f}")
    print(classification_report(y_test, y_pred))


=== TRAINING: LinearSVC ===
Accuracy: 0.470
              precision    recall  f1-score   support

           0       0.70      0.55      0.62      1982
           1       0.11      0.18      0.13       573

    accuracy                           0.47      2555
   macro avg       0.40      0.37      0.38      2555
weighted avg       0.57      0.47      0.51      2555


=== TRAINING: SGDClassifier_hinge ===
Accuracy: 0.508
              precision    recall  f1-score   support

           0       0.78      0.51      0.62      1982
           1       0.22      0.49      0.31       573

    accuracy                           0.51      2555
   macro avg       0.50      0.50      0.46      2555
weighted avg       0.65      0.51      0.55      2555


=== TRAINING: SGDClassifier_log ===
Accuracy: 0.522
              precision    recall  f1-score   support

           0       0.72      0.62      0.67      1982
           1       0.12      0.17      0.14       573

    accuracy                 

## 5. Compare results

In [10]:
df_results = pd.DataFrame(results).sort_values(by="f1_macro", ascending=False)
df_results.reset_index(drop=True, inplace=True)

print("\n=== MODEL COMPARISON ===")
print(df_results)


=== MODEL COMPARISON ===
                 model  accuracy  f1_class0  f1_class1  f1_macro
0       DecisionForest  0.853229   0.910394   0.594595  0.752494
1        NeuralNetwork  0.834051   0.896937   0.574297  0.735617
2  SGDClassifier_hinge  0.507632   0.618094   0.307269  0.462681
3    SGDClassifier_log  0.521722   0.668655   0.140647  0.404651
4            LinearSVC  0.470059   0.618161   0.134271  0.376216


## 6. SAVE BEST MODEL

In [11]:
best_model_name = df_results.loc[0, "model"]
best_model_pipeline = Pipeline([
    ("preprocess", preprocess),
    ("clf", models[best_model_name])
])

# Fit best model on full training set
best_model_pipeline.fit(X_train, y_train)

joblib.dump(best_model_pipeline, "best_model_all_models.pkl")
print(f"Saved best model: {best_model_name} -> best_model_all_models.pkl")

Saved best model: DecisionForest -> best_model_all_models.pkl
