# Results Summary - Code Comment Classification

This notebook summarizes the complete pipeline results and provides conclusions.

## Pipeline Overview:

1. **Data Cleaning** (`data_cleaning.ipynb`)
   - Loaded 12,775 rows from original dataset
   - BERT similarity analysis identified most confusable categories
   - Automatically merged Usage → DevelopmentNotes (similarity: 0.9249)
   - Filtered to 2,812 labeled samples (4 categories)
   - Split into train (2,249) and test (563)

2. **Feature Encoding** (`encoding.ipynb`)
   - BERT embeddings: 384 features
   - Class one-hot encoding: 306 features
   - Metadata features: 5 features
   - **Total: 695 features**

3. **Baseline Model** (`model_training.ipynb`)
   - GridSearchCV with Logistic Regression
   - Best CV F1-Macro: 0.672
   - Test Accuracy: 69.4%

4. **Multi-Model Comparison** (`multi_model_training.ipynb`)
   - Tested 4 algorithms
   - Logistic Regression: 0.6704 (best)
   - Linear SVC: 0.6608
   - SGD Classifier: 0.6406
   - Random Forest: 0.4984

## Conclusions and Key Findings

### Final Model Performance

This notebook implements an **automatic category merging** approach based on BERT similarity analysis combined with BERT embeddings and metadata features for classification.

**Best Model:** Logistic Regression with RandomOverSampler
- **Test Accuracy: 69.4%**
- **F1-Macro Score: 0.67**
- **Cross-Validation F1-Macro: 0.672**

### Per-Category Performance

**Strong Categories:**
- **Parameters (Class 2):** F1=0.77 ✓ - Clear vocabulary (parameter-related keywords)
- **DevelopmentNotes (Class 0):** F1=0.74 ✓ - Largest class after merge, well-defined

**Moderate Categories:**
- **Summary (Class 3):** F1=0.65 - Good recall (0.69) but lower precision (0.61)

**Struggling Categories:**
- **Expand (Class 1):** F1=0.52 ❌ - Lowest performance, confusion with other classes

### Model Comparison

Tested 4 different algorithms (all with RandomOverSampler):

| Model | Mean F1-Macro | Std Dev | Winner |
|-------|---------------|---------|--------|
| Logistic Regression | 0.6704 | 0.033 | ✓ Best |
| Linear SVC | 0.6608 | 0.034 | |
| SGD Classifier | 0.6406 | 0.034 | |
| Random Forest | 0.4984 | 0.018 | ❌ Worst |

Linear models (Logistic Regression, SVC) significantly outperform Random Forest when using dense BERT embeddings.

### Conclusion

This pipeline successfully demonstrates:
- **Automatic category merging** based on semantic similarity analysis
- **BERT embeddings + metadata** achieve 69.4% accuracy (F1-Macro: 0.67)
- **Logistic Regression** outperforms complex models on this task
- **Proper data handling** prevents leakage and ensures fair evaluation

The 69.4% accuracy represents solid performance given:
- Only 2,812 samples (small dataset)
- Average 6.9 words per comment (limited context)
- 4 categories with fuzzy semantic boundaries

## Generated Files

### Data Files:
- `code-comment-classification-cleaned.csv` - Cleaned dataset (2,812 rows)
- `code-comment-classification-train-unbalanced.csv` - Training set (2,249 rows)
- `code-comment-classification-test.csv` - Test set (563 rows)

### Encoded Features:
- `train_features_4cat_bert_meta.npz` - Training features (2,249 × 695)
- `test_features_4cat_bert_meta.npz` - Test features (563 × 695)
- `train_target_4cat_meta.csv` - Training labels
- `test_target_4cat_meta.csv` - Test labels

### Encoders:
- `class_encoder_4cat_meta.pkl` - OneHotEncoder for class names
- `bert_model_4cat_meta.pkl` - SentenceTransformer model
- `label_encoder_4cat_meta.pkl` - LabelEncoder for categories

### Trained Models:
- `best_model_final.pkl` - Best performing model (Logistic Regression)

## How to Use

### Run the complete pipeline:
```bash
# 1. Clean data and merge categories
jupyter notebook data_cleaning.ipynb

# 2. Encode features
jupyter notebook encoding.ipynb

# 3. Train baseline model
jupyter notebook model_training.ipynb

# 4. Compare multiple models
jupyter notebook multi_model_training.ipynb

# 5. View results summary
jupyter notebook results_summary.ipynb
```

### Load and use the trained model:
```python
import joblib

# Load model
model = joblib.load("best_model_final.pkl")

# Load encoders
class_encoder = joblib.load("class_encoder_4cat_meta.pkl")
bert_model = joblib.load("bert_model_4cat_meta.pkl")
label_encoder = joblib.load("label_encoder_4cat_meta.pkl")

# Make predictions on new data
# (after encoding features the same way)
predictions = model.predict(X_new)
```