## 📊 Self-Training Classifier Evaluation

This section demonstrates how to implement and evaluate a **semi-supervised learning pipeline using Self-Training Classifier**.

### Objective:
Leverage both **labeled (`train_df`) and unlabeled (`test_df`) data** to improve model performance, and quantify the improvement using **Macro-F1 Score uplift**.

---

### Process Overview:
1. **Supervised Split**  
   - Split the labeled data (`train_df`) into:
     - **Supervised training set** (`X_supervised_train`, `y_supervised_train`).
     - **Validation set** (`X_val`, `y_val`).
   - Validation set remains **untouched throughout training** and is used only for final evaluation.

2. **Self-Training Setup**  
   - Assign a placeholder label `-1` to the unlabeled data (`test_df`).
   - **Combine supervised data and unlabeled data** into a new dataset for self-training.

3. **Model Training**  
   - Define a base classifier (e.g., RandomForest, LogisticRegression, CatBoost).
   - Automatically wrap models needing scaling in a pipeline with `StandardScaler`.
   - Apply `SelfTrainingClassifier` using the base model.
   - Train the model, which will:
     - Iteratively label the most confident unlabeled samples.
     - Re-train on the expanded dataset.

4. **Evaluation & Uplift Calculation**  
   - Predict on the **unseen validation set** (`X_val`).
   - Calculate **Macro-F1 Score**:
     - **Macro-F1** gives equal weight to each class, which is crucial for imbalanced datasets.
   - Calculate the **uplift (%) over the supervised baseline**:
     \[
     \text{Uplift} = \frac{F1_{self\_training} - F1_{supervised}}{F1_{supervised}} \times 100
     \]

---

### Why Macro-F1 Score?
- **Macro-F1** averages F1 scores per class, treating all classes equally.
- Essential when **class imbalance exists**, ensuring performance isn't dominated by majority classes.

### Key Metric:
| Metric    | Description                                    |
|-----------|------------------------------------------------|
| Macro-F1  | Balanced F1 across all classes (unbiased by class size). |
| Uplift %  | Percentage improvement of self-training vs supervised-only model. |

---


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.semi_supervised import SelfTrainingClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
import pandas as pd
import xgboost as xgb
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import os
import sys
from pathlib import Path  # Import Path from pathlib
# Find project root (the folder containing .git or a marker file)
project_root = Path(__file__).resolve().parents[1] if '__file__' in globals() else Path().resolve()
os.chdir(project_root)
sys.path.append(str(project_root))  # Add to import path

# List all files in the current directory
files_in_dir = os.listdir(project_root)
print("Files in current directory:", files_in_dir)

%run datainfo.ipynb

score_cols = [f"A{i}_Score" for i in range(1, 11)]
train_df['total_score'] = train_df[score_cols].sum(axis=1)
test_df['total_score'] = test_df[score_cols].sum(axis=1)

# Normalize the total score
train_df['score_ratio'] = train_df['total_score'] / 10
test_df['score_ratio'] = test_df['total_score'] / 10

# Add interaction features
train_df['gender_result'] = train_df['gender'] * train_df['result']
train_df['age_score_ratio'] = train_df['age'] * train_df['score_ratio']
train_df['score_autism'] = train_df['total_score'] * train_df['autism']
train_df['age_jaundice'] = train_df['age'] * train_df['jaundice']
train_df['autism_result'] = train_df['autism'] * train_df['result']
train_df['gender_total_score'] = train_df['gender'] * train_df['total_score']

test_df['gender_result'] = test_df['gender'] * test_df['result']
test_df['age_score_ratio'] = test_df['age'] * test_df['score_ratio']
test_df['score_autism'] = test_df['total_score'] * test_df['autism']
test_df['age_jaundice'] = test_df['age'] * test_df['jaundice']
test_df['autism_result'] = test_df['autism'] * test_df['result']
test_df['gender_total_score'] = test_df['gender'] * test_df['total_score']

def train_and_evaluate_model(base_model):
    #Create a hold-out validation set from train_df only
    X_supervised_train, X_val, y_supervised_train, y_val = train_test_split(
        
        train_df.drop('Class/ASD', axis=1),
        train_df['Class/ASD'],
        test_size=0.4,
        random_state=42
    )

    #Prepare self-training data
    # Add fake -1 to test_df again to make sure it's still 'unlabeled'
    test_df['Class/ASD'] = -1

    # Combine the reduced train and test for self-training
    train_for_self_training = pd.concat([
        X_supervised_train.assign(**{'Class/ASD': y_supervised_train}),
        test_df
    ], ignore_index=True)

    X_combined = train_for_self_training.drop('Class/ASD', axis=1)
    y_combined = train_for_self_training['Class/ASD']

    #Automatically add scaler if model type needs it
    models_needing_scaling = (LogisticRegression,)
    if isinstance(base_model, models_needing_scaling):
        base_model = Pipeline([
            ('scaler', StandardScaler()),
            ('clf', base_model)
        ])

    #Define and train the self-training model
    #base_model = RandomForestClassifier(n_estimators=600, max_depth=10, min_samples_split=10)
    self_training_model = SelfTrainingClassifier(base_model, criterion='threshold', k_best=251, threshold=.84)
    self_training_model.fit(X_combined, y_combined)

    #Evaluate on the real validation set (never seen during self-training)
    y_pred_val_self_training = self_training_model.predict(X_val)

    f1_self_training = f1_score(y_val, y_pred_val_self_training, average='macro')
    f1_supervised = final_f1

    # Calculate the uplift
    uplift = ((f1_self_training - f1_supervised) / f1_supervised) * 100

    print("Self-training model Macro-F1 on validation set:", f1_self_training)
    print ("Supervised model Macro-F1 on validation set:", f1_supervised)
    print(f"Macro-F1 Uplift/iporivment: {uplift:.2f}%")


## 📈 Self-Training Model vs Supervised Model Evaluation - RANDOM FOREST

### Objective:
Evaluate and compare the performance of a **self-training model** versus a **supervised-only model** based on **Macro-F1 score**.

---

### Results:
- **Self-training model Macro-F1 on validation set**: `0.80995`
- **Supervised model Macro-F1 on validation set**: `0.77372`
- **Macro-F1 Uplift/Improvement**: `4.68%`

### Interpretation:
- The **self-training model** achieves a **higher Macro-F1 score** than the supervised-only model, indicating better overall performance, especially on **imbalanced data**.
- The **4.68% uplift** shows a **fair improvement**, suggesting that the self-training approach, which incorporates both labeled and unlabeled data, enhances the model’s ability to generalize across all classes.

---

### Why it matters:
- **Macro-F1** ensures equal importance across all classes, preventing dominance by larger classes in imbalanced datasets.
- **Uplift percentage** quantifies how much the self-training method improves over traditional supervised learning.


In [None]:
%run SupervisedModels/OptimizedRandomForest.ipynb
# random forest model
rf_model = RandomForestClassifier(n_estimators=300, max_depth=10, min_samples_split=10)
# Train the base model on the full training set
train_and_evaluate_model(rf_model)

## 📈 Self-Training Model vs Supervised Model Evaluation - XGBOOST

### Objective:
Evaluate and compare the performance of a **self-training model** versus a **supervised-only model** based on **Macro-F1 score**.

---

### Results:
- **Self-training model Macro-F1 on validation set**: `0.77271`
- **Supervised model Macro-F1 on validation set**: `0.771241`
- **Macro-F1 Uplift/Improvement**: `00.19%`

### Interpretation:
- The **self-training model** achieves a **higher Macro-F1 score** than the supervised-only model, indicating better overall performance, especially on **imbalanced data**.
- The **00.19% uplift** shows a **very small**, suggesting that the self-training approach, which incorporates both labeled and unlabeled data, doesnt enchnace the model’s ability to generalize across all classes very much.

---

### Why it matters:
- **Macro-F1** ensures equal importance across all classes, preventing dominance by larger classes in imbalanced datasets.
- **Uplift percentage** quantifies how much the self-training method improves over traditional supervised learning.


In [None]:
%run SupervisedModels/OptimizedXGBoost.ipynb
# random forest model
xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    scale_pos_weight=1,  # Handle imbalance through class weight
    random_state=42
)
# Train the base model on the full training set
train_and_evaluate_model(xgb_model)

## 📈 Self-Training Model vs Supervised Model Evaluation - CATBOOST

### Objective:
Evaluate and compare the performance of a **self-training model** versus a **supervised-only model** based on **Macro-F1 score**.

---

### Results:
- **Self-training model Macro-F1 on validation set**: `0.80375`
- **Supervised model Macro-F1 on validation set**: `0.71473`
- **Macro-F1 Uplift/Improvement**: `12.45%`

### Interpretation:
- The **self-training model** achieves a **higher Macro-F1 score** than the supervised-only model, indicating improved performance across all classes.
- The **12.45% uplift** shows a **substantial improvement**, suggesting that the self-training approach, which utilizes both labeled and unlabeled data, greatly enhances the model’s ability to generalize and perform well on all classes.

---

### Why it matters:
- **Macro-F1** ensures that the model performs well on both majority and minority classes, which is crucial for imbalanced datasets.
- **Uplift percentage** provides a clear indication of how much the self-training method improves model performance compared to traditional supervised learning.


In [None]:
%run SupervisedModels/OptimizedCatBoost.ipynb
# random forest model
cb_model = CatBoostClassifier(
    random_state=42, verbose=0
)
# Train the base model on the full training set
train_and_evaluate_model(cb_model)

## 📈 Self-Training Model vs Supervised Model Evaluation - LOGISTICAL REGRESSION

### Objective:
Evaluate and compare the performance of a **self-training model** versus a **supervised-only model** based on **Macro-F1 score**.

---

### Results:
- **Self-training model Macro-F1 on validation set**: `0.78430`
- **Supervised model Macro-F1 on validation set**: `0.71384`
- **Macro-F1 Uplift/Improvement**: `9.87%`

### Interpretation:
- The **self-training model** achieves a **higher Macro-F1 score** than the supervised-only model, indicating significantly better performance across all classes.
- The **9.87% uplift** shows a **substantial improvement**, suggesting that the self-training approach, which uses both labeled and unlabeled data, enhances the model's ability to generalize and perform effectively, even in imbalanced data scenarios.

---

### Why it matters:
- **Macro-F1** ensures balanced model performance across all classes, reducing the impact of imbalanced data.
- **Uplift percentage** demonstrates the extent of improvement achieved by the self-training method compared to traditional supervised learning.


In [None]:
%run SupervisedModels/OptimizedLogisticRegression.ipynb
# random forest model
lr_model = LogisticRegression(
    class_weight='balanced', random_state=42
)
# Train the base model on the full training set
train_and_evaluate_model(lr_model)