## Week 3–4 Misclassification Review
This section summarises post-tuning error patterns so we can target threshold calibration and feature review:
- LogisticRegression_Tuned, RandomForest_Tuned, and XGBoost_Tuned now concentrate the majority of their mistakes as **false positives (≈87–88%)**, reflecting the recall-first configuration.
- NeuralNetwork_Tuned flips the baseline tendency: it keeps false negatives low (≈6%) but creates many false positives (≈94%), confirming the need for Week 5–6 threshold calibration before Gradio deployment.
- High false-positive rows for NeuralNetwork_Tuned are dominated by low self-reported health (`numeric__health`), high perceived effort (`numeric__flteeff`), and reduced sleep/rest (`numeric__slprl`). False negatives skew toward high enjoyment of life and frequent sports, a cue to inspect feature scaling and interaction terms.


In [2]:
import os, sys
from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path('..').resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

misclassified_path = PROJECT_ROOT / 'results' / 'metrics' / 'misclassified_samples.csv'
if not misclassified_path.exists():
    raise FileNotFoundError('Expected misclassified_samples.csv after evaluation step.')

mis = pd.read_csv(misclassified_path)
summary = mis.groupby('model')['error_type'].value_counts(normalize=True).unstack(fill_value=0)
print('Error proportions by model (False Negative vs False Positive):')
print(summary.round(3))

nn_tuned = mis[mis['model'] == 'neural_network_tuned'].copy()
if not nn_tuned.empty:
    numeric_cols = [c for c in nn_tuned.columns if c.startswith('numeric__')]
    fn_means = nn_tuned[nn_tuned['error_type'] == 'False Negative'][numeric_cols].mean().sort_values(ascending=False).head(5)
    fp_means = nn_tuned[nn_tuned['error_type'] == 'False Positive'][numeric_cols].mean().sort_values(ascending=False).head(5)
    print('
NeuralNetwork_Tuned — top numeric averages for false negatives:')
    print(fn_means.round(3))
    print('
NeuralNetwork_Tuned — top numeric averages for false positives:')
    print(fp_means.round(3))
else:
    print('No neural_network_tuned records found; rerun evaluation if this is unexpected.')

Error proportions by model (False Negative vs False Positive):
error_type                 False Negative  False Positive
model                                                    
logistic_regression                 0.126           0.874
logistic_regression_tuned           0.126           0.874
neural_network                      0.922           0.078
neural_network_tuned                0.061           0.939
random_forest                       0.865           0.135
random_forest_tuned                 0.130           0.870
xgboost                             0.910           0.090
xgboost_tuned                       0.120           0.880
NeuralNetwork_Tuned — top numeric averages for false negatives:
numeric__inprdsc    0.368
numeric__enjlf      0.323
numeric__dosprt     0.227
numeric__happy      0.163
numeric__wrhpp      0.159
dtype: float64
NeuralNetwork_Tuned — top numeric averages for false positives:
numeric__health     0.754
numeric__flteeff    0.320
numeric__slprl      0.314
numeri

### Threshold Calibration Sweep (Tuned Models)
This section summarises how recall-focused models behave across probability thresholds (0.2–0.8). Metrics are saved to `results/metrics/threshold_sweep.csv` for reproducibility.

In [4]:
from pathlib import Path
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

try:
    from src.evaluate_models import load_splits, load_models
except ModuleNotFoundError:
    import os, sys
    PROJECT_ROOT = Path('..').resolve()
    if str(PROJECT_ROOT) not in sys.path:
        sys.path.insert(0, str(PROJECT_ROOT))
    from src.evaluate_models import load_splits, load_models

PROJECT_ROOT = Path('..').resolve()
threshold_out = PROJECT_ROOT / 'results' / 'metrics' / 'threshold_sweep.csv'

splits = load_splits()
models, scaler = load_models(input_dim=splits['X_train'].shape[1], include_tuned=True)

X_test = splits['X_test']
y_test = splits['y_test'].astype(int)
X_test_scaled = scaler.transform(X_test)

model_keys = ['logistic_regression_tuned', 'random_forest_tuned', 'xgboost_tuned', 'neural_network_tuned']
thresholds = np.linspace(0.2, 0.8, 13)
rows = []

for key in model_keys:
    if key not in models:
        print(f"Skipping {key} (model artefact not found)")
        continue
    model = models[key]
    features = X_test_scaled if key in {'logistic_regression_tuned', 'neural_network_tuned'} else X_test.values

    if key == 'neural_network_tuned':
        import torch
        with torch.no_grad():
            logits = model(torch.tensor(features, dtype=torch.float32))
            scores = torch.sigmoid(logits).numpy().ravel()
    else:
        if hasattr(model, 'predict_proba'):
            scores = model.predict_proba(features)[:, 1]
        else:
            scores = model.decision_function(features)
            scores = 1 / (1 + np.exp(-scores))

    for thresh in thresholds:
        preds = (scores >= thresh).astype(int)
        rows.append({
            'model': key,
            'threshold': round(float(thresh), 3),
            'precision': precision_score(y_test, preds, zero_division=0),
            'recall': recall_score(y_test, preds, zero_division=0),
            'f1_score': f1_score(y_test, preds, zero_division=0),
        })

threshold_df = pd.DataFrame(rows)
threshold_out.parent.mkdir(parents=True, exist_ok=True)
threshold_df.to_csv(threshold_out, index=False)
print('Saved threshold sweep metrics to', threshold_out)
display(threshold_df.head())


[INFO] Loading data splits from /Users/peter/Desktop/AI_MLProjects_Research_Project/health_xai_project/results/models/data_splits.joblib




Saved threshold sweep metrics to /Users/peter/Desktop/AI_MLProjects_Research_Project/health_xai_project/results/metrics/threshold_sweep.csv


Unnamed: 0,model,threshold,precision,recall,f1_score
0,logistic_regression_tuned,0.2,0.146393,0.970793,0.25442
1,logistic_regression_tuned,0.25,0.159309,0.94854,0.2728
2,logistic_regression_tuned,0.3,0.175751,0.919332,0.295089
3,logistic_regression_tuned,0.35,0.193004,0.874826,0.316239
4,logistic_regression_tuned,0.4,0.215121,0.819193,0.340758


### Recommended Thresholds (Max F1)
Selecting the threshold with the highest F1 score per tuned model provides a balanced starting point for Week 5–6 calibration.

In [5]:
import pandas as pd
from pathlib import Path

threshold_df = pd.read_csv(Path('..') / 'results' / 'metrics' / 'threshold_sweep.csv')
best_thresholds = (
    threshold_df.loc[threshold_df.groupby('model')['f1_score'].idxmax()]
    .sort_values('model')
    .reset_index(drop=True)
)
print(best_thresholds)
best_thresholds.to_csv(Path('..') / 'results' / 'metrics' / 'threshold_recommendations.csv', index=False)


                       model  threshold  precision    recall  f1_score
0  logistic_regression_tuned       0.65   0.322667  0.504868  0.393709
1       neural_network_tuned       0.65   0.302987  0.592490  0.400941
2        random_forest_tuned       0.60   0.327812  0.522949  0.403001
3              xgboost_tuned       0.65   0.330508  0.542420  0.410742
