# ü§ñ AI Peer Review Experiments: Testing Gemini's Critiques

**Purpose:** This notebook systematically tests critiques raised by Gemini AI in a debate about our methodology.

**Background:** After publishing our main results (AUC ~0.72), Gemini AI raised several provocative critiques:

| Critique | Gemini's Claim | Experiment to Test |
|----------|----------------|-------------------|
| 1. Monotonic constraints = "handcuffs" | Forcing monotonicity prevents learning U-shaped relationships | Compare unconstrained vs constrained models |
| 2. Reverse-causality purge was wrong | Screening tools should use whatever predicts, causality irrelevant | Test full-feature model with dental_visit, floss, mobile_teeth |
| 3. Missingness indicators = data leakage | Learning NHANES protocol, not biology | Test deployment-ready model without missingness indicators |
| 4. U-shaped relationships exist | BMI, age have non-linear effects | Analyze SHAP dependence plots for non-linearity |
| 5. We "handicapped" the model | Artificially capping AUC at 0.72 | Test all feature combinations |

**Hypothesis:** If Gemini is correct, unconstrained models with all features should achieve AUC significantly > 0.72.

---

## The AI Debate

This notebook tests the claims from a fascinating debate between Claude AI (defending our methodology) and Gemini AI (critiquing it). Key quotes from Gemini:

> "They took a non-linear model (Gradient Boosting) capable of finding complex patterns and forced it to behave like a simple Linear Regression. They effectively 'dumbed down' the algorithm."

> "This confuses Etiology (what causes disease) with Prediction (who has the disease). In a screening tool, you want to know if someone hasn't visited a dentist in 5 years. That is a massive red flag for disease."

**Let's test these claims empirically!**


In [None]:
"""
Section 0: Environment Setup
============================
"""

import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, average_precision_score, roc_curve

import xgboost as xgb
import lightgbm as lgb
import catboost as cb

import shap
import json
from datetime import datetime

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Paths
BASE_DIR = Path('/Users/franciscoteixeirabarbosa/Dropbox/Random_scripts/nhanes_periodontitis_ml')
PROCESSED_DIR = BASE_DIR / 'data' / 'processed'
FIGURES_DIR = BASE_DIR / 'figures'
RESULTS_DIR = BASE_DIR / 'results'

# Periospot colors
PERIOSPOT_BLUE = '#15365a'
PERIOSPOT_RED = '#6c1410'
CRIMSON_BLAZE = '#a92a2a'
VANILLA_CREAM = '#f7f0da'

print("‚úÖ Environment setup complete")
print(f"üìÅ Base directory: {BASE_DIR}")


## Section 1: Load Data and Define Feature Sets

We'll define multiple feature sets to test Gemini's critiques:
1. **Primary model (v1.3):** Our published model (29 features, no reverse-causality)
2. **Full features:** All 33 features including dental_visit, floss_days, mobile_teeth
3. **Deployment-ready:** No missingness indicators (test "data leakage" claim)
4. **Core features only:** No missingness indicators, no reverse-causality


In [None]:
"""
Section 1: Load Data and Define Feature Sets
============================================
"""

# Load the cleaned features dataset
df = pd.read_parquet(PROCESSED_DIR / 'features_cleaned.parquet')
print(f"üìä Loaded dataset: {df.shape[0]} rows, {df.shape[1]} columns")

# Target variable
y = df['has_periodontitis'].astype(int)
print(f"üéØ Target prevalence: {y.mean()*100:.1f}%")

# ============================================================================
# FEATURE SET DEFINITIONS
# ============================================================================

# Core clinical features (no missingness indicators, no reverse-causality)
CORE_FEATURES = [
    'age', 'sex', 'education',
    'bmi', 'waist_cm', 'waist_height', 'height_cm',
    'systolic_bp', 'diastolic_bp',
    'glucose', 'triglycerides', 'hdl',
    'smoke_current', 'smoke_former', 'alcohol_current'
]

# Missingness indicators
MISSINGNESS_INDICATORS = [
    'bmi_missing', 'systolic_bp_missing', 'diastolic_bp_missing',
    'glucose_missing', 'triglycerides_missing', 'hdl_missing',
    'smoking_missing', 'alcohol_missing',
    'waist_cm_missing', 'waist_height_missing', 'height_cm_missing',
    'alcohol_current_missing'
]

# Reverse-causality features (Gemini claims we should keep these)
REVERSE_CAUSALITY_FEATURES = [
    'dental_visit', 'floss_days', 'mobile_teeth', 'floss_days_missing'
]

# Filter to available columns
available_cols = set(df.columns)
CORE_FEATURES = [f for f in CORE_FEATURES if f in available_cols]
MISSINGNESS_INDICATORS = [f for f in MISSINGNESS_INDICATORS if f in available_cols]
REVERSE_CAUSALITY_FEATURES = [f for f in REVERSE_CAUSALITY_FEATURES if f in available_cols]

# Define feature sets for experiments
FEATURE_SETS = {
    'primary_v13': CORE_FEATURES + MISSINGNESS_INDICATORS,
    'full_features': CORE_FEATURES + MISSINGNESS_INDICATORS + REVERSE_CAUSALITY_FEATURES,
    'deployment_ready': CORE_FEATURES,
    'core_only': [f for f in CORE_FEATURES if not f.endswith('_missing')]
}

print("\nüìã Feature Set Definitions:")
print("="*60)
for name, features in FEATURE_SETS.items():
    print(f"  {name}: {len(features)} features")
print("="*60)

print(f"\nüîç Reverse-causality features available: {REVERSE_CAUSALITY_FEATURES}")
print(f"üîç Missingness indicators available: {len(MISSINGNESS_INDICATORS)}")


## Section 2: Experiment 1 - Unconstrained vs Constrained Models

**Gemini's Claim:** Monotonic constraints prevent learning U-shaped relationships, "dumbing down" the model.

**Test:** Compare AUC of models WITH vs WITHOUT monotonic constraints.

> "They took a non-linear model (Gradient Boosting) capable of finding complex patterns and forced it to behave like a simple Linear Regression."


In [None]:
"""
Experiment 1: Unconstrained vs Constrained Models
=================================================
"""

print("="*70)
print("üß™ EXPERIMENT 1: UNCONSTRAINED vs CONSTRAINED MODELS")
print("="*70)
print("\nGemini's Hypothesis: Removing monotonic constraints will INCREASE AUC")
print("")

# Use primary feature set
features = FEATURE_SETS['primary_v13']
X = df[features].copy()

# Define monotonic constraints
MONOTONIC_FEATURES = {
    'age': 1, 'bmi': 1, 'waist_cm': 1, 'waist_height': 1,
    'systolic_bp': 1, 'diastolic_bp': 1, 'glucose': 1, 'triglycerides': 1,
    'hdl': -1  # Higher HDL = lower risk
}

# Build constraint vector
constraints = [MONOTONIC_FEATURES.get(f, 0) for f in features]

# Setup cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

results_exp1 = []

# Test both constrained and unconstrained
for model_type in ['XGBoost', 'LightGBM']:
    for constrained in [True, False]:
        
        if model_type == 'XGBoost':
            params = {'n_estimators': 200, 'max_depth': 6, 'learning_rate': 0.1,
                      'subsample': 0.8, 'colsample_bytree': 0.8, 'random_state': RANDOM_SEED,
                      'use_label_encoder': False, 'eval_metric': 'logloss'}
            if constrained:
                params['monotone_constraints'] = tuple(constraints)
            model = xgb.XGBClassifier(**params)
        else:
            params = {'n_estimators': 200, 'max_depth': 6, 'learning_rate': 0.1,
                      'subsample': 0.8, 'colsample_bytree': 0.8, 'random_state': RANDOM_SEED,
                      'verbose': -1}
            if constrained:
                params['monotone_constraints'] = constraints
            model = lgb.LGBMClassifier(**params)
        
        # Cross-validation
        oof_preds = np.zeros(len(y))
        for train_idx, val_idx in cv.split(X, y):
            X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
            y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
            model.fit(X_train.values, y_train.values)
            oof_preds[val_idx] = model.predict_proba(X_val.values)[:, 1]
        
        auc = roc_auc_score(y, oof_preds)
        prauc = average_precision_score(y, oof_preds)
        
        label = "Constrained" if constrained else "Unconstrained"
        results_exp1.append({'model': model_type, 'constrained': label, 'auc': auc, 'prauc': prauc})
        print(f"  {model_type} ({label}): AUC = {auc:.4f}")

# Analysis
print("\n" + "="*70)
print("üìä EXPERIMENT 1 VERDICT")
print("="*70)

for model_type in ['XGBoost', 'LightGBM']:
    c_auc = [r for r in results_exp1 if r['model']==model_type and r['constrained']=='Constrained'][0]['auc']
    u_auc = [r for r in results_exp1 if r['model']==model_type and r['constrained']=='Unconstrained'][0]['auc']
    delta = u_auc - c_auc
    
    print(f"\n{model_type}: Œî AUC = {delta:+.4f}")
    if delta > 0.01:
        print(f"  ‚ö†Ô∏è GEMINI WAS RIGHT: Unconstrained performs better!")
    elif delta < -0.01:
        print(f"  ‚ùå GEMINI WAS WRONG: Constrained performs better!")
    else:
        print(f"  ‚û°Ô∏è NEGLIGIBLE: Constraints don't significantly impact performance")


## Section 3: Experiment 2 - Full Features (With Reverse-Causality Variables)

**Gemini's Claim:** Removing dental_visit, floss_days, mobile_teeth was a mistake.

> "This confuses Etiology (what causes disease) with Prediction (who has the disease). In a screening tool, you want to know if someone hasn't visited a dentist in 5 years. That is a massive red flag for disease."

**Test:** Compare AUC with and without reverse-causality features.


In [None]:
"""
Experiment 2: Full Features vs Primary Model
============================================
"""

print("="*70)
print("üß™ EXPERIMENT 2: REVERSE-CAUSALITY FEATURES")
print("="*70)
print(f"\nFeatures to test: {REVERSE_CAUSALITY_FEATURES}")
print("")

results_exp2 = []

for name in ['primary_v13', 'full_features']:
    features = [f for f in FEATURE_SETS[name] if f in df.columns]
    
    if len(features) == 0:
        continue
    
    X = df[features].copy()
    
    # Unconstrained LightGBM
    model = lgb.LGBMClassifier(n_estimators=200, max_depth=6, learning_rate=0.1,
                               random_state=RANDOM_SEED, verbose=-1)
    
    oof_preds = np.zeros(len(y))
    for train_idx, val_idx in cv.split(X, y):
        model.fit(X.iloc[train_idx].values, y.iloc[train_idx].values)
        oof_preds[val_idx] = model.predict_proba(X.iloc[val_idx].values)[:, 1]
    
    auc = roc_auc_score(y, oof_preds)
    results_exp2.append({'feature_set': name, 'n_features': len(features), 'auc': auc})
    print(f"  {name} ({len(features)} features): AUC = {auc:.4f}")

# Analysis
print("\n" + "="*70)
print("üìä EXPERIMENT 2 VERDICT")
print("="*70)

if len(results_exp2) >= 2:
    primary = [r for r in results_exp2 if r['feature_set']=='primary_v13'][0]['auc']
    full = [r for r in results_exp2 if r['feature_set']=='full_features'][0]['auc']
    delta = full - primary
    
    print(f"\nPrimary (no reverse-causality): AUC = {primary:.4f}")
    print(f"Full (with reverse-causality): AUC = {full:.4f}")
    print(f"Œî AUC = {delta:+.4f}")
    
    if delta > 0.02:
        print(f"\n‚ö†Ô∏è GEMINI WAS RIGHT: Reverse-causality features add significant power!")
        print(f"   BUT: Are we predicting disease or detecting already-known cases?")
    elif delta > 0.005:
        print(f"\n‚û°Ô∏è SMALL IMPROVEMENT: Modest value, exclusion is defensible")
    else:
        print(f"\n‚ùå GEMINI WAS WRONG: Negligible value from reverse-causality features")


## Section 4: Experiment 3 - Deployment-Ready Model (No Missingness Indicators)

**Gemini's Claim:** Missingness indicators are "data leakage" - learning NHANES survey protocol, not patient biology.

**Test:** Compare model WITH vs WITHOUT missingness indicators.

> "This is arguably pure data leakage specific to the NHANES survey design. In a real-world clinical setting, a missing glucose test doesn't mean the same thing."


In [None]:
"""
Experiment 3: Deployment-Ready Model
====================================
"""

print("="*70)
print("üß™ EXPERIMENT 3: MISSINGNESS INDICATORS")
print("="*70)
print(f"\nMissingness indicators to test: {len(MISSINGNESS_INDICATORS)} features")
print("")

results_exp3 = []

for name in ['primary_v13', 'deployment_ready']:
    features = [f for f in FEATURE_SETS[name] if f in df.columns]
    
    if len(features) == 0:
        continue
    
    X = df[features].copy()
    model = lgb.LGBMClassifier(n_estimators=200, max_depth=6, learning_rate=0.1,
                               random_state=RANDOM_SEED, verbose=-1)
    
    oof_preds = np.zeros(len(y))
    for train_idx, val_idx in cv.split(X, y):
        model.fit(X.iloc[train_idx].values, y.iloc[train_idx].values)
        oof_preds[val_idx] = model.predict_proba(X.iloc[val_idx].values)[:, 1]
    
    auc = roc_auc_score(y, oof_preds)
    has_miss = name != 'deployment_ready'
    results_exp3.append({'feature_set': name, 'has_missingness': has_miss, 'auc': auc})
    print(f"  {name}: AUC = {auc:.4f}")

# Analysis
print("\n" + "="*70)
print("üìä EXPERIMENT 3 VERDICT")
print("="*70)

if len(results_exp3) >= 2:
    with_miss = [r for r in results_exp3 if r['has_missingness']][0]['auc']
    without_miss = [r for r in results_exp3 if not r['has_missingness']][0]['auc']
    delta = with_miss - without_miss
    
    print(f"\nWith missingness indicators: AUC = {with_miss:.4f}")
    print(f"Without (deployment-ready): AUC = {without_miss:.4f}")
    print(f"Œî AUC = {delta:+.4f}")
    
    if delta > 0.03:
        print(f"\n‚ö†Ô∏è GEMINI HAS A POINT: Missingness contributes substantially")
        print(f"   For deployment outside NHANES, expect AUC ~{without_miss:.3f}")
    elif delta > 0.01:
        print(f"\n‚û°Ô∏è MODEST CONTRIBUTION: Deployment-ready model still works")
    else:
        print(f"\n‚ùå GEMINI WAS WRONG: Core clinical features carry the signal")


## Section 5: Comprehensive Summary and Final Verdict

Testing all of Gemini's critiques to determine if our methodology was sound or if we "handicapped" our model.


In [None]:
"""
Section 5: Final Summary
========================
"""

# Test ALL feature combinations to find maximum AUC
print("="*70)
print("üèÜ COMPREHENSIVE TEST: MAXIMUM ACHIEVABLE AUC")
print("="*70)

all_results = []

variations = {
    'Core only': [f for f in CORE_FEATURES if not f.endswith('_missing')],
    'Core + missingness': FEATURE_SETS['primary_v13'],
    'Core + reverse-causality': [f for f in CORE_FEATURES if not f.endswith('_missing')] + REVERSE_CAUSALITY_FEATURES,
    'ALL features (Gemini optimal)': FEATURE_SETS['full_features']
}

for name, features in variations.items():
    features = [f for f in features if f in df.columns]
    if len(features) == 0:
        continue
    
    X = df[features].copy()
    model = lgb.LGBMClassifier(n_estimators=200, max_depth=6, learning_rate=0.1,
                               random_state=RANDOM_SEED, verbose=-1)
    
    oof_preds = np.zeros(len(y))
    for train_idx, val_idx in cv.split(X, y):
        model.fit(X.iloc[train_idx].values, y.iloc[train_idx].values)
        oof_preds[val_idx] = model.predict_proba(X.iloc[val_idx].values)[:, 1]
    
    auc = roc_auc_score(y, oof_preds)
    all_results.append({'feature_set': name, 'n_features': len(features), 'auc': auc})
    print(f"  {name} ({len(features)} features): AUC = {auc:.4f}")

# Save results
experiment_summary = {
    'experiment_date': datetime.now().isoformat(),
    'purpose': 'Testing Gemini AI critiques',
    'exp1_monotonic': results_exp1,
    'exp2_reverse_causality': results_exp2,
    'exp3_missingness': results_exp3,
    'comprehensive': all_results
}

with open(RESULTS_DIR / 'ai_peer_review_experiments.json', 'w') as f:
    json.dump(experiment_summary, f, indent=2, default=str)
print(f"\n‚úÖ Results saved to: {RESULTS_DIR / 'ai_peer_review_experiments.json'}")

# Final verdict
max_auc = max(r['auc'] for r in all_results)
our_auc = 0.717

print("\n" + "="*70)
print("üèÜ FINAL VERDICT: GEMINI vs OUR METHODOLOGY")
print("="*70)
print(f"""
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     EXPERIMENT RESULTS                          ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Maximum achievable AUC (all features, unconstrained): {max_auc:.4f}  ‚îÇ
‚îÇ Our published model AUC (v1.3 primary):               {our_auc:.4f}  ‚îÇ
‚îÇ Difference:                                           {max_auc-our_auc:+.4f}  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
""")

if max_auc > 0.80:
    print("‚ö†Ô∏è GEMINI WAS RIGHT: Significant AUC was left on the table!")
    print("   We should reconsider our methodological choices.")
elif max_auc > 0.75:
    print("‚û°Ô∏è PARTIAL VALIDITY: Some room for improvement exists.")
    print("   Our choices were conservative but defensible.")
else:
    print("‚ùå GEMINI WAS WRONG: AUC ceiling is ~0.72-0.73 with these features.")
    print("   Our 'realistic ceiling' claim is VALIDATED.")
    print("   The problem is feature informativeness, not methodology.")

print("\nüìù RECOMMENDATIONS FOR PAPER:")
print("-" * 50)
print("1. Report unconstrained model AUC to show constraints don't hurt")
print("2. Report 'deployment-ready' AUC for real-world applicability")
print("3. Acknowledge reverse-causality tradeoff explicitly")
print("4. Defend methodology with these empirical results")

print("\n‚úÖ AI Peer Review Experiments Complete!")
