# ðŸ¦· Dental Implant 10-Year Survival Prediction

## Notebook 08: Upsampling + 10-Fold LightGBM Ensemble

**Objective:** Implement the winning approach from a top competitor:
1. **Upsample minority class** (failures) to match majority class
2. **10-fold Stratified Cross-Validation** with LightGBM
3. **Ensemble predictions** by averaging probabilities from all 10 models
4. **Submit probabilities** (not binary predictions)

---

### ðŸ”‘ Key Techniques:
- **Bootstrap Upsampling**: Duplicate minority class samples to create 50/50 balance
- **10-Fold Ensemble**: Average predictions from 10 models for robustness
- **Probability Output**: Submit survival probabilities instead of 0/1

### ðŸ“Š Expected Results:
Based on the reference notebook:
- **Failure Recall: ~93-97%** (vs our previous 0%!)
- **Overall Accuracy: ~91-93%**


---

### 1. Setup & Import Libraries


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.utils import resample
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')

# Periospot Brand Colors
COLORS = {
    'periospot_blue': '#15365a',
    'mystic_blue': '#003049',
    'periospot_red': '#6c1410',
    'crimson_blaze': '#a92a2a',
    'vanilla_cream': '#f7f0da',
    'black': '#000000',
    'white': '#ffffff',
    'periospot_yellow': '#ffc430',
}

plt.rcParams['font.family'] = 'DejaVu Sans'
plt.rcParams['axes.facecolor'] = COLORS['vanilla_cream']
plt.rcParams['figure.facecolor'] = COLORS['white']

print("âœ… Libraries imported!")


---

### 2. Load Raw Data

We'll load the raw data and do minimal preprocessing (like the reference notebook).


In [None]:
# Load raw data
df_train = pd.read_csv('../data/raw/train.csv')
df_test = pd.read_csv('../data/raw/test.csv')

print(f"Training data: {df_train.shape}")
print(f"Test data: {df_test.shape}")

# Check class distribution
print(f"\nTarget distribution:")
print(df_train['implant_survival_10y'].value_counts())
print(f"\nClass imbalance ratio: {df_train['implant_survival_10y'].value_counts()[1] / df_train['implant_survival_10y'].value_counts()[0]:.1f}:1")


In [None]:
# Remove patient_id (not predictive)
train_clean = df_train.drop(['patient_id'], axis=1)
test_clean = df_test.drop(['patient_id'], axis=1)

# Identify numerical and categorical columns
numerical_cols = train_clean.select_dtypes(include='number').columns.tolist()
categorical_cols = train_clean.select_dtypes(include='object').columns.tolist()

# Remove target from numerical
numerical_cols.remove('implant_survival_10y')

print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols}")
print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")


---

### 3. One-Hot Encoding


In [None]:
# One-Hot Encode categorical columns
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit and transform training data
encoded_train = ohe.fit_transform(train_clean[categorical_cols])
encoded_cols = ohe.get_feature_names_out(categorical_cols)
encoded_train_df = pd.DataFrame(encoded_train, columns=encoded_cols)

# Combine numerical and encoded features
df_train_processed = train_clean.drop(columns=categorical_cols)
df_train_final = pd.concat([df_train_processed.reset_index(drop=True), encoded_train_df], axis=1)

print(f"Processed training data shape: {df_train_final.shape}")
print(f"Features: {df_train_final.shape[1] - 1} (excluding target)")


---

### 4. Upsample Minority Class (Key Technique!)

Instead of SMOTE or class weights, we'll **duplicate minority class samples** to create a perfectly balanced 50/50 dataset.


In [None]:
# =============================================================================
# UPSAMPLE MINORITY CLASS (FAILURES) TO MATCH MAJORITY CLASS
# =============================================================================

# Separate majority and minority classes
majority_class = df_train_final[df_train_final['implant_survival_10y'] == 1]
minority_class = df_train_final[df_train_final['implant_survival_10y'] == 0]

print(f"Before upsampling:")
print(f"  Majority class (Survival): {len(majority_class)}")
print(f"  Minority class (Failure):  {len(minority_class)}")

# Upsample minority class with replacement
minority_upsampled = resample(
    minority_class,
    replace=True,                          # Sample with replacement
    n_samples=len(majority_class),         # Match majority class count
    random_state=42
)

# Combine for balanced dataset
balanced_data = pd.concat([majority_class, minority_upsampled])

# Shuffle the balanced data
balanced_data = balanced_data.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"\nAfter upsampling:")
print(f"  Total samples: {len(balanced_data)}")
print(f"  Class distribution:")
print(balanced_data['implant_survival_10y'].value_counts())

# Separate features and target
X_balanced = balanced_data.drop('implant_survival_10y', axis=1)
y_balanced = balanced_data['implant_survival_10y']

print(f"\nâœ… Balanced dataset created: {X_balanced.shape[0]} samples, {X_balanced.shape[1]} features")


---

### 5. Train 10-Fold LightGBM Ensemble

We'll use 10-fold Stratified Cross-Validation and keep all 10 models for ensemble predictions.


In [None]:
# =============================================================================
# 10-FOLD STRATIFIED CROSS-VALIDATION WITH LIGHTGBM
# =============================================================================

# Initialize LightGBM with default parameters (like the reference notebook)
lgb = LGBMClassifier(random_state=42, verbose=-1)

# 10-fold Stratified Cross-Validation
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Run cross-validation and return all estimators
print("Training 10-fold LightGBM ensemble...")
cv_results = cross_validate(
    lgb, 
    X_balanced, 
    y_balanced, 
    cv=skf, 
    scoring='roc_auc', 
    n_jobs=-1, 
    return_estimator=True
)

print(f"\nâœ… 10 models trained!")
print(f"Mean ROC-AUC: {cv_results['test_score'].mean():.4f} (+/- {cv_results['test_score'].std():.4f})")


In [None]:
# =============================================================================
# EVALUATE EACH FOLD
# =============================================================================

print("=" * 60)
print("CLASSIFICATION REPORTS FOR EACH FOLD")
print("=" * 60)

for fold, model in enumerate(cv_results['estimator']):
    # Get validation indices for this fold
    val_idx = list(skf.split(X_balanced, y_balanced))[fold][1]
    X_val, y_val = X_balanced.iloc[val_idx], y_balanced.iloc[val_idx]
    
    # Predict
    y_pred = model.predict(X_val)
    
    # Print report
    print(f"\nFold {fold+1}:")
    print(classification_report(y_val, y_pred, target_names=['Failure', 'Survival']))
    print("-" * 60)


---

### 6. Generate Test Predictions (Ensemble Averaging)

We'll average the **probabilities** from all 10 models for the final prediction.


In [None]:
# =============================================================================
# PREPARE TEST DATA (same preprocessing as training)
# =============================================================================

# One-hot encode test categorical columns (using same encoder)
test_encoded = ohe.transform(test_clean[categorical_cols])
test_encoded_df = pd.DataFrame(test_encoded, columns=encoded_cols)

# Combine numerical and encoded features
df_test_processed = test_clean.drop(columns=categorical_cols)
df_test_final = pd.concat([df_test_processed.reset_index(drop=True), test_encoded_df], axis=1)

print(f"Test data shape: {df_test_final.shape}")


In [None]:
# =============================================================================
# ENSEMBLE PREDICTIONS (AVERAGE PROBABILITIES FROM ALL 10 MODELS)
# =============================================================================

# Initialize array for predictions from all folds
test_preds = np.zeros((len(df_test_final), 10))

# Get probability predictions from each model
for fold, model in enumerate(cv_results['estimator']):
    test_preds[:, fold] = model.predict_proba(df_test_final)[:, 1]  # Probability of class 1 (Survival)

# Average predictions across all 10 models
final_probs = test_preds.mean(axis=1)

print(f"âœ… Ensemble predictions generated!")
print(f"\nPrediction statistics:")
print(f"  Min probability:  {final_probs.min():.4f}")
print(f"  Max probability:  {final_probs.max():.4f}")
print(f"  Mean probability: {final_probs.mean():.4f}")
print(f"  Std probability:  {final_probs.std():.4f}")


In [None]:
# Visualize probability distribution
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(final_probs, bins=50, color=COLORS['periospot_blue'], edgecolor='white', alpha=0.7)
ax.axvline(x=0.5, color=COLORS['crimson_blaze'], linestyle='--', linewidth=2, label='Threshold 0.5')
ax.set_xlabel('Survival Probability', fontweight='bold')
ax.set_ylabel('Frequency', fontweight='bold')
ax.set_title('Distribution of Ensemble Predictions (10-Model Average)', fontweight='bold')
ax.legend()

plt.tight_layout()
plt.savefig('../figures/ensemble_prediction_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

# Count predictions above/below threshold
print(f"\nPredictions above 0.5 (Survival): {(final_probs >= 0.5).sum()} ({(final_probs >= 0.5).mean()*100:.1f}%)")
print(f"Predictions below 0.5 (Failure):  {(final_probs < 0.5).sum()} ({(final_probs < 0.5).mean()*100:.1f}%)")


---

### 7. Create Submission File

We'll submit **probabilities** (not binary 0/1) as the reference notebook did.


In [None]:
# Create submission with probabilities
submission = pd.DataFrame({
    'patient_id': df_test['patient_id'],
    'implant_survival_10y': final_probs
})

submission.to_csv('../submission.csv', index=False)

print("Submission file created: ../submission.csv")
print(f"Shape: {submission.shape}")
print(submission.head(10))


---

### âœ… Upsampling + Ensemble Complete!

**Techniques Used:**
1. âœ… **Bootstrap Upsampling** - Balanced data to 50/50
2. âœ… **10-Fold Stratified CV** - Robust model evaluation
3. âœ… **LightGBM Ensemble** - 10 models averaged
4. âœ… **Probability Predictions** - Not binary 0/1

**Expected Improvement:**
- Previous best: 0.92171 (XGBoost)
- Reference notebook achieved ~93% with this approach!

**Submit to Kaggle:**
```bash
kaggle competitions submit -c dental-implant-10-year-survival-prediction -f submission.csv -m "Upsampling + 10-fold LightGBM Ensemble"
```
