# ü¶∑ Dental Implant 10-Year Survival Prediction

## Notebook 07: Submission Generation

**Objective:** Use the best-performing model to generate predictions on the test set and create the final submission file for Kaggle.

---


### üé® Setup: Import Libraries & Configure Plotting


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os
import glob
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier
import warnings
warnings.filterwarnings('ignore')

# Periospot Brand Colors
COLORS = {
    'periospot_blue': '#15365a',
    'mystic_blue': '#003049',
    'periospot_red': '#6c1410',
    'crimson_blaze': '#a92a2a',
    'vanilla_cream': '#f7f0da',
    'black': '#000000',
    'white': '#ffffff',
    'classic_periospot_blue': '#0031af',
    'periospot_light_blue': '#0297ed',
    'periospot_dark_blue': '#02011e',
    'periospot_yellow': '#ffc430',
    'periospot_bright_blue': '#1040dd'
}

periospot_palette = [COLORS['periospot_blue'], COLORS['crimson_blaze'], 
                     COLORS['periospot_light_blue'], COLORS['periospot_yellow'],
                     COLORS['mystic_blue'], COLORS['periospot_red']]

# Configure matplotlib
plt.rcParams['font.family'] = 'DejaVu Sans'
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.facecolor'] = COLORS['white']
plt.rcParams['axes.facecolor'] = COLORS['vanilla_cream']
plt.rcParams['axes.edgecolor'] = COLORS['periospot_blue']

sns.set_palette(periospot_palette)

print("‚úÖ Libraries imported and plotting style configured!")


---

### 1. Load and Compare All Model Results

First, let's load the results from all the models we trained and compare their performance.


In [None]:
# TODO: Load all model results from the /results/ folder

results_files = glob.glob('../results/*.json')
all_results = []

for file in results_files:
    with open(file, 'r') as f:
        result = json.load(f)
        all_results.append(result)

# Create comparison dataframe
comparison_df = pd.DataFrame(all_results)
comparison_df = comparison_df.sort_values('roc_auc', ascending=False)

print("=" * 60)
print("MODEL COMPARISON - SORTED BY ROC-AUC")
print("=" * 60)
print(comparison_df[['model', 'roc_auc', 'accuracy']].to_string(index=False))
print("=" * 60)


In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ROC-AUC comparison
ax1 = axes[0]
bars1 = ax1.barh(comparison_df['model'], comparison_df['roc_auc'], color=periospot_palette[:len(comparison_df)])
ax1.set_xlabel('ROC-AUC Score')
ax1.set_title('Model Comparison - ROC-AUC', fontweight='bold')
ax1.bar_label(bars1, fmt='%.4f', padding=3)
ax1.set_xlim(0, 1.0)

# Accuracy comparison
ax2 = axes[1]
bars2 = ax2.barh(comparison_df['model'], comparison_df['accuracy'], color=periospot_palette[:len(comparison_df)])
ax2.set_xlabel('Accuracy')
ax2.set_title('Model Comparison - Accuracy', fontweight='bold')
ax2.bar_label(bars2, fmt='%.4f', padding=3)
ax2.set_xlim(0, 1.0)

plt.tight_layout()
plt.savefig('../figures/all_models_comparison.png', dpi=150, bbox_inches='tight')
plt.show()


In [None]:
# Identify the best model
best_model_name = comparison_df.iloc[0]['model']
best_roc_auc = comparison_df.iloc[0]['roc_auc']

print(f"\nüèÜ BEST MODEL: {best_model_name}")
print(f"   ROC-AUC Score: {best_roc_auc:.4f}")


---

### 2. Load Data & Train Best Model on Full Dataset

Now we'll train the best-performing model on the ENTIRE training dataset (not just the training split) to maximize performance on the test set.


In [None]:
# Load the processed training data
X = pd.read_csv('../data/processed/X_train.csv')
y = pd.read_csv('../data/processed/y_train.csv').values.ravel()

# Load the processed test data
X_test = pd.read_csv('../data/processed/X_test.csv')
test_ids = pd.read_csv('../data/processed/test_ids.csv')

print(f"Training data shape: {X.shape}")
print(f"Test data shape: {X_test.shape}")
print(f"Test IDs shape: {test_ids.shape}")


In [None]:
# TODO: Initialize and train your best-performing model on the ENTIRE training dataset.
# Based on the comparison above, choose the appropriate model.

# Example: If XGBoost was the best model:
# best_model = xgb.XGBClassifier(
#     n_estimators=100,
#     max_depth=6,
#     learning_rate=0.1,
#     random_state=42,
#     eval_metric='auc',
#     use_label_encoder=False
# )

# Example: If LightGBM was the best:
# best_model = lgb.LGBMClassifier(n_estimators=100, max_depth=6, learning_rate=0.1, random_state=42, verbose=-1)

# Example: If CatBoost was the best:
# best_model = CatBoostClassifier(iterations=100, depth=6, learning_rate=0.1, random_state=42, verbose=False)

# TODO: Choose and initialize your best model based on results
best_model = ...

# TODO: Fit the model on the entire training dataset
# best_model.fit(X, y)
...

print(f"‚úÖ {best_model_name} trained on full dataset!")


---

### 3. Generate Predictions & Create Submission File


In [None]:
# TODO: Make predictions on the test set.

test_predictions = best_model.predict(X_test)

print(f"Generated {len(test_predictions)} predictions")
print(f"Prediction distribution:")
print(pd.Series(test_predictions).value_counts())


In [None]:
# TODO: Load the sample_submission.csv file to see the expected format.
sample_submission = pd.read_csv('../data/raw/sample_submission.csv')

print("Sample Submission Format:")
print(sample_submission.head())
print(f"\nExpected shape: {sample_submission.shape}")


In [None]:
# TODO: Create and save the submission file.

submission_df = sample_submission.copy()
submission_df['implant_survival_10y'] = test_predictions
submission_df.to_csv('../submission.csv', index=False)

print("‚úÖ Submission file created: ../submission.csv")
print(submission_df.head())


---

### ‚úÖ Submission Generation Complete!

**How to Submit to Kaggle:**

**Via UI:** Go to Kaggle ‚Üí Competition ‚Üí Submit Predictions ‚Üí Upload `submission.csv`

**Via CLI:**
```bash
kaggle competitions submit -c dental-implant-10-year-survival-prediction -f submission.csv -m "Best model submission"
```

ü¶∑ Good luck with your submission!


In [None]:
# TODO: Initialize and train your best-performing model on the ENTIRE training dataset.
# Based on the comparison above, choose the appropriate model.

# Example: If XGBoost was the best model:
# best_model = xgb.XGBClassifier(
#     n_estimators=100,
#     max_depth=6,
#     learning_rate=0.1,
#     random_state=42,
#     eval_metric='auc',
#     use_label_encoder=False
# )

# Example: If LightGBM was the best:
# best_model = lgb.LGBMClassifier(n_estimators=100, max_depth=6, learning_rate=0.1, random_state=42, verbose=-1)

# Example: If CatBoost was the best:
# best_model = CatBoostClassifier(iterations=100, depth=6, learning_rate=0.1, random_state=42, verbose=False)

# TODO: Choose and initialize your best model based on results
best_model = ...

# TODO: Fit the model on the entire training dataset
# best_model.fit(X, y)
...

print(f"‚úÖ {best_model_name} trained on full dataset!")


---

### 3. Generate Predictions & Create Submission File


In [None]:
# TODO: Make predictions on the test set.
# You can predict class labels (0 or 1) or probabilities depending on competition requirements.

# For class labels:
test_predictions = best_model.predict(X_test)

# For probabilities (use this if the competition requires probability scores):
# test_predictions_proba = best_model.predict_proba(X_test)[:, 1]

print(f"Generated {len(test_predictions)} predictions")
print(f"Prediction distribution:")
print(pd.Series(test_predictions).value_counts())


In [None]:
# TODO: Load the sample_submission.csv file to see the expected format.
sample_submission = pd.read_csv('../data/raw/sample_submission.csv')

print("Sample Submission Format:")
print(sample_submission.head())
print(f"\nExpected shape: {sample_submission.shape}")
print(f"Columns: {list(sample_submission.columns)}")


In [None]:
# TODO: Create the submission dataframe with predictions.

# Option 1: If the sample_submission has an ID column and a target column
submission_df = sample_submission.copy()
submission_df['implant_survival_10y'] = test_predictions

# Option 2: Create from scratch using test_ids
# submission_df = pd.DataFrame({
#     'id': test_ids.values.ravel(),
#     'implant_survival_10y': test_predictions
# })

print("Submission DataFrame:")
print(submission_df.head())
print(f"\nShape: {submission_df.shape}")


In [None]:
# TODO: Save the final submission file.

submission_df.to_csv('../submission.csv', index=False)

print("‚úÖ Submission file created successfully!")
print(f"   Location: ../submission.csv")
print(f"   Shape: {submission_df.shape}")


---

### 4. Validate Submission


In [None]:
# Validate the submission file

submission_check = pd.read_csv('../submission.csv')

print("=" * 50)
print("SUBMISSION VALIDATION")
print("=" * 50)

# Check shape matches sample submission
shape_match = submission_check.shape == sample_submission.shape
print(f"‚úì Shape matches sample: {shape_match} ({submission_check.shape})")

# Check columns match
columns_match = list(submission_check.columns) == list(sample_submission.columns)
print(f"‚úì Columns match: {columns_match} ({list(submission_check.columns)})")

# Check for missing values
no_missing = submission_check.isnull().sum().sum() == 0
print(f"‚úì No missing values: {no_missing}")

# Check prediction values are valid (0 or 1 for classification)
valid_values = submission_check['implant_survival_10y'].isin([0, 1]).all()
print(f"‚úì Valid prediction values (0 or 1): {valid_values}")

print("=" * 50)

if all([shape_match, columns_match, no_missing, valid_values]):
    print("\nüéâ SUBMISSION FILE IS VALID!")
else:
    print("\n‚ö†Ô∏è SUBMISSION FILE HAS ISSUES - Please check!")


---

### 5. Upload Instructions


### How to Submit to Kaggle

**Option 1: Via Kaggle UI**
1. Go to the [Kaggle Competition Page](https://www.kaggle.com/competitions/dental-implant-10-year-survival-prediction)
2. Click on "Submit Predictions"
3. Upload your `submission.csv` file
4. Add a description of your submission (e.g., "XGBoost with default parameters")

**Option 2: Via Kaggle CLI**
```bash
kaggle competitions submit -c dental-implant-10-year-survival-prediction -f submission.csv -m "Best model submission - XGBoost"
```

**Note:** Make sure you have the Kaggle API key configured if using the CLI.


---

### ‚úÖ Submission Generation Complete!

**Summary:**
- Compared all trained models
- Selected the best performing model based on ROC-AUC
- Trained on full dataset for maximum performance
- Generated and validated submission file

**Files created:**
- `submission.csv` - Ready for Kaggle upload

**Next Steps (Optional):**
- [ ] Hyperparameter tuning with GridSearchCV/Optuna
- [ ] Feature engineering iterations
- [ ] Ensemble multiple models
- [ ] Cross-validation for more robust evaluation

ü¶∑ Good luck with your submission!
