# ü¶∑ Dental Implant 10-Year Survival Prediction

## Notebook 03: Baseline Models

**Objective:** Train and evaluate baseline machine learning models (Logistic Regression and Random Forest) to establish a performance benchmark.

---


### üé® Setup: Import Libraries & Configure Plotting


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix, classification_report, roc_curve
import warnings
warnings.filterwarnings('ignore')

# Periospot Brand Colors
COLORS = {
    'periospot_blue': '#15365a',
    'mystic_blue': '#003049',
    'periospot_red': '#6c1410',
    'crimson_blaze': '#a92a2a',
    'vanilla_cream': '#f7f0da',
    'black': '#000000',
    'white': '#ffffff',
    'classic_periospot_blue': '#0031af',
    'periospot_light_blue': '#0297ed',
    'periospot_dark_blue': '#02011e',
    'periospot_yellow': '#ffc430',
    'periospot_bright_blue': '#1040dd'
}

periospot_palette = [COLORS['periospot_blue'], COLORS['crimson_blaze'], 
                     COLORS['periospot_light_blue'], COLORS['periospot_yellow']]

# Configure matplotlib
plt.rcParams['font.family'] = 'DejaVu Sans'
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.facecolor'] = COLORS['white']
plt.rcParams['axes.facecolor'] = COLORS['vanilla_cream']
plt.rcParams['axes.edgecolor'] = COLORS['periospot_blue']

sns.set_palette(periospot_palette)

print("‚úÖ Libraries imported and plotting style configured!")


---

### 1. Load Processed Data & Setup


In [None]:
# TODO: Load the processed data (X.csv and y.csv) from the /data/processed/ folder.
X = pd.read_csv('../data/processed/X_train.csv')
y = pd.read_csv('../data/processed/y_train.csv').values.ravel()  # Convert to 1D array

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"\nTarget distribution:")
print(pd.Series(y).value_counts())


In [None]:
# TODO: Split the data into training and validation sets (80/20 split).
# Hint: Use train_test_split with test_size=0.2 and random_state=42.

X_train, X_val, y_train, y_val = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Maintain class distribution
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")


---

### 2. Train & Evaluate Logistic Regression

Logistic Regression is a simple, interpretable baseline model for binary classification.


In [None]:
# TODO: Initialize the Logistic Regression model.
# Hint: Use LogisticRegression(random_state=42, max_iter=1000)
lr_model = ...

# TODO: Fit the model on the training data.
# Hint: Use the .fit() method.
...

print("‚úÖ Logistic Regression model trained!")


In [None]:
# TODO: Make predictions on the validation set.
# Hint: Use .predict() for class labels and .predict_proba() for probabilities.

y_pred_lr = ...  # Class predictions
y_pred_lr_proba = ...  # Probability predictions (use [:, 1] for positive class)

# TODO: Calculate the ROC-AUC score.
# Hint: Use roc_auc_score(y_val, y_pred_lr_proba)
roc_auc_lr = ...

# Calculate accuracy
accuracy_lr = ...

print(f"Logistic Regression Results:")
print(f"  - ROC-AUC: {roc_auc_lr:.4f}")
print(f"  - Accuracy: {accuracy_lr:.4f}")


In [None]:
# TODO: Display the confusion matrix and classification report.

print("Classification Report:")
print(classification_report(y_val, y_pred_lr))

# Plot confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
cm = confusion_matrix(y_val, y_pred_lr)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
ax.set_title('Logistic Regression - Confusion Matrix', fontweight='bold')
plt.tight_layout()
plt.savefig('../figures/lr_confusion_matrix.png', dpi=150, bbox_inches='tight')
plt.show()


In [None]:
# TODO: Save the results to a JSON file in the /results/ folder.
# The JSON should contain the model name and its metrics.

results_lr = {
    "model": "LogisticRegression",
    "roc_auc": float(roc_auc_lr),
    "accuracy": float(accuracy_lr)
}

with open('../results/logistic_regression_results.json', 'w') as f:
    json.dump(results_lr, f, indent=2)

print("‚úÖ Results saved to results/logistic_regression_results.json")


---

### 3. Train & Evaluate Random Forest

Random Forest is an ensemble method that typically performs better than single decision trees.


In [None]:
# TODO: Initialize the Random Forest Classifier.
# Hint: Use RandomForestClassifier(n_estimators=100, random_state=42).
rf_model = ...

# TODO: Fit the model on the training data.
...

print("‚úÖ Random Forest model trained!")


In [None]:
# TODO: Make predictions and evaluate the Random Forest model.

y_pred_rf = ...  # Class predictions
y_pred_rf_proba = ...  # Probability predictions

# Calculate metrics
roc_auc_rf = ...
accuracy_rf = ...

print(f"Random Forest Results:")
print(f"  - ROC-AUC: {roc_auc_rf:.4f}")
print(f"  - Accuracy: {accuracy_rf:.4f}")


In [None]:
# TODO: Display classification report and confusion matrix for Random Forest.

print("Classification Report:")
print(classification_report(y_val, y_pred_rf))

# Plot confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
cm_rf = confusion_matrix(y_val, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues', ax=ax,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
ax.set_title('Random Forest - Confusion Matrix', fontweight='bold')
plt.tight_layout()
plt.savefig('../figures/rf_confusion_matrix.png', dpi=150, bbox_inches='tight')
plt.show()


In [None]:
# TODO: Visualize feature importance from Random Forest.

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top 15 features
fig, ax = plt.subplots(figsize=(10, 8))
top_features = feature_importance.head(15)
sns.barplot(data=top_features, x='importance', y='feature', 
            palette=periospot_palette, ax=ax)
ax.set_title('Random Forest - Top 15 Feature Importances', fontweight='bold')
ax.set_xlabel('Importance')
ax.set_ylabel('Feature')
plt.tight_layout()
plt.savefig('../figures/rf_feature_importance.png', dpi=150, bbox_inches='tight')
plt.show()


In [None]:
# TODO: Save the Random Forest results to a new JSON file.

results_rf = {
    "model": "RandomForest",
    "roc_auc": float(roc_auc_rf),
    "accuracy": float(accuracy_rf),
    "n_estimators": 100
}

with open('../results/random_forest_results.json', 'w') as f:
    json.dump(results_rf, f, indent=2)

print("‚úÖ Results saved to results/random_forest_results.json")


---

### 4. Compare Models


In [None]:
# TODO: Plot ROC curves for both models to compare them.

fig, ax = plt.subplots(figsize=(10, 8))

# Logistic Regression ROC curve
fpr_lr, tpr_lr, _ = roc_curve(y_val, y_pred_lr_proba)
ax.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {roc_auc_lr:.4f})', 
        color=COLORS['periospot_blue'], linewidth=2)

# Random Forest ROC curve
fpr_rf, tpr_rf, _ = roc_curve(y_val, y_pred_rf_proba)
ax.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {roc_auc_rf:.4f})', 
        color=COLORS['crimson_blaze'], linewidth=2)

# Diagonal line (random classifier)
ax.plot([0, 1], [0, 1], 'k--', label='Random Classifier')

ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curve Comparison - Baseline Models', fontweight='bold')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../figures/baseline_roc_comparison.png', dpi=150, bbox_inches='tight')
plt.show()


In [None]:
# Summary comparison table
comparison_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest'],
    'ROC-AUC': [roc_auc_lr, roc_auc_rf],
    'Accuracy': [accuracy_lr, accuracy_rf]
})

print("=" * 50)
print("BASELINE MODELS COMPARISON")
print("=" * 50)
print(comparison_df.to_string(index=False))
print("=" * 50)

# Identify best model
best_model = comparison_df.loc[comparison_df['ROC-AUC'].idxmax(), 'Model']
print(f"\nüèÜ Best baseline model: {best_model}")


---

### ‚úÖ Baseline Models Complete!

**Results Summary:**
- Logistic Regression: Simple, interpretable baseline
- Random Forest: Ensemble method with feature importance

**Next Steps:** 
- Try advanced gradient boosting models in notebooks 04-06
- Compare all models to select the best one for submission


# ü¶∑ Dental Implant 10-Year Survival Prediction

## Notebook 03: Baseline Models (Logistic Regression & Random Forest)

**Objective:** Train and evaluate baseline models to establish a performance benchmark. We'll use Logistic Regression and Random Forest as our starting points.

---
