# Chapter 01 - The Power of Prediction in Dentistry

> **Book:** Machine Learning For Dentists: From Torque To Tensors
>
> **Author:** Francisco Teixeira Barbosa

---

## üéØ What You'll Learn

In this notebook, you will:

1. Load a simple dataset and see what it looks like
2. Understand the difference between **features** (inputs) and **target** (what we predict)
3. Split data into **training** and **testing** sets
4. Create the simplest possible model: the **baseline**
5. Calculate **accuracy** and understand what it means

**Time to complete:** ~15 minutes

---


## üì¶ Setup: Import Libraries

First, we import the tools we'll use. Don't worry if you don't know what each one does yet‚Äîwe'll explain as we go.


In [None]:
# Standard libraries for data handling
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Periospot brand colors for consistent styling
COLORS = {
    'periospot_blue': '#15365a',
    'mystic_blue': '#003049',
    'periospot_red': '#6c1410',
    'crimson_blaze': '#a92a2a',
    'vanilla_cream': '#f7f0da',
    'black': '#000000',
    'white': '#ffffff'
}

# Set up matplotlib defaults
plt.rcParams['font.family'] = 'DejaVu Sans'
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.facecolor'] = 'white'

print("‚úÖ Libraries loaded successfully!")


---

## üìä Step 1: Load the Data

For this introductory chapter, we'll use a **toy dataset** (D2). This is intentionally simple so we can focus on the concepts, not the clinical complexity.

### Creating a Synthetic Toy Dataset

Since we don't have our D2 dataset file yet, we'll create a simple synthetic one. Imagine this represents:

- 100 patients
- Each patient has a few measurements
- We want to predict if they are **high risk** (1) or **low risk** (0)

**In a real scenario, you would load this from a CSV file:**
```python
df = pd.read_csv('../../data/D2_toy_tabular/toy_dental_data.csv')
```


In [None]:
# Create a simple synthetic dataset for demonstration
# In real use, you would load from: ../../data/D2_toy_tabular/

np.random.seed(42)  # For reproducibility

n_patients = 100

# Create features
df = pd.DataFrame({
    'patient_id': [f'P{i:03d}' for i in range(1, n_patients + 1)],
    'age': np.random.normal(55, 12, n_patients).astype(int),
    'smoker': np.random.choice([0, 1], n_patients, p=[0.7, 0.3]),
    'plaque_score': np.random.uniform(0, 3, n_patients).round(1),
    'pocket_depth_avg': np.random.normal(3.5, 1.2, n_patients).round(1)
})

# Create target: high_risk depends on other features (with some noise)
risk_score = (
    0.02 * df['age'] + 
    0.8 * df['smoker'] + 
    0.3 * df['plaque_score'] + 
    0.2 * df['pocket_depth_avg'] +
    np.random.normal(0, 0.5, n_patients)
)
df['high_risk'] = (risk_score > np.percentile(risk_score, 70)).astype(int)

print(f"Dataset created with {len(df)} patients")
print(f"\nFirst 5 rows:")
df.head()


### Understanding the Data

Let's look at what we have:

| Column | Description | Type |
|--------|-------------|------|
| `patient_id` | Unique identifier | ID (not used for prediction) |
| `age` | Patient age in years | Feature |
| `smoker` | 1 = smoker, 0 = non-smoker | Feature |
| `plaque_score` | Plaque index (0-3) | Feature |
| `pocket_depth_avg` | Average probing depth (mm) | Feature |
| `high_risk` | 1 = high risk, 0 = low risk | **TARGET** |


In [None]:
# Basic information about the dataset
print("Dataset shape:", df.shape)
print(f"\nWe have {df.shape[0]} patients and {df.shape[1]} columns")
print("\n--- Data Types ---")
print(df.dtypes)


In [None]:
# Summary statistics
print("--- Summary Statistics ---")
df.describe()


---

## üéØ Step 2: Features vs. Target

In machine learning, we distinguish between:

- **Features (X):** The information we USE to make predictions
- **Target (y):** What we WANT to predict

### Clinical Analogy

Think of it like this:
- **Features:** Patient history, exam findings, measurements
- **Target:** The diagnosis or prognosis

We use the features to predict the target.


In [None]:
# Define our features and target

# Features: everything except patient_id and the target
feature_columns = ['age', 'smoker', 'plaque_score', 'pocket_depth_avg']
X = df[feature_columns]

# Target: what we want to predict
y = df['high_risk']

print("Features (X):")
print(X.head())
print(f"\nShape: {X.shape} (100 patients √ó 4 features)")


In [None]:
print("Target (y):")
print(y.head())
print(f"\nShape: {y.shape} (100 values)")


### Class Balance: How Many High Risk vs. Low Risk?

Before doing any modeling, we should always check the **class balance** ‚Äî how many examples of each class do we have?


In [None]:
# Check class distribution
class_counts = y.value_counts()
class_percentages = y.value_counts(normalize=True) * 100

print("Class Distribution:")
print(f"  Low Risk (0):  {class_counts[0]} patients ({class_percentages[0]:.1f}%)")
print(f"  High Risk (1): {class_counts[1]} patients ({class_percentages[1]:.1f}%)")


In [None]:
# Visualize the class distribution
fig, ax = plt.subplots(figsize=(8, 5))

bars = ax.bar(
    ['Low Risk (0)', 'High Risk (1)'], 
    class_counts.values,
    color=[COLORS['periospot_blue'], COLORS['crimson_blaze']],
    edgecolor='black',
    linewidth=1.5
)

# Add count labels on bars
for bar, count, pct in zip(bars, class_counts.values, class_percentages.values):
    ax.text(
        bar.get_x() + bar.get_width()/2, 
        bar.get_height() + 1,
        f'{count} ({pct:.1f}%)',
        ha='center', 
        fontsize=12,
        fontweight='bold'
    )

ax.set_ylabel('Number of Patients', fontsize=12)
ax.set_title('Class Distribution: How Many High Risk vs. Low Risk?', fontsize=14, fontweight='bold')
ax.set_ylim(0, max(class_counts.values) * 1.15)

plt.tight_layout()
plt.show()


### üí° Key Insight: The Baseline is Already Here!

If ~70% of patients are **Low Risk**, then the simplest prediction strategy is:

> **"Predict Low Risk for everyone"**

This gives us ~70% accuracy without looking at ANY features!

This is called the **majority class baseline**, and it tells us: **any useful model must beat this.**

---

## ‚úÇÔ∏è Step 3: Train/Test Split

Before training a model, we must split our data into:

- **Training set:** Data the model learns from
- **Test set:** Data we use to evaluate how well the model generalizes

### Why Split?

Imagine studying for an exam using the exact questions that will be on the test. You'd score perfectly, but would you actually understand the material?

The same applies to ML:
- Training on ALL data and testing on the SAME data ‚Üí overoptimistic results
- We need **unseen data** to know if the model truly learned

### Clinical Analogy

- **Training set:** Patients you've seen before
- **Test set:** New patients walking into your clinic

Your skills should work on new patients, not just the ones you memorized.


In [None]:
# Split the data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% goes to test set
    random_state=42,    # For reproducibility
    stratify=y          # Keep class proportions similar in both sets
)

print("Data split complete!")
print(f"\nTraining set: {len(X_train)} patients ({len(X_train)/len(X)*100:.0f}%)")
print(f"Test set:     {len(X_test)} patients ({len(X_test)/len(X)*100:.0f}%)")


In [None]:
# Verify that class proportions are preserved (stratification)
print("Class distribution in training set:")
print(y_train.value_counts(normalize=True).round(3))

print("\nClass distribution in test set:")
print(y_test.value_counts(normalize=True).round(3))


---

## üé≤ Step 4: The Baseline Model

Before using any fancy algorithm, we create a **baseline** ‚Äî the simplest possible prediction.

### Why Baseline Matters

The baseline tells us:
1. **How hard is this problem?** (High baseline = easy problem or imbalanced data)
2. **Is our model doing anything useful?** (Must beat baseline to be worthwhile)

### Our Baseline Strategy: Majority Class

Strategy: **Always predict the most common class in the training data.**

If 70% of training patients are Low Risk, we predict Low Risk for everyone.


In [None]:
# Create a baseline model using scikit-learn's DummyClassifier
# "most_frequent" strategy = always predict the majority class

baseline_model = DummyClassifier(strategy='most_frequent', random_state=42)

# "Train" the baseline (it just learns the most common class)
baseline_model.fit(X_train, y_train)

print("Baseline model 'trained'!")
print(f"\nThis model will always predict: {baseline_model.classes_[np.argmax(baseline_model.class_prior_)]}")
print(f"(Because that's the most common class in training data)")


In [None]:
# Make predictions on the test set
y_pred_baseline = baseline_model.predict(X_test)

print("Baseline predictions on test set:")
print(y_pred_baseline)
print(f"\n(Notice: every prediction is the same!)")


---

## üìè Step 5: Evaluating the Baseline

Now let's see how well our baseline performs.

### Accuracy

$$
\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}
$$

**In clinical terms:** Out of 20 test patients, how many did we classify correctly?


In [None]:
# Calculate accuracy
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)

print("=" * 50)
print("BASELINE MODEL RESULTS")
print("=" * 50)
print(f"\nAccuracy: {baseline_accuracy:.1%}")
print(f"\nThis means: out of {len(y_test)} test patients,")
print(f"we correctly classified {int(baseline_accuracy * len(y_test))} of them.")


### The Confusion Matrix

Accuracy alone can be misleading. Let's look at what the model got right and wrong using a **confusion matrix**.

```
                    Predicted
                 Low Risk | High Risk
              +-----------+-----------+
    Low Risk  |    TN     |    FP     |   Actual Low Risk
Actual        +-----------+-----------+
    High Risk |    FN     |    TP     |   Actual High Risk
              +-----------+-----------+
```

- **TN (True Negative):** Correctly predicted Low Risk
- **TP (True Positive):** Correctly predicted High Risk  
- **FP (False Positive):** Predicted High Risk, but was actually Low Risk
- **FN (False Negative):** Predicted Low Risk, but was actually High Risk


In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred_baseline)

# Visualize it
fig, ax = plt.subplots(figsize=(8, 6))

sns.heatmap(
    cm, 
    annot=True, 
    fmt='d', 
    cmap='Blues',
    xticklabels=['Predicted\nLow Risk', 'Predicted\nHigh Risk'],
    yticklabels=['Actual\nLow Risk', 'Actual\nHigh Risk'],
    annot_kws={'size': 16, 'fontweight': 'bold'},
    cbar_kws={'label': 'Count'},
    ax=ax
)

ax.set_title('Confusion Matrix: Baseline Model', fontsize=14, fontweight='bold', pad=20)
ax.set_xlabel('')
ax.set_ylabel('')

plt.tight_layout()
plt.show()


### üí° The Problem with the Baseline

Look at the confusion matrix:

- The model predicts **Low Risk for everyone**
- It gets all the actual Low Risk patients correct (top-left)
- But it **misses ALL the High Risk patients** (bottom-left)

**Clinical translation:** 

A model that misses ALL high-risk patients is useless, even if accuracy looks decent!


In [None]:
# Let's be explicit about what went wrong
tn, fp, fn, tp = cm.ravel()

print("Breakdown of predictions:")
print(f"\n‚úÖ True Negatives (Low Risk, correctly identified):  {tn}")
print(f"‚úÖ True Positives (High Risk, correctly identified): {tp}")
print(f"‚ùå False Positives (Said High Risk, was Low Risk):   {fp}")
print(f"‚ùå False Negatives (Said Low Risk, was High Risk):   {fn}")

print(f"\n‚ö†Ô∏è  The baseline MISSED {fn} out of {fn+tp} High Risk patients!")
print(f"    That's a {fn/(fn+tp)*100:.0f}% miss rate on the patients we care most about.")


---

## ü§î Step 6: Reflection ‚Äî What Does This Mean?

### What We Learned

1. **Accuracy can be misleading** ‚Äî 70% sounds good, but we missed all high-risk patients

2. **The baseline is a benchmark** ‚Äî Any useful model MUST beat this

3. **Class imbalance matters** ‚Äî When one class dominates, simple strategies look good

4. **Context determines what "good" means** ‚Äî In clinical settings, missing high-risk patients (false negatives) is often worse than false alarms (false positives)

### Clinical Perspective

Imagine if this were a real screening tool:

- Saying "Everyone is Low Risk" gets you 70% accuracy
- But you'd miss every single patient who actually needs intervention
- This is why we need to look beyond accuracy

### Coming Up Next

In the next chapters, we'll:
- Learn algorithms that actually USE the features
- Look at metrics beyond accuracy (precision, recall, AUC)
- See how to tune models for clinical goals

---

## üß™ Experiments for You

Try these modifications to build intuition:

### Experiment 1: Change the Class Balance

Go back to the data creation cell and change the percentile threshold from 70 to 50:
```python
df['high_risk'] = (risk_score > np.percentile(risk_score, 50)).astype(int)
```
What happens to the baseline accuracy?

### Experiment 2: Change the Test Size

Change `test_size=0.2` to `test_size=0.5`. 
- Does the accuracy change much?
- Is the accuracy more or less stable?

### Experiment 3: Random Baseline

Change the baseline strategy from `'most_frequent'` to `'uniform'` (random guessing):
```python
baseline_model = DummyClassifier(strategy='uniform', random_state=42)
```
What accuracy do you get now?

---

## üìù Summary

| Concept | What It Means |
|---------|---------------|
| **Features (X)** | The patient data we use to make predictions |
| **Target (y)** | What we want to predict |
| **Train/Test Split** | Separate data for learning vs. evaluation |
| **Baseline** | Simplest possible prediction (benchmark) |
| **Accuracy** | Percentage of correct predictions |
| **Confusion Matrix** | Breakdown of right/wrong predictions |

### Key Takeaways

1. Always establish a baseline before using complex models
2. Accuracy alone doesn't tell the whole story
3. The baseline we need to beat is: **predict the most common class**
4. Clinical context determines which errors matter more

---

**Next Chapter:** [Data for Clinical Questions](../02_data_for_clinical_questions/)

---

*Machine Learning For Dentists: From Torque To Tensors*  
*¬© 2024 Francisco Teixeira Barbosa*
