### **Chunk 4: Model Evaluation Fundamentals**

#### **1. Concept Introduction**

The choice of metric depends entirely on the problem you are trying to solve and the cost of different types of errors.

**For Classification:**

Imagine you are building a model to detect a rare disease.
-   **False Positive (FP):** The model predicts a healthy person has the disease. This causes stress but is resolved with further testing.
-   **False Negative (FN):** The model predicts a sick person is healthy. This is a catastrophic failure.

In this case, minimizing False Negatives is far more important than minimizing False Positives. `accuracy` treats all errors equally, which is why it's often a misleading metric.

Here are the key metrics built from the **Confusion Matrix**:
-   **Precision**: "Of all the times the model predicted 'Positive', how often was it correct?" Use this when the cost of a **False Positive** is high. (e.g., sending a promotional offer to a customer who won't respond).
    -   `Precision = TP / (TP + FP)`
-   **Recall (Sensitivity)**: "Of all the actual 'Positive' cases, how many did the model find?" Use this when the cost of a **False Negative** is high. (e.g., fraud detection, medical diagnosis).
    -   `Recall = TP / (TP + FN)`
-   **F1-Score**: The harmonic mean of Precision and Recall. It provides a single score that balances both concerns. It's a great default metric if you care about both FPs and FNs.
-   **Confusion Matrix**: A table that visualizes the performance, showing you exactly where your model is getting confused (e.g., consistently misclassifying the digit '8' as a '3').

**For Regression:**

-   **Mean Absolute Error (MAE)**: The average absolute difference between the predicted and actual values. It's easy to interpret because it's in the same units as the target.
-   **Mean Squared Error (MSE)**: The average of the squared differences. The squaring penalizes larger errors much more heavily than smaller ones. Its units are squared, making it less intuitive.
-   **Root Mean Squared Error (RMSE)**: The square root of MSE. This brings the metric back to the same units as the target, making it interpretable while still punishing large errors. It's the most common regression metric.
-   **R-squared (R²)**: The coefficient of determination. It represents the proportion of the variance in the target variable that is predictable from the features. A score of 1.0 means the model explains all the variability; 0 means it explains none.

#### **2. Dataset EDA: The Digits Dataset**

This dataset contains images of handwritten digits (0-9). Each image is 8x8 pixels, flattened into a 64-feature vector. It's a classic multi-class classification problem where a confusion matrix is essential to see which specific digits the model struggles with.


In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits

# Set plot style
sns.set_style('whitegrid')

**Load Data**

In [None]:
digits = load_digits()
X, y = digits.data, digits.target

print(f"Features shape : {X.shape}")
print(f"Target SHape : {y.shape}")
print(f"NUmber of unique classes : {len(np.unique(y))}")

**Visualize the Data**

In [None]:
fig, axes = plt.subplots(2, 5, figsize=(12, 5),
                       subplot_kw={'xticks':[],
                                    'yticks':[]})
for i, ax in enumerate(axes.flat):
    ax.imshow(X[i].reshape(8, 8),
              cmap='binary',
              interpolation='nearest')
    ax.set_title(f"True Label: {y[i]}")

plt.suptitle("Sample IMages from the Digits Dataset", fontsize=16);

**Target Variable Distribution**

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x=y)
plt.title('Distribution of Digits')
plt.xlabel('Digit')
plt.ylabel('Count')
plt.show()
# The dataset is very well-balanced.


##### **3. Minimal Working Example (Classification Metrics)**

Let's train a model and then dive into the evaluation.

In [None]:
# IMports, Splitting, Scaling, and Training
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Split the data
X_train,\
X_test,\
y_train,\
y_test = \
        train_test_split(
            X, 
            y,
            test_size=0.2,
            random_state=42,
            stratify=y
        )
print(f"Shape of X_Train : {X_train.shape}")
print(f"Shape of X_Test : {X_test.shape}")
print(f"Shape of y_Train : {y_train.shape}")
print(f"Shape of y_Test : {y_test.shape}")
print(f"MINIMUM VALUE IN TRAINING BEFORE SCALING : {X_train.min()} ")
print(f"MAXIMUM VALUE IN TRAINING BEFORE SCALING : {X_train.max()} ")
print()
print(f"MEAN IN TRAINING BEFORE SCALING : {np.mean(X_train)}")
print(f"STD IN TRAINING BEFORE SCALING : {np.std(X_train)}")
print("The dataset already seems pretty solid but normalizing\nwill always be a crucial part before modelling")
# SCale the Data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)
print()
print(f"MINIMUM VALUE AFTER SCALING : {X_train_scaled.min()} ")
print(f"MAXIMUM VALUE AFTER SCALING : {X_train_scaled.max()} ")
print()
print(f"MEAN AFTER SCALING : {np.mean(X_train_scaled)}")
print(f"STD AFTER SCALING : {np.std(X_train_scaled)}")
print()

# Training the model
model =  LogisticRegression(max_iter=1000,
                            random_state=42)
model.fit(X_train_scaled, y_train)

# Make Predictions
y_pred = model.predict(X_test_scaled)
print(f"First 5 Predictions made by the model : {y_pred[:5]}")
print(f"First 5 true labels :                   {y_test[:5]}")
print("Pretty close. Let's evaluate our model")

**The CLassification report**
* This is the single most useful evaluation function for classification.

In [None]:
report = classification_report(y_test, y_pred, target_names=digits.target_names.astype(str))
print('classification_report')
print(report)
# 'Support' is the number of true instances for each class in the test set.

**The Confusion Matrix**

In [None]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix")
print(cm)
print("\nSo beautiful")

**visualizing the confusion Matrix**
* A heatmap is much easier to interpret than the raw numbers

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(cm,
            annot=True,
            fmt='d',
            cmap="Blues",
            xticklabels=digits.target_names,
            yticklabels=digits.target_names
            )
plt.xlabel('Predicted Label')
plt.ylabel('True label')
plt.title('Confusion Matrix for Digit Recognition')
plt.show()

#### 4. Variations (Regression Metrics)

Let's quickly revisit the Boston Housing dataset from the previous chunk to demonstrate the regression metrics.

In [None]:
# Load the data and train a regression model
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load data
boston = fetch_openml(name="boston",
                      version=1,
                      as_frame=True,
                      parser='auto')
X_boston, y_boston = boston.data, boston.target

# Split, Scale, Train
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_boston,
                                                            y_boston,
                                                            test_size=0.2,
                                                            random_state=42)
print(f"Shape of X_Train : {X_train_b.shape}")
print(f"Shape of X_Test : {X_test_b.shape}")
print(f"Shape of y_Train : {y_train_b.shape}")
print(f"Shape of y_Test : {y_test_b.shape}")

# scale
scaler_b = StandardScaler()
X_train_b_scaled = scaler_b.fit_transform(X_train_b)
X_test_b_scaled  = scaler_b.transform(X_test_b)

# model
lin_reg = LinearRegression()
lin_reg.fit(X_train_b_scaled, y_train_b)
y_pred_b = lin_reg.predict(X_test_b_scaled)

#### $Calculate$ $and$ $Print$ $regression$ $metrics$

In [None]:
mae = mean_absolute_error(y_test_b, y_pred_b)
mse = mean_squared_error(y_test_b, y_pred_b)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_b, y_pred_b)

print("Regression Metrics")
print(f"Mean Absolute Error (MAE): ${mae*1000:.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse*1000:.2f}")
print(f"R-squared (R²): {r2:.3f}")

#### **5. Common Pitfalls**

1.  **Using Accuracy on Imbalanced Datasets:** As discussed, this is the biggest trap. If a dataset has 99% class A and 1% class B, a model that always predicts A has 99% accuracy but is completely useless. Always check your target distribution and use precision/recall/F1 for imbalanced problems.
2.  **Looking at Only One Metric:** No single metric tells the whole story. R² can be high, but your RMSE might still be too large for your business case. Precision might be high, but recall could be terrible. Always look at a suite of metrics.
3.  **Confusing the Axes of the Confusion Matrix:** Be very careful to label which axis is "True" and which is "Predicted". Scikit-learn's `confusion_matrix(y_true, y_pred)` puts `y_true` on the y-axis and `y_pred` on the x-axis.

# Congragulations Soldier. You may now proceed to Chunk 05 : Essential Tree-Based Models