This is a critical step in any machine learning workflow. We need to evaluate how well our model generalizes to new, unseen data, not just how well it performs on the data it was trained on. This requires splitting our dataset.

---

## Scikit-learn: Data Splitting & Basic Evaluation

This document covers:

* **Why Split:** The fundamental reason for splitting data into training and testing sets – to evaluate how well a model generalizes to new, unseen data.
* **Conceptual Issues:** A brief explanation of overfitting and underfitting.
* **`train_test_split`:** How to use this function from `sklearn.model_selection` to split data for both classification (using `stratify`) and regression tasks, explaining key parameters like `test_size`, `random_state`, `shuffle`, and `stratify`.
* **Basic Metrics:** Introduction to core evaluation metrics from `sklearn.metrics`:
    * `accuracy_score` for classification.
    * `mean_squared_error` (MSE) and `r2_score` (R-squared) for regression.
* **Evaluation Workflow:** Demonstrating the basic steps: train on the training set, predict on the test set, and compare predictions to the true test set labels using appropriate metrics. It also shows the use of the estimator's `.score()` method.

---

Understanding how to split data and evaluate models correctly is crucial before diving into specific algorithms.

In [5]:
pip install scikit-learn

Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp313-cp313-win_amd64.whl.metadata (15 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.0-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.6.1-cp313-cp313-win_amd64.whl (11.1 MB)
Downloading joblib-1.5.0-py3-none-any.whl (307 kB)
Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.5.0 scikit-learn-1.6.1 threadpoolctl-3.6.0
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.neighbors import KNeighborsClassifier # Simple classifier
from sklearn.linear_model import LinearRegression # Simple regressor
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score

#### 1. The Need for Splitting Data
- Why split? We train a model on one portion of the data (training set)
- and evaluate its performance on a separate, unseen portion (test set).
- This helps estimate how well the model will perform on new, real-world data.
- Evaluating on the training data itself leads to an overly optimistic assessment
- and doesn't measure generalization ability.

#### 2. Basic Evaluation Concepts (Conceptual)
- Overfitting: The model learns the training data *too well*, including noise
   and specific patterns that don't generalize. It performs well on training
   data but poorly on test data.
- Underfitting: The model is too simple and fails to capture the underlying
   patterns in the training data. It performs poorly on both training and
   test data.
- Bias-Variance Tradeoff: Complex models tend to have low bias (fit training
  data well) but high variance (sensitive to training data specifics, prone
  to overfitting). Simple models tend to have high bias but low variance.

 The goal is often to find a balance.

#### 3. Splitting Data with train_test_split
- Found in sklearn.model_selection

In [8]:
print("--- Splitting Data ---")

# a) Classification Example (Iris dataset)
print("\n--- Classification Splitting ---")
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

print(f"Original Iris data shape: X={X_iris.shape}, y={y_iris.shape}")

--- Splitting Data ---

--- Classification Splitting ---
Original Iris data shape: X=(150, 4), y=(150,)


In [9]:
# Split into training (70%) and testing (30%) sets
# test_size: Proportion (or absolute number) of the dataset for the test split.
# train_size: Can be specified instead of test_size.
# random_state: Controls the shuffling applied to the data before splitting.
#               Ensures reproducibility - use the same state for the same split.
# shuffle: Whether or not to shuffle the data before splitting (default=True).
# stratify: Ensures that the proportion of target classes is the same in both
#           train and test sets. Crucial for classification, especially with
#           imbalanced datasets. Set it to the target variable 'y'.
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris,
    test_size=0.3,       # 30% for testing
    random_state=42,     # For reproducibility
    shuffle=True,        # Shuffle before splitting
    stratify=y_iris      # Preserve class proportions
)

print(f"Iris Training set shape: X={X_train_iris.shape}, y={y_train_iris.shape}")
print(f"Iris Test set shape: X={X_test_iris.shape}, y={y_test_iris.shape}")

Iris Training set shape: X=(105, 4), y=(105,)
Iris Test set shape: X=(45, 4), y=(45,)


In [10]:
# Check class distribution in original vs splits (using Pandas for convenience)
original_dist = pd.Series(y_iris).value_counts(normalize=True).sort_index()
train_dist = pd.Series(y_train_iris).value_counts(normalize=True).sort_index()
test_dist = pd.Series(y_test_iris).value_counts(normalize=True).sort_index()

print("\nClass distribution (Original):\n", original_dist)
print("\nClass distribution (Train - stratified):\n", train_dist)
print("\nClass distribution (Test - stratified):\n", test_dist)
print("-" * 20)


Class distribution (Original):
 0    0.333333
1    0.333333
2    0.333333
Name: proportion, dtype: float64

Class distribution (Train - stratified):
 0    0.333333
1    0.333333
2    0.333333
Name: proportion, dtype: float64

Class distribution (Test - stratified):
 0    0.333333
1    0.333333
2    0.333333
Name: proportion, dtype: float64
--------------------


In [11]:
# b) Regression Example (California Housing dataset)
print("\n--- Regression Splitting ---")
# fetch_california_housing returns features (X) as DataFrame and target (y) as Series
housing = fetch_california_housing(as_frame=True)
X_housing = housing.data
y_housing = housing.target

print(f"Original Housing data shape: X={X_housing.shape}, y={y_housing.shape}")

# Split into training (80%) and testing (20%) sets
# stratify is typically NOT used for regression tasks.
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_housing, y_housing,
    test_size=0.2,       # 20% for testing
    random_state=123,    # Different random state
    shuffle=True
)

print(f"Housing Training set shape: X={X_train_reg.shape}, y={y_train_reg.shape}")
print(f"Housing Test set shape: X={X_test_reg.shape}, y={y_test_reg.shape}")
print("-" * 30)


--- Regression Splitting ---
Original Housing data shape: X=(20640, 8), y=(20640,)
Housing Training set shape: X=(16512, 8), y=(16512,)
Housing Test set shape: X=(4128, 8), y=(4128,)
------------------------------


#### 4. Introduction to Evaluation Metrics
- Found in sklearn.metrics
- We evaluate the model on the *test set* after training it on the *training set*.

In [12]:
print("--- Basic Model Evaluation ---")

# a) Classification Example (Iris)
print("\n--- Classification Evaluation ---")
# 1. Instantiate and Train Model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_iris, y_train_iris) # Train ONLY on the training data

--- Basic Model Evaluation ---

--- Classification Evaluation ---


In [15]:
# 2. Make Predictions on Test Data
y_pred_iris = knn.predict(X_test_iris) # Predict on the unseen test data

# 3. Evaluate using Metrics
# Accuracy: Proportion of correct predictions
accuracy = accuracy_score(y_test_iris, y_pred_iris) # Compare true test labels vs predicted labels
print(f"KNN Classifier Accuracy on Iris test set: {accuracy:.4f}")

# Can also use the model's score method (often calculates accuracy for classifiers)
score_accuracy = knn.score(X_test_iris, y_test_iris)
print(f"KNN Classifier .score() on Iris test set: {score_accuracy:.4f}")
print("-" * 20)

KNN Classifier Accuracy on Iris test set: 0.9778
KNN Classifier .score() on Iris test set: 0.9778
--------------------


In [16]:
# b) Regression Example (Housing)
print("\n--- Regression Evaluation ---")
# 1. Instantiate and Train Model
lr = LinearRegression()
lr.fit(X_train_reg, y_train_reg) # Train ONLY on the training data


--- Regression Evaluation ---


In [17]:
# 2. Make Predictions on Test Data
y_pred_reg = lr.predict(X_test_reg) # Predict on the unseen test data

# 3. Evaluate using Metrics
# Mean Squared Error (MSE): Average squared difference between actual and predicted values. Lower is better.
mse = mean_squared_error(y_test_reg, y_pred_reg)
print(f"Linear Regression MSE on Housing test set: {mse:.4f}")

# R-squared (Coefficient of Determination): Proportion of variance in the dependent variable
# that is predictable from the independent variables. Ranges from -inf to 1. 1 is perfect prediction. 0 means model performs no better than predicting the mean.
r2 = r2_score(y_test_reg, y_pred_reg)
print(f"Linear Regression R-squared (R2) on Housing test set: {r2:.4f}")

# Can also use the model's score method (often calculates R2 for regressors)
score_r2 = lr.score(X_test_reg, y_test_reg)
print(f"Linear Regression .score() (R2) on Housing test set: {score_r2:.4f}")
print("-" * 30)

Linear Regression MSE on Housing test set: 0.5180
Linear Regression R-squared (R2) on Housing test set: 0.6105
Linear Regression .score() (R2) on Housing test set: 0.6105
------------------------------
