<h4 style="color:#1a73e8;">2.3.7 Dataset Splitting: Ensuring Valid Evaluation</h4>

### **The Golden Rule**: **Never train and evaluate on the same data.**

- **Training Set**: Estimates model parameters.
- **Validation Set**: Tunes hyperparameters (e.g., via GridSearchCV).
- **Test Set**: Provides **final, unbiased performance estimate**—used **only once**.

### **Stratified Sampling**

In classification with imbalanced classes (e.g., 95% negative, 5% positive), random splits may yield validation sets with **no positive samples**. **Stratification** ensures **proportional representation** of each class in all splits.

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)

X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)

# Now: 60% train, 20% val, 20% test — all stratified

For regression, use **stratification by quantiles** (not built into sklearn; requires manual binning).

---

<h4 style="color:#1a73e8;">2.3.8 Feature Scaling: Normalization vs. Standardization</h4>

### **Why Scale Features?**

Algorithms like **k-NN, SVM, neural networks, and PCA** are **distance-based** or **gradient-based**. If features are on different scales (e.g., `Age` in [0–100] vs. `Income` in [0–1,000,000]), the latter dominates.

### **Standardization (Z-score)**

\[
x_{\text{scaled}} = \frac{x - \mu}{\sigma}
\]

- Centers data at 0, unit variance.
- **Use for**: Linear models, SVM, PCA, neural nets (often preferred).
- **Robust to outliers?** No—use `RobustScaler` if outliers present.

### **Min-Max Scaling**

\[
x_{\text{scaled}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
\]

- Scales to [0, 1].
- **Use for**: Neural nets with bounded activations (sigmoid, tanh), image pixel normalization.
- **Sensitive to outliers**—a single extreme value compresses the rest.

### **Critical Rule: Fit Scaler on Training Only**

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)      # FIT + TRANSFORM
X_test_scaled = scaler.transform(X_test)            # TRANSFORM ONLY

Violating this leaks test set statistics into training—**data leakage**.

---