# Train-Test Split and Validation

One of the most important steps in model evaluation is splitting the dataset. This helps us ensure that the model is tested on **unseen data** to measure generalization.

---
## 1. Train-Test Split
- **Training set**: Used to fit the model.
- **Test set**: Used to evaluate performance.
- Typical ratio → 70-80% training, 20-30% testing.

### Why do we split?
- Prevents overfitting evaluation on training data.
- Ensures unbiased performance measurement.

---
## 2. Validation Set
- Sometimes we further split training data into **train** and **validation**.
- Validation set is used for **hyperparameter tuning**.
- Final test set is kept aside and used only at the end.

---
## 3. Cross-Validation (Preview)
- Instead of one split, we use multiple splits.
- More reliable estimate.
- Introduced in detail later.

---
## Workflow
1. Split dataset → train + test.
2. Further split training → train + validation.
3. Train on train, tune on validation.
4. Final evaluation on test.

In [None]:
# Example: Train-Test Split with Validation
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Step 1: Split into train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Split training set into train (70%) and validation (30%)
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)

print("Train size:", len(X_train_final))
print("Validation size:", len(X_val))
print("Test size:", len(X_test))

# Train model on training set
model = LogisticRegression(max_iter=200)
model.fit(X_train_final, y_train_final)

# Evaluate on validation set (used for tuning)
val_pred = model.predict(X_val)
val_acc = accuracy_score(y_val, val_pred)
print(f"Validation Accuracy: {val_acc:.2f}")

# Final evaluation on test set
test_pred = model.predict(X_test)
test_acc = accuracy_score(y_test, test_pred)
print(f"Test Accuracy: {test_acc:.2f}")