# Train/Test Split & Model Evaluation

In this notebook, we'll learn about the importance of splitting data into training and testing sets to evaluate machine learning models fairly.

## 🎯 Why Do We Need a Train/Test Split?

Imagine trying to grade your own homework. If you grade yourself on the same work, you might be biased and get an overly optimistic score. Similarly, testing a model on the data it was trained on can give an unfairly high performance measure.

Let's see why we need to split data:
- **Avoid overfitting:** When a model memorizes training data, it may not perform well on new data.
- **Realistic evaluation:** Testing on unseen data helps us understand how well our model will perform in real-world scenarios.


## 📊 Common Train/Test Split Ratios

- **80/20 Split:** 80% training, 20% testing (common for many tasks)
- **70/30 Split:** For smaller datasets
- **60/20/20 Split:** Train/Validation/Test (more advanced)
- **Larger datasets:** Usually allow smaller test percentage


## 💻 Example: How to Split Data and Evaluate a Model


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier

# Load sample data
iris = load_iris()
X, y = iris.data, iris.target

# Split data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = KNeighborsClassifier()
model.fit(X_train, y_train)

# Test model
accuracy = model.score(X_test, y_test)
print(f"Model accuracy: {accuracy:.2f}")

## 🚀 Open this example in Google Colab

[Open in Colab](https://colab.research.google.com/github/Roopesht/codeexamples/blob/main/genai/python_easy/2/concept_1.ipynb)

## 🎯 Key Takeaway

Remember: "Never test on training data - it's like grading your own homework!"

## ❓ Question:

If you're building a spam email detector, why would testing on the same training emails give misleading results?