# Model Selection Strategies

Selecting the right model is crucial for building effective machine learning systems. Different strategies help us decide which model performs best on unseen data.

## 1. Hold-out Method
- Split data into **train, validation, and test sets**.
- Train on training set, tune hyperparameters on validation set, and evaluate on test set.

## 2. Cross-Validation (CV)
- Data is split into *k* folds.
- Train on (k-1) folds, validate on 1 fold, repeat k times.
- Average results across folds.
- More reliable than a single split.

## 3. Grid Search / Random Search
- **Grid Search:** Try all possible combinations of hyperparameters.
- **Random Search:** Randomly sample hyperparameters (faster).
- Both usually combined with cross-validation.

## 4. Bayesian Optimization / Advanced Search
- Uses probability models to guide hyperparameter tuning.
- More efficient than grid/random search.

## 5. Ensemble Methods for Model Selection
- Sometimes combining models (bagging, boosting, stacking) gives better results than selecting a single best model.

## 6. Practical Considerations
- Avoid overfitting to validation data.
- Always evaluate final performance on a separate test set.

In [None]:
# Example: Comparing Logistic Regression and Random Forest with Cross-Validation
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target

models = {
    'Logistic Regression': LogisticRegression(max_iter=500, solver='liblinear'),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name}: Mean Accuracy = {np.mean(scores):.3f}, Std = {np.std(scores):.3f}")