# Advanced: Hands-on with sklearn Tools

In this notebook, we will explore some advanced tools in scikit-learn for model evaluation and selection. We will learn how to perform cross-validation, generate detailed classification reports, and understand the complete evaluation workflow. These techniques help you build reliable machine learning models.


## Cross-validation with `cross_val_score()`

Cross-validation is a technique to evaluate how well your model generalizes to unseen data. It involves splitting the data into multiple parts, training and testing the model on different splits to get a more reliable estimate of performance.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
import pandas as pd

## Load Dataset

We will use the `load_wine` dataset, which is a classic dataset for classification tasks.

In [None]:
# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

## Step 1: Initial train/test split

Splitting the data into training and testing sets prepares us for evaluating the model's performance later.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Step 2: Train the model

Using a Random Forest classifier, we train on the training data.

In [None]:
# Initialize and train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

## Step 3: Cross-validation on training data

We perform 5-fold cross-validation to estimate the model's performance more reliably.

In [None]:
# Perform cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"CV Score: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

## Step 4: Final evaluation on test set

We evaluate the trained model on the unseen test data and generate a classification report to see detailed metrics.

In [None]:
# Predict on test set
y_pred = model.predict(X_test)

# Print classification report
print("\nFinal Test Results:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

## Step 5: Confusion matrix

The confusion matrix provides a detailed breakdown of predictions versus true labels.

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

## Pro Tips for Model Evaluation

- 🎯 **Stratify:** Use stratified splits for imbalanced data
- 🔢 **Random State:** Set random_state for reproducible results
- 📊 **Multiple Metrics:** Never rely on accuracy alone
- 🔄 **Cross-Validation First:** Use CV for model selection, final test for reporting
- ⚠️ **Data Leakage:** Never let test data influence training