# Scikit-learn: A Deep Dive Tutorial

This notebook provides a comprehensive walkthrough of `scikit-learn`, the essential library for machine learning in Python. We will cover the entire ML workflow:

1. **Core Concepts:** The Estimator API (`fit`, `predict`, `transform`).
2. **Data Exploration & Preprocessing:** Loading, visualizing, splitting, and scaling data.
3. **Supervised Learning: Classification:** Building and evaluating a classification model visually.
4. **Supervised Learning: Regression:** Building and evaluating a regression model visually.
5. **Pipelines & Hyperparameter Tuning:** Best practices for building robust models.
6. **Unsupervised Learning:** A brief look at clustering and dimensionality reduction.

In [ ]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set a nice style for the plots
plt.style.use('seaborn-v0_8-whitegrid')

## 1. Core Concepts: The Scikit-learn API

Scikit-learn's power comes from its simple, consistent API. Every algorithm is exposed via an **'Estimator'** object. Key methods include:

- **`fit(X, y)`**: Trains the model. `X` contains the features (the data), and `y` contains the target labels or values.
- **`predict(X_new)`**: Makes predictions on new, unseen data `X_new` after the model has been trained.
- **`transform(X)`**: For preprocessing steps, this method transforms the data (e.g., scales it or encodes it).
- **`fit_transform(X)`**: A convenience method that combines `fit` and `transform` in one step.

## 2. Data Exploration & Preprocessing

Before building any model, we must understand and prepare our data. We'll use the famous Iris dataset for our classification example.

In [ ]:
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

### Understanding the Dataset

The Iris dataset contains 150 samples of iris flowers. There are 4 features (sepal length, sepal width, petal length, petal width) and a target variable indicating the species (setosa, versicolor, or virginica). Let's put it into a pandas DataFrame for easier exploration.

In [ ]:
# Create a pandas DataFrame
iris_df = pd.DataFrame(X, columns=iris.feature_names)
iris_df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Display the first 5 rows
iris_df.head()

### Visualizing the Dataset

A `pairplot` is a fantastic tool to quickly visualize the relationships between all pairs of features, as well as the distribution of each feature, colored by the target variable.

In [ ]:
# Create a pairplot to visualize the data
sns.pairplot(iris_df, hue='species', height=2.5)
plt.show()

### Train-Test Split

We must split our data into a training set and a testing set. The model learns from the training set and is evaluated on the unseen testing set to gauge its real-world performance. `stratify=y` ensures that the proportion of each class is the same in both the train and test sets.

In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')

### Feature Scaling

Many algorithms perform better when features are on a similar scale. `StandardScaler` standardizes features by removing the mean and scaling to unit variance.

In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit on the training data ONLY to avoid data leakage
scaler.fit(X_train)

# Transform both train and test data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)



## 3. Supervised Learning: Classification

Let's train a K-Nearest Neighbors (KNN) classifier to predict the species of an iris flower based on its measurements.

In [ ]:
from sklearn.neighbors import KNeighborsClassifier

# 1. Instantiate the model
knn = KNeighborsClassifier(n_neighbors=5)

# 2. Train the model using the scaled training data
knn.fit(X_train_scaled, y_train)

print('Model trained successfully!')

### Evaluation with Visuals

While metrics are important, visuals provide deeper insight into model performance.

In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = knn.predict(X_test_scaled)
print(f'Model Accuracy: {accuracy_score(y_test, y_pred):.4f}\n')
print('Classification Report:')
print(classification_report(y_test, y_pred, target_names=iris.target_names))

In [ ]:
# Display the confusion matrix visually
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

### Visualizing the Decision Boundary

A decision boundary plot shows how the model would classify any point in the feature space. It gives a great intuition for the model's behavior. We will use only the first two features (sepal length and width) for this 2D visualization.

In [ ]:
from matplotlib.colors import ListedColormap

# --- Create a new model trained on only the first two features ---
X_train_2d = X_train_scaled[:, :2]
X_test_2d = X_test_scaled[:, :2]
knn_2d = KNeighborsClassifier(n_neighbors=5)
knn_2d.fit(X_train_2d, y_train)

# --- Create a mesh grid for plotting ---
h = .02  # step size in the mesh
x_min, x_max = X_train_2d[:, 0].min() - 1, X_train_2d[:, 0].max() + 1
y_min, y_max = X_train_2d[:, 1].min() - 1, X_train_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# --- Make predictions on the mesh grid ---
Z = knn_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# --- Plot the decision boundary and the test points ---
plt.figure(figsize=(10, 8))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ['darkred', 'darkgreen', 'darkblue']

plt.contourf(xx, yy, Z, cmap=cmap_light)

sns.scatterplot(x=X_test_2d[:, 0], y=X_test_2d[:, 1], hue=iris.target_names[y_test], palette=cmap_bold, alpha=1.0, edgecolor="black")

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title('2-Class classification (k = 5)')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.show()

## 4. Supervised Learning: Regression

Now let's switch to a regression task: predicting a continuous value. We'll use the California Housing dataset.

In [ ]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [ ]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred = lr.predict(X_test_scaled)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error (MSE): {mse:.4f}')
print(f'R-squared (R2) Score: {r2:.4f}')

### Visualizing Regression Results

A scatter plot of actual vs. predicted values is a great way to visualize regression performance. For a perfect model, all points would lie on the 45-degree diagonal line.

In [ ]:
plt.figure(figsize=(8, 8))
plt.scatter(y_test, y_pred, alpha=0.3)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r', linewidth=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values')
plt.show()

## 5. Pipelines & Hyperparameter Tuning

Manually scaling and training can be error-prone. A **Pipeline** chains these steps together into a single estimator object. This prevents data leakage from the test set and simplifies your code.

In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(random_state=42))
])

pipeline.fit(X_train, y_train)
print(f'Pipeline Accuracy: {pipeline.score(X_test, y_test):.4f}')

### Hyperparameter Tuning with GridSearchCV

Most models have parameters (hyperparameters) that can be tuned. `GridSearchCV` automates this by exhaustively searching over a specified parameter grid and using cross-validation to find the best combination.

In [ ]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'svc__C': [0.1, 1, 10, 100],
    'svc__gamma': [1, 0.1, 0.01, 0.001],
    'svc__kernel': ['rbf', 'linear']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f'\nBest parameters found: {grid_search.best_params_}')
print(f'Best cross-validation score: {grid_search.best_score_:.4f}')
print(f'Test set score with best params: {grid_search.score(X_test, y_test):.4f}')

## 6. Unsupervised Learning

Unsupervised learning finds patterns in data without pre-existing labels (`y`).

### Clustering with K-Means
K-Means tries to partition data into *k* distinct clusters.

In [ ]:
from sklearn.cluster import KMeans

X, y = iris.data, iris.target

kmeans = KMeans(n_clusters=3, random_state=42, n_init='auto')
cluster_labels = kmeans.fit_predict(X)

# Visualize the clusters vs the actual labels
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6), sharey=True)

ax1.scatter(X[:, 2], X[:, 3], c=cluster_labels, cmap='viridis', s=50)
ax1.set_title('K-Means Clusters (on Petal features)')
ax1.set_xlabel(iris.feature_names[2])
ax1.set_ylabel(iris.feature_names[3])

ax2.scatter(X[:, 2], X[:, 3], c=y, cmap='viridis', s=50)
ax2.set_title('Actual Iris Species')
ax2.set_xlabel(iris.feature_names[2])
plt.show()

### Dimensionality Reduction with PCA

Principal Component Analysis (PCA) reduces the number of features while preserving as much of the data's variance as possible. It's great for visualization.

In [ ]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2) # Reduce from 4 features to 2
X_pca = pca.fit_transform(X)

print(f'Original shape: {X.shape}')
print(f'Reduced shape: {X_pca.shape}')

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=70)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('Iris Dataset Visualized with PCA')
plt.show()

## Conclusion

This notebook has covered the essential workflow of a scikit-learn project, emphasizing visual exploration and evaluation. The library's consistent API for estimators, transformers, and pipelines makes it an incredibly efficient tool for both beginners and experts.

From here, you can explore the vast number of other algorithms available for classification, regression, clustering, and more, all while using the same fundamental API principles demonstrated here.