# Scikit-learn: Machine Learning in Python

## Introduction

Scikit-learn is one of the most popular and user-friendly machine learning libraries in Python. It provides simple and efficient tools for data analysis and modeling, built on NumPy, SciPy, and matplotlib. This tutorial will guide you through the fundamentals of scikit-learn and demonstrate how to implement various machine learning algorithms for classification, regression, clustering, and more.

## Why Scikit-learn?

Scikit-learn has become the go-to library for machine learning in Python for several reasons:

- **Consistent API**: All algorithms follow a consistent interface, making it easy to switch between different models
- **Comprehensive Documentation**: Extensive documentation with examples and tutorials
- **Active Community**: Large and active community providing support and contributions
- **Integration**: Seamless integration with the scientific Python stack (NumPy, SciPy, Pandas, Matplotlib)
- **Production-Ready**: Robust implementation suitable for both research and production environments
- **Extensive Algorithm Coverage**: Implements a wide range of machine learning algorithms

Let's start by importing scikit-learn and checking its version:

In [None]:
# Import scikit-learn and check version
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")

# Import other necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
plt.style.use('seaborn-whitegrid')
sns.set_style("whitegrid")

## 1. Scikit-learn Core Concepts

Before diving into specific algorithms, let's understand the core concepts and components of scikit-learn:

### Concept: Estimators

In scikit-learn, all machine learning algorithms are implemented as Python classes called **estimators**. Estimators implement the following methods:

- `fit(X, y)`: Fit the model to the training data
- `predict(X)`: Make predictions on new data
- `score(X, y)`: Evaluate the model's performance

Some estimators also implement additional methods like:

- `transform(X)`: Transform the data (e.g., dimensionality reduction)
- `fit_transform(X)`: Fit to data, then transform it
- `predict_proba(X)`: Predict class probabilities

This consistent API makes it easy to use different algorithms with minimal code changes.

### Concept: Datasets

Scikit-learn provides several built-in datasets for practicing machine learning. These datasets are useful for learning and testing algorithms without having to download external data.

Let's explore some of the built-in datasets:

In [None]:
from sklearn import datasets

# Load the iris dataset
iris = datasets.load_iris()
print("Iris dataset:")
print(f"- Number of samples: {iris.data.shape[0]}")
print(f"- Number of features: {iris.data.shape[1]}")
print(f"- Number of classes: {len(np.unique(iris.target))}")
print(f"- Feature names: {iris.feature_names}")
print(f"- Target names: {iris.target_names}")

# Load the digits dataset
digits = datasets.load_digits()
print("\nDigits dataset:")
print(f"- Number of samples: {digits.data.shape[0]}")
print(f"- Number of features: {digits.data.shape[1]}")
print(f"- Number of classes: {len(np.unique(digits.target))}")

# Load the Boston housing dataset
boston = datasets.load_boston()
print("\nBoston housing dataset:")
print(f"- Number of samples: {boston.data.shape[0]}")
print(f"- Number of features: {boston.data.shape[1]}")
print(f"- Feature names: {boston.feature_names[:5]}... (and more)")

### Concept: Data Preprocessing

Before applying machine learning algorithms, it's often necessary to preprocess the data. Scikit-learn provides various tools for data preprocessing, including:

- **Standardization**: Scale features to have mean=0 and variance=1
- **Normalization**: Scale features to a specific range (e.g., [0,1])
- **Encoding**: Convert categorical variables to numerical
- **Imputation**: Handle missing values
- **Feature Selection**: Select the most relevant features

Let's see some examples of data preprocessing:

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Example data
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Original data:")
print(X)

# Standardization (Z-score normalization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("\nStandardized data (mean=0, std=1):")
print(X_scaled)
print(f"Mean: {X_scaled.mean(axis=0)}")
print(f"Standard deviation: {X_scaled.std(axis=0)}")

# Min-Max scaling (normalization to [0,1] range)
min_max_scaler = MinMaxScaler()
X_normalized = min_max_scaler.fit_transform(X)
print("\nNormalized data (range [0,1]):")
print(X_normalized)
print(f"Min: {X_normalized.min(axis=0)}")
print(f"Max: {X_normalized.max(axis=0)}")

# Handling missing values
X_missing = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])
print("\nData with missing values:")
print(X_missing)

# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_missing)
print("\nData after imputation (using mean):")
print(X_imputed)

### Concept: Train-Test Split

To evaluate a machine learning model properly, we need to split our data into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance on unseen data.

Scikit-learn provides the `train_test_split` function for this purpose:

In [None]:
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Total dataset size: {X.shape[0]} samples")
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

# Check class distribution in original dataset
print("\nClass distribution in original dataset:")
for i, target_name in enumerate(iris.target_names):
    print(f"- {target_name}: {np.sum(y == i)} samples")

# Check class distribution in training set
print("\nClass distribution in training set:")
for i, target_name in enumerate(iris.target_names):
    print(f"- {target_name}: {np.sum(y_train == i)} samples")

# Check class distribution in testing set
print("\nClass distribution in testing set:")
for i, target_name in enumerate(iris.target_names):
    print(f"- {target_name}: {np.sum(y_test == i)} samples")

### Concept: Cross-Validation

Cross-validation is a technique for evaluating machine learning models by training several models on different subsets of the available data and evaluating them on the complementary subset. This helps to assess how the model will generalize to an independent dataset.

Scikit-learn provides several cross-validation strategies:

In [None]:
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.linear_model import LogisticRegression

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Create a model
model = LogisticRegression(max_iter=200, random_state=42)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)
print("5-fold cross-validation scores:")
print(cv_scores)
print(f"Mean accuracy: {cv_scores.mean():.4f}")
print(f"Standard deviation: {cv_scores.std():.4f}")

# K-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(model, X, y, cv=kf)
print("\nK-fold cross-validation scores:")
print(kf_scores)
print(f"Mean accuracy: {kf_scores.mean():.4f}")

# Stratified K-fold cross-validation (preserves class distribution)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
skf_scores = cross_val_score(model, X, y, cv=skf)
print("\nStratified K-fold cross-validation scores:")
print(skf_scores)
print(f"Mean accuracy: {skf_scores.mean():.4f}")

## 2. Classification Algorithms

Classification is a supervised learning task where the goal is to predict the category (class) of new observations based on training data. Scikit-learn provides many classification algorithms, including:

- Logistic Regression
- Support Vector Machines (SVM)
- Decision Trees
- Random Forests
- K-Nearest Neighbors (KNN)
- Naive Bayes

Let's implement some of these algorithms on the Iris dataset:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train different classifiers
classifiers = {
    "Logistic Regression": LogisticRegression(max_iter=200, random_state=42),
    "Support Vector Machine": SVC(kernel='rbf', random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5),
    "Naive Bayes": GaussianNB()
}

# Train and evaluate each classifier
for name, clf in classifiers.items():
    # Train the classifier
    clf.fit(X_train, y_train)
    
    # Make predictions
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Print results
    print(f"\n{name}:")
    print(f"Accuracy: {accuracy:.4f}")
    
    # Print classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=iris.target_names))
    
    # Print confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:")
    print(cm)

### Concept: Decision Boundaries

Let's visualize the decision boundaries of different classifiers on a 2D projection of the Iris dataset:

In [None]:
from sklearn.decomposition import PCA

# Use PCA to reduce the iris dataset to 2 dimensions for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

# Define a function to plot decision boundaries
def plot_decision_boundary(clf, X, y, title):
    h = 0.02  # step size in the mesh
    
    # Create a mesh grid
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # Predict class for each point in the mesh
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot the decision boundary
    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdYlBu)
    
    # Plot the training points
    scatter = plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.RdYlBu)
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.title(title)
    plt.legend(handles=scatter.legend_elements()[0], labels=iris.target_names)
    plt.tight_layout()
    plt.show()

# Train classifiers on the 2D data and plot decision boundaries
for name, clf in classifiers.items():
    clf.fit(X_2d, y)
    plot_decision_boundary(clf, X_2d, y, f"Decision Boundary - {name}")

## 3. Regression Algorithms

Regression is a supervised learning task where the goal is to predict continuous values. Scikit-learn provides several regression algorithms, including:

- Linear Regression
- Ridge Regression
- Lasso Regression
- Support Vector Regression (SVR)
- Decision Tree Regressor
- Random Forest Regressor

Let's implement some of these algorithms on the Boston Housing dataset:

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Load the Boston Housing dataset
boston = datasets.load_boston()
X = boston.data
y = boston.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train different regressors
regressors = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0, random_state=42),
    "Lasso Regression": Lasso(alpha=0.1, random_state=42),
    "Support Vector Regression": SVR(kernel='rbf'),
    "Decision Tree Regressor": DecisionTreeRegressor(random_state=42),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, random_state=42)
}

# Train and evaluate each regressor
for name, reg in regressors.items():
    # Train the regressor
    reg.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = reg.predict(X_test_scaled)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Print results
    print(f"\n{name}:")
    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}")
    
    # Plot actual vs predicted values
    plt.figure(figsize=(10, 6))
    plt.scatter(y_test, y_pred, alpha=0.7)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
    plt.xlabel('Actual Values')
    plt.ylabel('Predicted Values')
    plt.title(f"{name} - Actual vs Predicted Values")
    plt.tight_layout()
    plt.show()

### Concept: Feature Importance

Some regression models, like Random Forest, provide feature importance scores that indicate how useful each feature was in the construction of the model. Let's visualize the feature importance for the Boston Housing dataset:

In [None]:
# Train a Random Forest regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)

# Get feature importances
importances = rf_reg.feature_importances_

# Sort feature importances in descending order
indices = np.argsort(importances)[::-1]

# Rearrange feature names so they match the sorted feature importances
names = [boston.feature_names[i] for i in indices]

# Create plot
plt.figure(figsize=(12, 8))
plt.title("Feature Importance for Boston Housing Dataset")
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), names, rotation=90)
plt.tight_layout()
plt.show()

# Print feature ranking
print("Feature ranking:")
for i in range(X.shape[1]):
    print(f"{i+1}. {names[i]} ({importances[indices[i]]:.4f})")

## 4. Clustering Algorithms

Clustering is an unsupervised learning task where the goal is to group similar data points together. Scikit-learn provides several clustering algorithms, including:

- K-Means
- DBSCAN
- Hierarchical Clustering
- Gaussian Mixture Models

Let's implement some of these algorithms on a synthetic dataset:

In [None]:
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score

# Generate a synthetic dataset with 3 clusters
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.8, random_state=42)

# Plot the original data
plt.figure(figsize=(10, 8))
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', edgecolors='k', s=50)
plt.title('Original Data with 3 Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='True Cluster')
plt.tight_layout()
plt.show()

# Create and train different clustering algorithms
clustering_algorithms = {
    "K-Means": KMeans(n_clusters=3, random_state=42),
    "DBSCAN": DBSCAN(eps=0.5, min_samples=5),
    "Agglomerative Clustering": AgglomerativeClustering(n_clusters=3),
    "Gaussian Mixture Model": GaussianMixture(n_components=3, random_state=42)
}

# Train and evaluate each clustering algorithm
for name, algorithm in clustering_algorithms.items():
    # Fit the algorithm
    algorithm.fit(X)
    
    # Get cluster labels
    if name == "Gaussian Mixture Model":
        y_pred = algorithm.predict(X)
    else:
        y_pred = algorithm.labels_
    
    # Calculate silhouette score (if there are at least 2 clusters and not all points are in the same cluster)
    n_clusters = len(np.unique(y_pred))
    if n_clusters > 1 and n_clusters < len(X):
        silhouette_avg = silhouette_score(X, y_pred)
        print(f"{name} - Silhouette Score: {silhouette_avg:.4f}")
    else:
        print(f"{name} - Silhouette Score: N/A (need at least 2 clusters)")
    
    # Plot the clusters
    plt.figure(figsize=(10, 8))
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', edgecolors='k', s=50)
    plt.title(f'Clusters Identified by {name}')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.colorbar(label='Cluster')
    plt.tight_layout()
    plt.show()

### Concept: Determining the Optimal Number of Clusters

For algorithms like K-Means, we need to specify the number of clusters in advance. The Elbow Method and Silhouette Analysis are common techniques for determining the optimal number of clusters:

In [None]:
# Elbow Method for K-Means
inertia = []
silhouette_scores = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plot the Elbow Method
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(k_range, inertia, 'o-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)

# Plot the Silhouette Scores
plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, 'o-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis for Optimal k')
plt.grid(True)

plt.tight_layout()
plt.show()

# Print the results
print("Elbow Method and Silhouette Analysis Results:")
for k, inert, silhouette in zip(k_range, inertia, silhouette_scores):
    print(f"k={k}: Inertia={inert:.2f}, Silhouette Score={silhouette:.4f}")

## 5. Dimensionality Reduction

Dimensionality reduction techniques are used to reduce the number of features in a dataset while preserving as much information as possible. Scikit-learn provides several dimensionality reduction algorithms, including:

- Principal Component Analysis (PCA)
- t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Truncated Singular Value Decomposition (TruncatedSVD)

Let's implement some of these algorithms on the digits dataset:

In [None]:
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.manifold import TSNE

# Load the digits dataset
digits = datasets.load_digits()
X = digits.data
y = digits.target

# Standardize the data
X_scaled = StandardScaler().fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# Apply TruncatedSVD
svd = TruncatedSVD(n_components=2, random_state=42)
X_svd = svd.fit_transform(X_scaled)

# Plot the results
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# PCA
scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolors='k', alpha=0.7)
axes[0].set_title('PCA')
axes[0].set_xlabel('First Principal Component')
axes[0].set_ylabel('Second Principal Component')
axes[0].grid(True)

# t-SNE
scatter2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', edgecolors='k', alpha=0.7)
axes[1].set_title('t-SNE')
axes[1].set_xlabel('First t-SNE Component')
axes[1].set_ylabel('Second t-SNE Component')
axes[1].grid(True)

# TruncatedSVD
scatter3 = axes[2].scatter(X_svd[:, 0], X_svd[:, 1], c=y, cmap='viridis', edgecolors='k', alpha=0.7)
axes[2].set_title('TruncatedSVD')
axes[2].set_xlabel('First SVD Component')
axes[2].set_ylabel('Second SVD Component')
axes[2].grid(True)

# Add a colorbar
plt.colorbar(scatter1, ax=axes[0], label='Digit')
plt.colorbar(scatter2, ax=axes[1], label='Digit')
plt.colorbar(scatter3, ax=axes[2], label='Digit')

plt.tight_layout()
plt.show()

# For PCA, let's also look at the explained variance ratio
pca_full = PCA().fit(X_scaled)
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca_full.explained_variance_ratio_), 'o-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance vs. Number of Components')
plt.grid(True)
plt.show()

# Print the explained variance ratio for the first few components
print("Explained variance ratio for the first 10 components:")
for i, ratio in enumerate(pca_full.explained_variance_ratio_[:10]):
    print(f"Component {i+1}: {ratio:.4f} ({ratio*100:.2f}%)")

print(f"\nCumulative explained variance with 10 components: {np.sum(pca_full.explained_variance_ratio_[:10]):.4f} ({np.sum(pca_full.explained_variance_ratio_[:10])*100:.2f}%)")

## 6. Model Selection and Hyperparameter Tuning

Hyperparameter tuning is the process of finding the optimal hyperparameters for a machine learning algorithm. Scikit-learn provides several tools for hyperparameter tuning, including:

- Grid Search
- Randomized Search

Let's use these techniques to optimize a Support Vector Machine (SVM) classifier on the digits dataset:

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from scipy.stats import uniform, randint

# Load the digits dataset
digits = datasets.load_digits()
X = digits.data
y = digits.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create an SVM classifier
svm = SVC(random_state=42)

# Define the parameter grid for Grid Search
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

# Perform Grid Search
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train_scaled, y_train)

# Print the best parameters and score
print("Grid Search Results:")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
print("\nTest set performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Define the parameter distributions for Randomized Search
param_dist = {
    'C': uniform(0.1, 100),
    'gamma': uniform(0.001, 1),
    'kernel': ['rbf', 'linear']
}

# Perform Randomized Search
random_search = RandomizedSearchCV(svm, param_dist, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42, verbose=1)
random_search.fit(X_train_scaled, y_train)

# Print the best parameters and score
print("\nRandomized Search Results:")
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_:.4f}")

# Evaluate the best model on the test set
best_model_random = random_search.best_estimator_
y_pred_random = best_model_random.predict(X_test_scaled)
print("\nTest set performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_random):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_random))

## 7. Pipelines

Scikit-learn pipelines allow you to chain multiple steps together, such as preprocessing, feature selection, and model training. This helps to prevent data leakage and makes your code more organized.

Let's create a pipeline for a classification task on the breast cancer dataset:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

# Load the breast cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(f_classif, k=10)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Evaluate the pipeline
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Pipeline accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))

# Get the selected features
selected_features = pipeline.named_steps['feature_selection'].get_support()
selected_feature_names = [cancer.feature_names[i] for i in range(len(cancer.feature_names)) if selected_features[i]]
print("\nSelected features:")
for feature in selected_feature_names:
    print(f"- {feature}")

# Perform hyperparameter tuning on the pipeline
param_grid = {
    'feature_selection__k': [5, 10, 15, 20],
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20, 30]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

# Print the best parameters and score
print("\nGrid Search Results:")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Evaluate the best pipeline on the test set
best_pipeline = grid_search.best_estimator_
y_pred_best = best_pipeline.predict(X_test)
print("\nTest set performance of best pipeline:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_best):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best, target_names=cancer.target_names))

## 8. Model Persistence

Once you've trained a model, you might want to save it for later use without having to retrain it. Scikit-learn provides the `joblib` module for model persistence:

In [None]:
from sklearn.externals import joblib
import os

# Create a directory for saving models if it doesn't exist
if not os.path.exists('models'):
    os.makedirs('models')

# Save the best pipeline to a file
joblib.dump(best_pipeline, 'models/best_pipeline.pkl')

# Load the model from the file
loaded_model = joblib.load('models/best_pipeline.pkl')

# Verify that the loaded model works correctly
y_pred_loaded = loaded_model.predict(X_test)
accuracy_loaded = accuracy_score(y_test, y_pred_loaded)
print(f"Loaded model accuracy: {accuracy_loaded:.4f}")

# Verify that the predictions are the same
print(f"Are predictions identical? {np.array_equal(y_pred_best, y_pred_loaded)}")

## Practice Problems

Now that you've learned the fundamentals of scikit-learn, try solving these practice problems to test your understanding.

### Problem 1: Wine Classification

Use the wine dataset from scikit-learn to train a classifier that can predict the wine type based on its chemical properties. Compare the performance of at least three different classification algorithms.

In [None]:
# Your solution here
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create classifiers
classifiers = {
    "Support Vector Machine": SVC(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier()
}

# Train and evaluate each classifier
for name, clf in classifiers.items():
    # Train the classifier
    clf.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = clf.predict(X_test_scaled)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Perform cross-validation
    cv_scores = cross_val_score(clf, X_train_scaled, y_train, cv=5)
    
    # Print results
    print(f"\n{name}:")
    print(f"Test accuracy: {accuracy:.4f}")
    print(f"Cross-validation accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
    
    # Print classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=wine.target_names))
    
    # Print confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:")
    print(cm)

### Problem 2: Diabetes Regression

Use the diabetes dataset from scikit-learn to train a regression model that can predict the disease progression based on patient data. Implement a pipeline that includes feature scaling and feature selection.

In [None]:
# Your solution here
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt

# Load the diabetes dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a pipeline with scaling, feature selection, and regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(f_regression, k=5)),
    ('regressor', Ridge(alpha=1.0, random_state=42))
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Print results
print("Pipeline Results:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

# Perform cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_rmse = np.sqrt(-cv_scores)
print(f"\nCross-validation RMSE: {cv_rmse.mean():.4f} ± {cv_rmse.std():.4f}")

# Get the selected features
selected_features = pipeline.named_steps['feature_selection'].get_support()
selected_feature_names = [diabetes.feature_names[i] for i in range(len(diabetes.feature_names)) if selected_features[i]]
print("\nSelected features:")
for feature in selected_feature_names:
    print(f"- {feature}")

# Plot actual vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values for Diabetes Progression')
plt.grid(True)
plt.show()

# Plot residuals
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True)
plt.show()

### Problem 3: Clustering Countries

Use the scikit-learn clustering algorithms to group countries based on socio-economic indicators. You can use the World Bank or UN datasets, or create a synthetic dataset.

In [None]:
# Your solution here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

# Create a synthetic dataset of countries with socio-economic indicators
np.random.seed(42)
n_countries = 50

# Generate synthetic data
gdp_per_capita = np.random.exponential(scale=15000, size=n_countries)
life_expectancy = 50 + 30 * np.random.beta(5, 2, size=n_countries)
literacy_rate = 40 + 60 * np.random.beta(5, 2, size=n_countries)
infant_mortality = np.random.exponential(scale=30, size=n_countries)
internet_users = np.random.beta(2, 5, size=n_countries) * 100

# Create a DataFrame
countries = [f"Country_{i+1}" for i in range(n_countries)]
data = pd.DataFrame({
    'Country': countries,
    'GDP_per_capita': gdp_per_capita,
    'Life_expectancy': life_expectancy,
    'Literacy_rate': literacy_rate,
    'Infant_mortality': infant_mortality,
    'Internet_users': internet_users
})

# Display the first few rows
print("Synthetic Country Dataset:")
print(data.head())

# Prepare the data for clustering
X = data.drop('Country', axis=1).values
country_names = data['Country'].values

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Determine the optimal number of clusters using the Elbow Method
inertia = []
silhouette_scores = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

# Plot the Elbow Method and Silhouette Scores
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(k_range, inertia, 'o-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, 'o-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis for Optimal k')
plt.grid(True)

plt.tight_layout()
plt.show()

# Choose the optimal number of clusters (let's say k=4 based on the plots)
optimal_k = 4

# Apply K-Means clustering
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Apply Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=optimal_k)
hierarchical_labels = hierarchical.fit_predict(X_scaled)

# Add cluster labels to the DataFrame
data['KMeans_Cluster'] = kmeans_labels
data['Hierarchical_Cluster'] = hierarchical_labels

# Apply PCA for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot the clusters in 2D
plt.figure(figsize=(15, 6))

# K-Means
plt.subplot(1, 2, 1)
for i in range(optimal_k):
    plt.scatter(X_pca[kmeans_labels == i, 0], X_pca[kmeans_labels == i, 1], label=f'Cluster {i+1}')
plt.title('K-Means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True)

# Hierarchical
plt.subplot(1, 2, 2)
for i in range(optimal_k):
    plt.scatter(X_pca[hierarchical_labels == i, 0], X_pca[hierarchical_labels == i, 1], label=f'Cluster {i+1}')
plt.title('Hierarchical Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

# Analyze the clusters
print("\nK-Means Cluster Analysis:")
for i in range(optimal_k):
    cluster_data = data[data['KMeans_Cluster'] == i].drop(['KMeans_Cluster', 'Hierarchical_Cluster'], axis=1)
    print(f"\nCluster {i+1} ({len(cluster_data)} countries):")
    print(cluster_data.describe().loc[['mean', 'std']].round(2))
    print(f"Countries in this cluster: {', '.join(cluster_data['Country'].values[:5])}" + 
          ("..." if len(cluster_data) > 5 else ""))

## Additional Resources

To further enhance your scikit-learn skills, check out these resources:

- [Scikit-learn Official Documentation](https://scikit-learn.org/stable/documentation.html)
- [Scikit-learn Tutorials](https://scikit-learn.org/stable/tutorial/index.html)
- [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [Scikit-learn Examples](https://scikit-learn.org/stable/auto_examples/index.html)
- [Python Machine Learning (Book) by Sebastian Raschka](https://sebastianraschka.com/books.html)
- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Book) by Aurélien Géron](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)