# Chapter 2: Pre-Model Workflow and Data Preprocessing

This notebook provides "recipes" for using the scikit-learn Python library to preprocess data before modeling. Each recipe includes explanations, code examples, visualizations, best practices, and common pitfalls.

## Recipe 1: Handling Missing Data

In this section, we will explore different strategies for handling missing data using scikit-learn's imputation tools.

### SimpleImputer

The `SimpleImputer` class provides basic strategies for imputing missing values. It can replace missing values using a constant, the mean, median, or most frequent value of each column.

In [1]:
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

# Create a sample dataset with missing values
data = {"Feature1": [1, 2, np.nan, 4], "Feature2": [np.nan, 2, 3, 4]}
df = pd.DataFrame(data)

# Initialize the SimpleImputer
imputer = SimpleImputer(strategy="mean")

# Fit and transform the data
imputed_data = imputer.fit_transform(df)
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
imputed_df

ModuleNotFoundError: No module named 'sklearn'

### KNNImputer

The `KNNImputer` class uses the k-Nearest Neighbors approach to impute missing values. It considers the nearest neighbors to estimate the missing values.

In [None]:
from sklearn.impute import KNNImputer

# Initialize the KNNImputer
knn_imputer = KNNImputer(n_neighbors=2)

# Fit and transform the data
knn_imputed_data = knn_imputer.fit_transform(df)
knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=df.columns)
knn_imputed_df

### IterativeImputer

The `IterativeImputer` class models each feature with missing values as a function of other features, and iteratively estimates missing values.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Initialize the IterativeImputer
iterative_imputer = IterativeImputer()

# Fit and transform the data
iterative_imputed_data = iterative_imputer.fit_transform(df)
iterative_imputed_df = pd.DataFrame(iterative_imputed_data, columns=df.columns)
iterative_imputed_df

### Visualizations Comparing Strategies

Visualize the differences in imputation strategies using bar plots.

In [None]:
import matplotlib.pyplot as plt

# Plot the imputed data
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
imputed_df.plot(kind="bar", ax=axes[0], title="SimpleImputer")
knn_imputed_df.plot(kind="bar", ax=axes[1], title="KNNImputer")
iterative_imputed_df.plot(kind="bar", ax=axes[2], title="IterativeImputer")
plt.tight_layout()
plt.show()

## Recipe 2: Scaling and Normalization

Scaling and normalization are crucial steps in preprocessing data for machine learning models. They ensure that each feature contributes equally to the distance calculations in algorithms like k-NN and SVM.

### StandardScaler

The `StandardScaler` standardizes features by removing the mean and scaling to unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
scaled_df

### MinMaxScaler

The `MinMaxScaler` transforms features by scaling each feature to a given range, often between zero and one.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
minmax_scaler = MinMaxScaler()

# Fit and transform the data
minmax_scaled_data = minmax_scaler.fit_transform(df)
minmax_scaled_df = pd.DataFrame(minmax_scaled_data, columns=df.columns)
minmax_scaled_df

### Normalizer

The `Normalizer` scales individual samples to have unit norm.

In [None]:
from sklearn.preprocessing import Normalizer

# Initialize the Normalizer
normalizer = Normalizer()

# Fit and transform the data
normalized_data = normalizer.fit_transform(df)
normalized_df = pd.DataFrame(normalized_data, columns=df.columns)
normalized_df

### Visual Comparisons

Visualize the effects of different scaling and normalization techniques.

In [None]:
# Plot the scaled data
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
scaled_df.plot(kind="bar", ax=axes[0], title="StandardScaler")
minmax_scaled_df.plot(kind="bar", ax=axes[1], title="MinMaxScaler")
normalized_df.plot(kind="bar", ax=axes[2], title="Normalizer")
plt.tight_layout()
plt.show()

## Recipe 3: Encoding Categorical Variables

Encoding categorical variables is essential for converting non-numeric data into a format that can be used by machine learning algorithms.

### OneHotEncoder

The `OneHotEncoder` converts categorical values into a one-hot numeric array.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Sample categorical data
categorical_data = pd.DataFrame({"Category": ["A", "B", "A", "C"]})

# Initialize the OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)

# Fit and transform the data
onehot_encoded_data = onehot_encoder.fit_transform(categorical_data)
onehot_encoded_df = pd.DataFrame(
    onehot_encoded_data, columns=onehot_encoder.get_feature_names_out()
)
onehot_encoded_df

### LabelEncoder

The `LabelEncoder` encodes target labels with values between 0 and n_classes-1.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the data
label_encoded_data = label_encoder.fit_transform(categorical_data["Category"])
label_encoded_data

### ColumnTransformer

The `ColumnTransformer` allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Sample mixed data
mixed_data = pd.DataFrame({"Numeric": [1, 2, 3, 4], "Category": ["A", "B", "A", "C"]})

# Initialize the ColumnTransformer
column_transformer = ColumnTransformer(
    [("num", StandardScaler(), ["Numeric"]), ("cat", OneHotEncoder(), ["Category"])]
)

# Fit and transform the data
transformed_data = column_transformer.fit_transform(mixed_data)
transformed_df = pd.DataFrame(
    transformed_data,
    columns=["Numeric_scaled"] + list(onehot_encoder.get_feature_names_out()),
)
transformed_df

### Target Encoding Considerations

Target encoding is a technique where categorical variables are replaced with the mean of the target variable. It can be useful but also risky due to potential data leakage.

## Recipe 4: Introduction to Pipelines

Pipelines are a simple way to streamline a machine learning workflow by chaining together transformers and estimators.

### Basic Pipeline Construction

A basic pipeline chains together a sequence of transformations and a final estimator.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Create a pipeline
pipeline = Pipeline(
    [("scaler", StandardScaler()), ("classifier", LogisticRegression())]
)

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Predict using the pipeline
predictions = pipeline.predict(X_test)
predictions

### Pipeline with Multiple Steps

A more complex pipeline can include multiple preprocessing steps before the final estimator.

In [None]:
# Create a more complex pipeline
complex_pipeline = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", StandardScaler()),
        ("classifier", LogisticRegression()),
    ]
)

# Fit the pipeline
complex_pipeline.fit(X_train, y_train)

# Predict using the pipeline
complex_predictions = complex_pipeline.predict(X_test)
complex_predictions

### Pipeline Visualization

Visualizing pipelines can help understand the workflow and ensure all steps are correctly configured.

In [None]:
from sklearn import set_config

# Set display configuration
set_config(display="diagram")

# Display the pipeline
complex_pipeline

## Recipe 5: Feature Engineering

Feature engineering involves creating new features or modifying existing ones to improve model performance.

### Polynomial Features

The `PolynomialFeatures` transformer generates polynomial and interaction features.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Initialize the PolynomialFeatures
poly = PolynomialFeatures(degree=2)

# Fit and transform the data
poly_features = poly.fit_transform(df)
poly_features_df = pd.DataFrame(
    poly_features, columns=poly.get_feature_names_out(df.columns)
)
poly_features_df

### Binning with KBinsDiscretizer

The `KBinsDiscretizer` discretizes continuous features into k bins.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

# Initialize the KBinsDiscretizer
kbins = KBinsDiscretizer(n_bins=3, encode="ordinal", strategy="uniform")

# Fit and transform the data
binned_data = kbins.fit_transform(df)
binned_df = pd.DataFrame(binned_data, columns=df.columns)
binned_df

### Feature Selection

Feature selection helps in selecting the most important features for the model.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Initialize the RFE
rfe = RFE(estimator=LinearRegression(), n_features_to_select=1)

# Fit the RFE
rfe.fit(df, y)

# Get the ranking of features
rfe.ranking_

## Recipe 6: Practical Exercises

In this section, we will combine all the recipes into a comprehensive pipeline and apply it to the California Housing dataset.

### Comprehensive Pipeline

We will create a pipeline that includes imputation, scaling, encoding, and modeling steps.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Load the California Housing dataset
california_data = fetch_california_housing()
X, y = california_data.data, california_data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create a comprehensive pipeline
comprehensive_pipeline = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("model", RandomForestRegressor()),
    ]
)

# Fit the pipeline
comprehensive_pipeline.fit(X_train, y_train)

# Evaluate the pipeline
score = comprehensive_pipeline.score(X_test, y_test)
score

### Performance Evaluation

Evaluate the performance of the comprehensive pipeline on the California Housing dataset.