This is a vital part of the machine learning pipeline. Raw data often isn't in the right format or scale for algorithms to perform optimally. Preprocessing involves transforming data to make it suitable, while feature engineering involves creating new features from existing ones.

## Scikit-learn: Data Preprocessing & Feature Engineering

This document covers a range of essential preprocessing and feature engineering techniques:

* **Scaling:** `StandardScaler`, `MinMaxScaler`, `RobustScaler`, `Normalizer`.
* **Encoding Categorical:** `OneHotEncoder`, `OrdinalEncoder`, `LabelEncoder` (for target `y`).
* **Imputation:** `SimpleImputer` for handling missing values.
* **Feature Engineering:** `PolynomialFeatures` for creating interaction/polynomial terms, `KBinsDiscretizer` for binning continuous data.
* **Text Feature Extraction:** `CountVectorizer` and `TfidfVectorizer` for converting text into numerical matrices.
* **Feature Selection:** Basic methods like `VarianceThreshold` and `SelectKBest`.

---

These tools are fundamental for preparing your data effectively before applying machine learning algorithms.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import (StandardScaler, MinMaxScaler, RobustScaler, Normalizer,
                                   OneHotEncoder, LabelEncoder, OrdinalEncoder,
                                   PolynomialFeatures, KBinsDiscretizer)
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_regression
from sklearn.datasets import load_iris # For feature selection example

# --- 1. Importance of Preprocessing ---
# - Many algorithms assume data is normally distributed or features are on a similar scale
#   (e.g., algorithms using distance like KNN, SVM, or gradient-based like linear models).
# - Categorical features need to be converted to numerical format.
# - Missing values often need to be handled (imputed or removed).
# - Feature engineering can create more informative inputs for the model.

# --- 2. Scaling Numerical Features (sklearn.preprocessing) ---
# Applies transformations feature by feature (column-wise).

print("--- Scaling Numerical Features ---")
data_numeric = pd.DataFrame({
    'Age': [25, 45, 35, 55, 22],
    'Salary': [50000, 80000, 60000, 95000, 48000],
    'Height_cm': [175, 163, 170, 180, 168]
})
print("Original Numerical Data:\n", data_numeric)

# a) StandardScaler: Z-score normalization (mean=0, std=1)
scaler_standard = StandardScaler()
# Fit learns mean/std, transform applies the scaling
# fit_transform() does both in one step (preferred on training data)
data_standardized = scaler_standard.fit_transform(data_numeric)
print("\nStandardized Data (StandardScaler):\n", data_standardized.round(2))
# To apply to new data (test set), use scaler_standard.transform(X_test)

# b) MinMaxScaler: Scales data to a specific range [min, max] (default [0, 1])
scaler_minmax = MinMaxScaler(feature_range=(0, 1))
data_minmax = scaler_minmax.fit_transform(data_numeric)
print("\nMin-Max Scaled Data (MinMaxScaler to [0, 1]):\n", data_minmax.round(2))

# c) RobustScaler: Uses median and IQR, less sensitive to outliers
scaler_robust = RobustScaler()
data_robust = scaler_robust.fit_transform(data_numeric)
print("\nRobust Scaled Data (RobustScaler):\n", data_robust.round(2))

# d) Normalizer: Scales individual *samples* (rows) to have unit norm (L1 or L2).
# Used less often for features, more for text data or specific algorithms.
# Note: Fits are not typically needed for Normalizer.
normalizer_l2 = Normalizer(norm='l2') # L2 norm (Euclidean distance)
data_normalized = normalizer_l2.transform(data_numeric) # Use transform directly
print("\nL2 Normalized Data (Normalizer - row-wise):\n", data_normalized.round(2))
print("-" * 30)


# --- 3. Encoding Categorical Features (sklearn.preprocessing) ---

print("--- Encoding Categorical Features ---")
data_categorical = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red'],
    'Size': ['M', 'L', 'S', 'M', 'L'],
    'Quality': ['Good', 'Great', 'Good', 'Fair', 'Great'] # Ordinal feature
})
print("Original Categorical Data:\n", data_categorical)

# a) OneHotEncoder: Creates binary columns for each category. Preferred for nominal features.
# handle_unknown='ignore' prevents errors if unseen categories appear in test data.
# sparse_output=False returns a dense numpy array (easier to view)
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
data_onehot = ohe.fit_transform(data_categorical[['Color', 'Size']]) # Apply to nominal features
# Get feature names for the new columns
ohe_feature_names = ohe.get_feature_names_out(['Color', 'Size'])
print("\nOne-Hot Encoded Data (OneHotEncoder):\n", pd.DataFrame(data_onehot, columns=ohe_feature_names))

# b) OrdinalEncoder: Encodes categories into integers (0, 1, 2...). Assumes an order.
# Define the desired order for the 'Quality' feature
quality_order = ['Fair', 'Good', 'Great']
ordinal_enc = OrdinalEncoder(categories=[quality_order], handle_unknown='use_encoded_value', unknown_value=np.nan) # Specify order
data_ordinal = ordinal_enc.fit_transform(data_categorical[['Quality']])
print("\nOrdinal Encoded Data (OrdinalEncoder - Quality):\n", pd.DataFrame(data_ordinal, columns=['QualityEncoded']))

# c) LabelEncoder: Encodes target variable (y) into integers [0, n_classes-1].
# Generally NOT recommended for features (X) as it implies an arbitrary order.
target_labels = ['Cat', 'Dog', 'Cat', 'Fish', 'Dog']
le = LabelEncoder()
target_encoded = le.fit_transform(target_labels)
print(f"\nOriginal Target Labels: {target_labels}")
print(f"Label Encoded Target: {target_encoded}")
print(f"Encoded Classes: {le.classes_}") # Shows mapping: Cat=0, Dog=1, Fish=2
print("-" * 30)


# --- 4. Handling Missing Values (sklearn.impute) ---

print("--- Handling Missing Values ---")
data_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 7, 8, 9, 10],
    'C': [11, 12, 13, np.nan, 15]
})
print("Original Data with Missing Values:\n", data_missing)

# SimpleImputer: Replaces missing values (NaN) using a strategy.
# Strategies: 'mean', 'median', 'most_frequent', 'constant'
imputer_mean = SimpleImputer(strategy='mean')
data_imputed_mean = imputer_mean.fit_transform(data_missing)
print("\nImputed Data (Mean Strategy):\n", pd.DataFrame(data_imputed_mean, columns=data_missing.columns))

imputer_median = SimpleImputer(strategy='median')
data_imputed_median = imputer_median.fit_transform(data_missing)
print("\nImputed Data (Median Strategy):\n", pd.DataFrame(data_imputed_median, columns=data_missing.columns))

# Impute categorical (if needed, use strategy='most_frequent' or 'constant')
# data_cat_missing = pd.DataFrame({'Color': ['R', 'G', np.nan, 'B']})
# imputer_cat = SimpleImputer(strategy='most_frequent')
# print("\nImputed Categorical:\n", imputer_cat.fit_transform(data_cat_missing))
print("-" * 30)


# --- 5. Feature Engineering (sklearn.preprocessing) ---

print("--- Feature Engineering ---")
data_poly = pd.DataFrame({'X1': [1, 2, 3], 'X2': [4, 5, 6]})
print("Original Data for Polynomial Features:\n", data_poly)

# PolynomialFeatures: Generates polynomial features (e.g., x1^2, x2^2, x1*x2).
# degree: The degree of the polynomial.
# include_bias=False: Avoids adding a column of ones (bias term).
# interaction_only=False: Includes interaction terms (e.g., X1*X2).
poly = PolynomialFeatures(degree=2, include_bias=False)
data_poly_features = poly.fit_transform(data_poly)
print(f"\nPolynomial Features (degree=2):\n", data_poly_features)
print(f"Feature Names: {poly.get_feature_names_out(['X1', 'X2'])}") # X1, X2, X1^2, X1*X2, X2^2

# KBinsDiscretizer: Bin continuous data into intervals (discretization).
# n_bins: Number of bins.
# encode: 'ordinal' (integer bins), 'onehot' (sparse), 'onehot-dense'.
# strategy: 'uniform' (equal width), 'quantile' (equal frequency), 'kmeans'.
data_discretize = pd.DataFrame({'Value': [1, 5, 12, 18, 25, 30, 35, 40, 48, 55]})
print("\nOriginal Data for Discretization:\n", data_discretize)
discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform', subsample=None)
data_binned = discretizer.fit_transform(data_discretize[['Value']])
print("\nDiscretized Data (uniform bins):\n", data_binned)
print("-" * 30)


# --- 6. Feature Extraction (Text - sklearn.feature_extraction.text) ---

print("--- Text Feature Extraction ---")
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
print("Original Text Corpus:\n", corpus)

# CountVectorizer: Converts text to a matrix of token (word) counts.
vectorizer_count = CountVectorizer()
X_counts = vectorizer_count.fit_transform(corpus)
print("\nWord Counts (CountVectorizer - sparse matrix):\n", X_counts.toarray())
print("Feature Names (Vocabulary):\n", vectorizer_count.get_feature_names_out())

# TfidfVectorizer: Converts text to a matrix of TF-IDF features.
# TF-IDF = Term Frequency * Inverse Document Frequency. Weights important words higher.
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(corpus)
print("\nTF-IDF Features (TfidfVectorizer - sparse matrix):\n", X_tfidf.toarray().round(2))
print("Feature Names (Vocabulary):\n", vectorizer_tfidf.get_feature_names_out())
print("-" * 30)


# --- 7. Feature Selection (Introduction - sklearn.feature_selection) ---
# Selecting relevant features, potentially improving model performance and reducing complexity.

print("--- Feature Selection (Introduction) ---")
# Load iris data for example
X_iris, y_iris = load_iris(return_X_y=True)
print(f"Original Iris shape: {X_iris.shape}")

# a) VarianceThreshold: Remove features with variance below a threshold (removes constant features by default).
selector_var = VarianceThreshold(threshold=0) # threshold=0 removes zero-variance features
X_iris_high_variance = selector_var.fit_transform(X_iris)
print(f"\nShape after VarianceThreshold(0): {X_iris_high_variance.shape}") # Usually no change for iris

# b) SelectKBest: Select features based on univariate statistical tests.
# For regression: f_regression, mutual_info_regression
# For classification: chi2, f_classif, mutual_info_classif
# Select top 2 features based on F-test for regression (example, though iris is classification)
selector_kbest = SelectKBest(score_func=f_regression, k=2)
X_iris_kbest = selector_kbest.fit_transform(X_iris, y_iris) # Needs y for supervised selection
print(f"\nShape after SelectKBest(k=2): {X_iris_kbest.shape}")
print(f"Selected feature indices: {selector_kbest.get_support(indices=True)}") # Shows which columns were kept

# c) Recursive Feature Elimination (RFE) - More advanced (covered later if needed)
# Recursively removes features and builds a model on remaining features.
print("\nOther methods like RFE exist for more advanced selection.")
print("-" * 30)

--- Scaling Numerical Features ---
Original Numerical Data:
    Age  Salary  Height_cm
0   25   50000        175
1   45   80000        163
2   35   60000        170
3   55   95000        180
4   22   48000        168

Standardized Data (StandardScaler):
 [[-0.93 -0.91  0.65]
 [ 0.7   0.74 -1.4 ]
 [-0.11 -0.36 -0.21]
 [ 1.51  1.56  1.51]
 [-1.17 -1.02 -0.55]]

Min-Max Scaled Data (MinMaxScaler to [0, 1]):
 [[0.09 0.04 0.71]
 [0.7  0.68 0.  ]
 [0.39 0.26 0.41]
 [1.   1.   1.  ]
 [0.   0.   0.29]]

Robust Scaled Data (RobustScaler):
 [[-0.5  -0.33  0.71]
 [ 0.5   0.67 -1.  ]
 [ 0.    0.    0.  ]
 [ 1.    1.17  1.43]
 [-0.65 -0.4  -0.29]]

L2 Normalized Data (Normalizer - row-wise):
 [[0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]]
------------------------------
--- Encoding Categorical Features ---
Original Categorical Data:
    Color Size Quality
0    Red    M    Good
1  Green    L   Great
2   Blue    S    Good
3  Green    M    Fair
4    Red    L   Great

One-Hot Encoded Data




Original Target Labels: ['Cat', 'Dog', 'Cat', 'Fish', 'Dog']
Label Encoded Target: [0 1 0 2 1]
Encoded Classes: ['Cat' 'Dog' 'Fish']
------------------------------
--- Handling Missing Values ---
Original Data with Missing Values:
      A     B     C
0  1.0   NaN  11.0
1  2.0   7.0  12.0
2  NaN   8.0  13.0
3  4.0   9.0   NaN
4  5.0  10.0  15.0

Imputed Data (Mean Strategy):
      A     B      C
0  1.0   8.5  11.00
1  2.0   7.0  12.00
2  3.0   8.0  13.00
3  4.0   9.0  12.75
4  5.0  10.0  15.00

Imputed Data (Median Strategy):
      A     B     C
0  1.0   8.5  11.0
1  2.0   7.0  12.0
2  3.0   8.0  13.0
3  4.0   9.0  12.5
4  5.0  10.0  15.0
------------------------------
--- Feature Engineering ---
Original Data for Polynomial Features:
    X1  X2
0   1   4
1   2   5
2   3   6

Polynomial Features (degree=2):
 [[ 1.  4.  1.  4. 16.]
 [ 2.  5.  4. 10. 25.]
 [ 3.  6.  9. 18. 36.]]
Feature Names: ['X1' 'X2' 'X1^2' 'X1 X2' 'X2^2']

Original Data for Discretization:
    Value
0      1
1      