Scikit-learn offers various preprocessing utilities to transform and prepare your data before feeding it into a machine learning model. 

1. Scaling
Scaling transforms the features to a similar range, typically between a fixed minimum and maximum value.

Standardization (StandardScaler)
Standardization scales the features to have zero mean and unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

# Example data
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)

print("Original Data:\n", X)
print("Scaled Data:\n", X_scaled)

Min-Max Scaling (MinMaxScaler)
Min-Max Scaling scales the features to a fixed range, usually [0, 1].

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
min_max_scaler = MinMaxScaler()

# Fit and transform the data
X_min_max_scaled = min_max_scaler.fit_transform(X)

print("Min-Max Scaled Data:\n", X_min_max_scaled)

2. Normalization
Normalization rescales each sample to have unit norm.

In [None]:
from sklearn.preprocessing import Normalizer

# Initialize the normalizer
normalizer = Normalizer()

# Fit and transform the data
X_normalized = normalizer.fit_transform(X)

print("Normalized Data:\n", X_normalized)

3. Encoding Categorical Variables

One-Hot Encoding (OneHotEncoder)

One-hot encoding converts categorical values into a set of binary features.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Example categorical data
categories = np.array([['red'], ['green'], ['blue'], ['green']])

# Initialize the encoder
encoder = OneHotEncoder()

# Fit and transform the data
categories_encoded = encoder.fit_transform(categories).toarray()

print("Original Categories:\n", categories)
print("One-Hot Encoded Data:\n", categories_encoded)

Label Encoding (LabelEncoder)

Label encoding assigns a unique integer to each category.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the encoder
label_encoder = LabelEncoder()

# Fit and transform the data
categories_label_encoded = label_encoder.fit_transform(categories.ravel())

print("Label Encoded Data:\n", categories_label_encoded)


4. Handling Missing Values

Imputation (SimpleImputer)

Imputation fills missing values with a specific value, such as the mean, median, or most frequent value.

In [25]:
from sklearn.impute import SimpleImputer

# Example data with missing values
X_with_nan = np.array([[1, 2], [np.nan, 3], [7, 6], [np.nan, 8]])

# Initialize the imputer
imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform the data
X_imputed = imputer.fit_transform(X_with_nan)

print("Data with Missing Values:\n", X_with_nan)
print("Imputed Data:\n", X_imputed)


Data with Missing Values:
 [[ 1.  2.]
 [nan  3.]
 [ 7.  6.]
 [nan  8.]]
Imputed Data:
 [[1. 2.]
 [1. 3.]
 [7. 6.]
 [1. 8.]]
