# Column Transformer | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2012%20Column%20Transformer)

When preprocessing data for machine learning, it is common to apply different transformations to different types of features (e.g., numerical vs. categorical). The **ColumnTransformer** in scikit-learn helps manage this by allowing you to specify different preprocessing pipelines for subsets of features.

---

## Overview

A **Pipeline** is used to chain multiple steps together (e.g., preprocessing and model training), ensuring that all steps are applied in sequence during both training and prediction. The **ColumnTransformer** extends this idea by allowing different transformations on specified columns of your dataset.

### Why Use ColumnTransformer?

- **Selective Preprocessing:** Apply different transformers (e.g., scaling, encoding) to different columns.
- **Simplified Workflow:** Combine multiple transformations in a single object that fits into a broader pipeline.
- **Cleaner Code:** Manage complex preprocessing logic in an organized, reproducible manner.

---

## Mathematical Formulas in Preprocessing

Many transformations use well-known mathematical formulas. For example:

### Standard Scaling

To standardize a numerical feature, the **StandardScaler** transforms data by removing the mean and scaling to unit variance:

$$
x_{\text{scaled}} = \frac{x - \mu}{\sigma}
$$

where:
- μ is the mean of the feature.
- σ is the standard deviation.

### Min-Max Scaling

Alternatively, the **MinMaxScaler** scales features to a given range, usually \([0, 1]\):

$$
x_{\text{norm}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}
$$

These formulas are applied individually to each feature column selected by the ColumnTransformer.

---

## Using ColumnTransformer in Python

Below is a Python code example that demonstrates how to create a preprocessing pipeline using ColumnTransformer. In this example, numerical features are imputed and standardized, while categorical features are imputed and one-hot encoded.

```python
# Import necessary modules
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Sample feature lists
numeric_features = ['age', 'salary']
categorical_features = ['gender', 'country']

# Pipeline for numerical features: impute missing values and scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Pipeline for categorical features: impute missing values and encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create the ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Combine the preprocessor with an estimator in a full pipeline
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# Example usage with train/test split (replace X, y with your data)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# clf.fit(X_train, y_train)
# score = clf.score(X_test, y_test)
# print("Model accuracy:", score)
```

### Key Points in the Code

- **Numeric Pipeline:**  
  - **SimpleImputer:** Fills in missing values using the median.
  - **StandardScaler:** Standardizes features using the formula:
    $$
    x_{\text{scaled}} = \frac{x - \mu}{\sigma}
    $$

- **Categorical Pipeline:**  
  - **SimpleImputer:** Fills in missing values with a constant value (`'missing'`).
  - **OneHotEncoder:** Converts categorical variables into a binary matrix.

- **Combining Pipelines:**  
  The `ColumnTransformer` applies these pipelines to the respective columns, and the overall `Pipeline` chains preprocessing with model training.

---

## Benefits and Best Practices

- **Modularity:**  
  Each transformer (or pipeline) is modular. This means you can easily swap or update individual preprocessing steps.

- **Reproducibility:**  
  Pipelines ensure that the same transformations are applied during training and prediction, minimizing data leakage.

- **Extensibility:**  
  You can add more transformers or estimators to the pipeline as needed. For example, you might include feature selection or additional scaling.

- **Debugging:**  
  Breaking down your preprocessing steps into modular pipelines makes it easier to track down errors or unexpected behavior in your data processing.

---

## Conclusion

The ColumnTransformer is a powerful tool in the scikit-learn ecosystem that streamlines the preprocessing of heterogeneous datasets. By applying different transformations to specific subsets of features, it helps create robust and maintainable machine learning pipelines.