# Demo 2: Using the ColumnTransformer

This demo showcases the powerful `ColumnTransformer` for applying different transformation pipelines to different subsets of columns in parallel. This is a very common and efficient pattern in real-world data preprocessing.

In [1]:
import sys
import os
# In a Jupyter notebook, __file__ is not defined. We can use a relative path to add the project root.
# This assumes the notebook is in the 'demo' folder, and 'transfory' is in the parent directory.
project_root = os.path.abspath('..')
if project_root not in sys.path:
    sys.path.insert(0, project_root)

import pandas as pd
import seaborn as sns

from transfory.pipeline import Pipeline
from transfory.column_transformer import ColumnTransformer
from transfory.missing import MissingValueHandler
from transfory.encoder import Encoder
from transfory.scaler import Scaler
from transfory.insight import InsightReporter

### 1. Load Data

We'll use the 'titanic' dataset again, but this time we'll build a more sophisticated preprocessor.

In [2]:
df = sns.load_dataset('titanic')
reporter = InsightReporter()

# Select a subset of columns for the demo
df_subset = df[['age', 'fare', 'embarked', 'sex', 'pclass', 'who']].copy()

print("Original Data (first 5 rows):")
display(df_subset.head())

Original Data (first 5 rows):


Unnamed: 0,age,fare,embarked,sex,pclass,who
0,22.0,7.25,S,male,3,man
1,38.0,71.2833,C,female,1,woman
2,26.0,7.925,S,female,3,woman
3,35.0,53.1,S,female,1,woman
4,35.0,8.05,S,male,3,man


### 2. Define Sub-Pipelines and the ColumnTransformer

We create two small pipelines: one for numeric features (impute then scale) and one for categorical features (impute then encode). The `ColumnTransformer` then applies each pipeline to the correct columns.

- `numeric_processing` is applied to `['age', 'fare']`.
- `categorical_processing` is applied to `['embarked', 'sex']`.
- `pclass` is explicitly passed through without changes.
- `who` is not mentioned, so it will be dropped (`remainder='drop'`).

In [3]:
# Define separate pipelines for numeric and categorical features
numeric_pipeline = Pipeline([
    ("imputer", MissingValueHandler(strategy="mean")),
    ("scaler", Scaler(method="zscore"))
])

categorical_pipeline = Pipeline([
    ("imputer", MissingValueHandler(strategy="mode")),
    ("encoder", Encoder(method="onehot"))
])

# Use ColumnTransformer to apply pipelines to the correct columns
preprocessor = ColumnTransformer(
    transformers=[
        ("numeric_processing", numeric_pipeline, ['age', 'fare']),
        ("categorical_processing", categorical_pipeline, ['embarked', 'sex']),
        ("pass_through_pclass", "passthrough", ['pclass'])
    ],
    remainder='drop',
    logging_callback=reporter.get_callback()
)

# Fit and transform the data
transformed_df = preprocessor.fit_transform(df_subset)

print("Transformed Data (first 5 rows):")
display(transformed_df.head())

Transformed Data (first 5 rows):


Unnamed: 0,age,fare,embarked_S,embarked_C,embarked_Q,sex_male,sex_female,pclass
0,-0.592481,-0.502445,1,0,0,1,0,3
1,0.638789,0.786845,0,1,0,0,1,1
2,-0.284663,-0.488854,1,0,0,0,1,3
3,0.407926,0.42073,1,0,0,0,1,1
4,0.407926,-0.486337,1,0,0,1,0,3


### 3. Review the Insight Report

The report is now nested, clearly showing the steps that occurred *inside* each part of the `ColumnTransformer`. This makes debugging and understanding complex preprocessing effortless.

In [4]:
print(reporter.summary())

=== Transfory Insight Report ===
Session started: 2025-12-09 06:41:50
Total steps logged: 42

[2025-12-09 06:41:51] started fitting sub-transformer 'Pipeline' on 2 column(s): ['age', 'fare'].
[2025-12-09 06:41:51] [numeric_processing] Step 'Pipeline' completed a 'fit_step_start' event.
[2025-12-09 06:41:51] [numeric_processing::imputer] Step 'MissingValueHandler' (MissingValueHandler) learned imputation values using 'mean' for 1 column(s). Values: age: 29.70.
[2025-12-09 06:41:51] [numeric_processing::imputer] Step 'MissingValueHandler' (MissingValueHandler) applied imputation to the data.
[2025-12-09 06:41:51] [numeric_processing] Step 'Pipeline' completed a 'fit_end' event.
[2025-12-09 06:41:51] [numeric_processing] Step 'Pipeline' completed a 'fit_step_start' event.
[2025-12-09 06:41:51] [numeric_processing::scaler] Step 'Scaler' (Scaler) fitted. It will apply 'zscore' scaling to 2 numeric column(s).
[2025-12-09 06:41:51] [numeric_processing] Step 'Pipeline' completed a 'fit_end' ev