# Demo 4: Outlier Handling

This demo uses the 'diamonds' dataset, which has some features with extreme values, to show how the `OutlierHandler` can be used to cap outliers before performing other transformations like scaling.

In [1]:
import sys
import os
# In a Jupyter notebook, __file__ is not defined. We can use a relative path to add the project root.
# This assumes the notebook is in the 'demo' folder, and 'transfory' is in the parent directory.
project_root = os.path.abspath('..')
if project_root not in sys.path:
    sys.path.insert(0, project_root)

import pandas as pd
import seaborn as sns

from transfory.pipeline import Pipeline
from transfory.outlier import OutlierHandler
from transfory.scaler import Scaler
from transfory.insight import InsightReporter

### 1. Load Data

We'll take a random sample from the 'diamonds' dataset and look at the descriptive statistics. Notice the large difference between the 75% percentile and the max value for columns like `price` and `carat`, suggesting the presence of outliers.

In [2]:
# We take a sample to keep the output clean
df = sns.load_dataset('diamonds').sample(n=1000, random_state=42)
reporter = InsightReporter()

# Select only numeric columns for this demo
df_numeric = df.select_dtypes(include='number')

print("Original Data Description:")
display(df_numeric.describe())

Original Data Description:


Unnamed: 0,carat,depth,table,price,x,y,z
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,0.8275,61.7597,57.3688,4165.147,5.80121,5.80087,3.58036
std,0.490193,1.401489,2.303578,4190.104476,1.1474,1.142003,0.700021
min,0.23,55.2,53.0,360.0,3.89,3.93,2.43
25%,0.41,61.1,56.0,1007.0,4.74,4.75,2.95
50%,0.71,61.9,57.0,2542.5,5.745,5.75,3.55
75%,1.08,62.5,59.0,5569.75,6.6,6.6,4.06
max,2.75,68.4,73.0,18803.0,9.04,8.98,5.49


### 2. Define and Run the Outlier Handling Pipeline

This pipeline will:
1.  Cap extreme values using the Interquartile Range (IQR) method.
2.  Scale the now-capped data to a 0-1 range.

In [3]:
pipeline = Pipeline(
    steps=[
        ("outlier_capper", OutlierHandler(method="iqr", factor=1.5)),
        ("scaler", Scaler(method="minmax"))
    ],
    logging_callback=reporter.get_callback()
)

# Fit and transform the data
transformed_df = pipeline.fit_transform(df_numeric)

print("Transformed Data Description (after capping and scaling):")
display(transformed_df.describe())

Transformed Data Description (after capping and scaling):


Unnamed: 0,carat,depth,table,price,x,y,z
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,0.32093,0.494429,0.412838,0.298843,0.371109,0.370469,0.375935
std,0.260813,0.217854,0.207005,0.303275,0.222796,0.226139,0.228765
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.097035,0.375,0.285714,0.053676,0.165049,0.162376,0.169935
50%,0.25876,0.517857,0.380952,0.181062,0.360194,0.360396,0.366013
75%,0.458221,0.625,0.571429,0.432205,0.526214,0.528713,0.53268
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### 3. Review the Insight Report

The report shows the bounds that were learned for capping and confirms that both capping and scaling were applied.

In [4]:
print(reporter.summary())

=== Transfory Insight Report ===
Session started: 2025-12-09 06:36:02
Total steps logged: 9

[2025-12-09 06:36:02] Step 'Pipeline' completed a 'fit_transform_step' event.
[2025-12-09 06:36:02] [outlier_capper] Step 'OutlierHandler' (OutlierHandler) learned capping bounds using 'iqr' for 7 column(s). (e.g., 'carat' will be capped between -0.60 and 2.09).
[2025-12-09 06:36:02] [outlier_capper] Step 'OutlierHandler' (OutlierHandler) applied capping to 7 column(s).
[2025-12-09 06:36:02] [outlier_capper] Step 'OutlierHandler' completed a 'transform' event.
[2025-12-09 06:36:02] Step 'Pipeline' completed a 'fit_transform_done' event.
[2025-12-09 06:36:02] Step 'Pipeline' completed a 'fit_transform_step' event.
[2025-12-09 06:36:02] [scaler] Step 'Scaler' (Scaler) fitted. It will apply 'minmax' scaling to 7 numeric column(s).
[2025-12-09 06:36:02] [scaler] Step 'Scaler' completed a 'transform' event.
[2025-12-09 06:36:02] Step 'Pipeline' completed a 'fit_transform_done' event.
