# Transfory Interactive Test Notebook

Welcome! This notebook allows you to test the `Transfory` library in an interactive way.

### Instructions:
1.  **Activate Environment**: Make sure your virtual environment is activated.
2.  **Install Dependencies**: From your project root, run `pip install -r requirements.txt` to install dependencies. For development, you can use `pip install -e .` to install `Transfory` in editable mode.
3.  **Run All Cells**: Run the entire notebook (`Kernel -> Restart & Run All`) to see the complete workflow, from data creation to transformation, reporting, and persistence.

In [1]:
# To run this script from the project root, we add the current directory to the path
# This allows Python to find the 'transfory' package
import sys
sys.path.insert(0, '.')

import pandas as pd
import numpy as np

# Import all components from your Transfory library
# Note: Corrected import paths based on your file structure
from transfory.pipeline import Pipeline
from transfory.imputation import MissingValueHandler
from transfory.encoder import Encoder
from transfory.featuregen import FeatureGenerator
from transfory.scaler import Scaler
from transfory.insight import InsightReporter

print("✅ Transfory components imported successfully!")

✅ Transfory components imported successfully!


## Step 1: Create Sample Data

To make this notebook self-contained, we'll create a sample DataFrame. It includes missing values and mixed data types to test all our transformers.

In [2]:
def create_sample_data():
    """Creates a sample DataFrame with mixed data types and missing values."""
    data = {
        'age': [25, 30, np.nan, 45, 35, 28, 50],
        'city': ['New York', 'London', 'Paris', 'Tokyo', 'London', 'New York', np.nan],
        'experience': [2, 7, 5, 20, 10, 4, 22],
        'salary': [50000, 90000, 75000, 150000, 110000, np.nan, 180000]
    }
    return pd.DataFrame(data)

raw_df = create_sample_data()

print("Original DataFrame:")
raw_df.info()
raw_df

Original DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         6 non-null      float64
 1   city        6 non-null      object 
 2   experience  7 non-null      int64  
 3   salary      6 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 352.0+ bytes


Unnamed: 0,age,city,experience,salary
0,25.0,New York,2,50000.0
1,30.0,London,7,90000.0
2,,Paris,5,75000.0
3,45.0,Tokyo,20,150000.0
4,35.0,London,10,110000.0
5,28.0,New York,4,
6,50.0,,22,180000.0


## Step 2: Define the Transformation Pipeline

Here, we create an `InsightReporter` to track the changes and define a `Pipeline` with all the transformation steps. You can comment out, reorder, or customize the steps as you wish.

In [3]:
# Create an InsightReporter to capture all events
reporter = InsightReporter()

# Define the full pipeline
full_pipeline = Pipeline([
    # Step 1: Handle missing values
    ("imputer_numeric", MissingValueHandler(strategy="mean")), # Fills NaN in 'age' and 'salary' with their mean
    ("imputer_categorical", MissingValueHandler(strategy="mode")), # Fills NaN in 'city' with its mode
    
    # Step 2: Convert categorical columns to numbers
    ("encoder", Encoder(method="onehot", handle_unseen="error")), # Use handle_unseen='error' to raise an error for new categories
    
    # Step 3: Generate new features from numeric columns
    ("feature_generator", FeatureGenerator(degree=2, include_interactions=True)),
    
    # Step 4: Scale all numeric features
    ("scaler", Scaler(method="zscore")) # Applies Z-score scaling to all numeric columns
    
], logging_callback=reporter.get_callback()) # Attach the reporter to the pipeline

print("Pipeline defined:")
full_pipeline

Pipeline defined:


<Pipeline (5 steps): imputer_numeric → imputer_categorical → encoder → feature_generator → scaler>

## Step 3: Run the Pipeline

This cell executes the `fit_transform` method on your data, applying all the defined steps sequentially.

In [4]:
# Fit the pipeline to the data and transform it
transformed_df = full_pipeline.fit_transform(raw_df)

print("Transformed DataFrame (first 5 rows):")
transformed_df.head()

Transformed DataFrame (first 5 rows):


Unnamed: 0,age,experience,salary,city_New York,city_London,city_Paris,city_Tokyo,age^2,experience^2,salary^2,...,salary_x_city_New York,salary_x_city_London,salary_x_city_Paris,salary_x_city_Tokyo,city_New York_x_city_London,city_New York_x_city_Paris,city_New York_x_city_Tokyo,city_London_x_city_Paris,city_London_x_city_Tokyo,city_Paris_x_city_Tokyo
0,-1.245494,-1.088662,-1.445929,1.581139,-0.866025,-0.408248,-0.408248,-1.107371,-0.807503,-1.146828,...,0.694112,-0.803255,-0.408248,-0.408248,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.652402,-0.408248,-0.468399,-0.632456,1.154701,-0.408248,-0.408248,-0.676226,-0.565252,-0.567819,...,-0.578931,0.528457,-0.408248,-0.408248,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,-0.680414,-0.834973,-0.632456,-0.866025,2.44949,-0.408248,-0.111426,-0.694452,-0.82372,...,-0.578931,-0.803255,2.44949,-0.408248,0.0,0.0,0.0,0.0,0.0,0.0
3,1.126876,1.360828,0.997894,-0.632456,-0.866025,-0.408248,2.44949,1.08755,1.324304,0.921063,...,-0.578931,-0.803255,-0.408248,2.44949,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.059309,0.0,0.020365,-0.632456,1.154701,-0.408248,-0.408248,-0.166691,-0.290701,-0.15424,...,-0.578931,0.824394,-0.408248,-0.408248,0.0,0.0,0.0,0.0,0.0,0.0


## Step 4: View the Insight Report

The `InsightReporter` provides a human-readable summary of every action the pipeline took. This is the core of Transfory's **explainability**.

In [5]:
# Print the summary from the reporter
print(reporter.summary())

=== Transfory Insight Report ===
Session started: 2025-12-08 18:33:26
Total steps logged: 22

[2025-12-08 18:33:26] Step 'Pipeline' completed a 'fit_transform_step' event.
[2025-12-08 18:33:26] Step 'imputer_numeric' (MissingValueHandler) fitted. Will use 'mean' on 2 column(s): ['age', 'salary'].
[2025-12-08 18:33:26] Step 'imputer_numeric' (MissingValueHandler(strategy='mean')) completed a 'transform' event.
[2025-12-08 18:33:26] Step 'Pipeline' completed a 'fit_transform_done' event.
[2025-12-08 18:33:26] Step 'Pipeline' completed a 'fit_transform_step' event.
[2025-12-08 18:33:26] Step 'imputer_categorical' (MissingValueHandler) fitted. Will use 'mode' on 1 column(s): ['city'].
[2025-12-08 18:33:26] Step 'imputer_categorical' (MissingValueHandler(strategy='mode')) completed a 'transform' event.
[2025-12-08 18:33:26] Step 'Pipeline' completed a 'fit_transform_done' event.
[2025-12-08 18:33:26] Step 'Pipeline' completed a 'fit_transform_step' event.
[2025-12-08 18:33:26] Step 'encoder

You can also view the raw log data as a DataFrame for more detailed analysis.

In [6]:
reporter.summary(as_dataframe=True)

Unnamed: 0,timestamp,step,transformer_name,event,details,config
0,2025-12-08 18:33:26,Pipeline,Pipeline,fit_transform_step,"{'step': 'imputer_numeric', 'input_shape': (7,...","{'name': 'Pipeline', 'steps': [('imputer_numer..."
1,2025-12-08 18:33:26,imputer_numeric,MissingValueHandler(strategy='mean'),fit,"{'input_shape': (7, 4), 'fitted_params': {'fil...",{'name': 'MissingValueHandler(strategy='mean')...
2,2025-12-08 18:33:26,imputer_numeric,MissingValueHandler(strategy='mean'),transform,"{'input_shape': (7, 4), 'output_shape': (7, 4)}",{'name': 'MissingValueHandler(strategy='mean')...
3,2025-12-08 18:33:26,Pipeline,Pipeline,fit_transform_done,"{'step': 'imputer_numeric', 'output_shape': (7...","{'name': 'Pipeline', 'steps': [('imputer_numer..."
4,2025-12-08 18:33:26,Pipeline,Pipeline,fit_transform_step,"{'step': 'imputer_categorical', 'input_shape':...","{'name': 'Pipeline', 'steps': [('imputer_numer..."
5,2025-12-08 18:33:26,imputer_categorical,MissingValueHandler(strategy='mode'),fit,"{'input_shape': (7, 4), 'fitted_params': {'fil...",{'name': 'MissingValueHandler(strategy='mode')...
6,2025-12-08 18:33:26,imputer_categorical,MissingValueHandler(strategy='mode'),transform,"{'input_shape': (7, 4), 'output_shape': (7, 4)}",{'name': 'MissingValueHandler(strategy='mode')...
7,2025-12-08 18:33:26,Pipeline,Pipeline,fit_transform_done,"{'step': 'imputer_categorical', 'output_shape'...","{'name': 'Pipeline', 'steps': [('imputer_numer..."
8,2025-12-08 18:33:26,Pipeline,Pipeline,fit_transform_step,"{'step': 'encoder', 'input_shape': (7, 4)}","{'name': 'Pipeline', 'steps': [('imputer_numer..."
9,2025-12-08 18:33:26,encoder,Encoder(method='onehot'),fit,"{'input_shape': (7, 4), 'fitted_params': {'map...","{'name': 'Encoder(method='onehot')', 'method':..."


## Step 5: Save and Load the Pipeline (Persistence)

A key feature is the ability to save your trained pipeline. This allows you to apply the *exact same transformations* to new, unseen data later (e.g., in a production environment).

In [7]:
# Save the fitted pipeline to a file
pipeline_filepath = "trained_transfory_pipeline.joblib"
full_pipeline.save(pipeline_filepath)

print(f"✅ Pipeline saved to '{pipeline_filepath}'")

# Load the pipeline back from the file
loaded_pipeline = Pipeline.load(pipeline_filepath)

print(f"✅ Pipeline loaded successfully!")
print(loaded_pipeline)

✅ Pipeline saved to 'trained_transfory_pipeline.joblib'
✅ Pipeline loaded successfully!
<Pipeline (5 steps): imputer_numeric → imputer_categorical → encoder → feature_generator → scaler>


## Step 6: Transform New Data with the Loaded Pipeline

Now, let's simulate receiving new data and use our `loaded_pipeline` to transform it. The loaded pipeline already knows the means, modes, and scaling parameters from the original data, ensuring consistency.

In [8]:
# Create some new, unseen data
new_data = pd.DataFrame({
    'age': [60, np.nan],
    'city': ['Paris', 'Dubai'], # 'Dubai' is an unseen category
    'experience': [35, 1],
    'salary': [250000, 45000]
})

print("New, unseen data:")
print(new_data)

# Use the loaded pipeline (with handle_unseen='error') to transform the new data.
# We expect this to fail because 'Dubai' is an unseen category.
try:
    print("\nAttempting to transform new data with handle_unseen='error'...")
    new_data_transformed = loaded_pipeline.transform(new_data)
    print("\nTransformed new data:")
    print(new_data_transformed)
except ValueError as e:
    print(f"\n✅ SUCCESS: The pipeline correctly raised a ValueError as expected.")
    print(f"Error message: {e}")

New, unseen data:
    age   city  experience  salary
0  60.0  Paris          35  250000
1   NaN  Dubai           1   45000

Attempting to transform new data with handle_unseen='error'...

✅ SUCCESS: The pipeline correctly raised a ValueError as expected.
Error message: Unseen categories in column 'city': ['Dubai']


## Step 7: Demonstrate `handle_unseen='ignore'`

Now, let's create a new pipeline with the default `handle_unseen='ignore'` policy. This time, the pipeline should not raise an error. Instead, it will create columns for the categories it knows ('New York', 'London', etc.) and assign `0` to all of them for the row containing 'Dubai'.

In [9]:
# Define a new pipeline with the default 'ignore' policy
ignore_pipeline = Pipeline([
    ("imputer_numeric", MissingValueHandler(strategy="mean")),
    ("imputer_categorical", MissingValueHandler(strategy="mode")),
    ("encoder", Encoder(method="onehot", handle_unseen="ignore")) # Default behavior
])

# Fit the pipeline on the original raw data
ignore_pipeline.fit(raw_df)

print("--- Transforming new data with handle_unseen='ignore' ---")
# Transform the new data. This should not raise an error.
new_data_ignored = ignore_pipeline.transform(new_data)

print("Transformed new data (unseen 'Dubai' is ignored):")
new_data_ignored

--- Transforming new data with handle_unseen='ignore' ---
Transformed new data (unseen 'Dubai' is ignored):


Unnamed: 0,age,experience,salary,city_New York,city_London,city_Paris,city_Tokyo
0,60.0,35,250000,0,0,1,0
1,35.5,1,45000,0,0,0,0
