# Full Workflow Demo: Cleaning a Challenging, Messy Dataset

This notebook demonstrates the full power of `Transfory` by tackling a messy, real-world dataset. We will use a combination of standard transformers, custom-built transformers, and complex pipelines to clean and prepare the data for machine learning.

**The dataset (`challenging_messy_dataset.csv`) contains numerous issues:**
- Missing values in multiple columns.
- Inconsistent categorical data (e.g., 'm', 'Male', 'M').
- Invalid data entries (e.g., ages > 150, impossible dates).
- Mixed data types in a single column (e.g., 'Y', 'true', '1').
- Extreme outliers.
- Unstructured text data in a 'notes' column.

### 1. Setup and Initial Data Exploration

In [None]:
import sys
import os
import pandas as pd
import numpy as np

# Add the project root to the path to allow importing 'transfory'
project_root = os.path.abspath('..')
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Import all the necessary tools from Transfory
from transfory import (
    BaseTransformer,
    Pipeline,
    ColumnTransformer,
    MissingValueHandler,
    Encoder,
    Scaler,
    OutlierHandler,
    DatetimeFeatureExtractor,
    FeatureGenerator,
    InsightReporter
)

print("Transfory modules loaded successfully!")

In [None]:
# Load the messy dataset
df = pd.read_csv('challenging_messy_dataset.csv')

print("--- Original Data Info ---")
df.info()

print("\n--- First 5 Rows of Messy Data ---")
display(df.head())

### 2. Creating Custom Transformers for Complex Cleaning

Some cleaning tasks are too specific for a generic transformer. Here, we'll create a custom transformer to handle the unique messiness in our `gender`, `is_active`, `membership_level`, and `notes` columns. This demonstrates the extensibility of `Transfory`.

In [None]:
class CustomCleaner(BaseTransformer):
    """A custom transformer to handle specific data inconsistencies."""
    def __init__(self, name: str = "CustomCleaner"):
        super().__init__(name=name)

    def _fit(self, X: pd.DataFrame, y=None):
        # This transformer is stateless, so fit does nothing.
        pass

    def _transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X_out = X.copy()

        # 1. Standardize 'gender' column
        if 'gender' in X_out.columns:
            gender_map = {'m': 'Male', 'M': 'Male', 'f': 'Female', 'F': 'Female'}
            X_out['gender'] = X_out['gender'].replace(gender_map)
            # Consolidate remaining values into 'Other'
            valid_genders = ['Male', 'Female']
            X_out['gender'] = X_out['gender'].where(X_out['gender'].isin(valid_genders), 'Other')
            self._log("transform", {"message": "Standardized 'gender' column to Male/Female/Other."})

        # 2. Standardize 'is_active' column to a proper boolean
        if 'is_active' in X_out.columns:
            active_map = {'Y': True, 'Yes': True, '1': True, 'true': True,
                            'N': False, 'No': False, '0': False, 'false': False}
            X_out['is_active'] = X_out['is_active'].replace(active_map)
            # Any value not in the map (like '?') becomes NaN, to be handled later
            X_out['is_active'] = pd.to_numeric(X_out['is_active'], errors='coerce')
            self._log("transform", {"message": "Converted 'is_active' to numeric (1/0/NaN)."})

        # 3. Standardize 'membership_level' to lowercase
        if 'membership_level' in X_out.columns:
            X_out['membership_level'] = X_out['membership_level'].str.lower()
            # Replace 'none' string with actual NaN
            X_out['membership_level'].replace('none', np.nan, inplace=True)
            self._log("transform", {"message": "Normalized 'membership_level' to lowercase."})

        # 4. Feature Engineering from 'notes' column
        if 'notes' in X_out.columns:
            X_out['has_complaint'] = X_out['notes'].str.contains('complaint', case=False, na=False).astype(int)
            self._log("transform", {"message": "Created 'has_complaint' feature from 'notes'."})

        return X_out

### 3. Defining Sub-Pipelines for Different Data Types

We'll create specialized pipelines for handling numeric, datetime, and categorical data separately. This modular approach is a core strength of `Transfory`.

In [None]:
# Pipeline for cleaning and transforming numeric features
numeric_pipeline = Pipeline([
    # Cap impossible ages (e.g., > 100) and extreme incomes/purchases
    ("outliers", OutlierHandler(method='percentile', lower_quantile=0.01, upper_quantile=0.99)),
    # Impute any remaining missing values with the median
    ("imputer", MissingValueHandler(strategy='median')),
    # Generate interaction features between numeric columns
    ("feature_gen", FeatureGenerator(degree=2, include_interactions=True)),
    # Scale all numeric features to a common range
    ("scaler", Scaler(method='zscore'))
])

# Pipeline for datetime features
datetime_pipeline = Pipeline([
    # Extract year and month from the join date
    ("date_extractor", DatetimeFeatureExtractor(features=['year', 'month']))
])

# Pipeline for categorical features
categorical_pipeline = Pipeline([
    # Impute missing values with the most frequent category
    ("imputer", MissingValueHandler(strategy='mode')),
    # Convert categories to one-hot encoded format
    ("encoder", Encoder(method='onehot'))
])

### 4. Building the Master Preprocessing Pipeline

Now, we'll assemble everything into a single, powerful pipeline. This master pipeline will first run our `CustomCleaner` on the entire dataset. Then, it will use a `ColumnTransformer` to apply our specialized sub-pipelines to the correct columns in parallel.

In [None]:
# Initialize the InsightReporter to capture all events
reporter = InsightReporter()

# Define the ColumnTransformer to apply different pipelines to different columns
preprocessor = ColumnTransformer(
    transformers=[
        # Apply the numeric pipeline to specific numeric columns
        ("numeric_features", numeric_pipeline, ['age', 'income', 'last_purchase_amount']),
        # Apply the datetime pipeline to the join_date column
        ("datetime_features", datetime_pipeline, ['join_date']),
        # Apply the categorical pipeline to the cleaned categorical columns
        ("categorical_features", categorical_pipeline, ['city', 'gender', 'membership_level'])
    ],
    remainder='passthrough' # Keep other columns (like 'is_active' and 'has_complaint')
)

# Define the final, master pipeline
master_pipeline = Pipeline([
    # Step 1: Run the custom cleaner first to standardize data
    ("custom_cleaner", CustomCleaner()),
    # Step 2: Run the parallel preprocessor
    ("preprocessor", preprocessor),
    # Step 3: Impute any missing values that might have been created (e.g., in 'is_active')
    ("final_imputer", MissingValueHandler(strategy='mode'))
], logging_callback=reporter.get_callback())

print("Master pipeline created successfully!")

### 5. Executing the Pipeline and Viewing the Results

With our entire workflow defined in a single object, we can now apply it to our messy data with one command: `fit_transform`.

In [None]:
# Drop columns we don't need for the model
df_to_process = df.drop(columns=['customer_id', 'notes'])

# Run the entire pipeline!
try:
    # We handle errors because the dates are intentionally messy
    # The DatetimeFeatureExtractor will coerce invalid dates to NaT, which is expected
    transformed_df = master_pipeline.fit_transform(df_to_process)

    print("\n--- Transformed Data (first 5 rows) ---")
    display(transformed_df.head())

    print("\n--- Final Data Info ---")
    transformed_df.info()

except Exception as e:
    print(f"An error occurred during pipeline execution: {e}")

### 6. The Explainability Report

Finally, the most powerful feature of `Transfory`: the `InsightReporter`. Let's review the detailed, human-readable summary of every single action that was performed on our data. The nested structure clearly shows what happened inside the `ColumnTransformer` and its sub-pipelines.

In [None]:
print("="*30 + " INSIGHT REPORT " + "="*30)
print(reporter.summary())