# Dataruns Transform Library - Comprehensive Examples

This notebook provides comprehensive examples demonstrating advanced usage of the dataruns transform library. We'll explore:

1. **Standalone Transforms** - Using individual transforms
2. **Transform Composer** - Chaining multiple transforms together
3. **Dataruns Pipeline Integration** - Integrating transforms with the main pipeline
4. **Custom Preprocessing Pipelines** - Using convenience functions
5. **Advanced Pipelines** - Building complex preprocessing workflows

This notebook demonstrates real-world scenarios where you need to preprocess data before analysis or machine learning.

## Setup and Imports

Let's start by importing the necessary libraries and setting up our environment.

In [None]:
import os

import numpy as np
import pandas as pd
from dataruns.core.transforms import (
    StandardScaler, MinMaxScaler, DropNA, FillNA, 
    SelectColumns, TransformComposer,
    create_preprocessing_pipeline
)
from dataruns.core.pipeline import Pipeline, Make_Pipeline

print("✓ All imports successful!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

✓ All imports successful!
NumPy version: 2.2.5
Pandas version: 2.2.3


## Sample Data Creation

First, let's create a sample dataset that we'll use throughout all our examples. This dataset includes various types of features and intentionally has missing values to demonstrate data preprocessing techniques.

In [5]:
def create_sample_data():
    """Create sample data for demonstration."""
    np.random.seed(42)
    data = pd.DataFrame({
        'feature1': np.random.normal(10, 2, 100),      # Normal distribution
        'feature2': np.random.normal(5, 1, 100),       # Different scale
        'feature3': np.random.normal(0, 0.5, 100),     # Centered around 0
        'feature4': np.random.exponential(2, 100),     # Exponential distribution
        'category': np.random.choice(['A', 'B', 'C'], 100)  # Categorical
    })
    
    # Add some missing values to simulate real-world data
    data.iloc[5:10, 0] = np.nan    # Missing in feature1
    data.iloc[15:20, 1] = np.nan   # Missing in feature2
    data.iloc[25:30, 2] = np.nan   # Missing in feature3
    
    return data

# Create our sample dataset
sample_data = create_sample_data()

print(f"Dataset shape: {sample_data.shape}")
print(f"Missing values per column:")
print(sample_data.isnull().sum())
print(f"\nDataset info:")
print(sample_data.info())
print(f"\nFirst 5 rows:")
sample_data.head()

Dataset shape: (100, 5)
Missing values per column:
feature1    5
feature2    5
feature3    5
feature4    0
category    0
dtype: int64

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   feature1  95 non-null     float64
 1   feature2  95 non-null     float64
 2   feature3  95 non-null     float64
 3   feature4  100 non-null    float64
 4   category  100 non-null    object 
dtypes: float64(4), object(1)
memory usage: 4.0+ KB
None

First 5 rows:


Unnamed: 0,feature1,feature2,feature3,feature4,category
0,10.993428,3.584629,0.178894,0.662668,A
1,9.723471,4.579355,0.280392,0.390667,C
2,11.295377,4.657285,0.541526,2.777513,B
3,13.04606,4.197723,0.526901,3.288418,B
4,9.531693,4.838714,-0.688835,9.31401,A


## Example 1: Standalone Transforms

In this example, we'll use individual transforms separately. This approach gives you full control over each step of the preprocessing pipeline and is useful when you need to inspect intermediate results.

In [6]:
print("="*60)
print("EXAMPLE 1: Standalone Transforms")
print("="*60)

data = create_sample_data()
print(f"Original data shape: {data.shape}")
print(f"Missing values: {data.isnull().sum().sum()}")

# Step 1: Handle missing values
fill_na = FillNA(method='mean')
numerical_data = data.select_dtypes(include=[np.number])
data_filled = fill_na.fit_transform(numerical_data)
print(f"After filling NA: {data_filled.isnull().sum().sum()} missing values")

# Step 2: Scale features
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_filled)
print(f"Scaled data mean: {data_scaled.mean().round(3).tolist()}")
print(f"Scaled data std: {data_scaled.std().round(3).tolist()}")

print("\nFirst 5 rows of processed data:")
print(data_scaled.head())

EXAMPLE 1: Standalone Transforms
Original data shape: (100, 5)
Missing values: 15
After filling NA: 0 missing values
Scaled data mean: [0.0, -0.0, 0.0, 0.0]
Scaled data std: [1.0, 1.0, 1.0, 1.0]

First 5 rows of processed data:
   feature1  feature2  feature3  feature4
0  0.711342 -1.525980  0.271283 -0.730803
1 -0.006724 -0.463299  0.461248 -0.849971
2  0.882072 -0.380044  0.949985  0.195742
3  1.871952 -0.871003  0.922614  0.419577
4 -0.115160 -0.186221 -1.352758  3.059478


## Example 2: Transform Composer

The `TransformComposer` allows you to chain multiple transforms together in a single, reusable pipeline. This is more convenient than applying transforms individually and ensures consistency across different datasets.

In [7]:
print("="*60)
print("EXAMPLE 2: Transform Composer")
print("="*60)

data = create_sample_data()

# Create a preprocessing pipeline using TransformComposer
preprocessor = TransformComposer(
    SelectColumns(['feature1', 'feature2', 'feature3', 'feature4']),  # Select numerical columns
    FillNA(method='median'),                                          # Fill missing values with median
    StandardScaler()                                                  # Standardize features
)

# Apply the entire pipeline at once
result = preprocessor.fit_transform(data)
print(f"Processed data shape: {result.shape}")
print(f"Missing values after processing: {result.isnull().sum().sum()}")

print("\nFirst 5 rows of processed data:")
print(result.head())

print(f"\nData statistics after preprocessing:")
print(f"Mean: {result.mean().round(3).tolist()}")
print(f"Std: {result.std().round(3).tolist()}")

EXAMPLE 2: Transform Composer
Processed data shape: (100, 4)
Missing values after processing: 0

First 5 rows of processed data:
   feature1  feature2  feature3  feature4
0  0.711677 -1.528818  0.269137 -0.730803
1 -0.006388 -0.466226  0.459093 -0.849971
2  0.882407 -0.382978  0.947809  0.195742
3  1.872286 -0.873895  0.920439  0.419577
4 -0.114824 -0.189171 -1.354833  3.059478

Data statistics after preprocessing:
Mean: [0.0, 0.0, 0.0, 0.0]
Std: [1.0, 1.0, 1.0, 1.0]


## Example 3: Integration with Dataruns Pipeline

This example shows how to integrate transforms with the main dataruns `Pipeline` class. This approach allows you to combine data preprocessing with other data processing steps in a unified workflow.

In [8]:
print("="*60)
print("EXAMPLE 3: Integration with Dataruns Pipeline")
print("="*60)

data = create_sample_data()

# Define preprocessing function using transforms
def preprocess_numerical(df):
    """Preprocess numerical columns using transforms."""
    numerical_cols = df.select_dtypes(include=[np.number]).columns
    preprocessor = TransformComposer(
        SelectColumns(numerical_cols.tolist()),
        FillNA(method='mean'),
        MinMaxScaler(feature_range=(0, 1))
    )
    return preprocessor.fit_transform(df)

def add_feature_engineering(data):
    """Add some feature engineering."""
    result = data.copy()
    if hasattr(data, 'columns') and len(data.columns) >= 2:
        # Add interaction features for dataframes
        result[f'{data.columns[0]}_x_{data.columns[1]}'] = data.iloc[:, 0] * data.iloc[:, 1]
        print(f"Added interaction feature: {data.columns[0]}_x_{data.columns[1]}")
    return result

# Create dataruns pipeline that combines preprocessing with feature engineering
pipeline = Pipeline(
    preprocess_numerical,      # Step 1: Preprocess numerical data
    add_feature_engineering    # Step 2: Add engineered features
)

# Apply the entire pipeline
result = pipeline(data)
print(f"Pipeline result shape: {result.shape}")
print("Pipeline result (first 5 rows):")
print(result.head())

EXAMPLE 3: Integration with Dataruns Pipeline
Added interaction feature: feature1_x_feature2
Pipeline result shape: (100, 5)
Pipeline result (first 5 rows):
   feature1  feature2  feature3  feature4  feature1_x_feature2
0  0.696879  0.108516  0.507338  0.064788             0.075623
1  0.554890  0.322946  0.535953  0.037291             0.179199
2  0.730639  0.339745  0.609574  0.278584             0.248231
3  0.926376  0.240679  0.605451  0.330234             0.222959
4  0.533448  0.378855  0.262701  0.939380             0.202099


## Example 4: Custom Preprocessing Pipeline

The dataruns library provides convenience functions like `create_preprocessing_pipeline()` that create common preprocessing workflows with just a few parameters. This is perfect for rapid prototyping and standard preprocessing tasks.

In [9]:
print("="*60)
print("EXAMPLE 4: Custom Preprocessing Pipeline")
print("="*60)

data = create_sample_data()

# Create custom preprocessing using the convenience function
preprocessing_pipeline = create_preprocessing_pipeline(
    scale_method='minmax',    # Use MinMax scaling
    handle_missing='fill',    # Fill missing values
    fill_value=None          # Use mean for filling (None defaults to mean)
)

# Apply to numerical data only
numerical_data = data.select_dtypes(include=[np.number])
processed = preprocessing_pipeline.fit_transform(numerical_data)

print(f"Original data range:")
print(f"Min: {numerical_data.min().round(3).tolist()}")
print(f"Max: {numerical_data.max().round(3).tolist()}")

print(f"\nProcessed data range (should be 0-1 after MinMax scaling):")
print(f"Min: {processed.min().round(3).tolist()}")
print(f"Max: {processed.max().round(3).tolist()}")

print(f"\nMissing values before: {numerical_data.isnull().sum().sum()}")
print(f"Missing values after: {processed.isnull().sum().sum()}")

print("\nProcessed data (first 5 rows):")
print(processed.head())

EXAMPLE 4: Custom Preprocessing Pipeline
Original data range:
Min: [4.761, 3.081, -1.621, 0.022]
Max: [13.705, 7.72, 1.926, 9.914]

Processed data range (should be 0-1 after MinMax scaling):
Min: [0.0, 0.0, 0.0, 0.0]
Max: [1.0, 1.0, 1.0, 1.0]

Missing values before: 15
Missing values after: 0

Processed data (first 5 rows):
   feature1  feature2  feature3  feature4
0  0.696879  0.108516  0.507338  0.064788
1  0.554890  0.322946  0.535953  0.037291
2  0.730639  0.339745  0.609574  0.278584
3  0.926376  0.240679  0.605451  0.330234
4  0.533448  0.378855  0.262701  0.939380


## Example 5: Advanced Pipeline with Custom Transforms

In this final example, we'll create a sophisticated preprocessing pipeline that includes custom functions for outlier removal and log transformation, combined with standard preprocessing steps. This showcases the flexibility of the dataruns system.

In [10]:
print("="*60)
print("EXAMPLE 5: Advanced Pipeline with Custom Transforms")
print("="*60)

data = create_sample_data()

# Define custom transformation functions
def remove_outliers(df, n_std=2):
    """Remove outliers beyond n standard deviations."""
    numerical_cols = df.select_dtypes(include=[np.number]).columns
    original_shape = df.shape[0]
    
    for col in numerical_cols:
        mean = df[col].mean()
        std = df[col].std()
        df = df[(df[col] >= mean - n_std * std) & (df[col] <= mean + n_std * std)]
    
    print(f"Outlier removal: {original_shape} -> {df.shape[0]} rows ({original_shape - df.shape[0]} outliers removed)")
    return df

def log_transform_positive(df):
    """Apply log transformation to positive values."""
    result = df.copy()
    numerical_cols = df.select_dtypes(include=[np.number]).columns
    
    for col in numerical_cols:
        if (df[col] > 0).all():
            result[col] = np.log1p(df[col])  # log1p is log(1+x), safer for values near 0
            print(f"Applied log transformation to {col}")
    
    return result

# Create advanced pipeline using Make_Pipeline
advanced_pipeline = Make_Pipeline()
advanced_pipeline.add(
    lambda x: x.select_dtypes(include=[np.number]),           # Select numerical columns
    lambda x: FillNA(method='median').fit_transform(x),      # Fill missing values
    remove_outliers,                                         # Remove outliers
    log_transform_positive,                                  # Log transform
    lambda x: StandardScaler().fit_transform(x)             # Standardize
)

# Build and apply the pipeline
pipeline = advanced_pipeline.build()
result = pipeline(data)

print(f"Advanced pipeline result shape: {result.shape}")
print(f"Original data shape: {data.shape}")
print("Advanced pipeline result (first 5 rows):")
print(result.head())

EXAMPLE 5: Advanced Pipeline with Custom Transforms
Outlier removal: 100 -> 82 rows (18 outliers removed)
Applied log transformation to feature1
Applied log transformation to feature2
Applied log transformation to feature4
Advanced pipeline result shape: (82, 4)
Original data shape: (100, 5)
Advanced pipeline result (first 5 rows):
   feature1  feature2  feature3  feature4
0  0.757813 -1.843210  0.272577 -0.693713
1  0.001856 -0.351475  0.492754 -0.999505
2  0.925752 -0.246100  1.059219  0.711041
3  1.824862 -0.889727  1.027495  0.928183
5  0.001856  0.695800 -1.132682 -0.323458


## Summary and Conclusion

🎉 **Congratulations!** You've successfully explored all the comprehensive transform examples in the dataruns library.

### What We've Covered:

1. **Standalone Transforms**: How to use individual transforms step-by-step for fine-grained control
2. **Transform Composer**: Chaining multiple transforms into reusable preprocessing pipelines
3. **Pipeline Integration**: Combining transforms with the main dataruns Pipeline for complete workflows
4. **Convenience Functions**: Using `create_preprocessing_pipeline()` for rapid prototyping
5. **Advanced Pipelines**: Building sophisticated preprocessing workflows with custom functions

### Key Takeaways:

- ✅ **Flexibility**: Multiple approaches for different use cases
- ✅ **Reusability**: Create preprocessing pipelines that can be applied to new data
- ✅ **Integration**: Seamless integration with the broader dataruns ecosystem
- ✅ **Customization**: Easy to add custom preprocessing steps
- ✅ **Scalability**: Efficient processing for large datasets

### Next Steps:

- Try applying these patterns to your own datasets
- Experiment with different scaling methods and missing value strategies
- Create custom transforms for domain-specific preprocessing needs
- Combine with dataruns' other features for complete data processing workflows

Happy data processing! 🚀