# Dataruns Transform Examples

This notebook demonstrates the comprehensive data transformation capabilities of the dataruns library including:

- **Scaling Transforms**: StandardScaler, MinMaxScaler
- **Missing Value Handling**: DropNA, FillNA
- **Column Operations**: SelectColumns, RenameColumns
- **Custom Transforms**: ApplyFunction, FilterRows
- **Transform Composition**: TransformComposer and preprocessing pipelines
- **Integration**: Using transforms with dataruns Pipeline

## Overview

The dataruns transform system provides a sklearn-like interface for data preprocessing with the flexibility to work with both pandas DataFrames and numpy arrays.

In [None]:
import os
import numpy as np
import pandas as pd


# Import dataruns transforms
from dataruns.core.transforms import (
    StandardScaler, MinMaxScaler, DropNA, FillNA, 
    SelectColumns, TransformComposer,
    create_preprocessing_pipeline
)
from dataruns.core.pipeline import Pipeline, Make_Pipeline

print("✓ All transforms imported successfully!")

# Set random seed for reproducibility
np.random.seed(42)
print("✓ Random seed set for reproducible results")

✓ All transforms imported successfully!
✓ Random seed set for reproducible results


## Sample Data Creation

Let's create a realistic dataset with different types of features and some missing values to demonstrate the various transforms.

In [2]:
# Create sample data with different characteristics
data = pd.DataFrame({
    'feature1': np.random.normal(10, 2, 100),      # Normal distribution, mean=10, std=2
    'feature2': np.random.normal(5, 1, 100),       # Normal distribution, mean=5, std=1  
    'feature3': np.random.normal(0, 0.5, 100),     # Normal distribution, mean=0, std=0.5
    'feature4': np.random.exponential(2, 100),     # Exponential distribution
    'category': np.random.choice(['A', 'B', 'C'], 100)  # Categorical feature
})

# Add some missing values to make it realistic
data.iloc[5:10, 0] = np.nan    # Missing values in feature1
data.iloc[15:20, 1] = np.nan   # Missing values in feature2
data.iloc[25:30, 2] = np.nan   # Missing values in feature3

print("✓ Sample data created")
print(f"Data shape: {data.shape}")
print(f"Data types:")
print(data.dtypes)
print(f"\nMissing values per column:")
print(data.isnull().sum())
print(f"\nFirst 10 rows:")
print(data.head(10))

✓ Sample data created
Data shape: (100, 5)
Data types:
feature1    float64
feature2    float64
feature3    float64
feature4    float64
category     object
dtype: object

Missing values per column:
feature1    5
feature2    5
feature3    5
feature4    0
category    0
dtype: int64

First 10 rows:
    feature1  feature2  feature3  feature4 category
0  10.993428  3.584629  0.178894  0.662668        A
1   9.723471  4.579355  0.280392  0.390667        C
2  11.295377  4.657285  0.541526  2.777513        B
3  13.046060  4.197723  0.526901  3.288418        B
4   9.531693  4.838714 -0.688835  9.314010        A
5        NaN  5.404051 -0.468913  1.064159        B
6        NaN  6.886186  0.257518  0.930488        B
7        NaN  5.174578  0.256893  2.995909        B
8        NaN  5.257550  0.257524  0.833467        B
9        NaN  4.925554  1.926366  5.340276        A


## 1. Individual Transforms

Let's start by demonstrating individual transforms and their effects on the data.

### Standard Scaling

StandardScaler normalizes features to have mean=0 and standard deviation=1.

In [5]:
# Select only numerical columns for scaling
numerical_data = data.select_dtypes(include=[np.number])
print("Numerical features to scale:")
print(list(numerical_data.columns))

# Apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numerical_data)

print(f"\n✓ StandardScaler applied")
print(f"Original data shape: {numerical_data.shape}")
print(f"Scaled data shape: {scaled_data.shape}")

print(f"\nOriginal data statistics:")
print(numerical_data.describe().round(3))

print(f"\nScaled data statistics:")
print(scaled_data.describe().round(3))

print(f"\nScaled data means (should be ~0): {scaled_data.mean().round(6).tolist()}")
print(f"Scaled data stds (should be ~1): {scaled_data.std().round(6).tolist()}")

Numerical features to scale:
['feature1', 'feature2', 'feature3', 'feature4']

✓ StandardScaler applied
Original data shape: (100, 4)
Scaled data shape: (100, 4)

Original data statistics:
       feature1  feature2  feature3  feature4
count    95.000    95.000    95.000   100.000
mean      9.735     5.013     0.034     2.331
std       1.815     0.961     0.548     2.283
min       4.761     3.081    -1.621     0.022
25%       8.753     4.191    -0.326     0.689
50%       9.723     5.069     0.057     1.605
75%      10.723     5.502     0.361     3.174
max      13.705     7.720     1.926     9.914

Scaled data statistics:
       feature1  feature2  feature3  feature4
count    95.000    95.000    95.000   100.000
mean      0.000     0.000     0.000     0.000
std       1.000     1.000     1.000     1.000
min      -2.741    -2.011    -3.018    -1.012
25%      -0.541    -0.856    -0.657    -0.719
50%      -0.007     0.058     0.042    -0.318
75%       0.544     0.509     0.597     0.370
max 

### Transform Composition

The `TransformComposer` allows you to chain multiple transforms together into a single preprocessing pipeline.

In [6]:
# Create a preprocessing pipeline using TransformComposer
composer = TransformComposer(
    DropNA(),                                    # Remove rows with missing values
    SelectColumns(['feature1', 'feature2', 'feature3', 'feature4']),  # Select specific columns
    StandardScaler()                             # Standardize the features
)

print("✓ TransformComposer pipeline created with 3 steps:")
print("  1. DropNA() - Remove missing values")
print("  2. SelectColumns() - Select numerical features")  
print("  3. StandardScaler() - Standardize features")

# Apply the composed transformation
transformed_data = composer.fit_transform(data)

print(f"\n✓ Transform composition applied")
print(f"Original data shape: {data.shape}")
print(f"Transformed data shape: {transformed_data.shape}")
print(f"Missing values after processing: {transformed_data.isnull().sum().sum()}")

print(f"\nTransformed data (first 5 rows):")
print(transformed_data.head())

print(f"\nTransformed data statistics:")
print(transformed_data.describe().round(3))

✓ TransformComposer pipeline created with 3 steps:
  1. DropNA() - Remove missing values
  2. SelectColumns() - Select numerical features
  3. StandardScaler() - Standardize features

✓ Transform composition applied
Original data shape: (100, 5)
Transformed data shape: (85, 4)
Missing values after processing: 0

Transformed data (first 5 rows):
   feature1  feature2  feature3  feature4
0  0.629890 -1.462518  0.306503 -0.675677
1 -0.054804 -0.420451  0.496970 -0.792411
2  0.792685 -0.338811  0.986999  0.231942
3  1.736561 -0.820246  0.959555  0.451206
4 -0.158201 -0.148747 -1.321830  3.037184

Transformed data statistics:
       feature1  feature2  feature3  feature4
count    85.000    85.000    85.000    85.000
mean      0.000    -0.000     0.000     0.000
std       1.000     1.000     1.000     1.000
min      -2.731    -1.990    -3.070    -0.951
25%      -0.493    -0.840    -0.642    -0.792
50%       0.017     0.083     0.077    -0.316
75%       0.630     0.567     0.698     0.328
max

## 2. Integration with Dataruns Pipeline

Now let's see how transforms integrate seamlessly with the main dataruns Pipeline system.

In [7]:
# Create preprocessing pipeline using convenience function
preprocessing = create_preprocessing_pipeline(
    scale_method='minmax',      # Use MinMax scaling instead of standard
    handle_missing='fill',      # Fill missing values instead of dropping
    fill_value=None            # Use mean for filling (None means auto-select method)
)

print("✓ Preprocessing pipeline created using create_preprocessing_pipeline()")
print("  - Scale method: MinMax scaling")
print("  - Missing values: Fill with mean")

# Create a function to apply preprocessing to numerical data
def preprocess_data(data):
    """Apply preprocessing to numerical columns only."""
    numerical_data = data.select_dtypes(include=[np.number])
    print(f"  → Processing {len(numerical_data.columns)} numerical columns")
    return preprocessing.fit_transform(numerical_data)

# Integrate with dataruns Pipeline
pipeline = Pipeline(preprocess_data)
print("✓ Dataruns Pipeline created with preprocessing function")

# Apply the pipeline
result = pipeline(data)
print(f"\n✓ Pipeline execution completed")
print(f"Pipeline result shape: {result.shape}")
print(f"Missing values in result: {result.isnull().sum().sum()}")

print(f"\nPipeline result data range:")
print(f"Min values: {result.min().round(3).tolist()}")
print(f"Max values: {result.max().round(3).tolist()}")

print(f"\nPipeline result (first 5 rows):")
print(result.head())

✓ Preprocessing pipeline created using create_preprocessing_pipeline()
  - Scale method: MinMax scaling
  - Missing values: Fill with mean
✓ Dataruns Pipeline created with preprocessing function
  → Processing 4 numerical columns

✓ Pipeline execution completed
Pipeline result shape: (100, 4)
Missing values in result: 0

Pipeline result data range:
Min values: [0.0, 0.0, 0.0, 0.0]
Max values: [1.0, 1.0, 1.0, 1.0]

Pipeline result (first 5 rows):
   feature1  feature2  feature3  feature4
0  0.696879  0.108516  0.507338  0.064788
1  0.554890  0.322946  0.535953  0.037291
2  0.730639  0.339745  0.609574  0.278584
3  0.926376  0.240679  0.605451  0.330234
4  0.533448  0.378855  0.262701  0.939380


## Summary

This notebook demonstrated the comprehensive transform system in dataruns:

### Key Features Covered:
1. **Individual Transforms**: StandardScaler for feature normalization
2. **Transform Composition**: TransformComposer for chaining multiple transforms
3. **Pipeline Integration**: Seamless integration with dataruns Pipeline system
4. **Convenience Functions**: create_preprocessing_pipeline() for common workflows

### Benefits:
- ✅ **Familiar Interface**: Similar to sklearn's transform API
- ✅ **Flexible Composition**: Easy to chain and combine transforms
- ✅ **Type Support**: Works with both pandas DataFrames and numpy arrays
- ✅ **Missing Value Handling**: Built-in strategies for dealing with missing data
- ✅ **Pipeline Integration**: Transforms work seamlessly with dataruns pipelines

### Available Transforms:
- **Scaling**: `StandardScaler`, `MinMaxScaler`
- **Missing Values**: `DropNA`, `FillNA` 
- **Column Operations**: `SelectColumns`, `RenameColumns`
- **Custom**: `ApplyFunction`, `FilterRows`
- **Encoding**: `OneHotEncoder`
- **Composition**: `TransformComposer`

### Next Steps:
Explore the comprehensive transform examples notebook for more advanced use cases and custom transformation patterns.