# Feature Engine Demo

This notebook demonstrates the new **dynamic feature store** approach using the Feature Engine.

## Key Features:
- ✅ Self-documenting features with metadata
- ✅ Automatic dependency resolution
- ✅ Runtime introspection
- ✅ Input validation
- ✅ Easy to extend

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Import feature engine
from feature_engine import feature, registry, executor
from feature_engine.introspection import print_statistics, get_feature_catalog

print("Feature Engine loaded successfully!")

Feature Engine loaded successfully!


## 1. Load Sample Data

In [2]:
# Load auto parts data
df = pd.read_csv('data/auto_partes/auto_partes_transactions.csv')

# Parse dates - extract date part only (time has format issues)
# The data has incorrect format like "14:06:00 PM" (14 is already 24-hour)
# We use inith column for hour information
df['fecha'] = df['fecha'].str.split(' ').str[0]
df['fecha'] = pd.to_datetime(df['fecha'], format='%m/%d/%Y')

print(f"Loaded {len(df)} transactions")
print(f"Date range: {df['fecha'].min()} to {df['fecha'].max()}")
print(f"\nColumns: {list(df.columns)}")
df.head()

Loaded 5527 transactions
Date range: 2024-05-01 00:00:00 to 2024-06-30 00:00:00

Columns: ['trans_id', 'fecha', 'producto', 'glosa', 'costo', 'total', 'cantidad', 'inith', 'initm', 'customer']


Unnamed: 0,trans_id,fecha,producto,glosa,costo,total,cantidad,inith,initm,customer
0,AP000001_1,2024-05-01,ACC001,LLAVE TUERCAS CRUZ 4 PUNTAS,8500,15000,1,10,1,MECANICA_RAPIDA_LAS_CONDES
1,AP000002_1,2024-05-01,TIRE001,NEUMATICO 195/65 R15 VERANO,45000,75000,1,14,6,LUBRICENTRO_PROVIDENCIA
2,AP000002_2,2024-05-01,OIL002,ACEITE MOTOR 15W40 MINERAL 4L,8000,12000,1,14,6,LUBRICENTRO_PROVIDENCIA
3,AP000002_3,2024-05-01,OIL002,ACEITE MOTOR 15W40 MINERAL 4L,8000,12000,1,14,6,LUBRICENTRO_PROVIDENCIA
4,AP000003_1,2024-05-01,ENG004,TERMOSTATO MOTOR,15000,28000,1,8,38,TALLER_MECANICO_SAN_MIGUEL


## 2. Define Configuration

In [3]:
config = {
    'project_name': 'auto_partes_demo',
    'analysis_date': datetime.now(),
    'date_col': 'fecha',
    'product_col': 'producto',
    'description_col': 'glosa',
    'revenue_col': 'total',
    'quantity_col': 'cantidad',
    'transaction_col': 'trans_id',
    'cost_col': 'costo'
}

print("Configuration:")
for key, value in config.items():
    if key != 'analysis_date':
        print(f"  {key}: {value}")
    else:
        print(f"  {key}: {value.strftime('%Y-%m-%d')}")

Configuration:
  project_name: auto_partes_demo
  analysis_date: 2025-10-07
  date_col: fecha
  product_col: producto
  description_col: glosa
  revenue_col: total
  quantity_col: cantidad
  transaction_col: trans_id
  cost_col: costo


## 3. Define Features with @feature Decorator

Let's migrate some key features from the waterfall pipeline.

In [4]:
# Feature 1: Time Extraction
@feature(
    name='time_extraction',
    description='Extract time-based features (hour, weekday, month, etc.) from date column',
    category='filter',
    dtype='integer',
    requires=['date_col'],
    tags=['time', 'preprocessing']
)
def extract_time_features(df, config):
    """Extract time-based features from date column"""
    date_col = config['date_col']

    # Hour
    if 'hour' not in df.columns:
        if 'inith' in df.columns:
            df['hour'] = df['inith']
        else:
            df['hour'] = df[date_col].dt.hour

    # Weekday name
    if 'weekday' not in df.columns:
        df['weekday'] = df[date_col].dt.day_name()

    # Weekday number (0=Monday, 6=Sunday)
    if 'weekday_num' not in df.columns:
        df['weekday_num'] = df[date_col].dt.dayofweek

    # Month
    if 'month' not in df.columns:
        df['month'] = df[date_col].dt.month

    # Month name
    if 'month_name' not in df.columns:
        df['month_name'] = df[date_col].dt.month_name()

    # Year
    if 'year' not in df.columns:
        df['year'] = df[date_col].dt.year

    # Week of year
    if 'week_of_year' not in df.columns:
        df['week_of_year'] = df[date_col].dt.isocalendar().week

    # Day of month
    if 'day_of_month' not in df.columns:
        df['day_of_month'] = df[date_col].dt.day

    return df

[OK] Registered feature: time_extraction (filter)


In [5]:
# Feature 2: Profit Margin
@feature(
    name='profit_margin',
    description='Calculate profit margin percentage: (revenue - cost) / revenue * 100',
    category='filter',
    dtype='float',
    requires=['revenue_col', 'cost_col'],
    optional_requires=['cost_col'],
    tags=['financial', 'profitability']
)
def calculate_profit_margin(df, config):
    """Calculate profit margin percentage"""
    revenue_col = config['revenue_col']
    cost_col = config.get('cost_col')

    if cost_col and cost_col in df.columns:
        df['profit_margin'] = np.where(
            df[revenue_col] > 0,
            ((df[revenue_col] - df[cost_col]) / df[revenue_col]) * 100,
            0
        )
        df['profit'] = df[revenue_col] - df[cost_col]
    else:
        print("[INFO] cost_col not available, skipping profit_margin")

    return df

[OK] Registered feature: profit_margin (filter)


In [6]:
# Feature 3: Price Per Unit
@feature(
    name='price_per_unit',
    description='Calculate price per unit: revenue / quantity',
    category='filter',
    dtype='float',
    requires=['revenue_col', 'quantity_col'],
    tags=['pricing', 'unit_economics']
)
def calculate_price_per_unit(df, config):
    """Calculate price per unit"""
    revenue_col = config['revenue_col']
    quantity_col = config['quantity_col']

    df['price_per_unit'] = np.where(
        df[quantity_col] > 0,
        df[revenue_col] / df[quantity_col],
        0
    )

    return df

[OK] Registered feature: price_per_unit (filter)


In [7]:
# Feature 4: Time of Day (depends on time_extraction)
@feature(
    name='time_of_day',
    description='Classify transactions by time of day (Morning/Afternoon/Evening/Night)',
    category='filter',
    dtype='string',
    depends_on=['time_extraction'],  # Needs hour column
    tags=['time', 'segmentation']
)
def classify_time_of_day(df, config):
    """Classify time of day"""
    def classify(hour):
        if pd.isna(hour):
            return 'Unknown'
        elif 5 <= hour < 12:
            return 'Morning'
        elif 12 <= hour < 17:
            return 'Afternoon'
        elif 17 <= hour < 21:
            return 'Evening'
        else:
            return 'Night'

    df['time_of_day'] = df['hour'].apply(classify)
    return df

[OK] Registered feature: time_of_day (filter)


In [8]:
# Feature 5: Weekend Flag (depends on time_extraction)
@feature(
    name='is_weekend',
    description='Flag weekend transactions (Saturday/Sunday)',
    category='filter',
    dtype='boolean',
    depends_on=['time_extraction'],  # Needs weekday_num column
    tags=['time', 'segmentation']
)
def flag_weekend(df, config):
    """Flag weekend transactions"""
    df['is_weekend'] = df['weekday_num'].isin([5, 6])
    return df

[OK] Registered feature: is_weekend (filter)


In [9]:
# Feature 6: Transaction Size Category
@feature(
    name='transaction_size_category',
    description='Categorize transaction size as Small/Medium/Large based on revenue quartiles',
    category='filter',
    dtype='string',
    requires=['revenue_col'],
    tags=['segmentation', 'revenue']
)
def categorize_transaction_size(df, config):
    """Categorize transaction size"""
    revenue_col = config['revenue_col']

    q33 = df[revenue_col].quantile(0.33)
    q67 = df[revenue_col].quantile(0.67)

    def categorize(value):
        if value <= q33:
            return 'Small'
        elif value <= q67:
            return 'Medium'
        else:
            return 'Large'

    df['transaction_size'] = df[revenue_col].apply(categorize)
    return df

[OK] Registered feature: transaction_size_category (filter)


## 4. Introspect the Feature Registry

Let's see what features are now available.

In [10]:
# Print registry statistics
print_statistics(registry)


FEATURE REGISTRY STATISTICS

Total Features: 6

By Category:
  filter         :   6 features

By Data Type:
  boolean        :   1 features
  float          :   2 features
  integer        :   1 features
  string         :   2 features

Aggregations: 0 features
Features with no dependencies: 4
Average dependencies per feature: 0.33
Max dependencies: 1

Top Tags:
  time           :   3 features
  segmentation   :   3 features
  preprocessing  :   1 features
  financial      :   1 features
  profitability  :   1 features
  pricing        :   1 features
  unit_economics :   1 features
  revenue        :   1 features



In [11]:
# List all features
print("Available Features:")
print("="*70)

for name in sorted(registry.get_feature_names()):
    info = registry.introspect(name)
    print(f"\n{name}")
    print(f"  Description: {info['description']}")
    print(f"  Category: {info['category']}")
    print(f"  Requires: {info['requires']}")
    if info['depends_on']:
        print(f"  Depends on: {info['depends_on']}")
    print(f"  Tags: {info['tags']}")

Available Features:

is_weekend
  Description: Flag weekend transactions (Saturday/Sunday)
  Category: filter
  Requires: []
  Depends on: ['time_extraction']
  Tags: ['time', 'segmentation']

price_per_unit
  Description: Calculate price per unit: revenue / quantity
  Category: filter
  Requires: ['quantity_col', 'revenue_col']
  Tags: ['pricing', 'unit_economics']

profit_margin
  Description: Calculate profit margin percentage: (revenue - cost) / revenue * 100
  Category: filter
  Requires: ['cost_col', 'revenue_col']
  Tags: ['financial', 'profitability']

time_extraction
  Description: Extract time-based features (hour, weekday, month, etc.) from date column
  Category: filter
  Requires: ['date_col']
  Tags: ['time', 'preprocessing']

time_of_day
  Description: Classify transactions by time of day (Morning/Afternoon/Evening/Night)
  Category: filter
  Requires: []
  Depends on: ['time_extraction']
  Tags: ['time', 'segmentation']

transaction_size_category
  Description: Categori

## 5. Visualize Dependencies

In [12]:
# Show execution order
all_features = registry.get_feature_names()
execution_order = registry.get_execution_order(all_features)

print("Execution Order (with dependencies):")
print("="*70)
for i, feature in enumerate(execution_order, 1):
    deps = registry.get_dependencies(feature, recursive=False)
    if deps:
        print(f"{i}. {feature} (after: {', '.join(deps)})")
    else:
        print(f"{i}. {feature} (no dependencies)")

Execution Order (with dependencies):
1. price_per_unit (no dependencies)
2. profit_margin (no dependencies)
3. time_extraction (no dependencies)
4. is_weekend (after: time_extraction)
5. time_of_day (after: time_extraction)
6. transaction_size_category (no dependencies)


## 6. Execute Features

Now let's execute all features with automatic dependency resolution.

In [13]:
# Validate first
is_valid, errors = executor.validate_inputs(all_features, config, df)

if is_valid:
    print("[OK] All features validated successfully!")
else:
    print("[ERROR] Validation failed:")
    for error in errors:
        print(f"  - {error}")

[OK] All features validated successfully!


In [14]:
# Execute all features
result_df = executor.execute(
    df,
    all_features,
    config,
    verbose=True
)

print(f"\n[SUCCESS] Executed {len(all_features)} features!")
print(f"\nOriginal columns: {len(df.columns)}")
print(f"Final columns: {len(result_df.columns)}")
print(f"New columns added: {len(result_df.columns) - len(df.columns)}")


Executing 6 features...
Order: price_per_unit -> profit_margin -> time_extraction -> is_weekend -> time_of_day -> transaction_size_category

  * Executing: price_per_unit (filter)... [OK]
  * Executing: profit_margin (filter)... [OK]
  * Executing: time_extraction (filter)... [OK]
  * Executing: is_weekend (filter)... [OK]
  * Executing: time_of_day (filter)... [OK]
  * Executing: transaction_size_category (filter)... [OK]

[SUCCESS] Executed 6 features!

Original columns: 10
Final columns: 24
New columns added: 14


## 7. Inspect Results

In [15]:
# Show new columns
new_cols = [col for col in result_df.columns if col not in df.columns]

print(f"New Columns Created ({len(new_cols)}):")
print("="*70)
for col in new_cols:
    print(f"  - {col}")

New Columns Created (14):
  - price_per_unit
  - profit_margin
  - profit
  - hour
  - weekday
  - weekday_num
  - month
  - month_name
  - year
  - week_of_year
  - day_of_month
  - is_weekend
  - time_of_day
  - transaction_size


In [16]:
# Sample the results
display_cols = ['producto', 'glosa', 'total', 'cantidad', 'profit_margin', 
                'price_per_unit', 'time_of_day', 'is_weekend', 'transaction_size']

result_df[display_cols].head(10)

Unnamed: 0,producto,glosa,total,cantidad,profit_margin,price_per_unit,time_of_day,is_weekend,transaction_size
0,ACC001,LLAVE TUERCAS CRUZ 4 PUNTAS,15000,1,43.333333,15000.0,Morning,False,Small
1,TIRE001,NEUMATICO 195/65 R15 VERANO,75000,1,40.0,75000.0,Afternoon,False,Large
2,OIL002,ACEITE MOTOR 15W40 MINERAL 4L,12000,1,33.333333,12000.0,Afternoon,False,Small
3,OIL002,ACEITE MOTOR 15W40 MINERAL 4L,12000,1,33.333333,12000.0,Afternoon,False,Small
4,ENG004,TERMOSTATO MOTOR,28000,1,46.428571,28000.0,Morning,False,Medium
5,TIRE001,NEUMATICO 195/65 R15 VERANO,150000,2,70.0,75000.0,Morning,False,Large
6,TIRE003,NEUMATICO 185/60 R14 INVIERNO,170000,2,70.588235,85000.0,Morning,False,Large
7,ACC003,LIQUIDO FRENOS DOT 4 500ML,8500,1,47.058824,8500.0,Morning,False,Small
8,ENG002,CORREA DISTRIBUCION MOTOR,120000,2,70.833333,60000.0,Morning,False,Large
9,OIL002,ACEITE MOTOR 15W40 MINERAL 4L,24000,2,66.666667,12000.0,Morning,False,Medium


## 8. Analyze Results

In [17]:
# Profit margin distribution
print("Profit Margin Statistics:")
print("="*70)
print(result_df['profit_margin'].describe())

print("\nTransactions by Time of Day:")
print(result_df['time_of_day'].value_counts())

print("\nWeekend vs Weekday:")
print(result_df['is_weekend'].value_counts())

print("\nTransaction Size Distribution:")
print(result_df['transaction_size'].value_counts())

Profit Margin Statistics:
count    5527.000000
mean       58.329481
std        17.753639
min        33.333333
25%        41.379310
50%        66.666667
75%        71.052632
max        89.411765
Name: profit_margin, dtype: float64

Transactions by Time of Day:
time_of_day
Morning      3136
Afternoon    2070
Evening       321
Name: count, dtype: int64

Weekend vs Weekday:
is_weekend
False    5190
True      337
Name: count, dtype: int64

Transaction Size Distribution:
transaction_size
Medium    1866
Small     1839
Large     1822
Name: count, dtype: int64


## 9. Test Dependency Resolution

Let's execute just one feature that has dependencies and see the engine automatically resolve them.

In [18]:
# Execute time_of_day which depends on time_extraction
test_df = df.copy()

print("Executing 'time_of_day' with automatic dependency resolution:")
print("="*70)

test_result = executor.execute_with_dependencies(
    test_df,
    'time_of_day',
    config,
    verbose=True
)

print("\nColumns created:")
new_test_cols = [col for col in test_result.columns if col not in test_df.columns]
for col in new_test_cols:
    print(f"  - {col}")

Executing 'time_of_day' with automatic dependency resolution:
Feature 'time_of_day' requires 1 dependencies:
  - time_extraction

Executing 2 features...
Order: time_extraction -> time_of_day

  * Executing: time_extraction (filter)... [OK]
  * Executing: time_of_day (filter)... [OK]

Columns created:
  - hour
  - weekday
  - weekday_num
  - month
  - month_name
  - year
  - week_of_year
  - day_of_month
  - time_of_day


## 10. Performance Comparison

Let's compare execution time between old and new approaches.

In [19]:
import time

# New approach
start = time.time()
result_new = executor.execute(df, all_features, config, verbose=False)
time_new = time.time() - start

print("Performance Results:")
print("="*70)
print(f"Feature Engine: {time_new:.4f} seconds")
print(f"Features executed: {len(all_features)}")
print(f"Rows processed: {len(df):,}")
print(f"Throughput: {len(df)/time_new:,.0f} rows/second")

Performance Results:
Feature Engine: 0.0146 seconds
Features executed: 6
Rows processed: 5,527
Throughput: 377,328 rows/second


## 11. Generate Feature Catalog

Create documentation for all features.

In [20]:
# Generate catalog as Markdown
catalog_md = get_feature_catalog(registry, format='markdown')

# Save to file
with open('docs/FEATURE_CATALOG_AUTO.md', 'w') as f:
    f.write(catalog_md)

print("[OK] Feature catalog saved to: docs/FEATURE_CATALOG_AUTO.md")
print("\nPreview:")
print("="*70)
print(catalog_md[:500] + "...")

[OK] Feature catalog saved to: docs/FEATURE_CATALOG_AUTO.md

Preview:
# Feature Catalog

**Generated:** 2025-10-07T15:32:06.018071
**Total Features:** 6

## FILTER

*6 features*

### `is_weekend`

**Description:** Flag weekend transactions (Saturday/Sunday)

**Type:** `boolean`

**Depends On:** `time_extraction`

**Tags:** time, segmentation

**Version:** 1.0.0

---

### `price_per_unit`

**Description:** Calculate price per unit: revenue / quantity

**Type:** `float`

**Requires:** `quantity_col, revenue_col`

**Tags:** pricing, unit_economics

**Version:** 1.0.0...


## Summary

### Key Takeaways:

1. **Self-Documenting**: Each feature declares its requirements
2. **Automatic Dependencies**: Features execute in correct order automatically
3. **Introspection**: Can discover and inspect features at runtime
4. **Validation**: Catches errors before execution
5. **Extensible**: Add new features by decorating functions

### Next Steps:

1. Migrate remaining features from `waterfall/scripts/filters.py`
2. Add aggregation features (attributes)
3. Add scoring features
4. Integrate with existing dashboard generation
5. Update `main_driver.py` to use feature engine