# GabeDA Features (Daily Aggregation)

This notebook creates daily business metrics by aggregating transactions at the day level.
It processes transaction-level data to generate daily summaries of key business indicators.

**Input:** Preprocessed transactions from 01_transactions notebook  
**Output:** Daily metrics (1 row per date)  
**Group By:** `dt_date`

## 1. Setup: Imports, Context Loading, Logging

## 0. Project Root Setup (Auto-generated)

In [1]:
# Auto-detect project root and add to Python path
import os
import sys
from pathlib import Path

# Get the project root (2 levels up from notebooks/development or notebooks/from_store)
notebook_dir = Path.cwd() if '__file__' not in globals() else Path(__file__).parent
project_root = notebook_dir.parent.parent

# Change to project root
os.chdir(project_root)

# Add project root to Python path if not already there
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"Working directory: {os.getcwd()}")
print(f"Project root: {project_root}")

Working directory: c:\Projects\play\khujta_ai_business
Project root: c:\Projects\play\khujta_ai_business


In [2]:
import pandas as pd
import numpy as np

# v2.0 Refactored imports
from src.utils.logger import setup_logging, get_logger
from src.core.context import GabedaContext
from src.core.persistence import load_context_state, get_latest_state, save_context_state
from src.core.constants import *
from src.features.store import FeatureStore
from src.features.resolver import DependencyResolver
from src.features.detector import FeatureTypeDetector
from src.features.analyzer import FeatureAnalyzer
from src.execution.calculator import FeatureCalculator
from src.execution.groupby import GroupByProcessor
from src.execution.executor import ModelExecutor
from src.export.excel import ExcelExporter

# Load latest context state
client_name = 'test_client'
latest_state = get_latest_state(client_name, base_dir='data/context_states')

if latest_state:
    ctx, base_cfg = load_context_state(latest_state)
    print(f"‚úì Loaded latest state: {latest_state}")
else:
    raise FileNotFoundError(f"No context state found for client '{client_name}'")

# Setup logging
setup_logging(log_level=base_cfg.get('log_level', 'INFO'), 
              config={'client': base_cfg.get('client', 'unknown_client')})
logger = get_logger(__name__)

print(f"\n‚úì Context loaded successfully!")
print(f"  - Original run_id: {ctx.original_run_id}")
print(f"  - New run_id: {ctx.run_id}")
print(f"  - Available datasets: {len(ctx.list_datasets())} datasets")

‚úì Loaded latest state: data\context_states\test_client_20251022_150907
üìù Run instance ID: test_client_20251022_150937 - Logging [INFO] to: logs\test_client_20251022_150937.log

‚úì Context loaded successfully!
  - Original run_id: test_client_20251022_150907
  - New run_id: test_client_20251022_150937
  - Available datasets: 3 datasets


## 2. Load Input Data

In [3]:
# Get input dataset
input_df = ctx.get_dataset('transactions_filters')

print(f"‚úì Input dataset loaded")
print(f"  - Shape: {input_df.shape}")
print(f"  - Date range: {input_df['dt_date'].min()} to {input_df['dt_date'].max()}")
print(f"  - Unique days: {input_df['dt_date'].nunique()}")
print(f"\nFirst few rows:")
input_df.head()

‚úì Input dataset loaded
  - Shape: (609, 59)
  - Date range: 20251001 to 20251030
  - Unique days: 30

First few rows:


Unnamed: 0,in_dt,in_product_id,in_quantity,in_price_total,in_trans_type,in_customer_id,in_description,in_category,in_unit_type,in_stock,...,cost_unit,cost_total,price_unit,price_total,margin_unit,margin_unit_pct,margin_unit_valid,margin_total,margin_total_pct,margin_total_valid
0,2025-10-01 01:02:00,prod8,2.0,52964.0,return,client13,product 8,category B,pack,61.0,...,18792.0,37585.0,26482.0,52964.0,7690.0,29.04,True,15379.0,29.04,True
1,2025-10-01 06:24:00,prod4,6.0,177195.0,sale,client6,product 4,category B,unit,30.0,...,21526.0,129155.0,29533.0,177195.0,8007.0,27.11,True,48040.0,27.11,True
2,2025-10-01 08:38:00,prod7,2.0,70492.0,return,client12,product 7,category A,unit,78.0,...,25754.0,51509.0,35246.0,70492.0,9492.0,26.93,True,18983.0,26.93,True
3,2025-10-01 09:59:00,prod2,4.0,86751.0,sale,client3,product 2,category A,unit,80.0,...,12947.0,51786.0,21688.0,86751.0,8741.0,40.3,True,34965.0,40.31,True
4,2025-10-01 10:07:00,prod3,3.0,76465.0,sale,client12,product 3,category B,unit,47.0,...,16943.0,5083.0,25488.0,76465.0,8545.0,33.53,True,71382.0,93.35,True


## 3. Define Features

Daily business metrics aggregated by date:  
- Time identifiers (year, quarter, month, week)
- Customer, product, and transaction counts
- Quantity metrics (sum, mean, median)
- Revenue metrics (sum, mean, median)
- Cost metrics (sum, mean, median)
- Margin metrics (sum, mean, median, percentages)

In [4]:
# ===== Daily Business Metrics =====
# Aggregates transaction data at daily level to track daily performance

# --- Time Identifiers ---
def year(dt_year):
    """Extract year from first transaction of the day."""
    return dt_year[FIRST_VALUE]

def quarter(dt_quarter):
    """Extract quarter from first transaction of the day."""
    return dt_quarter[FIRST_VALUE]

def month(dt_month):
    """Extract month from first transaction of the day."""
    return dt_month[FIRST_VALUE]

def weekofyear(dt_weekofyear):
    """Extract week of year from first transaction of the day."""
    return dt_weekofyear[FIRST_VALUE]

# --- Count Metrics ---
def customer_id_count(customer_id):
    """Count unique customers who made purchases on this day."""
    return len(np.unique(customer_id))

def product_id_count(product_id):
    """Count unique products sold on this day."""
    return len(np.unique(product_id))

def trans_id_count(trans_id):
    """Count total transactions completed on this day."""
    return len(np.unique(trans_id))

# --- Quantity Metrics ---
def quantity_sum(quantity):
    """Total units sold on this day."""
    return np.sum(quantity)

def quantity_mean(quantity):
    """Average units per transaction on this day."""
    return round(np.mean(quantity), 2)

def quantity_median(quantity):
    """Median units per transaction on this day."""
    return round(np.median(quantity), 2)

# --- Revenue Metrics ---
def price_total_sum(price_total):
    """Total revenue generated on this day."""
    return np.sum(price_total)

def price_total_mean(price_total):
    """Average revenue per transaction on this day."""
    return round(np.mean(price_total), 2)

def price_total_median(price_total):
    """Median revenue per transaction on this day."""
    return round(np.median(price_total), 2)

# --- Cost Metrics ---
def cost_total_sum(cost_total):
    """Total costs incurred on this day."""
    return np.sum(cost_total)

def cost_total_mean(cost_total):
    """Average cost per transaction on this day."""
    return round(np.mean(cost_total), 2)

def cost_total_median(cost_total):
    """Median cost per transaction on this day."""
    return round(np.median(cost_total), 2)

def transaction_mean(price_total_sum, trans_id_count):
    """Average spend per visit on this day."""
    if trans_id_count == 0:
        return DEFAULT_FLOAT
    return round(np.mean(price_total_sum / trans_id_count), 2)

# --- Margin Metrics (Absolute) ---
def margin_total_sum(margin_total, margin_total_pct, margin_total_valid):
    """Total profit margin on this day (excluding invalid entries)."""
    flag = (margin_total_pct != DEFAULT_FLOAT) & (margin_total_valid == True)
    select = margin_total[flag]
    return np.sum(select) if len(select) > 0 else DEFAULT_FLOAT

def margin_total_mean(margin_total, margin_total_pct, margin_total_valid):
    """Average profit margin per transaction on this day."""
    flag = (margin_total_pct != DEFAULT_FLOAT) & (margin_total_valid == True)
    select = margin_total[flag]
    return round(np.mean(select), 2) if len(select) > 0 else DEFAULT_FLOAT

def margin_total_median(margin_total, margin_total_pct, margin_total_valid):
    """Median profit margin per transaction on this day."""
    flag = (margin_total_pct != DEFAULT_FLOAT) & (margin_total_valid == True)
    select = margin_total[flag]
    return round(np.median(select), 2) if len(select) > 0 else DEFAULT_FLOAT

# --- Margin Metrics (Percentage) ---
def margin_total_pct_min(margin_total_pct, margin_total_valid):
    """Minimum margin percentage on this day."""
    flag = (margin_total_pct != DEFAULT_FLOAT) & (margin_total_valid == True)
    select = margin_total_pct[flag]
    return round(np.min(select), 2) if len(select) > 0 else DEFAULT_FLOAT

def margin_total_pct_mean(margin_total_pct, margin_total_valid):
    """Average margin percentage on this day."""
    flag = (margin_total_pct != DEFAULT_FLOAT) & (margin_total_valid == True)
    select = margin_total_pct[flag]
    return round(np.mean(select), 2) if len(select) > 0 else DEFAULT_FLOAT

def margin_total_pct_median(margin_total_pct, margin_total_valid):
    """Median margin percentage on this day."""
    flag = (margin_total_pct != DEFAULT_FLOAT) & (margin_total_valid == True)
    select = margin_total_pct[flag]
    return round(np.median(select), 2) if len(select) > 0 else DEFAULT_FLOAT

print("‚úì Feature functions defined: 23 attributes")

‚úì Feature functions defined: 23 attributes


## 4. Configure Model

In [5]:
# Collect features into dictionary
features = {
    # Time identifiers
    'year': year,
    'quarter': quarter,
    'month': month,
    'weekofyear': weekofyear,
    # Count metrics
    'customer_id_count': customer_id_count,
    'product_id_count': product_id_count,
    'trans_id_count': trans_id_count,
    # Quantity metrics
    'quantity_sum': quantity_sum,
    'quantity_mean': quantity_mean,
    'quantity_median': quantity_median,
    # Revenue metrics
    'price_total_sum': price_total_sum,
    'price_total_mean': price_total_mean,
    'price_total_median': price_total_median,
    # Cost metrics
    'cost_total_sum': cost_total_sum,
    'cost_total_mean': cost_total_mean,
    'cost_total_median': cost_total_median,
    'transaction_mean': transaction_mean,
    # Margin metrics (absolute)
    'margin_total_sum': margin_total_sum,
    'margin_total_mean': margin_total_mean,
    'margin_total_median': margin_total_median,
    # Margin metrics (percentage)
    'margin_total_pct_min': margin_total_pct_min,
    'margin_total_pct_mean': margin_total_pct_mean,
    'margin_total_pct_median': margin_total_pct_median,
}

# Model configuration
cfg_model = {
    'model_name': 'daily',
    'input_dataset_name': 'transactions_filters',
    'group_by': ['dt_date'],  # Aggregate by date
    'row_id': 'in_trans_id',
    'output_cols': list(features.keys()),
    'features': features,
}

print(f"‚úì Model configured: '{cfg_model['model_name']}'")
print(f"  - Group by: {cfg_model['group_by']}")
print(f"  - Output features: {len(cfg_model['output_cols'])}")

‚úì Model configured: 'daily'
  - Group by: ['dt_date']
  - Output features: 23


## 5. Prepare Features (Store, Resolve Dependencies, Save Config)

In [6]:
# Initialize feature store and store features
feature_store = FeatureStore()
feature_store.store_features(features, model_name=cfg_model['model_name'], auto_save=True)

# Resolve dependencies
resolver = DependencyResolver(feature_store)
in_cols, exec_seq, ext_cols = resolver.resolve_dependencies(
    output_cols=cfg_model['output_cols'],
    available_cols=input_df.columns.tolist(),
    group_by=cfg_model.get('group_by'),
    model=cfg_model['model_name']
)

# Update model config with resolved dependencies
cfg_model['in_cols'] = in_cols
cfg_model['exec_seq'] = exec_seq
cfg_model['ext_cols'] = ext_cols

# Save master configuration
feature_store.save_master_config(
    model_name=cfg_model['model_name'],
    model_config=cfg_model
)

print("‚úì Features prepared and dependencies resolved")
print(f"  - Input columns needed: {len(in_cols)}")
print(f"  - Execution sequence: {exec_seq}")
print(f"  - Master config saved: feature_store/{cfg_model['model_name']}/master_cfg.json")

‚úì Features prepared and dependencies resolved
  - Input columns needed: 13
  - Execution sequence: ['year', 'quarter', 'month', 'weekofyear', 'customer_id_count', 'product_id_count', 'trans_id_count', 'quantity_sum', 'quantity_mean', 'quantity_median', 'price_total_sum', 'price_total_mean', 'price_total_median', 'cost_total_sum', 'cost_total_mean', 'cost_total_median', 'transaction_mean', 'margin_total_sum', 'margin_total_mean', 'margin_total_median', 'margin_total_pct_min', 'margin_total_pct_mean', 'margin_total_pct_median']
  - Master config saved: feature_store/daily/master_cfg.json


## 6. Execute Model (Initialize Components + Execute + Store Results)

In [7]:
# Initialize execution components
detector = FeatureTypeDetector()
analyzer = FeatureAnalyzer(feature_store, detector)
calculator = FeatureCalculator()
groupby_processor = GroupByProcessor(calculator, detector)
executor = ModelExecutor(analyzer, groupby_processor, context=ctx)

# Execute model
output = executor.execute_model(
    cfg_model=cfg_model,
    input_dataset_name=cfg_model['input_dataset_name']
)

# Store results in context
ctx.set_model_output(cfg_model['model_name'], output, cfg_model)

print("‚úì Model executed successfully!")
print(f"  - Filters: {output['filters'].shape if output['filters'] is not None else 'None'}")
print(f"  - Attributes: {output['attrs'].shape if output['attrs'] is not None else 'None'}")
print(f"  - Days processed: {output['attrs'].shape[0] if output['attrs'] is not None else 0}")

‚úì Model executed successfully!
  - Filters: (609, 59)
  - Attributes: (30, 24)
  - Days processed: 30


## 7. View Results

In [8]:
# View daily metrics (aggregated attributes)
attrs = ctx.get_model_attrs(cfg_model['model_name'])
print(f"Daily Metrics (n={len(attrs)})")
attrs.head(10)

Daily Metrics (n=30)


Unnamed: 0,dt_date,year,quarter,month,weekofyear,customer_id_count,product_id_count,trans_id_count,quantity_sum,quantity_mean,...,cost_total_sum,cost_total_mean,cost_total_median,transaction_mean,margin_total_sum,margin_total_mean,margin_total_median,margin_total_pct_min,margin_total_pct_mean,margin_total_pct_median
0,20251001,2025,4,10,40,14,9,22,71,3.23,...,1085435.0,49337.95,47665.5,79122.77,678405.0,32305.0,31241.0,25.33,39.02,34.21
1,20251002,2025,4,10,40,12,10,28,94,3.36,...,1429106.0,51039.5,33156.5,87682.32,1042216.0,38600.59,25959.0,19.25,41.8,33.39
2,20251003,2025,4,10,40,13,9,29,120,4.14,...,1876081.0,64692.45,40738.0,94232.76,941812.0,34881.93,21269.0,20.79,34.91,31.24
3,20251004,2025,4,10,40,10,10,23,98,4.26,...,1453574.0,63198.87,37963.0,99762.7,876471.0,41736.71,28923.0,20.82,39.23,34.79
4,20251005,2025,4,10,40,8,8,16,72,4.5,...,941393.0,58837.06,37837.5,85382.94,481244.0,34374.57,19076.5,16.01,35.25,33.18
5,20251006,2025,4,10,41,9,8,16,46,2.88,...,895889.0,55993.06,38932.5,37444.38,162905.0,14809.55,12795.0,17.75,32.37,33.0
6,20251007,2025,4,10,41,10,7,17,53,3.12,...,751192.0,44187.76,37043.0,70132.0,458038.0,28627.38,22994.5,21.93,38.98,30.33
7,20251008,2025,4,10,41,10,8,18,51,2.83,...,615279.0,34182.17,28343.5,51091.56,319478.0,18792.82,17325.0,24.4,36.81,33.73
8,20251009,2025,4,10,41,13,10,26,85,3.27,...,1430065.0,55002.5,32367.0,81660.31,735411.0,30642.12,27034.0,15.63,36.26,31.07
9,20251010,2025,4,10,41,12,10,26,89,3.42,...,1280582.0,49253.15,34648.5,83047.04,1028579.0,46753.59,22008.0,18.29,40.33,33.08


In [9]:
# View revenue and transaction summary
print("Daily Revenue & Transaction Summary:")
attrs[['price_total_sum', 'trans_id_count', 'transaction_mean', 'customer_id_count']].describe()

Daily Revenue & Transaction Summary:


Unnamed: 0,price_total_sum,trans_id_count,transaction_mean,customer_id_count
count,30.0,30.0,30.0,30.0
mean,1783194.0,20.3,86714.288667,10.9
std,723017.1,6.018076,22828.035069,2.171127
min,352659.0,8.0,37444.38,6.0
25%,1235715.0,17.0,79207.2975,9.25
50%,1838787.0,19.0,89446.205,11.0
75%,2156682.0,25.25,102267.0625,12.75
max,3413690.0,35.0,119392.11,14.0


In [10]:
# View margin performance summary
print("Daily Margin Performance:")
attrs[['margin_total_sum', 'margin_total_pct_mean', 'margin_total_pct_min']].describe()

Daily Margin Performance:


Unnamed: 0,margin_total_sum,margin_total_pct_mean,margin_total_pct_min
count,30.0,30.0,30.0
mean,671858.8,38.039,20.753667
std,287941.4,3.735383,2.859184
min,117362.0,31.76,15.63
25%,463839.5,35.895,19.0475
50%,724553.0,38.4,20.205
75%,839575.0,40.345,22.635
max,1278089.0,49.02,28.42


## 8. Export to Excel

In [11]:
# Export model results to Excel
exporter = ExcelExporter(ctx)
output_file = f'outputs/{cfg_model["model_name"]}_export.xlsx'
exporter.export_model(cfg_model['model_name'], output_file, include_input=True)

print(f"‚úì Export complete: {output_file}")
print("\nExcel tabs:")
print(f"  1. {cfg_model['input_dataset_name']} (input)")
print(f"  2. {cfg_model['model_name']}_filters")
print(f"  3. {cfg_model['model_name']}_attrs")

‚úì Export complete: outputs/daily_export.xlsx

Excel tabs:
  1. transactions_filters (input)
  2. daily_filters
  3. daily_attrs


## 9. Save Context State

Save the complete context state for use in downstream notebooks:

In [12]:
# Save context state (datasets, config, metadata)
state_dir = save_context_state(ctx=ctx, base_cfg=base_cfg)

print(f"‚úì Context state saved: {state_dir}")
print(f"  - Total datasets: {len(ctx.datasets)}")
print(f"\nTo load this state in another notebook:")
print(f"  from src.core.persistence import load_context_state")
print(f"  ctx, base_cfg = load_context_state('{state_dir}')")

‚úì Context state saved: data\context_states\test_client_20251022_150907
  - Total datasets: 5

To load this state in another notebook:
  from src.core.persistence import load_context_state
  ctx, base_cfg = load_context_state('data\context_states\test_client_20251022_150907')
