# GabeDA Features (Monthly Business Metrics)

This notebook creates monthly business metrics aggregated by year and month.
It aggregates transaction data to compute monthly revenue, customer counts, and growth metrics.

**Input:** Preprocessed transactions from 01_transactions notebook  
**Output:** Monthly business metrics (1 row per month)  
**Group By:** `dt_year`, `dt_month`

## 1. Setup: Imports, Context Loading, Logging

## 0. Project Root Setup (Auto-generated)

In [1]:
# Auto-detect project root and add to Python path
import os
import sys
from pathlib import Path

# Get the project root (2 levels up from notebooks/development or notebooks/from_store)
notebook_dir = Path.cwd() if '__file__' not in globals() else Path(__file__).parent
project_root = notebook_dir.parent.parent

# Change to project root
os.chdir(project_root)

# Add project root to Python path if not already there
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"Working directory: {os.getcwd()}")
print(f"Project root: {project_root}")

Working directory: c:\Projects\play\khujta_ai_business
Project root: c:\Projects\play\khujta_ai_business


In [2]:
import pandas as pd
import numpy as np

# v2.0 Refactored imports
from src.utils.logger import setup_logging, get_logger
from src.core.context import GabedaContext
from src.core.persistence import load_context_state, get_latest_state, save_context_state
from src.core.constants import *
from src.features.store import FeatureStore
from src.features.resolver import DependencyResolver
from src.features.detector import FeatureTypeDetector
from src.features.analyzer import FeatureAnalyzer
from src.execution.calculator import FeatureCalculator
from src.execution.groupby import GroupByProcessor
from src.execution.executor import ModelExecutor
from src.export.excel import ExcelExporter

# Load latest context state
client_name = 'test_client'
latest_state = get_latest_state(client_name, base_dir='data/context_states')

if latest_state:
    ctx, base_cfg = load_context_state(latest_state)
    print(f"‚úì Loaded latest state: {latest_state}")
else:
    raise FileNotFoundError(f"No context state found for client '{client_name}'")

# Setup logging
setup_logging(log_level=base_cfg.get('log_level', 'INFO'), 
              config={'client': base_cfg.get('client', 'unknown_client')})
logger = get_logger(__name__)

print(f"\n‚úì Context loaded successfully!")
print(f"  - Original run_id: {ctx.original_run_id}")
print(f"  - New run_id: {ctx.run_id}")
print(f"  - Available datasets: {len(ctx.list_datasets())} datasets")

‚úì Loaded latest state: data\context_states\test_client_20251022_150907
üìù Run instance ID: test_client_20251022_151056 - Logging [INFO] to: logs\test_client_20251022_151056.log

‚úì Context loaded successfully!
  - Original run_id: test_client_20251022_151027
  - New run_id: test_client_20251022_151056
  - Available datasets: 13 datasets


## 2. Load Input Data

In [3]:
# Get input dataset
input_df = ctx.get_dataset('transactions_filters')

print(f"‚úì Input dataset loaded")
print(f"  - Shape: {input_df.shape}")
print(f"  - Date range: {input_df['dt_date'].min()} to {input_df['dt_date'].max()}")
print(f"  - Unique months: {input_df[['dt_year', 'dt_month']].drop_duplicates().shape[0]}")
print(f"\nFirst few rows:")
input_df.head()

‚úì Input dataset loaded
  - Shape: (609, 59)
  - Date range: 20251001 to 20251030
  - Unique months: 1

First few rows:


Unnamed: 0,in_dt,in_product_id,in_quantity,in_price_total,in_trans_type,in_customer_id,in_description,in_category,in_unit_type,in_stock,...,cost_unit,cost_total,price_unit,price_total,margin_unit,margin_unit_pct,margin_unit_valid,margin_total,margin_total_pct,margin_total_valid
0,2025-10-01 01:02:00,prod8,2.0,52964.0,return,client13,product 8,category B,pack,61.0,...,18792.0,37585.0,26482.0,52964.0,7690.0,29.04,True,15379.0,29.04,True
1,2025-10-01 06:24:00,prod4,6.0,177195.0,sale,client6,product 4,category B,unit,30.0,...,21526.0,129155.0,29533.0,177195.0,8007.0,27.11,True,48040.0,27.11,True
2,2025-10-01 08:38:00,prod7,2.0,70492.0,return,client12,product 7,category A,unit,78.0,...,25754.0,51509.0,35246.0,70492.0,9492.0,26.93,True,18983.0,26.93,True
3,2025-10-01 09:59:00,prod2,4.0,86751.0,sale,client3,product 2,category A,unit,80.0,...,12947.0,51786.0,21688.0,86751.0,8741.0,40.3,True,34965.0,40.31,True
4,2025-10-01 10:07:00,prod3,3.0,76465.0,sale,client12,product 3,category B,unit,47.0,...,16943.0,5083.0,25488.0,76465.0,8545.0,33.53,True,71382.0,93.35,True


## 3. Define Features

Monthly business metrics:  
- Revenue metrics (total revenue, transaction count, unique customers)
- Growth metrics (month-over-month growth, revenue trend)
- Customer metrics (acquisition count, churn count)

**Note:** Some features require historical data and return default values in v1.

In [4]:
# ===== Monthly Business Metrics =====
# Based on specs: docs/specs/model/tech_specs.md - Dataset 2.2

def monthly_revenue(price_total):
    """
    Total revenue for the month.
    Formula: SUM(price_total) from transaction-level data
    
    Note: Could also aggregate from weekly_attrs.weekly_revenue, 
          but using transaction-level data for consistency
    """
    return np.sum(price_total)

def monthly_transaction_count(trans_id):
    """
    Total number of unique transactions in the month.
    Formula: COUNT(DISTINCT trans_id)
    """
    return len(np.unique(trans_id))

def monthly_unique_customers(customer_id):
    """
    Distinct customers who purchased during the month.
    Formula: COUNT(DISTINCT customer_id)
    
    Note: Cannot sum from weekly or daily (would double-count)
    """
    return len(np.unique(customer_id))

def month_over_month_growth(price_total):
    """
    Revenue growth percentage from previous month.
    Formula: ((current_month_revenue - previous_month_revenue) / previous_month_revenue) * 100
    
    Returns: DEFAULT_FLOAT (requires historical/window data - not implemented in v1)
    
    Note: This feature requires access to previous month's revenue via window functions
    or external historical context. Currently returns DEFAULT_FLOAT.
    Future enhancement: Implement window function support or pass historical data.
    """
    return DEFAULT_FLOAT

def revenue_trend(price_total):
    """
    Directional classification of revenue movement.
    Formula: 
      IF month_over_month_growth > 5% THEN 'increasing'
      ELSE IF month_over_month_growth < -5% THEN 'decreasing'
      ELSE 'stable'
    
    Returns: 'unknown' (requires month_over_month_growth which needs historical data)
    
    Note: Depends on month_over_month_growth. Since MoM growth is not available (DEFAULT_FLOAT),
    this feature cannot be calculated. Returns 'unknown' for all months.
    Future enhancement: Implement after window function support is added.
    """
    return 'unknown'

def customer_acquisition_count(customer_id):
    """
    New customers acquired this month (first-time buyers).
    Formula: COUNT(DISTINCT customer_id) WHERE first_purchase_date IN current_month
    
    Returns: DEFAULT_FLOAT (requires historical customer tracking - not implemented in v1)
    
    Note: This requires tracking customer purchase history across all time to identify
    first-time buyers. Current implementation doesn't maintain historical customer data.
    Future enhancement: Implement customer history tracking or use external customer master data.
    """
    return DEFAULT_FLOAT

def customer_churn_count(customer_id):
    """
    Customers lost (no purchase in 60+ days).
    Formula: COUNT(DISTINCT customer_id) WHERE last_purchase_date 
             BETWEEN (month_start - 90 days) AND (month_start - 60 days)
             AND customer_id NOT IN (purchases in last 60 days)
    
    Returns: DEFAULT_FLOAT (requires historical customer tracking - not implemented in v1)
    
    Note: This requires tracking customer purchase history and identifying customers who
    stopped purchasing. Needs historical data beyond current aggregation window.
    Future enhancement: Implement churn detection with proper historical tracking.
    """
    return DEFAULT_FLOAT

print("‚úì Feature functions defined: 7 attributes")

‚úì Feature functions defined: 7 attributes


## 4. Configure Model

In [5]:
# Collect features into dictionary
features = {
    'monthly_revenue': monthly_revenue,
    'monthly_transaction_count': monthly_transaction_count,
    'monthly_unique_customers': monthly_unique_customers,
    'month_over_month_growth': month_over_month_growth,
    'revenue_trend': revenue_trend,
    'customer_acquisition_count': customer_acquisition_count,
    'customer_churn_count': customer_churn_count,
}

# Model configuration
cfg_model = {
    'model_name': 'monthly',
    'input_dataset_name': 'transactions_filters',
    'group_by': ['dt_year', 'dt_month'],
    'row_id': 'in_trans_id',
    'output_cols': list(features.keys()),
    'features': features,
}

print(f"‚úì Model configured: '{cfg_model['model_name']}'")
print(f"  - Group by: {cfg_model['group_by']}")
print(f"  - Output features: {len(cfg_model['output_cols'])}")

‚úì Model configured: 'monthly'
  - Group by: ['dt_year', 'dt_month']
  - Output features: 7


## 5. Prepare Features (Store, Resolve Dependencies, Save Config)

In [6]:
# Initialize feature store and store features
feature_store = FeatureStore()
feature_store.store_features(features, model_name=cfg_model['model_name'], auto_save=True)

# Resolve dependencies
resolver = DependencyResolver(feature_store)
in_cols, exec_seq, ext_cols = resolver.resolve_dependencies(
    output_cols=cfg_model['output_cols'],
    available_cols=input_df.columns.tolist(),
    group_by=cfg_model.get('group_by'),
    model=cfg_model['model_name']
)

# Update model config with resolved dependencies
cfg_model['in_cols'] = in_cols
cfg_model['exec_seq'] = exec_seq
cfg_model['ext_cols'] = ext_cols

# Save master configuration
feature_store.save_master_config(
    model_name=cfg_model['model_name'],
    model_config=cfg_model
)

print("‚úì Features prepared and dependencies resolved")
print(f"  - Input columns needed: {len(in_cols)}")
print(f"  - Execution sequence: {exec_seq}")
print(f"  - Master config saved: feature_store/{cfg_model['model_name']}/master_cfg.json")

‚úì Features prepared and dependencies resolved
  - Input columns needed: 3
  - Execution sequence: ['monthly_revenue', 'monthly_transaction_count', 'monthly_unique_customers', 'month_over_month_growth', 'revenue_trend', 'customer_acquisition_count', 'customer_churn_count']
  - Master config saved: feature_store/monthly/master_cfg.json


## 6. Execute Model (Initialize Components + Execute + Store Results)

In [7]:
# Initialize execution components
detector = FeatureTypeDetector()
analyzer = FeatureAnalyzer(feature_store, detector)
calculator = FeatureCalculator()
groupby_processor = GroupByProcessor(calculator, detector)
executor = ModelExecutor(analyzer, groupby_processor, context=ctx)

# Execute model
output = executor.execute_model(
    cfg_model=cfg_model,
    input_dataset_name=cfg_model['input_dataset_name']
)

# Store results in context
ctx.set_model_output(cfg_model['model_name'], output, cfg_model)

print("‚úì Model executed successfully!")
print(f"  - Filters: {output['filters'].shape if output['filters'] is not None else 'None'}")
print(f"  - Attributes: {output['attrs'].shape if output['attrs'] is not None else 'None'}")
print(f"  - Months analyzed: {output['attrs'].shape[0] if output['attrs'] is not None else 0}")

‚úì Model executed successfully!
  - Filters: (609, 63)
  - Attributes: (1, 5)
  - Months analyzed: 1


## 7. View Results

In [8]:
# View monthly metrics (aggregated attributes)
attrs = ctx.get_model_attrs(cfg_model['model_name'])
print(f"Monthly Business Metrics (n={len(attrs)}):")
attrs.head(10)

Monthly Business Metrics (n=1):


Unnamed: 0,dt_year,dt_month,monthly_revenue,monthly_transaction_count,monthly_unique_customers
0,2025,10,53495820.0,609,15


In [9]:
# View summary statistics
print("Monthly Revenue Summary:")
attrs[['monthly_revenue', 'monthly_transaction_count', 'monthly_unique_customers']].describe()

Monthly Revenue Summary:


Unnamed: 0,monthly_revenue,monthly_transaction_count,monthly_unique_customers
count,1.0,1.0,1.0
mean,53495820.0,609.0,15.0
std,,,
min,53495820.0,609.0,15.0
25%,53495820.0,609.0,15.0
50%,53495820.0,609.0,15.0
75%,53495820.0,609.0,15.0
max,53495820.0,609.0,15.0


In [10]:
# # View trend information
# print("Revenue Trends:")
# print(attrs[['dt_year', 'dt_month', 'monthly_revenue', 'revenue_trend']])

## 8. Export to Excel

In [11]:
# Export model results to Excel
exporter = ExcelExporter(ctx)
output_file = f'outputs/{cfg_model["model_name"]}_export.xlsx'
exporter.export_model(cfg_model['model_name'], output_file, include_input=True)

print(f"‚úì Export complete: {output_file}")
print("\nExcel tabs:")
print(f"  1. {cfg_model['input_dataset_name']} (input)")
print(f"  2. {cfg_model['model_name']}_filters")
print(f"  3. {cfg_model['model_name']}_attrs")

‚úì Export complete: outputs/monthly_export.xlsx

Excel tabs:
  1. transactions_filters (input)
  2. monthly_filters
  3. monthly_attrs


## 9. Save Context State

Save the complete context state for use in downstream notebooks:

In [12]:
# Save context state (datasets, config, metadata)
state_dir = save_context_state(ctx=ctx, base_cfg=base_cfg)

print(f"‚úì Context state saved: {state_dir}")
print(f"  - Total datasets: {len(ctx.datasets)}")
print(f"\nTo load this state in another notebook:")
print(f"  from src.core.persistence import load_context_state")
print(f"  ctx, base_cfg = load_context_state('{state_dir}')")

‚úì Context state saved: data\context_states\test_client_20251022_150907
  - Total datasets: 15

To load this state in another notebook:
  from src.core.persistence import load_context_state
  ctx, base_cfg = load_context_state('data\context_states\test_client_20251022_150907')
