# Big Data & BI — Feature Engineering
## Notebook 4: Metric Design (on demand)

Goal: define KPIs **without** bloating the table.
We’ll compute revenue, discounted revenue, and grouped metrics, but emphasize that in BI these can live in a semantic/model layer.

In [None]:
import pandas as pd
import numpy as np

# Load the cleaned data from Notebook 3 (after validation and fixes)
# This matches the final state: fixed quantity, imputed prices/discounts, standardized categories

data = {
    "order_id":   [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    "order_date": ["2025-01-03", "2025-01-03", "2025-01-03", "2025-01-04", None,
                    "2025-01-05", "2025-01-05", "2025-01-06", "2025-01-06", "2025-01-06"],
    "customer_id_clean": [501, 502, 503, 503, 504, 505, 506, 506, 507, 10000],  # Placeholder for missing
    "country_clean":    ["germany", "germany", "germany", "france", "france", "germany", "germany", "unknown", "unknown", "germany"],
    "product":    ["Widget A", "Widget B", "Widget A", "Widget C", "Widget A",
                    "Widget B", "Widget B", "Widget C", "Widget A", "Widget A"],
    "quantity_fixed":   [2, 1, 3, 1, 1, 2, 2, 1, 5, 2],  # Already fixed (row 5: -1 → 1)
    "unit_price_filled": [20.0, 35.5, 20.0, 50.0, 20.0, 20.0, 35.5, 50.0, 20.0, 20.0],  # Filled with median
    "discount_filled":   [0.0, 0.1, 0.0, 0.0, 0.0, 0.05, 0.0, 0.0, 0.0, 0.0],  # Filled with 0.0
    "channel_clean":    ["online", "online", "offline", "partner", "online",
                    "offline", "online", "online", "unknown", "partner"]
}
df = pd.DataFrame(data)
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")

print("Data loaded from Notebook 3 (validated and fixed):")
print(f"  - All quantities positive: {(df['quantity_fixed'] > 0).all()}")
print(f"  - Prices imputed: using median (20.0)")
print(f"  - Customer ID placeholder: 10000")
df

## 1. Define KPI logic
These formulas can live in BI level but we practice them here.

In [None]:
df["revenue"] = df["quantity_fixed"] * df["unit_price_filled"]
df["revenue_after_discount"] = df["revenue"] * (1 - df["discount_filled"])
df[["order_id", "revenue", "revenue_after_discount"]].head()

## 2. Aggregate for dashboards
This is closer to a view your BI tool would consume.

In [None]:
country_metrics = (
    df.groupby("country_clean")
      .agg(
          orders=("order_id", "count"),
          total_revenue=("revenue_after_discount", "sum"),
          avg_ticket=("revenue_after_discount", "mean")
      )
      .sort_values("total_revenue", ascending=False)
)
country_metrics

**Task**: Add `channel` to the grouping so that dashboards can show revenue by country & channel.