# 02 — Feature Engineering & Commercial Enrichment

**Purpose.**  
This notebook transforms validated logistics shipment and charge data into a
**commercially enriched, analytics-ready feature layer**.

It sits between:
- **Data certification** (Notebook 01), and
- **Advanced analytics, modelling, and AI** (Notebooks 03–05).

The focus is on creating **interpretable, business-aligned features**
that support statistical analysis, machine learning, and decision-ready reporting.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns", 200)
pd.set_option("display.float_format", "{:,.2f}".format)

PROJECT_ROOT = Path(".").resolve()
DATA_DIR = PROJECT_ROOT / ".." / "data"

df_ship = pd.read_parquet(DATA_DIR / "fct_shipments_randomised.parquet")
df_chg  = pd.read_parquet(DATA_DIR / "fct_charges_randomised.parquet")

len(df_ship), len(df_chg)

(9675687, 38709088)

This notebook assumes the datasets have already passed structural and
quality validation in Notebook 01.

In [2]:
df_ship = df_ship.copy()

df_ship["margin"] = df_ship["total_sales"] - df_ship["total_costs"]
df_ship["margin_pct"] = np.where(
    df_ship["total_sales"] != 0,
    df_ship["margin"] / df_ship["total_sales"],
    np.nan
)

df_ship["is_loss_making"] = df_ship["margin"] < 0
df_ship["revenue_per_shipment"] = df_ship["total_sales"]
df_ship["cost_per_shipment"] = df_ship["total_costs"]

df_ship[[
    "unique_tracking",
    "total_sales",
    "total_costs",
    "margin",
    "margin_pct",
    "is_loss_making"
]].head()

Unnamed: 0,unique_tracking,total_sales,total_costs,margin,margin_pct,is_loss_making
0,TRK34889743,2.12,1.92,0.2,0.09,False
1,TRK10599887,2.36,2.03,0.33,0.14,False
2,TRK79442324,2.67,1.87,0.8,0.3,False
3,TRK85244549,2.91,1.87,1.04,0.36,False
4,TRK24334731,3.39,1.88,1.51,0.45,False


These metrics form the **financial backbone** of all downstream analysis.
They are intentionally explicit rather than derived on-the-fly in later notebooks.

In [3]:
charge_flags = (
    df_chg
    .assign(flag=1)
    .pivot_table(
        index="unique_tracking",
        columns="charge_type",
        values="flag",
        fill_value=0,
        aggfunc="max"
    )
)

charge_flags.columns = [
    f"has_charge_{c.lower().replace(' ', '_')}" 
    for c in charge_flags.columns
]

charge_flags.head()

Unnamed: 0_level_0,has_charge_collection_fee,has_charge_customs_clearance,has_charge_delivery_attempt,has_charge_duty_handling,has_charge_fuel_surcharge,has_charge_insurance,has_charge_oversize,has_charge_peak_surcharge,has_charge_remote_area,has_charge_saturday_delivery
unique_tracking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
TRK00000002,0,1,0,1,1,0,0,0,0,0
TRK00000015,0,1,1,0,1,1,1,1,0,0
TRK00000020,0,1,1,0,0,0,0,1,1,1
TRK00000036,0,0,1,1,0,1,0,1,1,1
TRK00000061,1,1,0,1,0,0,1,0,0,1


Binary charge indicators allow:
- interpretable modelling,
- loss uplift analysis,
- segmentation based on behaviour rather than totals.

In [4]:
charge_agg = (
    df_chg.groupby("unique_tracking")
    .agg(
        charge_count=("charge_type", "count"),
        charge_sales=("sales_amount", "sum"),
        charge_cost=("cost_amount", "sum"),
    )
    .reset_index()
)

charge_agg["charge_margin"] = (
    charge_agg["charge_sales"] - charge_agg["charge_cost"]
)

charge_agg.head()

Unnamed: 0,unique_tracking,charge_count,charge_sales,charge_cost,charge_margin
0,TRK00000002,3,0.17,0.16,0.01
1,TRK00000015,6,0.58,0.47,0.11
2,TRK00000020,5,0.46,0.34,0.12
3,TRK00000036,9,0.97,0.63,0.34
4,TRK00000061,5,1.45,1.18,0.27


These features quantify **how charge-heavy a shipment is**, not just which charges appear.

In [5]:
df_enriched = (
    df_ship
    .merge(charge_flags, on="unique_tracking", how="left")
    .merge(charge_agg, on="unique_tracking", how="left")
)

df_enriched.fillna(0, inplace=True)
df_enriched.shape

(9675687, 29)

This table becomes the **canonical enriched shipment fact** used by:
- statistical anomaly detection (03),
- charge behaviour analysis (04),
- clustering and AI insights (05),
- Power BI semantic models.

In [6]:
client_features = (
    df_enriched
    .groupby("client_code", as_index=False)
    .agg(
        shipments=("unique_tracking", "nunique"),
        total_revenue=("total_sales", "sum"),
        total_cost=("total_costs", "sum"),
        total_margin=("margin", "sum"),
        loss_shipments=("is_loss_making", "sum"),
        avg_margin_pct=("margin_pct", "mean"),
        avg_charge_count=("charge_count", "mean"),
    )
)

client_features["loss_rate"] = (
    client_features["loss_shipments"] / client_features["shipments"]
)

client_features.head()

Unnamed: 0,client_code,shipments,total_revenue,total_cost,total_margin,loss_shipments,avg_margin_pct,avg_charge_count,loss_rate
0,CLIENT001,1441104,9911326.24,6706964.53,3204361.71,30012,0.27,4.39,0.02
1,CLIENT002,1154664,7879908.2,5341195.06,2538713.14,24310,0.23,4.39,0.02
2,CLIENT003,964276,6578728.12,4463901.58,2114826.54,20090,0.28,4.38,0.02
3,CLIENT004,675571,4629537.25,3136407.97,1493129.28,14149,0.21,4.39,0.02
4,CLIENT005,578874,3940545.94,2681112.68,1259433.26,11933,0.28,4.39,0.02


Client-level features support:
- segmentation,
- concentration risk analysis,
- pricing and commercial strategy.

In [7]:
supplier_features = (
    df_enriched
    .groupby("supplier_code", as_index=False)
    .agg(
        shipments=("unique_tracking", "nunique"),
        total_revenue=("total_sales", "sum"),
        total_cost=("total_costs", "sum"),
        total_margin=("margin", "sum"),
        loss_shipments=("is_loss_making", "sum"),
        avg_margin_pct=("margin_pct", "mean"),
        avg_charge_count=("charge_count", "mean"),
    )
)

supplier_features["loss_rate"] = (
    supplier_features["loss_shipments"] / supplier_features["shipments"]
)

supplier_features.head()

Unnamed: 0,supplier_code,shipments,total_revenue,total_cost,total_margin,loss_shipments,avg_margin_pct,avg_charge_count,loss_rate
0,SUP001,2390191,16483349.01,11166125.14,5317223.87,50261,0.25,4.39,0.02
1,SUP002,1918578,13172083.93,8927375.58,4244708.35,40235,0.25,4.39,0.02
2,SUP003,1440490,9846260.64,6678107.62,3168153.02,29935,0.28,4.39,0.02
3,SUP004,962889,6579231.54,4456629.32,2122602.22,20057,0.27,4.38,0.02
4,SUP005,961809,6530636.12,4423346.15,2107289.97,19817,0.27,4.39,0.02


Supplier features support:
- benchmarking,
- cost competitiveness analysis,
- anomaly detection and charge behaviour modelling.

In [8]:
model_features = df_enriched.select_dtypes(include=[np.number])

model_features.isna().mean().sort_values(ascending=False).head()

total_sales            0.00
total_costs            0.00
margin                 0.00
margin_pct             0.00
revenue_per_shipment   0.00
dtype: float64

At this stage:
- features are numeric,
- interpretable,
- and aligned with business logic.

This is intentional — complexity is added in modelling, not hidden in preprocessing.

In [None]:
OUT_DIR = PROJECT_ROOT / ".." / "enriched"

OUT_DIR.mkdir(exist_ok=True)

df_enriched.to_parquet(OUT_DIR / "fct_shipments_enriched.parquet", index=False)
client_features.to_parquet(OUT_DIR / "dim_clients_enriched.parquet", index=False)
supplier_features.to_parquet(OUT_DIR / "dim_suppliers_enriched.parquet", index=False)

These outputs form the **contract** for:
- Notebook 03 (Outlier Detection),
- Notebook 04 (Charge Behaviour),
- Notebook 05 (Segmentation & AI),
- Power BI semantic models.

## Conclusion

This notebook converts validated logistics data into a **commercially enriched,
analytics-ready feature layer**.

Key outcomes:
- Explicit financial and behavioural features
- Shipment, client, and supplier perspectives
- Clear separation between data preparation and modelling
- Reusable outputs across analytics and BI tools

This layer ensures downstream analysis focuses on **insight generation**, not data wrangling.