# Session 7 — Data Analysis & Feature Engineering (NumPy + Pandas)

**Goal:** bridge Data Engineering to AI/ML by learning how to **inspect, clean, and transform** data into **model-ready features**.

**You will learn**
- Core **NumPy** concepts for efficient numeric computation
- Practical **Pandas** patterns for cleaning, joining, aggregating
- **EDA** (exploratory data analysis) workflow and pitfalls (leakage, imbalance)
- Common **feature engineering** techniques across domains (retail, fintech, healthcare, marketing, text, time series)
- Packaging features for ML pipelines (hand-off to Session 8)


## 🧭 1️⃣ Why This Matters
Data engineers enable ML by delivering **high-quality, well-shaped features**. Even simple models beat complex ones if the **features** are strong.

**Key principles:**
- Make transformations **reproducible** (deterministic code, versioned)
- Avoid **data leakage** (no future info in training)
- Capture **business logic** faithfully (domain-aware features)
- Keep **data lineage** and **quality checks** from earlier sessions


## 🧱 2️⃣ NumPy Foundations (Arrays & Vectorization)
NumPy provides fast **n-dimensional arrays** and vectorized operations.

**Concepts:**
- Array creation, dtypes, shapes
- Broadcasting: operations across different shapes
- Aggregations: `mean`, `sum`, `std`, `percentile`
- Boolean masks and indexing


In [None]:
import numpy as np

a = np.array([1, 2, 3, 4])
b = np.arange(4)  # [0,1,2,3]

print('a:', a, 'shape:', a.shape)
print('b:', b)
print('a + b:', a + b)
print('broadcast add (a + 10):', a + 10)
print('mean(a):', a.mean(), 'std(a):', a.std())

# boolean masking
mask = a % 2 == 0
print('even values in a:', a[mask])


## 🐼 3️⃣ Pandas for Tabular Data
**Pandas** wraps NumPy with labels and table-like operations.

**Patterns you’ll use daily:**
- Import/export: CSV, JSON, (Parquet optional)
- Selecting, filtering, sorting, renaming
- Handling missing values (drop/impute)
- GroupBy aggregations
- Joins/merges and reshaping (pivot/melt)


### ✍️ Mini Retail Dataset (inline)

In [None]:
import pandas as pd
from pathlib import Path

data = [
    {'order_id':1,'customer_id':'C-101','region':'North','sku':'LAPTOP-15','qty':2,'price':1200.0,'order_dt':'2025-10-01'},
    {'order_id':2,'customer_id':'C-102','region':'South','sku':'MOUSE-01','qty':1,'price':25.0,'order_dt':'2025-10-01'},
    {'order_id':3,'customer_id':'C-103','region':'North','sku':'LAPTOP-15','qty':1,'price':1250.0,'order_dt':'2025-10-02'},
    {'order_id':4,'customer_id':'C-104','region':'West','sku':'KB-01','qty':1,'price':45.0,'order_dt':'2025-10-02'},
    {'order_id':5,'customer_id':'C-101','region':'North','sku':'LAPTOP-15','qty':1,'price':1180.0,'order_dt':'2025-10-03'},
]
df = pd.DataFrame(data)
df['order_dt'] = pd.to_datetime(df['order_dt'])
df['amount'] = df['qty'] * df['price']
df

### 🔎 Typical Operations

In [None]:
# filter + select
north_laptop = df[(df['region']=='North') & (df['sku'].str.contains('LAPTOP'))][['order_id','customer_id','amount']]
print(north_laptop)

# groupby
sales_by_region = df.groupby('region', as_index=False)['amount'].sum().sort_values('amount', ascending=False)
print('\nSales by region:\n', sales_by_region)

# pivot (sku x day)
pivot = df.pivot_table(index='order_dt', columns='sku', values='amount', aggfunc='sum', fill_value=0)
print('\nPivot (daily amount by SKU):\n', pivot)


## 🧼 4️⃣ Data Cleaning Playbook
- **Missing values**: drop vs. impute (mean/median/mode/domain rules)
- **Outliers**: winsorize, cap, or remove if data errors
- **Duplicates**: use `drop_duplicates` on business keys
- **Types**: cast to numeric/datetime; normalize text case/trim


In [None]:
# simulate some dirtiness
dirt = df.copy()
dirt.loc[1,'price'] = None
dirt.loc[2,'region'] = None
dirt = pd.concat([dirt, dirt.iloc[[0]]], ignore_index=True)  # duplicate row

print('Before cleaning:\n', dirt)

# dedupe
dirt = dirt.drop_duplicates()
# impute price with median per sku
dirt['price'] = dirt.groupby('sku')['price'].transform(lambda s: s.fillna(s.median()))
# fill region with 'Unknown'
dirt['region'] = dirt['region'].fillna('Unknown')
# recompute amount
dirt['amount'] = dirt['qty'] * dirt['price']

print('\nAfter cleaning:\n', dirt)


## 🧪 5️⃣ Feature Engineering — Concepts & Patterns
**Numerical features**: scaling, binning, log transforms

**Categorical features**: one-hot, target encoding (careful: leakage risk)

**Datetime features**: year, month, day-of-week, hour; rolling windows; time since last event

**Text features**: length, word counts, n-grams, TF–IDF (details in Session 8)

**Aggregation features**: per-customer totals, moving averages, ratios

**Domain features**: promos, seasonality flags, churn windows


### 🧱 Numerical + Categorical + Datetime Examples

In [None]:
# numerical: simple scaling fallback if sklearn unavailable
import numpy as np

fe = df.copy()
fe['price_z'] = (fe['price'] - fe['price'].mean()) / (fe['price'].std(ddof=0))

# categorical: one-hot
ohe = pd.get_dummies(fe['region'], prefix='region')
fe = pd.concat([fe, ohe], axis=1)

# datetime: calendar parts
fe['dow'] = fe['order_dt'].dt.dayofweek
fe['month'] = fe['order_dt'].dt.month

# lag/rolling example (by SKU) – requires sorted index
fe = fe.sort_values(['sku','order_dt'])
fe['amt_rolling_3'] = fe.groupby('sku')['amount'].transform(lambda s: s.rolling(3, min_periods=1).mean())

fe[['order_id','sku','amount','price_z','region_North','region_South','region_West','dow','month','amt_rolling_3']]


## ⚠️ 6️⃣ Common Pitfalls (and How to Avoid)
- **Data leakage:** no using future info when creating features (respect time order)
- **Target leakage:** avoid features derived directly from labels
- **Imbalanced data:** stratified splits; balanced class weights
- **Spurious correlations:** validate with out-of-time tests
- **Overfitting via joins:** ensure one-to-one or aggregate before joining


## 🌐 7️⃣ Cross-Domain Feature Examples (Diverse)
**Retail** — basket size, monthly spend, days since last purchase

**FinTech** — transaction velocity, merchant diversity, failed-payment rate

**Healthcare IoT** — rolling mean of vitals, anomaly flags, adherence ratio

**Marketing** — email open rate, channel diversity index, RFM (recency-frequency-monetary)

**Text/NLP** — review length, sentiment score, TF–IDF vectors (Session 8)

**Time Series** — lag features, rolling windows, seasonality dummies


## 🖼️ 8️⃣ Visual — From Raw Data to Features

In [None]:
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch

fig, ax = plt.subplots(figsize=(12, 3.6))
ax.axis('off')

labels = [
    ('Raw Data', 'tables, logs, files'),
    ('Clean', 'types, missing, dedupe'),
    ('Enrich', 'joins, business rules'),
    ('Engineer', 'ohe, scale, lags'),
    ('Features', 'model-ready set')
]
W, H, GAP, PAD, Y0 = 0.18, 0.22, 0.06, 0.01, 0.39
total_w = len(labels)*W + (len(labels)-1)*GAP
x0 = (1-total_w)/2
xs = [x0 + i*(W+GAP) for i in range(len(labels))]

def box(x, t, s):
    r = FancyBboxPatch((x, Y0), W, H, boxstyle='round,pad=0.02,rounding_size=10')
    ax.add_patch(r)
    ax.text(x+W/2, Y0+H*0.62, t, ha='center', va='center', fontsize=10, fontweight='bold')
    ax.text(x+W/2, Y0+H*0.36, s, ha='center', va='center', fontsize=9)

for (t,s), x in zip(labels, xs):
    box(x, t, s)

y = Y0 + H/2
for i in range(len(xs)-1):
    ax.annotate('', xy=(xs[i+1]-PAD, y), xytext=(xs[i]+W+PAD, y),
                arrowprops=dict(arrowstyle='->', lw=2))

plt.tight_layout(); plt.show()


## 🔗 9️⃣ Handoff to Session 8 (ML Foundations)
- Split **train/test** preserving time order or using stratification
- Evaluate with proper metrics (classification vs regression)
- Save features and **model artifacts** reproducibly
- Integrate with orchestration (Airflow/ADF) and governance (Purview/Glue)


## 💡 🔟 Practice / Assignment
1) Create at least **8 features** from the retail dataset (mix of numerical/categorical/datetime/aggregation).
2) Simulate **leakage** (incorrect) and then fix it (correct time-aware features).
3) Write a short **EDA checklist** (nulls, skews, outliers) for any dataset you choose.
4) Save a **features CSV** ready for ML training in Session 8.
