# Big Data & BI — Feature Engineering
## Notebook 1: Data Quality & Cleaning Strategy (post-EDA)

You already had an EDA session, so we **do not repeat** `.describe()` and basic plots here.
This notebook is about turning EDA findings into a **concrete cleaning plan**:
- decide what to fix
- decide what to drop
- decide what to standardize
- mark columns that must be complete for dashboards

We'll work on an embedded messy retail dataset (same one across all 5 notebooks).

## 1. Imports & dataset
We make a small, messy dataset to simulate an export from a sales system.

In [None]:
import pandas as pd
import numpy as np

pd.set_option("display.width", 120)
pd.set_option("display.max_columns", 50)

data = {
    "order_id":   [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    "order_date": ["2025-01-03", "2025/01/03", "03-01-2025", "2025-01-04", None,
                    "2025-01-05", "2025-01-05", "2025-01-06", "2025-01-06", "2025-01-06"],
    "customer_id": [501, 502, 503, 503, 504, 505, 506, 506, 507, None],
    "country":    ["DE", "Germany", "germany", "FR", "France", "DE", "DE ", "?", None, "GER"],
    "product":    ["Widget A", "Widget B", "Widget A", "Widget C", "Widget A",
                    "Widget B", "Widget B", "Widget C", "Widget A", "Widget A"],
    "quantity":   [2, 1, 3, 1, -1, 2, 2, 1, 5, 2],
    "unit_price": [20.0, 35.5, 20.0, 50.0, 20.0, None, 35.5, 50.0, 20.0, 20.0],
    "discount":   [0.0, 0.1, None, 0.0, 0.0, 0.05, 0.0, None, 0.0, 0.0],
    "channel":    ["online", "Online", "offline", "partner", "online",
                    "offline", "online ", "ONLINE", None, "partner"],
    "comments":   [None, "urgent delivery", "", "customer asked for invoice", None,
                    "", "", "late payment last time", None, ""]
}
df = pd.DataFrame(data)
df

## 2. Quick audit
We only look at **what blocks dashboarding**: missing keys, missing dates, inconsistent categoricals.

In [None]:
missing = df.isna().sum().sort_values(ascending=False)
missing

**Task 1**: Based on this, list columns that **must not** be null for your dashboards (for example `order_date`, `country`, `product`).

> Must not be null:
> - …
> - …

## 3. Consistency audit (categoricals)
We already know from EDA that inconsistent text is common. Here we only list the values to decide on a standard form.

In [None]:
for col in ["country", "channel"]:
    print(f"\nDistinct values in {col}:")
    print(df[col].unique())

**Task 2**: Propose a target list, e.g.
- country: `['germany', 'france', 'unknown']`
- channel: `['online', 'offline', 'partner', 'unknown']`

## 4. Cleaning plan (write it down!)
Fill this out — you will implement it in Notebook 2.

**Transformations to apply:**
- **Parse dates**: `order_date` → parse all formats to datetime, coerce errors to NaT
- **Standardize country**: DE/Germany/GER → "germany", FR/France → "france", ?/None → "unknown"
- **Standardize channel**: all Online variants → "online", None → "unknown"
- **Impute numeric**: 
  - `unit_price` → median (20.0)
  - `discount` → 0.0 (safe default)
- **Fix data issues**:
  - `quantity` → fix negative value (will handle in Notebook 3)
- **Fill missing IDs**: 
  - `customer_id` → use unique high-value IDs (10000+) to preserve primary key property
  - Strategy: find max existing ID, go 2 orders of magnitude higher

**Columns that MUST NOT be null for dashboards:**
- `order_date` (needed for time series)
- `customer_id_clean` (for customer analysis)
- `country_clean` (for regional analysis)
- `product` (for product analysis)