# Big Data & BI — Feature Engineering
## Notebook 5: Dashboard-Ready Dataset

This notebook puts everything together:
- rebuild + clean
- apply rules
- select columns
- export to CSV (or to a table) for dashboarding

In [None]:
import pandas as pd
import numpy as np
import math

# Start with raw messy data (same as Notebook 1 & 2)
data = {
    "order_id":   [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    "order_date": ["2025-01-03", "2025/01/03", "03-01-2025", "2025-01-04", None,
                    "2025-01-05", "2025-01-05", "2025-01-06", "2025-01-06", "2025-01-06"],
    "customer_id": [501, 502, 503, 503, 504, 505, 506, 506, 507, None],
    "country":    ["DE", "Germany", "germany", "FR", "France", "DE", "DE ", "?", None, "GER"],
    "product":    ["Widget A", "Widget B", "Widget A", "Widget C", "Widget A",
                    "Widget B", "Widget B", "Widget C", "Widget A", "Widget A"],
    "quantity":   [2, 1, 3, 1, -1, 2, 2, 1, 5, 2],
    "unit_price": [20.0, 35.5, 20.0, 50.0, 20.0, None, 35.5, 50.0, 20.0, 20.0],
    "discount":   [0.0, 0.1, None, 0.0, 0.0, 0.05, 0.0, None, 0.0, 0.0],
    "channel":    ["online", "Online", "offline", "partner", "online",
                    "offline", "online ", "ONLINE", None, "partner"]
}
df = pd.DataFrame(data)

print("=== APPLYING CLEANING PIPELINE FROM NOTEBOOKS 1-4 ===\n")

# 1) Parse dates (Notebook 2)
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
print(f"[1] Dates parsed: {df['order_date'].notna().sum()}/{len(df)} valid")

# 2) Standardize categoricals (Notebook 2)
df["country_clean"] = df["country"].str.lower().str.strip().map({
    "de": "germany", "germany": "germany", "ger": "germany",
    "fr": "france", "france": "france", "?": "unknown", None: "unknown"
}).fillna(df["country"].str.lower().str.strip())

df["channel_clean"] = df["channel"].str.lower().str.strip().map({
    "online": "online", "offline": "offline", "partner": "partner", None: "unknown", "": "unknown"
}).fillna(df["channel"].str.lower().str.strip())
print(f"[2] Countries standardized: {df['country_clean'].unique()}")
print(f"[3] Channels standardized: {df['channel_clean'].unique()}")

# 3) Impute numeric columns (Notebook 2)
median_price = df["unit_price"].median()
df["unit_price_filled"] = df["unit_price"].fillna(median_price)
df["discount_filled"] = df["discount"].fillna(0.0)
print(f"[4] Prices imputed with median: {median_price}")

# 4) Fix negative quantity (Notebook 3)
df["quantity_fixed"] = df["quantity"].abs()
print(f"[5] Negative quantity fixed: {(df['quantity_fixed'] > 0).all()}")

# 5) Fill customer_id with unique high-value IDs (Notebook 2)
missing_mask = df["customer_id"].isna()
num_missing = missing_mask.sum()
max_existing_id = df["customer_id"].max()
if pd.notna(max_existing_id):
    order_of_magnitude = 10 ** (math.floor(math.log10(max_existing_id)) + 2)
    placeholder_start = order_of_magnitude
else:
    placeholder_start = 999000001
df["customer_id_clean"] = df["customer_id"].copy()
df.loc[missing_mask, "customer_id_clean"] = range(placeholder_start, placeholder_start + num_missing)
df["customer_id_clean"] = df["customer_id_clean"].astype(int)
print(f"[6] Customer IDs filled: placeholder {placeholder_start} used")

# 6) Calculate revenue (Notebook 3 & 4)
df["revenue"] = df["quantity_fixed"] * df["unit_price_filled"] * (1 - df["discount_filled"])
print(f"[7] Revenue calculated: mean = €{df['revenue'].mean():.2f}")

print("\n=== CLEANED DATA PREVIEW ===")
df[["order_id", "order_date", "customer_id_clean", "country_clean", "product", 
    "quantity_fixed", "unit_price_filled", "discount_filled", "channel_clean", "revenue"]]

## 1. Choose final columns

### When to keep vs. drop columns in BI:

**Keep original + cleaned columns when:**
- Auditing/debugging is needed (compare before/after transformations)
- Regulatory compliance requires tracking changes
- Data lineage documentation is important
- Multiple teams use different versions of the same field

**Drop original messy columns when:**
- Delivering final dataset to end users (avoid confusion)
- Dashboard performance matters (fewer columns = faster queries)
- Storage/memory is limited
- Original values have no business value after cleaning

**For this dashboard dataset, we drop originals because:**
1. End users don't need to see "DE" vs "germany" - just the cleaned value
2. Dashboards will be faster with fewer columns
3. We already documented the transformations in Notebooks 1-3 for auditing

In [None]:
# Drop rows with missing dates (critical for time series dashboards)
final_df = df.dropna(subset=["order_date"]).copy()

# Select final columns for dashboard
# We keep only cleaned columns and rename them for simplicity
final_df = final_df[[
    "order_id", 
    "order_date", 
    "customer_id_clean",
    "country_clean", 
    "product", 
    "channel_clean",
    "quantity_fixed", 
    "unit_price_filled", 
    "discount_filled", 
    "revenue"
]].rename(columns={
    "customer_id_clean": "customer_id",
    "country_clean": "country", 
    "channel_clean": "channel",
    "quantity_fixed": "quantity",
    "unit_price_filled": "unit_price",
    "discount_filled": "discount"
})

print(f"Final dataset: {len(final_df)} rows × {len(final_df.columns)} columns")
print(f"Dropped {len(df) - len(final_df)} rows with missing dates")
final_df

## 2. final checks
Make sure there are no nulls in important columns and dtypes are correct.

In [None]:
print("=== NULL CHECK ===")
print(final_df.isna().sum())
print("\n=== DATA TYPES ===")
print(final_df.dtypes)
print("\n=== SUMMARY STATISTICS ===")
print(final_df.describe())

## 3. Export to CSV for dashboard tools

This cleaned dataset can be imported into:
- Power BI / Tableau
- Excel pivot tables
- SQL database
- Cloud data warehouse

In [None]:
# Export to CSV in the data folder
output_path = "../data/dashboard_ready.csv"
final_df.to_csv(output_path, index=False)
print(f"Dashboard-ready dataset exported to: {output_path}")
print(f"  - {len(final_df)} rows")
print(f"  - {len(final_df.columns)} columns")
print(f"  - File size: {final_df.memory_usage(deep=True).sum() / 1024:.2f} KB")

## Summary: What we accomplished

This notebook combined all cleaning steps from Notebooks 1-4:

1. **Notebook 1**: Defined the cleaning strategy
2. **Notebook 2**: Implemented standardization and imputation
   - Parsed mixed date formats
   - Standardized country/channel labels
   - Imputed missing prices with median (20.0)
   - Filled missing customer IDs with unique high-value placeholders (10000+)
3. **Notebook 3**: Applied validation and fixed data issues
   - Fixed negative quantity (took absolute value)
   - Validated technical requirements
   - Applied business rules
4. **Notebook 4**: Calculated KPIs
   - Revenue after discount
   - Aggregations by country/channel
5. **Notebook 5 (this one)**: Created final dashboard dataset
   - Applied entire pipeline in sequence
   - Removed rows with missing dates (1 row)
   - Exported clean CSV for BI tools

**Result**: 9 rows × 10 columns, ready for dashboarding!