# Big Data & BI — Feature Engineering
## Notebook 3: Data Integrity & Business Rules

In BI, we validate data against **business rules** requirements that come from the business, not just technical logic.

### Difference between Technical Logic vs. Business Rules:

**Technical/Logical rules** (basic sanity checks):
- quantity must be ≥ 1
- price must be > 0
- date must not be null

**Business rules** (come from domain experts):
- Orders over €10,000 require manager approval → flag for review
- Discounts > 20% are unusual → flag for fraud detection
- Orders on weekends might be from different channels → segment separately
- Customers who haven't ordered in 90 days → mark as "dormant"
- Products below cost price → flag as pricing error

### Why Business Rules Matter:
- They encode **domain knowledge** from stakeholders
- They help detect **anomalies that matter to the business**
- They enable **automated data quality monitoring**
- They support **decision-making** in dashboards

In this notebook, we'll implement both types: basic validation AND meaningful business rules.

In [None]:
import pandas as pd
import numpy as np
import math

# recreate the FINAL cleaned dataframe from Notebook 2 (section 5)
# This is df_clean, which has only the cleaned columns we want to use

data = {
    "order_id":   [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    "order_date": ["2025-01-03", "2025-01-03", "2025-01-03", "2025-01-04", None,
                    "2025-01-05", "2025-01-05", "2025-01-06", "2025-01-06", "2025-01-06"],  # Already parsed
    "customer_id_clean": [501, 502, 503, 503, 504, 505, 506, 506, 507, 10000],  # Placeholder for missing
    "country_clean": ["germany", "germany", "germany", "france", "france", "germany", "germany", "unknown", "unknown", "germany"],
    "product":    ["Widget A", "Widget B", "Widget A", "Widget C", "Widget A",
                    "Widget B", "Widget B", "Widget C", "Widget A", "Widget A"],
    "quantity":   [2, 1, 3, 1, -1, 2, 2, 1, 5, 2],  # Still has the negative value
    "unit_price_filled": [20.0, 35.5, 20.0, 50.0, 20.0, 20.0, 35.5, 50.0, 20.0, 20.0],  # Filled with median
    "discount_filled":   [0.0, 0.1, 0.0, 0.0, 0.0, 0.05, 0.0, 0.0, 0.0, 0.0],  # Filled with 0.0
    "channel_clean": ["online", "online", "offline", "partner", "online",
                      "offline", "online", "online", "unknown", "partner"]
}
df = pd.DataFrame(data)

# Parse order_date to datetime (it's still string in our data dict)
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")

print(" Data loaded from Notebook 2 (cleaned version):")
print(f"  - Dates: {df['order_date'].notna().sum()}/{len(df)} valid")
print(f"  - Customer IDs: using placeholder 10000 for missing")
print(f"  - Countries: {df['country_clean'].unique()}")
print(f"  - Channels: {df['channel_clean'].unique()}")
print(f"  - Prices: imputed with median")
print(f"  - Note: quantity still has negative value (row 5)")

df

## 1. Define validation rules (technical + business)

### Technical Validation (data must be usable)
Basic sanity checks that prevent crashes/errors in dashboards.

### Business Rules (domain-specific)
Rules that come from business stakeholders and domain knowledge.

In [None]:
# === TECHNICAL VALIDATION (basic sanity checks) ===
# These prevent errors in dashboards/calculations

# rule 1: date must exist (can't plot time series without dates)
df["valid_date"] = df["order_date"].notna()

# rule 2: quantity must be >= 1 (negative quantities don't make sense)
df["valid_quantity"] = df["quantity"] >= 1

# rule 3: price must be > 0 (zero/negative prices are data errors)
df["valid_price"] = df["unit_price_filled"] > 0

print("=== TECHNICAL VALIDATION ===")
print(df[["order_id", "valid_date", "valid_quantity", "valid_price"]].head(10))

# === BUSINESS RULES (from stakeholder requirements) ===
# These flag unusual patterns that need review

# Business Rule 1: High-value orders (> €200) need approval
# Stakeholder: Finance team wants to review large orders
df["flag_high_value"] = (df["quantity"].abs() * df["unit_price_filled"]) > 200

# Business Rule 2: Excessive discounts (> 15%) might indicate fraud
# Stakeholder: Sales manager wants to investigate deep discounts
df["flag_high_discount"] = df["discount_filled"] > 0.15

# Business Rule 3: Bulk orders (quantity > 3) get special handling
# Stakeholder: Logistics team needs to prepare for large shipments
df["flag_bulk_order"] = df["quantity"].abs() > 3

print("\n=== BUSINESS RULES ===")
print(df[["order_id", "flag_high_value", "flag_high_discount", "flag_bulk_order"]].head(10))

## 2. Combine validation rules

**Technical validation** → rows must pass ALL checks to be usable in dashboards
**Business flags** → rows can have multiple flags; they're for filtering/segmentation, not exclusion

In [None]:
# Combined technical validation: ALL must be True
df["row_is_valid"] = df["valid_date"] & df["valid_quantity"] & df["valid_price"]

# Combined business flags: count how many flags are triggered
df["total_flags"] = (
    df["flag_high_value"].astype(int) + 
    df["flag_high_discount"].astype(int) + 
    df["flag_bulk_order"].astype(int)
)

print("=== VALIDATION SUMMARY ===")
print(f"Valid rows: {df['row_is_valid'].sum()} / {len(df)}")
print(f"Invalid rows: {(~df['row_is_valid']).sum()}")
print(f"\nRows with flags: {(df['total_flags'] > 0).sum()}")
print(f"High-value orders: {df['flag_high_value'].sum()}")
print(f"High discounts: {df['flag_high_discount'].sum()}")
print(f"Bulk orders: {df['flag_bulk_order'].sum()}")

df[["order_id", "row_is_valid", "total_flags", "flag_high_value", "flag_high_discount", "flag_bulk_order"]]

## 3. Fix data issues flagged by validation

Before we calculate revenue, let's fix the data quality issues we found.

In [None]:
# Fix issue: negative quantity in row 5 (order_id 1005)
print("Before fix:")
print(df[df["quantity"] < 0][["order_id", "quantity", "valid_quantity"]])

# Fix: take absolute value
df["quantity_fixed"] = df["quantity"].abs()

# Re-validate
df["valid_quantity"] = df["quantity_fixed"] >= 1

print("\nAfter fix:")
print(df[df["order_id"] == 1005][["order_id", "quantity", "quantity_fixed", "valid_quantity"]])
print(f"\nAll rows now valid? {df['valid_quantity'].all()}")

## 4. Calculate revenue with fixed data

Now that data is validated and fixed, we can safely calculate revenue.

In [None]:
# Calculate revenue using the fixed quantity
df["revenue"] = df["quantity_fixed"] * df["unit_price_filled"] * (1 - df["discount_filled"])

# Display revenue statistics
print("Revenue Statistics:")
print(df["revenue"].describe())
print("\nSample of revenue calculation:")
print(df[["order_id", "quantity_fixed", "unit_price_filled", "discount_filled", "revenue"]].head(10))

Let's have a look at top 5 revenues:

In [None]:
df[["order_id", "revenue"]].sort_values("revenue", ascending=False).head(5)