# Day 3, Block A: Data Pipelines & Real-World Validation

**Duration:** 100 minutes (13:30‚Äì15:10)
**Course:** ECBS5294 - Introduction to Data Science: Working with Data

## Learning Objectives

By the end of this session, you will be able to:

1. Explain the **bronze ‚Üí silver ‚Üí gold** pipeline pattern and why it matters
2. Design idempotent data transformations
3. Write **assertions** to validate data quality programmatically
4. Identify and handle common real-world data problems (dates, types, nulls)
5. Apply the **pipeline pattern** to a real dataset

---

## 1. Why Data Pipelines?

### The Problem: One-Off Analysis Doesn't Scale

**You've hit the wall when:**
- Data updates regularly
- Multiple people need consistent results
- Stakeholders ask "how did you get this number?"
- Requirements change

**The solution:** A systematic, repeatable pipeline.

---

## 2. The Bronze-Silver-Gold Pattern

> **"Preserve the original, clean incrementally, aggregate deliberately."**

#### **Bronze Layer: Raw Ingestion**
- Preserve original data exactly as received
- No transformations
- Keep everything‚Äîeven if it looks wrong

#### **Silver Layer: Clean & Validated**
- Analysis-ready data
- Fix types, handle nulls, validate
- Document what was fixed

#### **Gold Layer: Business Metrics**
- Aggregated, joined, ready for reporting
- Pre-computed KPIs

---

In [None]:
# Setup
import pandas as pd
import duckdb
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

con = duckdb.connect(':memory:')
print("‚úÖ Setup complete")

## 3. Scenario 1: E-Commerce Pipeline

**Business Context:** You're an analyst at a Brazilian e-commerce company. Orders and customer data arrive daily. You need reliable, repeatable metrics.

**Data:** 1,000 orders from Olist (familiar from Day 2)

**Pipeline Goal:** Bronze ‚Üí Silver ‚Üí Gold with validations

---

### Bronze Layer: Raw Ingestion

**Goal:** Load data exactly as received.

In [None]:
# BRONZE: Load raw data
print("=== BRONZE LAYER ===\n")

con.execute("""
    CREATE TABLE bronze_orders AS
    SELECT * FROM '../../data/day3/teaching/olist_orders_subset.csv'
""")

con.execute("""
    CREATE TABLE bronze_customers AS
    SELECT * FROM '../../data/day3/teaching/olist_customers_subset.csv'
""")

con.execute("""
    CREATE TABLE bronze_order_items AS
    SELECT * FROM '../../data/day3/teaching/olist_order_items_subset.csv'
""")

print(f"Loaded {con.execute('SELECT COUNT(*) FROM bronze_orders').fetchone()[0]} orders")
print("‚úÖ Bronze layer complete")

### Silver Layer: Clean & Validate

**Goal:** Transform into analysis-ready format.

In [None]:
# SILVER: Clean and validate
print("=== SILVER LAYER ===\n")

con.execute("""
    CREATE TABLE silver_orders AS
    SELECT
        order_id,
        customer_id,
        order_status,
        TRY_CAST(order_purchase_timestamp AS TIMESTAMP) as order_date
    FROM bronze_orders
    WHERE order_id IS NOT NULL
""")

con.execute("""
    CREATE TABLE silver_order_items AS
    SELECT
        order_id,
        product_id,
        CAST(price AS DOUBLE) as price,
        CAST(freight_value AS DOUBLE) as freight
    FROM bronze_order_items
    WHERE order_id IS NOT NULL
""")

print(f"Created {con.execute('SELECT COUNT(*) FROM silver_orders').fetchone()[0]} clean orders")
print("‚úÖ Silver layer complete")

### Validation: Prove Data Quality

In [None]:
# VALIDATION
print("=== VALIDATION ===\n")

# Check 1: Primary key uniqueness
order_count = con.execute("SELECT COUNT(*) FROM silver_orders").fetchone()[0]
order_unique = con.execute("SELECT COUNT(DISTINCT order_id) FROM silver_orders").fetchone()[0]

print(f"‚úì Order IDs unique? {order_count == order_unique}")
assert order_count == order_unique, "Duplicate order IDs!"

# Check 2: No null critical fields
null_ids = con.execute("SELECT COUNT(*) FROM silver_orders WHERE order_id IS NULL").fetchone()[0]
print(f"‚úì No NULL order IDs? {null_ids == 0}")
assert null_ids == 0, "NULL order IDs found!"

# Check 3: Foreign key integrity
orphans = con.execute("""
    SELECT COUNT(*)
    FROM silver_order_items i
    LEFT JOIN silver_orders o ON i.order_id = o.order_id
    WHERE o.order_id IS NULL
""").fetchone()[0]

print(f"‚úì All items have valid orders? {orphans == 0}")
assert orphans == 0, "Orphaned items found!"

print("\n‚úÖ ALL VALIDATIONS PASSED")

### Gold Layer: Business Metrics

In [None]:
# GOLD: Business metrics
print("=== GOLD LAYER ===\n")

con.execute("""
    CREATE TABLE gold_daily_sales AS
    SELECT
        CAST(o.order_date AS DATE) as date,
        COUNT(DISTINCT o.order_id) as num_orders,
        SUM(i.price + i.freight) as total_revenue
    FROM silver_orders o
    INNER JOIN silver_order_items i ON o.order_id = i.order_id
    WHERE o.order_date IS NOT NULL
    GROUP BY CAST(o.order_date AS DATE)
    ORDER BY date
""")

result = con.execute("SELECT * FROM gold_daily_sales LIMIT 5").df()
print("Daily sales summary:")
display(result)
print("\n‚úÖ Gold layer complete")

---

## 4. Scenario 2: Messy Retail Data Pipeline

**Business Context:** You've inherited a cafe's sales data export. It's full of data quality issues‚Äîthe reality of working with real-world data.

**Data:** 10,000 cafe transactions (from Day 1's messy dataset)

**Pipeline Goal:** Handle multiple NULL representations, validate calculations, detect outliers

**Key Difference from Scenario 1:** This focuses on **data quality** issues, not relational integrity.

---

### Bronze Layer: Load Messy Data

First, let's see what we're dealing with:

In [None]:
# BRONZE: Load raw cafe data
print("=== BRONZE LAYER (Cafe Sales) ===\n")

con.execute("""
    CREATE TABLE bronze_cafe_sales AS
    SELECT * FROM '../../data/day1/dirty_cafe_sales.csv'
""")

# Check what we loaded
print("Sample of raw data:")
display(con.execute("SELECT * FROM bronze_cafe_sales LIMIT 5").df())

# Check for data quality issues
print("\nData quality snapshot:")
print(f"Total records: {con.execute('SELECT COUNT(*) FROM bronze_cafe_sales').fetchone()[0]}")

# Show NULL counts
null_summary = con.execute("""
    SELECT
        COUNT(*) - COUNT("Transaction ID") as null_txn_id,
        COUNT(*) - COUNT("Item") as null_item,
        COUNT(*) - COUNT("Total Spent") as null_total,
        COUNT(*) - COUNT("Payment Method") as null_payment,
        COUNT(*) - COUNT("Transaction Date") as null_date
    FROM bronze_cafe_sales
""").df()

print("\nNULL counts:")
display(null_summary)

# Show sentinel values
print("\nSentinel value counts:")
display(con.execute("""
    SELECT
        SUM(CASE WHEN "Item" = 'ERROR' OR "Item" = 'UNKNOWN' THEN 1 ELSE 0 END) as error_unknown_items,
        SUM(CASE WHEN "Total Spent" = 'ERROR' THEN 1 ELSE 0 END) as error_totals,
        SUM(CASE WHEN "Transaction Date" = 'ERROR' THEN 1 ELSE 0 END) as error_dates
    FROM bronze_cafe_sales
""").df())

print("\n‚úÖ Bronze layer complete (messy data preserved)")

### Silver Layer: Clean & Validate

**Cleaning strategy:**
1. Replace sentinel values ("ERROR", "UNKNOWN") with NULL
2. Fix types (convert strings to numbers/dates)
3. Validate business rules (quantity > 0, price ‚â• 0)
4. Check calculation: `Total Spent = Quantity √ó Price Per Unit`

In [None]:
# SILVER: Clean the data
print("=== SILVER LAYER (Cafe Sales) ===\n")

con.execute("""
    CREATE TABLE silver_cafe_sales AS
    SELECT
        "Transaction ID" as transaction_id,
        -- Replace sentinel values with NULL
        CASE 
            WHEN "Item" IN ('ERROR', 'UNKNOWN', '') THEN NULL 
            ELSE "Item" 
        END as item,
        -- Convert to integers, handle NULLs
        TRY_CAST("Quantity" AS INTEGER) as quantity,
        TRY_CAST("Price Per Unit" AS DOUBLE) as price_per_unit,
        -- Handle "ERROR" in Total Spent
        TRY_CAST(
            CASE WHEN "Total Spent" = 'ERROR' THEN NULL ELSE "Total Spent" END
            AS DOUBLE
        ) as total_spent,
        -- Clean payment method
        CASE 
            WHEN "Payment Method" IN ('ERROR', 'UNKNOWN', '') THEN NULL 
            ELSE "Payment Method" 
        END as payment_method,
        -- Clean location
        CASE 
            WHEN "Location" IN ('ERROR', 'UNKNOWN', '') THEN NULL 
            ELSE "Location" 
        END as location,
        -- Parse dates
        TRY_CAST(
            CASE WHEN "Transaction Date" = 'ERROR' THEN NULL ELSE "Transaction Date" END
            AS DATE
        ) as transaction_date
    FROM bronze_cafe_sales
    WHERE "Transaction ID" IS NOT NULL  -- Keep only records with valid IDs
""")

# Check results
cleaned_count = con.execute("SELECT COUNT(*) FROM silver_cafe_sales").fetchone()[0]
print(f"Cleaned records: {cleaned_count}")

print("\nSample of cleaned data:")
display(con.execute("SELECT * FROM silver_cafe_sales LIMIT 5").df())

print("\n‚úÖ Silver layer complete")

### Validation: Business Rules & Data Quality

**Different validations than Scenario 1:**
- Value range checks (no negative prices/quantities)
- Categorical domain validation (only known items/payment methods)
- Calculation validation (does total = quantity √ó price?)

In [None]:
# VALIDATION
print("=== VALIDATION (Cafe Sales) ===\n")

# Check 1: Primary key uniqueness
txn_count = con.execute("SELECT COUNT(*) FROM silver_cafe_sales").fetchone()[0]
txn_unique = con.execute("SELECT COUNT(DISTINCT transaction_id) FROM silver_cafe_sales").fetchone()[0]
print(f"‚úì Transaction IDs unique? {txn_count == txn_unique}")
assert txn_count == txn_unique, "Duplicate transaction IDs!"

# Check 2: Value ranges (for non-NULL records)
negative_qty = con.execute("""
    SELECT COUNT(*) 
    FROM silver_cafe_sales 
    WHERE quantity IS NOT NULL AND quantity < 0
""").fetchone()[0]
print(f"‚úì No negative quantities? {negative_qty == 0}")
assert negative_qty == 0, f"Found {negative_qty} negative quantities!"

negative_price = con.execute("""
    SELECT COUNT(*) 
    FROM silver_cafe_sales 
    WHERE price_per_unit IS NOT NULL AND price_per_unit < 0
""").fetchone()[0]
print(f"‚úì No negative prices? {negative_price == 0}")
assert negative_price == 0, f"Found {negative_price} negative prices!"

# Check 3: Categorical domain validation
valid_items = ['Coffee', 'Sandwich', 'Cake', 'Cookie', 'Salad', 'Smoothie', 'Juice']
invalid_items = con.execute(f"""
    SELECT COUNT(*)
    FROM silver_cafe_sales
    WHERE item IS NOT NULL AND item NOT IN {tuple(valid_items)}
""").fetchone()[0]
print(f"‚úì All items in valid domain? {invalid_items == 0}")
# Note: We don't assert here - unknown items might be valid in real business!
if invalid_items > 0:
    print(f"  ‚ö†Ô∏è  Warning: {invalid_items} records with unexpected items (might be new products)")

# Check 4: Calculation validation (within 1 cent tolerance for floating point)
calc_errors = con.execute("""
    SELECT COUNT(*)
    FROM silver_cafe_sales
    WHERE 
        quantity IS NOT NULL 
        AND price_per_unit IS NOT NULL 
        AND total_spent IS NOT NULL
        AND ABS(total_spent - (quantity * price_per_unit)) > 0.01
""").fetchone()[0]
print(f"‚úì Total = Quantity √ó Price? {calc_errors == 0} calculation errors")
if calc_errors > 0:
    print(f"  ‚ö†Ô∏è  Warning: {calc_errors} records with calculation mismatches")
    # Show example
    print("\n  Example mismatch:")
    display(con.execute("""
        SELECT transaction_id, quantity, price_per_unit, total_spent,
               (quantity * price_per_unit) as calculated_total,
               ABS(total_spent - (quantity * price_per_unit)) as difference
        FROM silver_cafe_sales
        WHERE 
            quantity IS NOT NULL 
            AND price_per_unit IS NOT NULL 
            AND total_spent IS NOT NULL
            AND ABS(total_spent - (quantity * price_per_unit)) > 0.01
        LIMIT 3
    """).df())

# Check 5: Date range validation
date_range = con.execute("""
    SELECT MIN(transaction_date) as earliest, MAX(transaction_date) as latest
    FROM silver_cafe_sales
    WHERE transaction_date IS NOT NULL
""").df()
print(f"\n‚úì Date range: {date_range['earliest'].iloc[0]} to {date_range['latest'].iloc[0]}")

print("\n‚úÖ CRITICAL VALIDATIONS PASSED (warnings noted)")

### Gold Layer: Business Metrics

Create analysis-ready tables for stakeholders:

In [None]:
# GOLD: Business metrics
print("=== GOLD LAYER (Cafe Sales) ===\n")

# Metric 1: Product performance
con.execute("""
    CREATE TABLE gold_product_performance AS
    SELECT
        item,
        COUNT(*) as num_transactions,
        SUM(quantity) as total_quantity_sold,
        ROUND(AVG(price_per_unit), 2) as avg_price,
        ROUND(SUM(total_spent), 2) as total_revenue
    FROM silver_cafe_sales
    WHERE item IS NOT NULL
    GROUP BY item
    ORDER BY total_revenue DESC
""")

print("Top products by revenue:")
display(con.execute("SELECT * FROM gold_product_performance").df())

# Metric 2: Payment method distribution
con.execute("""
    CREATE TABLE gold_payment_mix AS
    SELECT
        payment_method,
        COUNT(*) as num_transactions,
        ROUND(SUM(total_spent), 2) as total_revenue,
        ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 1) as pct_transactions
    FROM silver_cafe_sales
    WHERE payment_method IS NOT NULL
    GROUP BY payment_method
    ORDER BY num_transactions DESC
""")

print("\n\nPayment method distribution:")
display(con.execute("SELECT * FROM gold_payment_mix").df())

print("\n‚úÖ Gold layer complete - ready for stakeholder reporting!")

---

## 6. Key Pipeline Principles

### 1. Idempotency
> **"Running the pipeline twice gives the same result."**

**Good:** Recreate tables from scratch
```sql
DROP TABLE IF EXISTS gold_daily_sales;
CREATE TABLE gold_daily_sales AS 
SELECT ...
```

**Why:** Makes pipelines debuggable and reproducible. No "what state am I in?" confusion.

---

### 2. Fail Fast
> **"If data is bad, stop immediately with a clear error."**

**Good:**  
```python
assert df['price'].min() >= 0, "Negative prices found!"
```

**Why:** Bad data = bad decisions. Find problems early, loudly.

---

### 3. Document Assumptions

Every validation is documentation:
```python
assert df['order_id'].is_unique, "Duplicate order IDs"
assert df['date'].max() <= pd.Timestamp.now(), "Future dates found"
```

**Why:** Future you (and your teammates) need to know what you expected.

---

## 7. Work Habits & Best Practices

How to work effectively with data and code.

---

### When to Use SQL vs Python

**Decision table:**

| Task | Use SQL | Use Python |
|------|---------|------------|
| Filter/subset rows | ‚úÖ `WHERE` clause | Only for complex logic |
| Join tables | ‚úÖ Fast, declarative | Complex merge logic |
| Aggregate metrics | ‚úÖ `GROUP BY` | Multi-step calculations |
| String parsing | Simple (`SPLIT`, `SUBSTRING`) | ‚úÖ Regex, complex rules |
| API calls | ‚ùå | ‚úÖ `requests` library |
| Date arithmetic | ‚úÖ Built-in functions | Complex timezone logic |
| Window functions | ‚úÖ `ROW_NUMBER()`, `LAG()` | When SQL gets unreadable |

**Examples:**

**1. Revenue by customer segment**
```sql
-- ‚úÖ SQL: Perfect for this
SELECT customer_segment, SUM(revenue) as total_revenue
FROM orders
GROUP BY customer_segment
```

**2. Parse nested JSON from API**
```python
# ‚úÖ Python: SQL's JSON functions are limited
import requests
response = requests.get('https://api.example.com/data')
data = response.json()
# Recursive parsing, normalization
```

**3. Clean messy address strings**
```python
# ‚úÖ Python: Regex and complex string logic
import re
df['clean_address'] = df['address'].str.replace(r'\s+', ' ').str.title()
```

**Rule of thumb:** Use SQL until it hurts, then switch to Python.

---

### Debugging Strategies

**1. Print-driven debugging**
```python
# Check intermediate results
print(f"Rows before cleaning: {len(df)}")
df = clean_data(df)
print(f"Rows after cleaning: {len(df)}")  # Did we lose too many?
```

**2. Isolate transformations**
```python
# ‚ùå Hard to debug
df = df.pipe(clean).pipe(validate).pipe(transform)

# ‚úÖ Easier
df = clean(df)
print(f"After clean: {len(df)} rows")
df = validate(df)  # Might assert and fail here ‚Äî good!
df = transform(df)
```

**3. Check types early**
```python
# After loading data
print(df.dtypes)  # Are numbers stored as strings?

# In SQL
SELECT * FROM information_schema.columns WHERE table_name = 'my_table';
```

---

### Documentation Patterns

**Data dictionary example:**
```markdown
## silver_orders

| Column | Type | Description | Nulls OK? |
|--------|------|-------------|-----------|
| order_id | VARCHAR | Unique order identifier | No |
| order_date | DATE | When order was placed | Yes (1% missing) |
| customer_id | VARCHAR | FK to customers table | No |
| status | VARCHAR | delivered, shipped, canceled | No |
```

**Assumption log:**
```python
# At top of notebook
"""
ASSUMPTIONS:
- Order dates before 2015 are data errors (filtered out)
- NULL payment_method means Cash (historical system default)
- Customer addresses not validated (trust source system)
"""
```

---

### Git Habits

**Small, focused commits:**
```bash
# ‚úÖ Good
git add notebooks/day3_analysis.ipynb
git commit -m "Add bronze layer for cafe sales data"

git add notebooks/day3_analysis.ipynb
git commit -m "Add silver layer with NULL handling"

# ‚ùå Bad
git add .
git commit -m "Stuff"  # What stuff? Why?
```

**Clear commit messages:**
- What changed
- Why it changed (if not obvious)

**`.gitignore` patterns:**
```
# Ignore notebook outputs
*.ipynb_checkpoints/
.ipynb_checkpoints

# Ignore data files (if large)
data/*.csv
!data/sample.csv  # Except small samples

# Ignore credentials
.env
credentials.json
```

---

### Restart & Run All Discipline

**Before every commit:**
1. Restart kernel
2. Run All cells
3. Verify no errors

**Why:**
- Catches hidden state bugs
- Ensures reproducibility
- Your teammate (or future you) can run it

**Common gotcha:**
```python
# Cell 1
df = load_data()

# Cell 2
df = df[df['price'] > 0]  # Modifies df

# Cell 3
print(len(df))  # Different every time you re-run cell 2!
```

**Fix:** Make transformations explicit, don't mutate.

---

---

## 5. Data in the Wild: Common Issues & Solutions

Real-world data is messy. Here's what you'll encounter and how to handle it.

---

### CSV Traps

**Problem:** CSVs have no standard for encoding, separators, or quoting.

**Common issues:**
```python
# Encoding issues
# üö® Wrong: Assumes UTF-8
df = pd.read_csv('data.csv')  # May fail on accented characters

# ‚úÖ Right: Specify or detect encoding
df = pd.read_csv('data.csv', encoding='utf-8')  # Or 'latin-1', 'cp1252'

# Locale separators (European CSVs often use ; and , for decimals)
df = pd.read_csv('data.csv', sep=';', decimal=',')

# Header drift: Sometimes data exports include summary rows
df = pd.read_csv('data.csv', skiprows=2, skipfooter=1)
```

**Excel gotchas:**
- Automatically converts `2-5` to `Feb 5` (date)
- Truncates leading zeros (`00123` ‚Üí `123`)
- Scientific notation for long numbers

---

### Date Handling

**Always be explicit** ‚Äî dates are strings until you parse them.

```sql
-- ‚úÖ Good: Try and validate
SELECT 
    transaction_date,
    TRY_CAST(transaction_date AS DATE) as parsed_date,
    CASE 
        WHEN TRY_CAST(transaction_date AS DATE) IS NULL 
        THEN 'INVALID' 
        ELSE 'OK' 
    END as status
FROM bronze_table
WHERE TRY_CAST(transaction_date AS DATE) IS NULL  -- Find bad dates
```

**Common patterns:**
- Mix of formats: `2023-01-15`, `01/15/2023`, `15-Jan-23`
- Partial dates: `2023-01`, `2023-Q1`
- Timezone issues: Always store in UTC, display in local
- Future dates (typos): Validate `date <= CURRENT_DATE`

---

### Multiple NULL Representations

**The reality:** NULL can be represented many ways.

**What you'll see:**
- True `NULL` (database null)
- Empty string: `""`
- Sentinel strings: `"N/A"`, `"UNKNOWN"`, `"ERROR"`, `"--"`
- Sentinel numbers: `-999`, `-1`, `0`
- Whitespace: `"   "` (looks empty but isn't)

**Your cleaning strategy:**
```sql
-- Normalize to NULL
CASE 
    WHEN field IN ('', 'N/A', 'UNKNOWN', 'ERROR', '--') THEN NULL
    WHEN TRIM(field) = '' THEN NULL  -- Handle whitespace
    ELSE field
END as cleaned_field
```

**Remember:**
- `NULL` means "unknown" or "not applicable"
- Aggregations **exclude NULLs**: `AVG(price)` ignores NULL prices
- Comparisons with NULL: `field = NULL` is **always** `FALSE` ‚Äî use `field IS NULL`

---

### Categorical Validation

**Pattern:** Check values against expected domain.

```sql
-- Find unexpected categories
SELECT DISTINCT payment_method
FROM silver_cafe_sales
WHERE payment_method NOT IN ('Cash', 'Credit Card', 'Digital Wallet')
  AND payment_method IS NOT NULL;
```

**When to use reference tables:**
- Product codes ‚Üí Product names
- Country codes (ISO) ‚Üí Country names  
- Customer IDs ‚Üí Customer details

**Benefit:** Ensures consistency, enables joins

---

### File Formats: CSV vs Parquet

**CSV:**
- ‚úÖ Human-readable, universal
- ‚ùå No built-in types (everything is string until parsed)
- ‚ùå Large file sizes
- ‚ùå Slow to read/write

**Parquet** (columnar format):
- ‚úÖ Built-in types (date, int, float, boolean)
- ‚úÖ Compressed (10x smaller than CSV)
- ‚úÖ Fast reads (especially for analytics)
- ‚ùå Not human-readable (binary format)
- ‚úÖ DuckDB can query directly: `SELECT * FROM 'data.parquet'`

**When to use Parquet:**
- Large datasets (>100MB)
- Repeated analysis
- Storing intermediate pipeline outputs

---

### Geodata (Brief Mention)

**Common patterns:**
- Latitude/Longitude pairs: `(40.7128, -74.0060)` ‚Üí NYC
- Addresses ‚Üí Geocoding (convert to lat/lon)
- **Caveats:** Geocoding quality varies, address normalization is hard

**Beyond this course** ‚Äî but common in business data (store locations, customer addresses, delivery zones)

---

---

## 8. Summary

### Key Takeaways

1. **Pipeline pattern:** Bronze (raw) ‚Üí Silver (clean) ‚Üí Gold (metrics)
2. **Two scenarios:** 
   - E-commerce: Relational integrity (PKs, FKs, joins)
   - Retail: Data quality (NULLs, types, calculations)
3. **Validations:** Assertions catch problems early and loudly
4. **Idempotency:** Re-running gives same result
5. **Data in the wild:** Encoding, dates, NULLs, categoricals, file formats
6. **Work habits:** SQL vs Python, debugging, docs, git, Restart & Run All

**You're not just analyzing data‚Äîyou're building infrastructure.**

---

## Next: In-Class Exercise

Build a mini-pipeline:
1. Bronze: Load raw data
2. Silver: Clean, validate (2 assertions)
3. Gold: Create 2-3 metrics
4. Document: Risk note

**Time:** 15 minutes
**Notebook:** `day3_exercise_mini_pipeline.ipynb`

**Let's build!** üöÄ

---

---

---

## üìã Course Evaluation

**Please complete the course evaluation!**

Use the QR code below to start the evaluation. This evaluation is only accessible for registered participants.

Your feedback helps improve the course for future students.

![Course Evaluation QR Code](../../references/images/course_evaluation_qr.png)

**Thank you!** üôè

---