# Chapter 12 ‚Äî Automation, Reporting, and Reproducible Analysis

---

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the importance of **automation** in data analytics workflows
- Create **reusable functions** to automate repetitive analysis tasks
- **Parameterize reports** so they can be run with different inputs
- **Export results** to multiple formats (CSV, Excel, HTML, PDF)
- Use **version control with Git** to track changes in your projects
- Write **well-documented, readable code** that others can understand
- Apply **reproducible research principles** to ensure your analysis can be replicated
- **Schedule analytics tasks** to run automatically

---

## Introduction

### Why Automation Matters

As a data analyst, you'll often find yourself repeating similar tasks:
- Running the same analysis every week with new data
- Generating reports for different regions, products, or time periods
- Cleaning data the same way each time you receive an update

**Manual repetition is:**
- ‚è∞ **Time-consuming** ‚Äî you spend hours on tasks that could take seconds
- ‚ùå **Error-prone** ‚Äî copy-paste mistakes, forgetting steps
- üò¥ **Boring** ‚Äî tedious work leads to disengagement

**Automation solves these problems by:**
- ‚ö° **Saving time** ‚Äî run complex analyses with a single command
- ‚úÖ **Reducing errors** ‚Äî code does the same thing every time
- üìä **Enabling scale** ‚Äî generate 50 reports as easily as one
- üîÑ **Ensuring consistency** ‚Äî every report follows the same process

### What is Reproducible Analysis?

**Reproducibility** means that someone else (or you, in 6 months) can:
1. Take your code and data
2. Run it exactly as you did
3. Get the **same results**

This is crucial for:
- **Scientific credibility** ‚Äî others can verify your findings
- **Collaboration** ‚Äî teammates can build on your work
- **Debugging** ‚Äî you can trace exactly what happened

> üí° **Tip:** Think of reproducibility as "documentation for your future self." You *will* forget why you did something ‚Äî make it easy to remember!

### Chapter Roadmap

In this chapter, we'll build a complete **automated reporting pipeline** that:

1. **Ingests** data (generates sample data for this demo)
2. **Cleans** the data using reusable functions
3. **Analyzes** by computing KPIs and breakdowns
4. **Visualizes** with saved charts
5. **Exports** to multiple formats (CSV, Excel, HTML)
6. **Documents** with metadata for reproducibility

Along the way, you'll learn best practices for code organization, version control, and documentation.

---

## 12.1 Setup and Environment

### Required Libraries

We'll use standard data analytics libraries:
- **`pandas`** ‚Äî for data manipulation and tables
- **`numpy`** ‚Äî for numeric operations and random data generation
- **`matplotlib`** ‚Äî for creating and saving charts
- **`pathlib`** ‚Äî for cross-platform file path handling (built into Python)

> ‚ö†Ô∏è **Warning:** If Excel export fails later (missing `openpyxl` engine), we'll automatically fall back to CSV. You can install it with: `pip install openpyxl`

### Setting Up Output Folders

A key automation practice is **separating outputs from source code**. We'll save all generated files to an `outputs/chapter_12/` folder.

In [None]:
# === Setup: Import libraries and create output folder ===

from __future__ import annotations

from dataclasses import dataclass          # For creating structured result objects
from datetime import datetime, timedelta   # For timestamps and date math
from pathlib import Path                   # Cross-platform file paths
import json                                # For serializing parameters
import platform                            # For recording system info
import sys                                 # For Python version info

import numpy as np                         # Numerical operations
import pandas as pd                        # Data manipulation
import matplotlib.pyplot as plt            # Visualization
import seaborn as sns                      # Dataset loading and enhanced plots

# Configure pandas display options for better readability
pd.set_option("display.max_columns", 50)   # Show more columns
pd.set_option("display.width", 120)        # Wider display

# Create a dedicated output folder for this chapter
# Using Path ensures this works on Windows, Mac, and Linux
OUTPUT_ROOT = Path("outputs") / "chapter_12"
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)  # Create if doesn't exist

# Print environment info (useful for reproducibility)
print("=" * 50)
print("ENVIRONMENT SETUP")
print("=" * 50)
print(f"Output folder: {OUTPUT_ROOT.resolve()}")
print(f"Python version: {sys.version.split()[0]}")
print(f"Platform: {platform.platform()}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print("=" * 50)

---

## 12.2 Loading a Demo Dataset

### Using Real Data from Seaborn

Instead of generating synthetic data, we'll use the **taxis** dataset from seaborn ‚Äî a real-world dataset containing NYC taxi ride information. This is perfect for demonstrating automated reporting because it has:
- Dates and times (for time-based aggregations)
- Categorical variables (pickup/dropoff locations, payment type)
- Numeric variables (distance, fare, tip)

> üí° **Tip:** In real projects, you would load data from files, databases, or APIs. Using built-in datasets ensures this notebook works for everyone.

### Dataset Overview

The taxis dataset includes:
- **pickup/dropoff** ‚Äî datetime and location information
- **passengers** ‚Äî number of passengers
- **distance** ‚Äî trip distance in miles
- **fare** ‚Äî fare amount in dollars
- **tip** ‚Äî tip amount
- **payment** ‚Äî payment method (cash, credit card)

In [None]:
def load_sales_data() -> pd.DataFrame:
    """
    Load and transform the seaborn taxis dataset for our sales reporting demo.
    
    Returns:
    --------
    pd.DataFrame
        A DataFrame with columns: date, region, channel, product, quantity, unit_price
    """
    # Load the taxis dataset from seaborn
    taxis = sns.load_dataset("taxis")
    
    # Transform to our "sales" format
    df = pd.DataFrame({
        "date": pd.to_datetime(taxis["pickup"]).dt.date,
        "region": taxis["pickup_borough"].fillna("Unknown"),
        "channel": taxis["payment"].map({"credit card": "Online", "cash": "Retail"}).fillna("Partner"),
        "product": np.where(taxis["distance"] > 5, "Long Trip", 
                   np.where(taxis["distance"] > 2, "Medium Trip", "Short Trip")),
        "quantity": taxis["passengers"].fillna(1).astype(int).clip(1, 5),
        "unit_price": taxis["fare"].fillna(taxis["fare"].median()).round(2)
    })
    
    # Convert date to datetime
    df["date"] = pd.to_datetime(df["date"])
    
    return df

# Load the raw data
raw = load_sales_data()

print(f"Dataset shape: {raw.shape}")
print(f"Date range: {raw['date'].min()} to {raw['date'].max()}")
raw.head()

---

## 12.3 Automation in Analytics Workflows

### The Automation Pattern

A well-automated analytics workflow follows this pattern:

```
1. INGEST ‚Üí Load or generate data
2. CLEAN ‚Üí Fix types, handle missing values, create derived columns  
3. ANALYZE ‚Üí Compute KPIs, aggregations, breakdowns
4. REPORT ‚Üí Export tables, charts, and formatted reports
```

Each step should be a **reusable function** that:
- Takes inputs as parameters
- Returns outputs explicitly
- Has no hidden side effects
- Can be tested independently

> üí° **Tip:** Writing functions instead of loose code in cells makes your analysis:
> - **Reusable** ‚Äî call the same function on different data
> - **Testable** ‚Äî verify each piece works correctly
> - **Readable** ‚Äî function names describe what they do

### The Clean Step

Let's implement data cleaning as a reusable function. This function:
1. Converts columns to proper types
2. Handles missing values
3. Creates a new calculated column (`revenue`)

In [None]:
def clean_sales_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Clean and prepare sales data for analysis.
    
    Steps performed:
    1. Convert date column to datetime type
    2. Convert text columns to string type
    3. Fill missing prices with the median (simple imputation strategy)
    4. Calculate revenue = quantity √ó unit_price
    
    Parameters:
    -----------
    df : pd.DataFrame
        Raw sales data
        
    Returns:
    --------
    pd.DataFrame
        Cleaned sales data with 'revenue' column added
    """
    # Always work on a copy to avoid modifying the original
    cleaned = df.copy()
    
    # Step 1: Ensure date is proper datetime type
    cleaned["date"] = pd.to_datetime(cleaned["date"], errors="coerce")
    
    # Step 2: Convert text columns to string type for consistency
    cleaned["region"] = cleaned["region"].astype("string")
    cleaned["channel"] = cleaned["channel"].astype("string")
    cleaned["product"] = cleaned["product"].astype("string")
    
    # Step 3: Handle missing prices
    # Using median is robust to outliers (better than mean for prices)
    price_median = cleaned["unit_price"].median(skipna=True)
    missing_count = cleaned["unit_price"].isna().sum()
    cleaned["unit_price"] = cleaned["unit_price"].fillna(price_median)
    print(f"Filled {missing_count} missing prices with median: ${price_median:.2f}")
    
    # Step 4: Calculate revenue
    cleaned["revenue"] = cleaned["quantity"] * cleaned["unit_price"]
    
    # Normalize dates (remove time component)
    cleaned["date"] = cleaned["date"].dt.normalize()
    
    return cleaned


# Apply the cleaning function
sales = clean_sales_data(raw)

print(f"\nCleaned data shape: {sales.shape}")
print(f"New columns: {set(sales.columns) - set(raw.columns)}")
print(f"Revenue range: ${sales['revenue'].min():.2f} to ${sales['revenue'].max():.2f}")
print("\nSample of cleaned data:")
sales.head()

### üéØ Mini-Exercise 1: Improve the Cleaning Function

The current cleaning function fills missing prices with the **overall median**. But different products might have very different prices!

**Your task:** Modify the cleaning logic to fill missing prices with the **median price for that product**.

<details>
<summary>üí° Hint (click to expand)</summary>

Use `groupby('product')['unit_price'].transform('median')` to get the per-product median for each row.

</details>

In [None]:
# Your solution here:
# Try modifying the clean_sales_data function to use per-product median

# Example solution (uncomment to test):
# def clean_sales_data_improved(df: pd.DataFrame) -> pd.DataFrame:
#     cleaned = df.copy()
#     cleaned["date"] = pd.to_datetime(cleaned["date"], errors="coerce")
#     # ... other type conversions ...
#     
#     # Per-product median imputation
#     product_medians = cleaned.groupby("product")["unit_price"].transform("median")
#     cleaned["unit_price"] = cleaned["unit_price"].fillna(product_medians)
#     
#     cleaned["revenue"] = cleaned["quantity"] * cleaned["unit_price"]
#     return cleaned

---

## 12.4 Parameterizing Reports

### Why Parameters Matter

Instead of hard-coding values like `region = "North"` scattered throughout your code, define **parameters** at the top of your script or notebook.

**Benefits:**
- ‚úÖ Change one value, update the entire analysis
- ‚úÖ Easy to run the same report for different inputs
- ‚úÖ Clear documentation of what can be configured
- ‚úÖ Enables command-line or scheduled execution

> ‚ö†Ô∏è **Common Mistake:** Beginners often copy-paste code and change values in multiple places. This leads to bugs when you forget to update one location. **Always use parameters!**

### Our Report Parameters

For this demo report, we'll use two parameters:
1. **`REPORT_REGION`** ‚Äî Which region to analyze (or "ALL" for everything)
2. **`DAYS_LOOKBACK`** ‚Äî How many recent days to include

Try changing these values and re-running the cells below!

In [None]:
# ==========================================================
# REPORT PARAMETERS - Change these to customize the report
# ==========================================================

REPORT_REGION = "North"     # Options: "North", "South", "East", "West", or "ALL"
DAYS_LOOKBACK = 30          # Number of recent days to include (e.g., 7, 30, 90)

# Display parameter summary
print("=" * 50)
print("REPORT PARAMETERS")
print("=" * 50)
print(f"Region filter: {REPORT_REGION}")
print(f"Days lookback: {DAYS_LOOKBACK}")
print("=" * 50)
print("\nüí° Change these values and re-run to generate different reports!")

### Filtering Data Based on Parameters

Now we create a function that applies our parameter filters. This function:
1. Calculates the date range based on `days_lookback`
2. Filters to the specified region (unless "ALL")

In [None]:
def filter_for_report(df: pd.DataFrame, region: str, days_lookback: int) -> pd.DataFrame:
    """
    Filter data based on report parameters.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Cleaned sales data
    region : str
        Region to filter by, or "ALL" for no region filter
    days_lookback : int
        Number of recent days to include
        
    Returns:
    --------
    pd.DataFrame
        Filtered data for the report
    """
    # Calculate date range
    end_date = df["date"].max()
    start_date = end_date - pd.Timedelta(days=days_lookback)
    
    # Filter by date
    filtered = df[df["date"].between(start_date, end_date)].copy()
    print(f"Date filter: {start_date.date()} to {end_date.date()}")
    print(f"Rows after date filter: {len(filtered)}")
    
    # Filter by region (unless "ALL")
    if region.upper() != "ALL":
        filtered = filtered[filtered["region"] == region]
        print(f"Region filter: {region}")
        print(f"Rows after region filter: {len(filtered)}")
    else:
        print("Region filter: ALL (no filtering)")
    
    return filtered


# Apply the filter
report_df = filter_for_report(sales, REPORT_REGION, DAYS_LOOKBACK)

print(f"\nFiltered data summary:")
print(f"  Shape: {report_df.shape}")
print(f"  Total revenue: ${report_df['revenue'].sum():,.2f}")
print("\nPreview:")
report_df.head()

---

## 12.5 KPI Summary and Breakdowns

### Understanding KPIs

**KPIs (Key Performance Indicators)** answer the question: "How are we doing overall?"

For a sales report, common KPIs include:
- **Total Revenue** ‚Äî How much money did we make?
- **Total Orders** ‚Äî How many transactions occurred?
- **Total Units Sold** ‚Äî How many items were purchased?
- **Average Order Value** ‚Äî How much is a typical order worth?

### Understanding Breakdowns

**Breakdowns** answer: "Where is the performance coming from?"

They split KPIs by different dimensions:
- By **channel** ‚Äî Are online sales growing faster than retail?
- By **product** ‚Äî Which product generates the most revenue?
- By **time** ‚Äî What's the daily trend?

> ‚ö†Ô∏è **Common Mistake:** Mixing formatting (like `$1,234.56`) into your raw numbers too early. Keep numeric columns as numbers for calculations; format only at the final display step.

In [None]:
def kpi_summary(df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate key performance indicators from sales data.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Filtered sales data
        
    Returns:
    --------
    pd.DataFrame
        Table with KPI names and values
    """
    # Calculate each KPI
    total_revenue = df["revenue"].sum()
    total_orders = len(df)
    total_units = df["quantity"].sum()
    avg_order_value = total_revenue / total_orders if total_orders > 0 else 0
    
    # Return as a tidy table
    return pd.DataFrame({
        "metric": ["total_revenue", "total_orders", "total_units", "avg_order_value"],
        "value": [total_revenue, total_orders, total_units, avg_order_value],
    })


# Calculate KPIs for our filtered data
kpis = kpi_summary(report_df)

# Display with formatting
print("KPI Summary")
print("-" * 40)
for _, row in kpis.iterrows():
    metric = row["metric"]
    value = row["value"]
    if "revenue" in metric or "order_value" in metric:
        print(f"{metric:20s}: ${value:,.2f}")
    else:
        print(f"{metric:20s}: {value:,.0f}")
print("-" * 40)

kpis

### Creating Breakdown Tables

Now let's create functions to generate breakdown tables by channel, product, and date:

In [None]:
def top_breakdowns(df: pd.DataFrame) -> dict[str, pd.DataFrame]:
    """
    Create breakdown tables by different dimensions.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Filtered sales data
        
    Returns:
    --------
    dict[str, pd.DataFrame]
        Dictionary with breakdown tables: by_channel, by_product, by_day
    """
    # Breakdown by sales channel
    by_channel = (
        df.groupby("channel", dropna=False)
        .agg(
            orders=("revenue", "size"),      # Count of orders
            revenue=("revenue", "sum"),      # Total revenue
            units=("quantity", "sum")        # Total units
        )
        .sort_values("revenue", ascending=False)
        .reset_index()
    )
    
    # Breakdown by product
    by_product = (
        df.groupby("product", dropna=False)
        .agg(
            orders=("revenue", "size"),
            revenue=("revenue", "sum"),
            units=("quantity", "sum")
        )
        .sort_values("revenue", ascending=False)
        .reset_index()
    )
    
    # Breakdown by day (for trend analysis)
    by_day = (
        df.groupby("date", dropna=False)
        .agg(
            orders=("revenue", "size"),
            revenue=("revenue", "sum"),
            units=("quantity", "sum")
        )
        .sort_values("date")  # Sort chronologically
        .reset_index()
    )
    
    return {
        "by_channel": by_channel,
        "by_product": by_product,
        "by_day": by_day
    }


# Generate breakdowns
breakdowns = top_breakdowns(report_df)

# Display results
print("Revenue by Channel:")
print(breakdowns["by_channel"].to_string(index=False))
print("\nRevenue by Product:")
print(breakdowns["by_product"].to_string(index=False))
print(f"\nDaily breakdown: {len(breakdowns['by_day'])} days of data")

---

## 12.6 Generating Automated Reports with Charts

### Why Save Charts as Files?

Automation is most useful when it produces **artifacts** you can share:
- üìä **PNG/JPG images** ‚Äî for presentations and emails
- üìÑ **CSV/Excel files** ‚Äî for further analysis
- üåê **HTML reports** ‚Äî for web viewing and sharing

Charts saved as image files can be:
- Embedded in reports and presentations
- Attached to automated emails
- Archived for historical comparison

### Creating a Daily Revenue Chart

Let's create a function that generates a chart and saves it as a PNG file:

In [None]:
def save_daily_revenue_plot(by_day: pd.DataFrame, out_path: Path) -> Path:
    """
    Create and save a daily revenue chart.
    
    Parameters:
    -----------
    by_day : pd.DataFrame
        Daily breakdown table with 'date' and 'revenue' columns
    out_path : Path
        Where to save the PNG file
        
    Returns:
    --------
    Path
        The path where the chart was saved
    """
    # Create the figure
    fig, ax = plt.subplots(figsize=(10, 5))
    
    # Plot the data
    ax.plot(
        by_day["date"], 
        by_day["revenue"], 
        marker="o",           # Circle markers at each point
        linewidth=2,          # Line thickness
        color="#2E86AB",      # Professional blue color
        markersize=6
    )
    
    # Add styling
    ax.set_title("Daily Revenue Trend", fontsize=14, fontweight="bold")
    ax.set_xlabel("Date", fontsize=11)
    ax.set_ylabel("Revenue ($)", fontsize=11)
    ax.grid(True, alpha=0.3, linestyle="--")
    
    # Format y-axis as currency
    ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x:,.0f}'))
    
    # Rotate x-axis labels for readability
    fig.autofmt_xdate(rotation=30)
    
    # Ensure output directory exists
    out_path.parent.mkdir(parents=True, exist_ok=True)
    
    # Save the figure
    fig.savefig(out_path, dpi=150, bbox_inches="tight", facecolor="white")
    plt.close(fig)  # Close to free memory
    
    print(f"‚úÖ Chart saved to: {out_path}")
    return out_path


# Save the chart
chart_path = OUTPUT_ROOT / "daily_revenue.png"
save_daily_revenue_plot(breakdowns["by_day"], chart_path)

# Display the chart inline as well
from IPython.display import Image
Image(filename=str(chart_path))

---

## 12.7 Exporting Results (CSV, Excel, HTML)

### Choosing Export Formats

Different formats serve different purposes:

| Format | Best For | Pros | Cons |
|--------|----------|------|------|
| **CSV** | Data exchange, archiving | Universal, simple, small | No formatting, one sheet |
| **Excel** | Business users, multiple tables | Multiple sheets, formatting | Requires openpyxl library |
| **HTML** | Web sharing, email | Opens in browser, embeds images | Larger files |
| **PDF** | Formal reports, printing | Professional look, fixed layout | Harder to generate |

### Export Strategy

Our export function will:
1. Try to save as Excel (multiple sheets in one file)
2. Fall back to CSV if Excel library is missing
3. Return a list of all files created

> üí° **Tip:** Always use try/except when dealing with file operations. External libraries might not be installed, or disk might be full.

In [None]:
def export_excel_or_csv(
    out_dir: Path,
    cleaned: pd.DataFrame,
    kpis: pd.DataFrame,
    breakdowns: dict[str, pd.DataFrame],
    excel_name: str = "report.xlsx",
) -> list[Path]:
    """
    Export data to Excel (preferred) or CSV (fallback).
    
    Parameters:
    -----------
    out_dir : Path
        Output directory
    cleaned : pd.DataFrame
        The cleaned data to export
    kpis : pd.DataFrame
        KPI summary table
    breakdowns : dict[str, pd.DataFrame]
        Breakdown tables by dimension
    excel_name : str
        Name for the Excel file
        
    Returns:
    --------
    list[Path]
        List of files that were created
    """
    out_dir.mkdir(parents=True, exist_ok=True)
    written: list[Path] = []
    excel_path = out_dir / excel_name
    
    # Try Excel first (requires openpyxl)
    try:
        with pd.ExcelWriter(excel_path, engine="openpyxl") as writer:
            cleaned.to_excel(writer, sheet_name="data", index=False)
            kpis.to_excel(writer, sheet_name="kpis", index=False)
            for name, table in breakdowns.items():
                # Excel sheet names limited to 31 characters
                table.to_excel(writer, sheet_name=name[:31], index=False)
        
        written.append(excel_path)
        print(f"‚úÖ Excel export successful: {excel_path.name}")
        return written
        
    except Exception as e:
        print(f"‚ö†Ô∏è Excel export failed: {e}")
        print("   Falling back to CSV files...")
    
    # Fallback: Export as separate CSV files
    csv_files = {
        "data.csv": cleaned,
        "kpis.csv": kpis,
        **{f"{name}.csv": table for name, table in breakdowns.items()}
    }
    
    for filename, df in csv_files.items():
        path = out_dir / filename
        df.to_csv(path, index=False)
        written.append(path)
        print(f"‚úÖ Saved: {path.name}")
    
    return written


# Export the data
export_paths = export_excel_or_csv(OUTPUT_ROOT, report_df, kpis, breakdowns)

print(f"\nTotal files exported: {len(export_paths)}")
for p in export_paths:
    print(f"  üìÑ {p.name}")

---

## 12.8 Generating HTML Reports

### Why HTML Reports?

HTML reports are powerful because:
- üåê **Universal** ‚Äî Opens in any web browser
- üìä **Rich content** ‚Äî Tables, images, styling all in one file
- üìß **Shareable** ‚Äî Can be emailed or hosted on a server
- üé® **Customizable** ‚Äî Full control over appearance with CSS

### Building an HTML Report

We'll create a function that:
1. Converts DataFrames to HTML tables
2. Embeds our saved chart image
3. Adds a timestamp and styling
4. Saves as a single HTML file

> üí° **Tip:** For more complex reports, consider libraries like `Jinja2` (templating) or `weasyprint` (PDF generation).

In [None]:
def df_to_html_table(df: pd.DataFrame, max_rows: int = 20) -> str:
    """Convert a DataFrame to an HTML table string."""
    preview = df.head(max_rows)
    return preview.to_html(index=False, escape=False, classes="data-table")


def export_html_report(
    out_dir: Path,
    title: str,
    kpis: pd.DataFrame,
    by_channel: pd.DataFrame,
    by_product: pd.DataFrame,
    chart_file: str,
    report_name: str = "report.html",
) -> Path:
    """
    Generate a styled HTML report.
    
    Parameters:
    -----------
    out_dir : Path
        Output directory
    title : str
        Report title
    kpis : pd.DataFrame
        KPI summary table
    by_channel : pd.DataFrame
        Channel breakdown
    by_product : pd.DataFrame
        Product breakdown
    chart_file : str
        Filename of the chart image (relative to report)
    report_name : str
        Output filename
        
    Returns:
    --------
    Path
        Path to the generated HTML file
    """
    out_dir.mkdir(parents=True, exist_ok=True)
    report_path = out_dir / report_name
    
    # HTML template with embedded CSS
    html = f"""<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>{title}</title>
    <style>
        /* Reset and base styles */
        * {{ box-sizing: border-box; }}
        body {{
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, sans-serif;
            margin: 0;
            padding: 24px;
            background: #f8f9fa;
            color: #333;
            line-height: 1.6;
        }}
        
        /* Container */
        .container {{
            max-width: 1000px;
            margin: 0 auto;
            background: white;
            padding: 32px;
            border-radius: 8px;
            box-shadow: 0 2px 8px rgba(0,0,0,0.1);
        }}
        
        /* Typography */
        h1 {{
            color: #2c3e50;
            border-bottom: 3px solid #3498db;
            padding-bottom: 12px;
            margin-top: 0;
        }}
        h2 {{
            color: #34495e;
            margin-top: 32px;
        }}
        
        /* Metadata */
        .meta {{
            color: #7f8c8d;
            font-size: 0.9em;
            margin-bottom: 24px;
        }}
        
        /* Tables */
        .data-table {{
            border-collapse: collapse;
            width: 100%;
            margin: 16px 0 32px 0;
        }}
        .data-table th, .data-table td {{
            border: 1px solid #ddd;
            padding: 10px 12px;
            text-align: left;
        }}
        .data-table th {{
            background: #3498db;
            color: white;
            font-weight: 600;
        }}
        .data-table tr:nth-child(even) {{
            background: #f8f9fa;
        }}
        .data-table tr:hover {{
            background: #e8f4f8;
        }}
        
        /* Chart */
        .chart-container {{
            text-align: center;
            margin: 24px 0;
        }}
        .chart-container img {{
            max-width: 100%;
            height: auto;
            border: 1px solid #ddd;
            border-radius: 4px;
        }}
        
        /* Footer */
        .footer {{
            margin-top: 40px;
            padding-top: 20px;
            border-top: 1px solid #eee;
            color: #95a5a6;
            font-size: 0.85em;
            text-align: center;
        }}
    </style>
</head>
<body>
    <div class="container">
        <h1>{title}</h1>
        <p class="meta">
            üìÖ Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}<br>
            üêç Python {sys.version.split()[0]} | Pandas {pd.__version__}
        </p>
        
        <h2>üìä Key Performance Indicators</h2>
        {df_to_html_table(kpis, max_rows=20)}
        
        <h2>üìà Revenue by Channel</h2>
        {df_to_html_table(by_channel, max_rows=20)}
        
        <h2>üì¶ Revenue by Product</h2>
        {df_to_html_table(by_product, max_rows=20)}
        
        <h2>üìâ Daily Revenue Trend</h2>
        <div class="chart-container">
            <img src="{chart_file}" alt="Daily Revenue Chart">
        </div>
        
        <div class="footer">
            Report generated by Automated Analytics Pipeline | Chapter 12 Demo
        </div>
    </div>
</body>
</html>"""
    
    report_path.write_text(html, encoding="utf-8")
    print(f"‚úÖ HTML report saved to: {report_path}")
    return report_path


# Generate the HTML report
html_path = export_html_report(
    OUTPUT_ROOT,
    title=f"Sales Report ({REPORT_REGION}, last {DAYS_LOOKBACK} days)",
    kpis=kpis,
    by_channel=breakdowns["by_channel"],
    by_product=breakdowns["by_product"],
    chart_file=chart_path.name,
)

print(f"\nüìÅ Open this file in your browser to view the report:")
print(f"   {html_path.resolve()}")

### üéØ Mini-Exercise 2: Customize the HTML Report

Modify the `export_html_report` function to add one of the following:

1. Add a **summary paragraph** that says how many orders and total revenue
2. Change the **color scheme** (try different hex colors for headers)
3. Add the **daily breakdown table** to the report

<details>
<summary>üí° Hint for option 1</summary>

Add this before the KPI table:
```python
summary = f"<p><strong>Summary:</strong> {len(report_df)} orders totaling ${report_df['revenue'].sum():,.2f}</p>"
```
</details>

---

## 12.9 PDF Export Options

### Why PDF?

PDF (Portable Document Format) is ideal for:
- üìÑ **Formal reports** that need to look professional
- üñ®Ô∏è **Printing** with consistent formatting
- üìß **Email attachments** that won't change appearance

### PDF Generation Options in Python

Generating PDFs from Python requires additional libraries. Here are the main options:

| Library | Approach | Difficulty | Best For |
|---------|----------|------------|----------|
| **weasyprint** | HTML ‚Üí PDF | Medium | Converting HTML reports to PDF |
| **reportlab** | Build PDF directly | Hard | Complex custom layouts |
| **fpdf2** | Simple PDF creation | Easy | Basic reports |
| **pdfkit** | HTML ‚Üí PDF (uses wkhtmltopdf) | Easy | Quick conversions |

> ‚ö†Ô∏è **Note:** PDF libraries often have system dependencies (fonts, wkhtmltopdf binary). For beginners, we recommend starting with HTML reports and converting to PDF manually (print ‚Üí Save as PDF) or using an online converter.

### Example: Simple PDF with fpdf2

Here's a conceptual example (uncomment and install `fpdf2` to run):

In [None]:
# PDF Export Example (conceptual - requires fpdf2: pip install fpdf2)
# Uncomment the code below to try it

# from fpdf import FPDF
#
# def export_simple_pdf(out_path: Path, title: str, kpis: pd.DataFrame) -> Path:
#     """Generate a simple PDF report."""
#     pdf = FPDF()
#     pdf.add_page()
#     
#     # Title
#     pdf.set_font("Arial", "B", 16)
#     pdf.cell(0, 10, title, ln=True, align="C")
#     pdf.ln(10)
#     
#     # KPIs
#     pdf.set_font("Arial", "B", 12)
#     pdf.cell(0, 10, "Key Performance Indicators", ln=True)
#     
#     pdf.set_font("Arial", "", 10)
#     for _, row in kpis.iterrows():
#         pdf.cell(0, 8, f"{row['metric']}: {row['value']:.2f}", ln=True)
#     
#     pdf.output(str(out_path))
#     return out_path
#
# # Generate PDF
# pdf_path = OUTPUT_ROOT / "report.pdf"
# export_simple_pdf(pdf_path, "Sales Report", kpis)

print("üí° PDF export requires additional libraries.")
print("   For now, open the HTML report in a browser and use Print ‚Üí Save as PDF")

---

## 12.10 Reproducibility: Capturing Run Metadata

### What is Reproducibility?

**Reproducibility** means being able to recreate your exact results at any time. This requires recording:

1. **What parameters were used** ‚Äî Region, date range, filters
2. **When the analysis ran** ‚Äî Timestamp
3. **What environment was used** ‚Äî Python version, library versions
4. **What data was used** ‚Äî Data source, row counts, checksums

### Why Metadata Matters

Imagine this scenario:
> "The sales report from last month shows different numbers than today's report for the same period. Why?"

Without metadata, you'd have to guess. With metadata, you can check:
- Were the parameters the same?
- Was the data source updated?
- Did library versions change?

### Creating a Metadata Capture Function

In [None]:
def capture_run_metadata(params: dict) -> pd.DataFrame:
    """
    Capture metadata about the current run for reproducibility.
    
    Parameters:
    -----------
    params : dict
        Dictionary of parameters used in this run
        
    Returns:
    --------
    pd.DataFrame
        Single-row DataFrame with metadata
    """
    def safe_version(mod_name: str) -> str:
        """Safely get a module's version, or 'not-installed' if unavailable."""
        try:
            mod = __import__(mod_name)
            return getattr(mod, "__version__", "unknown")
        except Exception:
            return "not-installed"
    
    # Collect all metadata
    meta = {
        # When
        "run_at": datetime.now().isoformat(timespec="seconds"),
        
        # Environment
        "python": sys.version.split()[0],
        "platform": platform.platform(),
        "pandas": safe_version("pandas"),
        "numpy": safe_version("numpy"),
        "matplotlib": safe_version("matplotlib"),
        
        # Parameters (stored as JSON string)
        "params": json.dumps(params, ensure_ascii=False),
    }
    
    return pd.DataFrame([meta])


# Capture metadata for this run
metadata = capture_run_metadata({
    "region": REPORT_REGION,
    "days_lookback": DAYS_LOOKBACK
})

print("Run Metadata:")
print("-" * 60)
for col in metadata.columns:
    print(f"{col:15s}: {metadata[col].values[0]}")
print("-" * 60)

In [None]:
# Save metadata to a file
metadata_path = OUTPUT_ROOT / "run_metadata.csv"
metadata.to_csv(metadata_path, index=False)

print(f"‚úÖ Metadata saved to: {metadata_path}")
print("\nüí° Tip: Append each run's metadata to build a history of all runs")

---

## 12.11 Version Control with Git

### What is Version Control?

**Version control** is a system that tracks changes to files over time. Think of it like "Track Changes" in Word, but for your entire project.

**Git** is the most popular version control system. It allows you to:
- üìú **Track history** ‚Äî See every change ever made
- ‚Ü©Ô∏è **Undo mistakes** ‚Äî Revert to previous versions
- üåø **Branch** ‚Äî Work on new features without breaking existing code
- üë• **Collaborate** ‚Äî Multiple people can work on the same project

### Why Version Control Matters for Data Analysis

As a data analyst, version control helps you:
1. **Explain changes** ‚Äî "Why did the numbers change?" ‚Üí Check the commit history
2. **Reproduce results** ‚Äî Go back to the exact code that generated a report
3. **Experiment safely** ‚Äî Try new approaches without losing working code
4. **Collaborate** ‚Äî Share analysis with teammates

### Essential Git Commands

Here are the most important Git commands for beginners:

| Command | Purpose | Example |
|---------|---------|---------|
| `git init` | Create a new repository | `git init` |
| `git status` | See what's changed | `git status` |
| `git add` | Stage changes for commit | `git add analysis.py` |
| `git commit` | Save a snapshot | `git commit -m "Add sales report"` |
| `git log` | View history | `git log --oneline` |
| `git diff` | See what changed | `git diff analysis.py` |

### Git Workflow for Analysis Projects

```
1. Make changes to your code
2. git add <files>       # Stage your changes
3. git commit -m "..."   # Save with a message
4. Repeat!
```

> üí° **Tip:** Commit often with descriptive messages. Instead of "updated code", write "Add regional filter to sales report"

In [None]:
# You can run Git commands from Python using subprocess
# Here's how to capture the current Git commit hash for reproducibility

import subprocess

def get_git_info() -> dict:
    """
    Get current Git repository information for reproducibility.
    
    Returns:
    --------
    dict
        Dictionary with commit_hash, branch, and status
    """
    info = {
        "commit_hash": "not-in-git-repo",
        "branch": "unknown",
        "has_uncommitted_changes": None
    }
    
    try:
        # Get current commit hash
        result = subprocess.run(
            ["git", "rev-parse", "HEAD"],
            capture_output=True, text=True, timeout=5
        )
        if result.returncode == 0:
            info["commit_hash"] = result.stdout.strip()[:8]  # Short hash
        
        # Get current branch
        result = subprocess.run(
            ["git", "branch", "--show-current"],
            capture_output=True, text=True, timeout=5
        )
        if result.returncode == 0:
            info["branch"] = result.stdout.strip()
        
        # Check for uncommitted changes
        result = subprocess.run(
            ["git", "status", "--porcelain"],
            capture_output=True, text=True, timeout=5
        )
        if result.returncode == 0:
            info["has_uncommitted_changes"] = len(result.stdout.strip()) > 0
            
    except Exception as e:
        info["error"] = str(e)
    
    return info


# Try to get Git info (will show placeholder if not in a Git repo)
git_info = get_git_info()
print("Git Repository Info:")
for key, value in git_info.items():
    print(f"  {key}: {value}")

---

## 12.12 Documentation and Code Readability

### Why Documentation Matters

> "Code is read far more often than it is written." ‚Äî Guido van Rossum (creator of Python)

Good documentation helps:
- **Your future self** ‚Äî You will forget why you wrote something
- **Your teammates** ‚Äî Others need to understand and use your code
- **Your stakeholders** ‚Äî Non-technical people may read your analysis

### Types of Documentation

1. **Code comments** ‚Äî Explain *why*, not *what*
2. **Docstrings** ‚Äî Describe what functions do
3. **README files** ‚Äî Project overview and setup instructions
4. **Inline explanations** ‚Äî Markdown cells in notebooks

### Best Practices for Readable Code

#### 1. Use Descriptive Names

```python
# ‚ùå Bad
x = df[df['d'] > '2024-01-01']

# ‚úÖ Good  
recent_sales = sales[sales['date'] > '2024-01-01']
```

#### 2. Write Helpful Comments

```python
# ‚ùå Bad - describes WHAT (obvious from code)
# Add 1 to x
x = x + 1

# ‚úÖ Good - explains WHY
# Shift to 1-based indexing for user display
display_index = index + 1
```

#### 3. Use Docstrings for Functions

```python
def calculate_roi(revenue: float, cost: float) -> float:
    """
    Calculate Return on Investment.
    
    Parameters:
    -----------
    revenue : float
        Total revenue generated
    cost : float
        Total cost of investment
        
    Returns:
    --------
    float
        ROI as a decimal (0.5 = 50% return)
        
    Example:
    --------
    >>> calculate_roi(150, 100)
    0.5
    """
    return (revenue - cost) / cost
```

#### 4. Keep Functions Small and Focused

Each function should do **one thing well**. If a function is longer than 20-30 lines, consider splitting it.

#### 5. Use Type Hints

Type hints make your code self-documenting:

```python
# Without hints - unclear what types are expected
def process(data, threshold):
    ...

# With hints - clear expectations
def process(data: pd.DataFrame, threshold: float) -> pd.DataFrame:
    ...
```

In [None]:
# Example: Good vs Bad Documentation

# ‚ùå BAD: Undocumented, cryptic names
def proc(d, t):
    return d[d['v'] > t]


# ‚úÖ GOOD: Clear names, docstring, type hints
def filter_by_threshold(
    data: pd.DataFrame, 
    threshold: float,
    value_column: str = "value"
) -> pd.DataFrame:
    """
    Filter DataFrame to rows where a column exceeds a threshold.
    
    Parameters:
    -----------
    data : pd.DataFrame
        Input data to filter
    threshold : float
        Minimum value to include (exclusive)
    value_column : str
        Name of the column to check (default: "value")
        
    Returns:
    --------
    pd.DataFrame
        Filtered data with only rows exceeding threshold
        
    Example:
    --------
    >>> df = pd.DataFrame({'value': [1, 5, 10]})
    >>> filter_by_threshold(df, 3)
       value
    1      5
    2     10
    """
    return data[data[value_column] > threshold].copy()


# Demonstrate the well-documented function
sample_df = pd.DataFrame({"value": [10, 25, 5, 30, 15]})
result = filter_by_threshold(sample_df, threshold=12)
print("Filtered result (values > 12):")
print(result)

---

## 12.13 Complete Pipeline: One Function to Run Everything

### The Goal

Create a **single function** that generates all outputs into a timestamped folder. This makes it easy to:
- Run the same analysis repeatedly without overwriting results
- Schedule as an automated job
- Compare outputs from different runs

### Design Decisions

1. **Timestamped folders** ‚Äî Each run gets its own folder (`run_20240115_093045/`)
2. **Return a result object** ‚Äî The function returns paths to all generated files
3. **Parameters as arguments** ‚Äî Easy to customize without editing code

In [None]:
@dataclass(frozen=True)
class ReportResult:
    """
    Container for all outputs from a report run.
    
    Using a dataclass makes it easy to access results by name
    and ensures immutability (frozen=True prevents modification).
    """
    run_dir: Path           # Directory containing all outputs
    chart_path: Path        # Path to the saved chart PNG
    html_path: Path         # Path to the HTML report
    exports: list[Path]     # List of exported data files
    metadata_path: Path     # Path to run metadata CSV


def run_report(
    region: str = "ALL",
    days_lookback: int = 30,
    seed: int = 42
) -> ReportResult:
    """
    Execute the complete analysis pipeline and generate all outputs.
    
    This is the main entry point for the automated report. It:
    1. Loads data from seaborn's taxis dataset
    2. Cleans and filters based on parameters
    3. Calculates KPIs and breakdowns
    4. Saves charts, exports, and HTML report
    5. Records metadata for reproducibility
    
    Parameters:
    -----------
    region : str
        Region to filter by, or "ALL" for all regions
    days_lookback : int
        Number of recent days to include
    seed : int
        Random seed for reproducibility (used in data transformation)
        
    Returns:
    --------
    ReportResult
        Dataclass containing paths to all generated outputs
    """
    print("=" * 60)
    print("RUNNING AUTOMATED SALES REPORT")
    print("=" * 60)
    print(f"Region: {region}")
    print(f"Days Lookback: {days_lookback}")
    print(f"Seed: {seed}")
    print("=" * 60)
    
    # Create a unique folder for this run using timestamp
    run_stamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    run_dir = OUTPUT_ROOT / f"run_{run_stamp}"
    run_dir.mkdir(parents=True, exist_ok=True)
    print(f"\nüìÅ Output directory: {run_dir}")
    
    # Step 1: Load data from seaborn taxis dataset
    print("\n[1/6] Loading data from seaborn taxis dataset...")
    raw = load_sales_data()
    
    # Step 2: Clean data
    print("\n[2/6] Cleaning data...")
    sales = clean_sales_data(raw)
    
    # Step 3: Filter based on parameters
    print("\n[3/6] Filtering data...")
    report_df = filter_for_report(sales, region, days_lookback)
    
    # Step 4: Calculate KPIs and breakdowns
    print("\n[4/6] Calculating KPIs...")
    kpis = kpi_summary(report_df)
    breakdowns = top_breakdowns(report_df)
    
    # Step 5: Generate outputs
    print("\n[5/6] Generating outputs...")
    
    # Save chart
    chart_path = run_dir / "daily_revenue.png"
    save_daily_revenue_plot(breakdowns["by_day"], chart_path)
    
    # Export data files
    exports = export_excel_or_csv(run_dir, report_df, kpis, breakdowns)
    
    # Generate HTML report
    html_path = export_html_report(
        run_dir,
        title=f"Sales Report ({region}, last {days_lookback} days)",
        kpis=kpis,
        by_channel=breakdowns["by_channel"],
        by_product=breakdowns["by_product"],
        chart_file=chart_path.name,
    )
    
    # Step 6: Save metadata
    print("\n[6/6] Saving metadata...")
    metadata = capture_run_metadata({
        "region": region,
        "days_lookback": days_lookback,
        "seed": seed,
        "row_count": len(report_df),
        "total_revenue": float(report_df["revenue"].sum()),
    })
    metadata_path = run_dir / "run_metadata.csv"
    metadata.to_csv(metadata_path, index=False)
    print(f"‚úÖ Metadata saved: {metadata_path.name}")
    
    print("\n" + "=" * 60)
    print("‚úÖ REPORT COMPLETE!")
    print("=" * 60)
    
    return ReportResult(
        run_dir=run_dir,
        chart_path=chart_path,
        html_path=html_path,
        exports=exports,
        metadata_path=metadata_path,
    )


# Run the complete pipeline
result = run_report(region=REPORT_REGION, days_lookback=DAYS_LOOKBACK, seed=42)

# Show all generated files
print("\nüìã Generated Files:")
for file in result.run_dir.iterdir():
    size_kb = file.stat().st_size / 1024
    print(f"   üìÑ {file.name} ({size_kb:.1f} KB)")

---

## 12.14 Scheduling Analytics Tasks

### Why Scheduling?

Scheduling allows you to run reports automatically:
- üìÖ **Daily reports** ‚Äî Fresh data every morning
- üìÜ **Weekly summaries** ‚Äî Monday briefings
- üåô **Overnight processing** ‚Äî Heavy computations while you sleep

### Scheduling Options

#### Windows: Task Scheduler

Windows Task Scheduler can run Python scripts on a schedule.

**Steps:**
1. Export your analysis to a `.py` script (see next section)
2. Open Task Scheduler (search "Task Scheduler" in Start menu)
3. Create a new task with:
   - Trigger: Daily at 7:00 AM
   - Action: Run your Python script

#### macOS/Linux: Cron

Cron is the Unix scheduler. Edit with `crontab -e`:

```bash
# Run every day at 7 AM
0 7 * * * /path/to/python /path/to/report_script.py
```

#### Cloud Options

For production systems, consider:
- **GitHub Actions** ‚Äî Free for public repos
- **AWS Lambda** + CloudWatch ‚Äî Serverless scheduling
- **Apache Airflow** ‚Äî Complex pipeline orchestration

### Best Practices for Scheduled Scripts

1. **Use absolute paths** ‚Äî Scheduled tasks run from unknown directories
2. **Log errors** ‚Äî Write errors to a log file for debugging
3. **Use timestamped folders** ‚Äî Don't overwrite previous runs
4. **Send notifications** ‚Äî Email on success/failure
5. **Test manually first** ‚Äî Run the script by hand before scheduling

In [None]:
# Example: Converting this notebook to a schedulable script
# The code below shows what a production script might look like

SCRIPT_TEMPLATE = '''#!/usr/bin/env python3
"""
Automated Sales Report Generator

This script generates a sales report and can be scheduled to run automatically.
Usage: python report_script.py --region North --days 30

Author: Your Name
Date: 2024-01-01
"""

import argparse
import logging
from pathlib import Path
from datetime import datetime

# Set up logging
LOG_DIR = Path("logs")
LOG_DIR.mkdir(exist_ok=True)
logging.basicConfig(
    filename=LOG_DIR / f"report_{datetime.now():%Y%m%d}.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

def main():
    # Parse command-line arguments
    parser = argparse.ArgumentParser(description="Generate sales report")
    parser.add_argument("--region", default="ALL", help="Region to filter")
    parser.add_argument("--days", type=int, default=30, help="Days lookback")
    args = parser.parse_args()
    
    logging.info(f"Starting report: region={args.region}, days={args.days}")
    
    try:
        # Your report code here...
        # result = run_report(region=args.region, days_lookback=args.days)
        logging.info("Report completed successfully")
        
    except Exception as e:
        logging.error(f"Report failed: {e}")
        raise

if __name__ == "__main__":
    main()
'''

print("Example Script Template:")
print("-" * 60)
print(SCRIPT_TEMPLATE[:800] + "...")
print("-" * 60)
print("\nüí° Save this template as 'report_script.py' and customize for your needs")

---

## 12.15 Exercises

### Exercise 1: Change Parameters
Set `REPORT_REGION = "ALL"` at the top of this notebook and rerun the complete pipeline. 
- How do the KPIs change?
- How many more rows are in the filtered data?

### Exercise 2: Add a New Breakdown
Modify the `top_breakdowns()` function to add a breakdown by `region` (useful when REPORT_REGION is "ALL").
- Add `by_region` to the returned dictionary
- Update the HTML report to include this new table

### Exercise 3: Add a New KPI
Add `median_order_value` to the `kpi_summary()` function.
- Hint: Use `df["revenue"].median()`
- Why might median be better than mean for order values?

### Exercise 4: Improve Missing Value Handling
The current cleaning function fills missing prices with the overall median. Improve it to use the **product-specific median** instead.
- Hint: Use `groupby('product')['unit_price'].transform('median')`

### Exercise 5: Add Git Commit to Metadata
Modify the `capture_run_metadata()` function to include the Git commit hash (use the `get_git_info()` function we created).

### Mini-Project: Create a Standalone Script
Convert this notebook into a Python script (`chapter12_report.py`) that:
1. Accepts command-line arguments for region and days_lookback
2. Logs progress to a file
3. Sends an email notification when complete (advanced)

<details>
<summary>üí° Hints for the Mini-Project</summary>

```python
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--region", default="ALL")
parser.add_argument("--days", type=int, default=30)
args = parser.parse_args()

result = run_report(region=args.region, days_lookback=args.days)
```
</details>

---

## Summary and Key Takeaways

### What We Learned

In this chapter, you learned how to transform manual, one-off analyses into **automated, reproducible pipelines**.

### Key Concepts

| Concept | What It Means | Why It Matters |
|---------|---------------|----------------|
| **Automation** | Using functions and code to perform repetitive tasks | Saves time, reduces errors, enables scale |
| **Parameterization** | Defining inputs at the top, not scattered in code | Easy to change, clear documentation |
| **Exporting** | Saving results as files (CSV, Excel, HTML, PDF) | Shareable deliverables, historical records |
| **Reproducibility** | Recording metadata so results can be recreated | Trust, collaboration, debugging |
| **Version Control** | Tracking code changes with Git | History, undo, collaboration |
| **Documentation** | Explaining code with comments and docstrings | Future you, teammates, stakeholders |
| **Scheduling** | Running scripts automatically at set times | Daily reports, overnight processing |

### Best Practices Checklist

‚úÖ **Automation**
- [ ] Wrap analysis steps in reusable functions
- [ ] Use a single "main" function for the complete pipeline
- [ ] Return structured results (dataclasses, dictionaries)

‚úÖ **Parameterization**
- [ ] Define all configurable values at the top
- [ ] Use descriptive parameter names
- [ ] Support command-line arguments for scripts

‚úÖ **Exporting**
- [ ] Save outputs to a dedicated folder (not mixed with source code)
- [ ] Use timestamped folders for multiple runs
- [ ] Export in formats your stakeholders need (Excel, HTML, PDF)

‚úÖ **Reproducibility**
- [ ] Record run timestamp and parameters
- [ ] Include Python and library versions
- [ ] Track Git commit hash when available

‚úÖ **Documentation**
- [ ] Write docstrings for all functions
- [ ] Use meaningful variable names
- [ ] Add type hints for clarity

### The Automation Mindset

> "If you do something more than twice, automate it."

Start thinking about your analysis as a **pipeline**, not a one-time script:

```
Data ‚Üí Clean ‚Üí Analyze ‚Üí Visualize ‚Üí Report ‚Üí Share
  ‚îÇ        ‚îÇ        ‚îÇ          ‚îÇ         ‚îÇ
  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           All reusable, all automated
```

---

## Additional Resources

### Official Documentation

- **Pandas I/O (CSV/Excel):** https://pandas.pydata.org/docs/user_guide/io.html
- **Pandas GroupBy:** https://pandas.pydata.org/docs/user_guide/groupby.html
- **Matplotlib Saving Figures:** https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html
- **Python pathlib:** https://docs.python.org/3/library/pathlib.html

### Version Control

- **Git Handbook (GitHub):** https://guides.github.com/introduction/git-handbook/
- **Git Tutorial (Atlassian):** https://www.atlassian.com/git/tutorials
- **Interactive Git Learning:** https://learngitbranching.js.org/

### Scheduling

- **Windows Task Scheduler:** https://learn.microsoft.com/windows/win32/taskschd/task-scheduler-start-page
- **Cron Tutorial:** https://crontab.guru/
- **GitHub Actions for Scheduling:** https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule

### Reproducibility

- **The Turing Way (Reproducible Research):** https://the-turing-way.netlify.app/reproducible-research/reproducible-research.html
- **Cookiecutter Data Science:** https://drivendata.github.io/cookiecutter-data-science/

### PDF Generation (Advanced)

- **fpdf2 Documentation:** https://pyfpdf.github.io/fpdf2/
- **WeasyPrint (HTML to PDF):** https://weasyprint.org/

---

## End of Chapter 12

üéâ **Congratulations!** You've learned how to build automated, reproducible analysis pipelines.

### What's Next?

In **Part B** of this book, we'll apply these skills to real-world data analytics projects, starting with:
- **Chapter 13:** Problem Definition and Analytical Frameworks
- **Chapter 14:** Data Collection, Integration, and Understanding

### Try This

1. Run the complete pipeline with different parameters
2. Open the generated HTML report in your browser
3. Explore the output files in the `outputs/chapter_12/` folder

---

*"The best code is code you don't have to run manually."*