# Chapter 23: Applied Data Analytics Case Studies

In the previous chapters, you learned individual skills — cleaning data, exploring patterns, building visualizations, and applying statistical methods. Now it's time to put everything together.

This chapter presents **four complete case studies** that walk you through real-world analytics scenarios from start to finish. Each case study follows a consistent workflow, demonstrating how professional analysts approach problems in business, marketing, finance, and operations.

**Why case studies matter:**
- They show how different skills connect in practice
- They expose you to realistic data issues and trade-offs
- They help you build intuition for what questions to ask
- They give you templates you can adapt for your own projects

**What you'll work with:**
| Case Study | Domain | Key Question |
|------------|--------|--------------|
| 1 | Retail/Business | Which product categories drive revenue and profit? |
| 2 | Marketing | Which channels and messages drive conversions efficiently? |
| 3 | Finance | Which loan applicants are higher risk? |
| 4 | Operations | Which products are likely to run out of stock? |

Each case study uses synthetic data so you can experiment freely without privacy concerns. The patterns and challenges, however, mirror what you'll encounter with real data.

---

## Learning Goals

By the end of this chapter, you will be able to:

- **Follow a repeatable analytics workflow** from problem definition to recommendation
- **Use Pandas** to clean, join, and summarize data across different domains
- **Create clear visualizations** that support conclusions and tell a story
- **Build simple baselines** and explain results in plain, non-technical language
- **Recognize common mistakes** (data leakage, small sample traps, metric confusion) and avoid them
- **Translate analytical findings into business recommendations**

> **Note:** This chapter assumes you're comfortable with the basics of Pandas, Matplotlib, and NumPy covered in earlier chapters. If you need a refresher, refer back to Chapters 3–5.

In [None]:
# Core libraries for this chapter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8')

# Reproducibility: set a random seed so your results are repeatable
RNG = np.random.default_rng(42)

pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)

---
## 23.1 End-to-end analytics workflow (a reusable template)
In real projects, the hardest part is often **not** writing code — it's keeping the work structured and explainable.

Here is a simple template you can reuse:

1) **Problem statement**: What decision will this analysis inform?
2) **Success metric**: How will we measure improvement?
3) **Data**: What tables/files do we have? What is missing?
4) **Data cleaning**: Fix missing values, types, duplicates, outliers
5) **EDA**: Summaries + visuals to understand patterns
6) **Analysis/model**: Start with a baseline; keep it interpretable
7) **Communication**: Insights + recommended actions + limitations

### Tip
A good beginner habit is writing down your assumptions before you code. It prevents you from accidentally answering a different question than the one you meant to answer.

We’ll use the same helper functions across case studies:
- `quick_check(df)`: quick data quality scan
- `plot_hist(df, col)`: quick distribution check
- `train_test_split_time(df, time_col, split)`: time-based split (avoids leakage)

In [None]:
def quick_check(df: pd.DataFrame, name: str = 'df') -> pd.DataFrame:
    """A quick, beginner-friendly quality report."""
    report = pd.DataFrame({
        'dtype': df.dtypes.astype(str),
        'missing': df.isna().sum(),
        'missing_%': (df.isna().mean() * 100).round(2),
        'n_unique': df.nunique(dropna=True)
    })
    print(f'[{name}] shape = {df.shape}')
    display(df.head(5))
    return report.sort_values(by=['missing', 'n_unique'], ascending=False)

def plot_hist(df: pd.DataFrame, col: str, bins: int = 30, title: str | None = None):
    data = df[col].dropna()
    plt.figure(figsize=(7, 4))
    plt.hist(data, bins=bins)
    plt.xlabel(col)
    plt.ylabel('count')
    plt.title(title or f'Distribution of {col}')
    plt.show()

def train_test_split_time(df: pd.DataFrame, time_col: str, split: float = 0.8):
    """Time-based split: train is earlier, test is later."""
    df_sorted = df.sort_values(time_col).reset_index(drop=True)
    cut = int(len(df_sorted) * split)
    return df_sorted.iloc[:cut].copy(), df_sorted.iloc[cut:].copy()

def rmse(y_true, y_pred) -> float:
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    return float(np.sqrt(np.mean((y_true - y_pred) ** 2)))

---
## 23.2 Case study 1 — Business analytics (retail orders)
### Scenario
A small online retailer asks: **Which product categories drive revenue and profit?**

### What we will do
- Generate an orders table (synthetic)
- Clean data types and handle missing values
- Compute revenue and profit
- Summarize by category and visualize top drivers

### Common beginner mistake
Mixing up **revenue** (money coming in) and **profit** (revenue minus cost). They can tell very different stories.

In [None]:
# Create a synthetic retail dataset
n = 2000
categories = ['Electronics', 'Home', 'Clothing', 'Beauty', 'Sports']

orders = pd.DataFrame({
    'order_id': np.arange(1, n + 1),
    'order_date': pd.to_datetime('2025-01-01') + pd.to_timedelta(RNG.integers(0, 365, size=n), unit='D'),
    'category': RNG.choice(categories, size=n, p=[0.18, 0.27, 0.25, 0.15, 0.15]),
    'units': RNG.integers(1, 6, size=n),
    # base_price simulates different typical prices by category
    'unit_price': 0.0,
    'unit_cost': 0.0,
    'returned': RNG.choice([0, 1], size=n, p=[0.92, 0.08])
})

price_map = {
    'Electronics': (220, 0.72),
    'Home': (60, 0.65),
    'Clothing': (35, 0.55),
    'Beauty': (25, 0.50),
    'Sports': (55, 0.62),
}

# Generate price/cost with noise
for cat, (base_price, cost_ratio) in price_map.items():
    mask = orders['category'] == cat
    unit_price = np.maximum(5, RNG.normal(base_price, base_price * 0.15, size=mask.sum()))
    unit_cost = unit_price * cost_ratio * RNG.normal(1.0, 0.05, size=mask.sum())
    orders.loc[mask, 'unit_price'] = unit_price
    orders.loc[mask, 'unit_cost'] = unit_cost

# Inject some realistic data issues
bad_idx = RNG.choice(orders.index, size=25, replace=False)
orders.loc[bad_idx, 'unit_price'] = np.nan
orders.loc[RNG.choice(orders.index, size=10, replace=False), 'category'] = None

orders.head()

### Step 1: Quick data check
Before analysis, look for problems: missing values, wrong types, unrealistic values.

In [None]:
quality_report = quick_check(orders, 'orders')
quality_report

### Step 2: Cleaning decisions (and why)
We need consistent data to compute revenue/profit. We'll do simple, explainable fixes:
- Missing `category`: drop those rows (we can’t group them)
- Missing `unit_price`: fill with the **median price of that category** (robust to outliers)

**Warning:** Filling missing values can bias results. Always report what you filled and how many rows were affected.

In [None]:
orders_clean = orders.dropna(subset=['category']).copy()

# Fill missing unit_price using median per category
orders_clean['unit_price'] = orders_clean['unit_price'].fillna(
    orders_clean.groupby('category')['unit_price'].transform('median')
)

# Basic sanity checks
orders_clean = orders_clean[(orders_clean['units'] > 0) & (orders_clean['unit_price'] > 0) & (orders_clean['unit_cost'] > 0)]

quick_check(orders_clean, 'orders_clean')

### Step 3: Feature engineering (revenue & profit)
We create new columns that represent business concepts:
- `revenue = units × unit_price`
- `cost = units × unit_cost`
- `profit = revenue − cost`

We also handle returns: in a real system, returns often reduce revenue and profit. We'll treat returned orders as **zero revenue and zero profit** for simplicity.

In [None]:
orders_clean['revenue'] = orders_clean['units'] * orders_clean['unit_price']
orders_clean['cost'] = orders_clean['units'] * orders_clean['unit_cost']
orders_clean['profit'] = orders_clean['revenue'] - orders_clean['cost']

orders_clean.loc[orders_clean['returned'] == 1, ['revenue', 'cost', 'profit']] = 0

orders_clean[['order_id', 'category', 'units', 'unit_price', 'unit_cost', 'returned', 'revenue', 'profit']].head()

### Step 4: Summarize and visualize
We aggregate by category to find what drives the business.

In [None]:
summary_cat = (
    orders_clean.groupby('category')
    .agg(orders=('order_id', 'count'),
         units=('units', 'sum'),
         revenue=('revenue', 'sum'),
         profit=('profit', 'sum'),
         return_rate=('returned', 'mean'))
    .sort_values('revenue', ascending=False)
)

summary_cat['profit_margin_%'] = (summary_cat['profit'] / summary_cat['revenue'].replace(0, np.nan) * 100).round(2)
summary_cat['return_rate_%'] = (summary_cat['return_rate'] * 100).round(2)
summary_cat.drop(columns=['return_rate'])

In [None]:
# Visual: revenue vs profit by category
plot_df = summary_cat.sort_values('revenue', ascending=True)

fig, ax = plt.subplots(figsize=(8, 4))
ax.barh(plot_df.index, plot_df['revenue'], label='Revenue')
ax.barh(plot_df.index, plot_df['profit'], label='Profit')
ax.set_title('Revenue and Profit by Category')
ax.set_xlabel('Amount')
ax.legend()
plt.show()

### Interpretation (example)
When you interpret results, try to write in **business language**:
- Which categories bring the most revenue?
- Which categories bring the most profit (not always the same)?
- Do returns look unusually high in any category?

**Tip:** Always check profit margin. A category can have high revenue but low profit if costs are high.

### Exercise 1 (practice)
1) Find the top 2 categories by **profit margin** (not by total profit).
2) Add a column `avg_order_value` = revenue / orders.
3) Create a bar chart of `return_rate_%` by category.

Try to answer: *If you could improve only one thing, would you focus on returns or margin?*

In [None]:
# Your code here
# Hint: profit_margin_% already exists in summary_cat
exercise1 = summary_cat.copy()
exercise1['avg_order_value'] = (exercise1['revenue'] / exercise1['orders']).round(2)

top2_margin = exercise1.sort_values('profit_margin_%', ascending=False).head(2)
display(top2_margin[['revenue', 'profit', 'profit_margin_%', 'avg_order_value', 'return_rate_%']])

plt.figure(figsize=(7, 4))
plt.bar(exercise1.index, exercise1['return_rate_%'])
plt.title('Return Rate by Category')
plt.ylabel('Return rate (%)')
plt.xticks(rotation=30, ha='right')
plt.show()

---
## 23.3 Case study 2 — Marketing analytics (campaign performance)
### Scenario
A marketing team ran several campaigns and asks: **Which channels and messages drive conversions efficiently?**

### Key metrics
- **CTR** (click-through rate): clicks / impressions
- **Conversion rate**: conversions / clicks
- **CPA** (cost per acquisition): spend / conversions

### Common beginner mistake
Comparing conversion rates without checking sample size. A channel with 2 clicks and 1 conversion has 50% conversion rate — but that’s not reliable.

In [None]:
# Synthetic marketing campaign dataset
days = pd.date_range('2025-07-01', periods=120, freq='D')
channels = ['Search', 'Social', 'Email', 'Display']
messages = ['Discount', 'Free Shipping', 'New Arrival']

rows = []
for d in days:
    for ch in channels:
        for msg in messages:
            impressions = int(np.maximum(50, RNG.normal(800 if ch in ['Search', 'Social'] else 500, 180)))
            base_ctr = {'Search': 0.035, 'Social': 0.025, 'Email': 0.06, 'Display': 0.012}[ch]
            msg_boost = {'Discount': 1.15, 'Free Shipping': 1.05, 'New Arrival': 0.95}[msg]
            ctr = np.clip(base_ctr * msg_boost * RNG.normal(1.0, 0.12), 0.001, 0.2)
            clicks = RNG.binomial(impressions, ctr)

            # conversion probability given a click
            base_cvr = {'Search': 0.06, 'Social': 0.035, 'Email': 0.07, 'Display': 0.02}[ch]
            cvr = np.clip(base_cvr * msg_boost * RNG.normal(1.0, 0.12), 0.001, 0.4)
            conversions = RNG.binomial(clicks, cvr)

            # spend: depends on channel and volume
            cpc = {'Search': 1.8, 'Social': 1.1, 'Email': 0.2, 'Display': 0.6}[ch]
            spend = clicks * cpc * float(np.clip(RNG.normal(1.0, 0.08), 0.7, 1.3))

            rows.append((d, ch, msg, impressions, clicks, conversions, spend))

marketing = pd.DataFrame(rows, columns=['date', 'channel', 'message', 'impressions', 'clicks', 'conversions', 'spend'])

# Inject a common data issue: missing spend
marketing.loc[RNG.choice(marketing.index, size=20, replace=False), 'spend'] = np.nan

marketing.head()

### Step 1: Clean and create metrics
We compute metrics carefully to avoid division-by-zero. We'll fill missing spend using the median spend for the same channel (simple and explainable).

In [None]:
marketing_clean = marketing.copy()
marketing_clean['spend'] = marketing_clean['spend'].fillna(
    marketing_clean.groupby('channel')['spend'].transform('median')
)

# Safe divisions: replace 0 with NaN to avoid errors, then fill where appropriate
marketing_clean['ctr'] = marketing_clean['clicks'] / marketing_clean['impressions'].replace(0, np.nan)
marketing_clean['conversion_rate'] = marketing_clean['conversions'] / marketing_clean['clicks'].replace(0, np.nan)
marketing_clean['cpa'] = marketing_clean['spend'] / marketing_clean['conversions'].replace(0, np.nan)

quick_check(marketing_clean, 'marketing_clean')

### Step 2: Summarize performance
We group by `channel` and `message`, then look for:
- High conversions
- Good conversion rate
- Low CPA

**Warning:** CPA can be misleading when conversions are very low. Always check conversion counts.

In [None]:
perf = (
    marketing_clean
    .groupby(['channel', 'message'])
    .agg(impressions=('impressions', 'sum'),
         clicks=('clicks', 'sum'),
         conversions=('conversions', 'sum'),
         spend=('spend', 'sum'))
    .reset_index()
)

perf['ctr'] = perf['clicks'] / perf['impressions'].replace(0, np.nan)
perf['conversion_rate'] = perf['conversions'] / perf['clicks'].replace(0, np.nan)
perf['cpa'] = perf['spend'] / perf['conversions'].replace(0, np.nan)

perf.sort_values(['conversions', 'cpa'], ascending=[False, True]).head(10)

In [None]:
# Visual: CPA by channel (overall)
overall = (
    marketing_clean
    .groupby('channel')
    .agg(conversions=('conversions', 'sum'), spend=('spend', 'sum'))
    .assign(cpa=lambda d: d['spend'] / d['conversions'].replace(0, np.nan))
    .sort_values('cpa')
)

plt.figure(figsize=(7, 4))
plt.bar(overall.index, overall['cpa'])
plt.title('Cost per Acquisition (CPA) by Channel')
plt.ylabel('CPA (lower is better)')
plt.show()

overall

### Exercise 2 (practice)
1) Add a filter to keep only channel-message pairs with at least **200 conversions**.
2) Among those, find the lowest CPA.
3) Plot conversion rate by message for the best channel.

Goal: practice comparing options while avoiding small-sample traps.

In [None]:
# Your code here
filtered = perf[perf['conversions'] >= 200].copy()
best = filtered.sort_values('cpa').head(1)
display(best)

best_channel = best['channel'].iloc[0]
subset = filtered[filtered['channel'] == best_channel].sort_values('message')

plt.figure(figsize=(7, 4))
plt.bar(subset['message'], subset['conversion_rate'])
plt.title(f'Conversion Rate by Message (Channel: {best_channel})')
plt.ylabel('Conversion rate')
plt.show()

---
## 23.4 Case study 3 — Finance analytics (credit risk screening)
### Scenario
A lender wants to reduce defaults. The question is: **Which applicants are higher risk?**

We will:
- Build a simple risk score (baseline model)
- Evaluate with accuracy and confusion matrix
- Discuss why accuracy alone can be misleading

### Important note (ethics)
Real credit models are heavily regulated and must be fair and explainable. This example is educational and uses synthetic data.

In [None]:
# Synthetic applicant data
m = 2500
apps = pd.DataFrame({
    'applicant_id': np.arange(1, m + 1),
    'age': RNG.integers(21, 70, size=m),
    'annual_income': np.maximum(12000, RNG.normal(52000, 18000, size=m)),
    'debt_to_income': np.clip(RNG.normal(0.28, 0.12, size=m), 0.02, 0.95),
    'credit_history_years': np.clip(RNG.normal(7.0, 4.0, size=m), 0, 30),
    'num_late_payments': np.clip(RNG.poisson(1.0, size=m), 0, 12),
})

# A synthetic probability of default: higher with high DTI and many late payments
logit = (
    -3.0
    + 3.8 * apps['debt_to_income']
    + 0.18 * apps['num_late_payments']
    - 0.04 * (apps['credit_history_years'])
    - 0.00001 * (apps['annual_income'])
)
p_default = 1 / (1 + np.exp(-logit))
apps['defaulted'] = RNG.binomial(1, np.clip(p_default, 0.001, 0.9))

# Inject missing values
apps.loc[RNG.choice(apps.index, size=20, replace=False), 'annual_income'] = np.nan
apps.head()

### Step 1: Clean and inspect
We fill missing income with the median income. This is a simple baseline. In real finance work, missingness might be informative and needs careful handling.

In [None]:
apps_clean = apps.copy()
apps_clean['annual_income'] = apps_clean['annual_income'].fillna(apps_clean['annual_income'].median())

quick_check(apps_clean, 'apps_clean')

### Step 2: Build a simple risk score (interpretable baseline)
Instead of jumping to a complex model, we will build a **rule-based score**.

Why?
- Easy to explain
- Helps you understand which variables matter
- Gives a baseline to beat later

We’ll score applicants with higher risk if they have:
- high debt-to-income
- many late payments
- short credit history

In [None]:
apps_scored = apps_clean.copy()
apps_scored['risk_score'] = (
    60 * apps_scored['debt_to_income']
    + 6 * apps_scored['num_late_payments']
    - 1.2 * apps_scored['credit_history_years']
)

# Higher score => higher predicted risk
plot_hist(apps_scored, 'risk_score', bins=40, title='Risk Score Distribution')
apps_scored[['debt_to_income', 'num_late_payments', 'credit_history_years', 'risk_score', 'defaulted']].head()

### Step 3: Choose a decision threshold and evaluate
We need a threshold like: “If risk_score ≥ X, predict default”.

In real life, the threshold depends on business costs:
- false negative (miss a risky applicant)
- false positive (reject a good applicant)

In [None]:
# Train-test split (random is ok here because there's no time component)
apps_scored = apps_scored.sample(frac=1.0, random_state=42).reset_index(drop=True)
cut = int(len(apps_scored) * 0.8)
train, test = apps_scored.iloc[:cut], apps_scored.iloc[cut:]

# Choose threshold using training data: try a few candidates and pick the best F1-like balance
candidates = np.quantile(train['risk_score'], [0.6, 0.7, 0.8, 0.9])

def confusion(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    tp = int(((y_true == 1) & (y_pred == 1)).sum())
    tn = int(((y_true == 0) & (y_pred == 0)).sum())
    fp = int(((y_true == 0) & (y_pred == 1)).sum())
    fn = int(((y_true == 1) & (y_pred == 0)).sum())
    return tp, tn, fp, fn

def precision_recall(tp, fp, fn):
    precision = tp / (tp + fp) if (tp + fp) else 0.0
    recall = tp / (tp + fn) if (tp + fn) else 0.0
    return precision, recall

results = []
for thr in candidates:
    pred = (train['risk_score'] >= thr).astype(int)
    tp, tn, fp, fn = confusion(train['defaulted'], pred)
    precision, recall = precision_recall(tp, fp, fn)
    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) else 0.0
    acc = (tp + tn) / (tp + tn + fp + fn)
    results.append({'threshold': float(thr), 'accuracy': acc, 'precision': precision, 'recall': recall, 'f1': f1})

pd.DataFrame(results).sort_values('f1', ascending=False)

In [None]:
best_thr = float(pd.DataFrame(results).sort_values('f1', ascending=False).iloc[0]['threshold'])
test_pred = (test['risk_score'] >= best_thr).astype(int)
tp, tn, fp, fn = confusion(test['defaulted'], test_pred)
precision, recall = precision_recall(tp, fp, fn)
accuracy = (tp + tn) / (tp + tn + fp + fn)

metrics = pd.DataFrame([{
    'threshold': best_thr,
    'accuracy': accuracy,
    'precision': precision,
    'recall': recall,
    'tp': tp, 'tn': tn, 'fp': fp, 'fn': fn
}])
metrics

### Interpretation
- **Precision** answers: “When we predict default, how often are we correct?”
- **Recall** answers: “Of all true defaults, how many did we catch?”

**Tip:** In many risk problems, catching risky cases (recall) matters more than overall accuracy.

### Exercise 3 (practice)
1) Try a different scoring formula (change weights).
2) Compare your test precision/recall to the original.
3) In one sentence, explain the trade-off you chose.

In [None]:
# Your code here: example alternative weights
apps_alt = apps_clean.copy()
apps_alt['risk_score'] = (70 * apps_alt['debt_to_income'] + 4 * apps_alt['num_late_payments'] - 1.0 * apps_alt['credit_history_years'])

apps_alt = apps_alt.sample(frac=1.0, random_state=42).reset_index(drop=True)
train_alt, test_alt = apps_alt.iloc[:cut], apps_alt.iloc[cut:]

thr = float(np.quantile(train_alt['risk_score'], 0.8))
pred_alt = (test_alt['risk_score'] >= thr).astype(int)
tp, tn, fp, fn = confusion(test_alt['defaulted'], pred_alt)
precision_alt, recall_alt = precision_recall(tp, fp, fn)
accuracy_alt = (tp + tn) / (tp + tn + fp + fn)

pd.DataFrame([{
    'threshold': thr,
    'accuracy': accuracy_alt,
    'precision': precision_alt,
    'recall': recall_alt,
    'tp': tp, 'tn': tn, 'fp': fp, 'fn': fn
}])

---
## 23.5 Case study 4 — Operations / supply chain (inventory & demand)
### Scenario
A warehouse team wants to avoid stockouts. The question is: **Which products are likely to run out soon?**

We will:
- Create daily demand data for multiple products
- Use a time-based train/test split (important!)
- Build a simple forecast baseline
- Convert forecast into a restock recommendation

### Common beginner mistake (data leakage)
Randomly splitting time-series data mixes past and future, causing overly optimistic results. Always keep the future in the test set.

In [None]:
# Synthetic daily demand for 8 products
products = [f'P{i:02d}' for i in range(1, 9)]
dates = pd.date_range('2025-01-01', periods=220, freq='D')

demand_rows = []
for p in products:
    base = RNG.integers(10, 60)
    weekly = RNG.uniform(0.1, 0.35)
    trend = RNG.uniform(-0.01, 0.02)
    for t, d in enumerate(dates):
        season = 1 + weekly * np.sin(2 * np.pi * (t % 7) / 7)
        mean = base * season * (1 + trend * t)
        qty = max(0, int(RNG.normal(mean, max(2, mean * 0.20))))
        demand_rows.append((d, p, qty))

demand = pd.DataFrame(demand_rows, columns=['date', 'product', 'demand'])
demand.head()

### Step 1: Visualize demand for one product
A quick plot helps you see patterns like seasonality or trends.

In [None]:
one = demand[demand['product'] == 'P01'].copy()
plt.figure(figsize=(10, 3))
plt.plot(one['date'], one['demand'])
plt.title('Daily Demand for Product P01')
plt.xlabel('Date')
plt.ylabel('Units demanded')
plt.show()

### Step 2: Forecast baseline (moving average)
A strong beginner baseline for forecasting is a **moving average**: predict tomorrow as the average of the last $k$ days.

Why it works (sometimes):
- Smooths random noise
- Easy to explain
- Often competitive as a baseline

We’ll evaluate using RMSE (lower is better).

In [None]:
def moving_average_forecast(series: pd.Series, window: int = 7) -> pd.Series:
    # Forecast for each day uses the *previous* window (shifted to avoid leakage)
    return series.shift(1).rolling(window=window).mean()

# Evaluate for each product
rows = []
for p in products:
    dfp = demand[demand['product'] == p].copy()
    train_p, test_p = train_test_split_time(dfp, 'date', split=0.8)

    train_p['pred_7'] = moving_average_forecast(train_p['demand'], window=7)
    test_p = pd.concat([train_p.tail(10), test_p], ignore_index=True)  # include a bit of history for rolling
    test_p['pred_7'] = moving_average_forecast(test_p['demand'], window=7)
    test_eval = test_p.iloc[10:].copy()

    score = rmse(test_eval['demand'], test_eval['pred_7'])
    rows.append({'product': p, 'rmse_7day_ma': score, 'avg_demand': test_eval['demand'].mean()})

scores = pd.DataFrame(rows).sort_values('rmse_7day_ma')
scores

In [None]:
# Visual: forecast vs actual for one product
p = 'P01'
dfp = demand[demand['product'] == p].copy()
train_p, test_p = train_test_split_time(dfp, 'date', split=0.8)
combined = pd.concat([train_p.tail(20), test_p], ignore_index=True)
combined['pred_7'] = moving_average_forecast(combined['demand'], window=7)

plt.figure(figsize=(10, 3))
plt.plot(combined['date'], combined['demand'], label='Actual')
plt.plot(combined['date'], combined['pred_7'], label='7-day MA forecast')
plt.title(f'Actual vs Forecast (Product {p})')
plt.xlabel('Date')
plt.ylabel('Demand')
plt.legend()
plt.show()

### Step 3: From forecast to a restock recommendation
Analytics becomes valuable when it supports decisions.

A simple rule:
- Forecast next 14 days demand
- Compare to current inventory
- If inventory is less than forecast + safety stock → recommend restock

We’ll create a toy inventory table and compute recommendations.

In [None]:
inventory = pd.DataFrame({
    'product': products,
    'on_hand': RNG.integers(150, 650, size=len(products)),
    'safety_stock': RNG.integers(40, 140, size=len(products))
})
inventory

In [None]:
# Forecast next 14 days using the last available moving average
horizon = 14
recs = []

for p in products:
    dfp = demand[demand['product'] == p].sort_values('date')
    # Last known 7-day average demand (as a simple constant forecast)
    last_ma = dfp['demand'].tail(7).mean()
    forecast_14 = last_ma * horizon
    recs.append({'product': p, 'forecast_14d': forecast_14})

recs = pd.DataFrame(recs)
plan = inventory.merge(recs, on='product', how='left')
plan['needed'] = plan['forecast_14d'] + plan['safety_stock']
plan['restock_qty'] = np.maximum(0, np.ceil(plan['needed'] - plan['on_hand'])).astype(int)
plan.sort_values('restock_qty', ascending=False)

### Exercise 4 (mini-project)
Pick one product and improve the forecast baseline:
1) Compare 7-day vs 14-day moving average using RMSE
2) Choose the better window
3) Recompute the restock recommendation using that window

Optional: Try a weekday/weekend average (simple seasonality).

In [None]:
# Your code here (starter example for one product)
p = 'P02'
dfp = demand[demand['product'] == p].copy()
train_p, test_p = train_test_split_time(dfp, 'date', split=0.8)

def eval_window(window):
    combined = pd.concat([train_p.tail(window + 3), test_p], ignore_index=True)
    combined['pred'] = moving_average_forecast(combined['demand'], window=window)
    eval_part = combined.iloc[window + 3:].copy()
    return rmse(eval_part['demand'], eval_part['pred'])

rmse7 = eval_window(7)
rmse14 = eval_window(14)
rmse7, rmse14

---
## 23.6 Lessons learned (how to think like an analyst)
Across the case studies, several themes repeat:

1) **Start with the decision** (What will someone do differently?)
2) **Define metrics clearly** (Revenue vs profit, CTR vs conversion rate, etc.)
3) **Clean data intentionally** (document what you dropped/filled)
4) **Use baselines** (simple first, then improve)
5) **Avoid leakage** (especially with time-series)
6) **Communicate limitations** (synthetic data, missing variables, assumptions)

### Common mistakes checklist
- Confusing correlation with causation
- Choosing metrics that don’t match the decision
- Ignoring small sample sizes
- Overfitting (making the model too complex)
- Forgetting to check data types and missing values

## Additional resources (optional)
- Pandas user guide: https://pandas.pydata.org/docs/user_guide/
- Matplotlib tutorials: https://matplotlib.org/stable/tutorials/
- Google Data Analytics case study examples (non-Python, but useful structure): https://www.coursera.org/professional-certificates/google-data-analytics
- Forecasting basics (time series): https://otexts.com/fpp3/

---
## Summary / Key Takeaways

Congratulations! You've completed four end-to-end analytics case studies. Here's what you should take away:

### The Analytics Workflow
A good analytics project is a **structured story**: Question → Data → Cleaning → EDA → Analysis → Decision

### Key Lessons by Case Study

| Case Study | Key Insight |
|------------|-------------|
| **Business (Retail)** | Revenue ≠ Profit — always check margins, not just totals |
| **Marketing** | Beware small samples — high conversion rates mean nothing with few conversions |
| **Finance (Credit)** | Accuracy can be misleading — precision and recall matter more for imbalanced outcomes |
| **Operations (Inventory)** | Time-series need time-based splits — random splits cause data leakage |

### Universal Principles
1. **Start simple and interpretable** — baselines are powerful and often sufficient
2. **Document your cleaning decisions** — what you dropped, filled, or transformed
3. **Visualize early and often** — patterns emerge faster through plots
4. **Always report limitations** — assumptions, missing data, and uncertainty
5. **Connect to decisions** — analytics is valuable only when it informs action

### Next Steps
- Turn one case study into a short report (1–2 pages) with charts and bullet recommendations
- Try applying the same workflow to a dataset you find interesting
- Practice explaining your analysis to someone non-technical — this builds real-world skills

> **Final Tip:** The best analysts aren't the ones with the most complex models. They're the ones who ask the right questions, clean data carefully, and communicate clearly.