# Chapter 17: Analytical Techniques and Applied Modeling

**Part B – Data Analytics Process and Methodology**

---

In this chapter, you will learn the core analytical techniques used in real-world data analytics projects. We move beyond exploration and start applying methods to answer business questions, identify patterns, and make predictions.

This chapter covers:
- **Descriptive analytics** – Summarizing what happened
- **Trend analysis** – Understanding direction over time
- **Time-series fundamentals** – Working with time-indexed data
- **Segmentation and clustering** – Grouping similar items
- **Forecasting basics** – Predicting what comes next
- **Model selection** – Choosing the right approach

These techniques form the foundation of applied analytics. Whether you're analyzing sales, customer behavior, or operational data, you'll use these methods repeatedly.

---

## Learning goals
By the end of this chapter, you will be able to:

1. Use descriptive analytics to summarize data and answer common business questions.
2. Detect and visualize trends over time.
3. Work with time-indexed data and compute rolling metrics.
4. Understand segmentation vs clustering and build a simple clustering model (if scikit-learn is available).
5. Build and compare basic forecasts (naïve, moving average, exponential smoothing).
6. Choose a modeling approach using clear criteria (goal, data, interpretability, cost of errors).

In [None]:
# Imports used throughout the chapter
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('default')
np.random.seed(42)

print('Ready! numpy:', np.__version__, '| pandas:', pd.__version__)


## Loading our dataset

We'll use the **flights** dataset from seaborn — a classic time-series dataset containing monthly airline passenger counts from 1949-1960. This is perfect for:
- Trend analysis
- Time-series fundamentals
- Forecasting basics

We'll also use the **iris** dataset for segmentation/clustering examples.

Why use real datasets?
- You can run the notebook without downloading files
- The patterns are real and well-documented
- You learn to work with actual data quirks

In [None]:
# Load the flights dataset from seaborn
flights = sns.load_dataset("flights")

# Create a proper datetime index
flights["date"] = pd.to_datetime(flights["year"].astype(str) + "-" + flights["month"].astype(str) + "-01")

# Create our time-series dataframe
ts = pd.DataFrame({
    "date": flights["date"],
    "orders": flights["passengers"],  # Rename to match our examples
    "month": flights["date"].dt.to_period("M").astype(str),
    "dow": flights["date"].dt.day_name()
})

# Sort by date
ts = ts.sort_values("date").reset_index(drop=True)

print(f"Dataset shape: {ts.shape}")
print(f"Date range: {ts['date'].min()} to {ts['date'].max()}")
ts.head()

# 17.1 Descriptive analytics techniques
Descriptive analytics answers: **“What happened?”**
It focuses on:
- Summaries (mean, median, min/max, counts)
- Comparisons (this month vs last month)
- Breakdown by categories (day-of-week, region, product)
- Basic KPIs (totals, growth rates, conversion rates)

### Why it matters
Before you build any model, you should know what your data looks like. Many real projects stop here because the descriptive insights are already actionable.

In [None]:
# Basic descriptive stats for daily orders
ts['orders'].describe()

### Grouping and aggregation
A common descriptive task is: **summarize by time bucket** (month) or category (day of week).
We use `groupby` because it makes the “split → apply → combine” workflow easy:
- Split data into groups (e.g., each month)
- Apply an aggregation (sum, mean, count)
- Combine results into a clean table

In [None]:
monthly = (
    ts.groupby('month', as_index=False)
      .agg(total_orders=('orders', 'sum'), avg_daily_orders=('orders', 'mean'))
)
monthly.head()

In [None]:
# Orders by month name (descriptive insight: which months are strongest?)
# Note: The flights dataset is monthly, so we analyze by month name (seasonality)
# rather than day of week

# Extract month name for seasonality analysis
ts['month_name'] = ts['date'].dt.month_name()

month_summary = (
    ts.groupby('month_name', as_index=False)
      .agg(avg_orders=('orders', 'mean'), total_orders=('orders', 'sum'), observations=('orders', 'size'))
)

# Put months in calendar order for readability
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']
month_summary['month_name'] = pd.Categorical(month_summary['month_name'], categories=month_order, ordered=True)
month_summary.sort_values('month_name')

### Visual descriptive analytics (quick plots)
Plotting is often the fastest way to spot patterns.
Below, we plot:
- Daily orders (lots of noise)
- A smoother rolling average (easier to see trend)

Tip: Rolling averages are a *descriptive* technique. They do not “predict” the future by themselves.

In [None]:
# Plot monthly orders with a rolling average to smooth out variations
ts_sorted = ts.sort_values('date').copy()

# For monthly data, use a 12-month rolling average to show the trend
ts_sorted['orders_rolling_12'] = ts_sorted['orders'].rolling(window=12).mean()

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(ts_sorted['date'], ts_sorted['orders'], alpha=0.5, marker='o', markersize=3, label='Monthly orders')
ax.plot(ts_sorted['date'], ts_sorted['orders_rolling_12'], linewidth=2, label='12-month rolling mean')
ax.set_title('Monthly Orders with Rolling Average')
ax.set_xlabel('Date')
ax.set_ylabel('Orders (passengers)')
ax.legend()
plt.show()

> **Common mistakes (descriptive analytics)**
> - Mixing totals and averages: a month with more days can have a higher total but a similar daily average.
> - Ignoring missing dates: time series often has gaps; you must check date continuity.
> - Using mean only: the median is often more “typical” when data has outliers.
> - Forgetting units: are we counting orders, dollars, or customers?

### Exercise 17.1 (quick practice)
1. Compute the **median** orders per month.
2. Find the month with the highest **average orders**.
3. Create a bar chart of average orders by **month name** (seasonality analysis).

In [None]:
# Exercise 17.1 - Starter code (fill in the TODOs)
exercise_monthly = (
    ts.groupby('month', as_index=False)
      .agg(
          median_orders=('orders', 'median'),
          avg_orders=('orders', 'mean')
      )
)
# TODO: find the month with highest avg_orders
best_month_row = exercise_monthly.sort_values('avg_orders', ascending=False).head(1)
display(best_month_row)

# TODO: bar chart of avg orders by month name (seasonality)
month_plot = month_summary.sort_values('month_name')
fig, ax = plt.subplots(figsize=(12, 5))
ax.bar(month_plot['month_name'].astype(str), month_plot['avg_orders'])
ax.set_title('Average Orders by Month (Seasonality)')
ax.set_xlabel('Month')
ax.set_ylabel('Average Orders')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# 17.2 Trend analysis
Trend analysis asks: **“Is the overall level increasing, decreasing, or staying stable?”**
Trends are important because they affect planning (inventory, staffing, budgeting).
We will look at trends using:
- A line chart over time
- Rolling averages (smoothing)
- Simple linear trend estimation (a very basic model)

Important idea: a trend is usually easier to see at a **higher time level** (weekly/monthly) than daily.

In [None]:
# Aggregate to weekly to reduce noise
weekly_ts = (
    ts.set_index('date')
      .resample('W')
      .agg(orders=('orders', 'sum'))
      .reset_index()
)
weekly_ts.head()

In [None]:
# Plot weekly totals and a rolling mean
weekly_ts['orders_roll_8'] = weekly_ts['orders'].rolling(8).mean()

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(weekly_ts['date'], weekly_ts['orders'], alpha=0.4, label='Weekly total orders')
ax.plot(weekly_ts['date'], weekly_ts['orders_roll_8'], linewidth=2, label='8-week rolling mean')
ax.set_title('Weekly Orders (Trend View)')
ax.set_xlabel('Week')
ax.set_ylabel('Orders')
ax.legend()
plt.show()

### A simple linear trend line (conceptual)
A linear trend is the simplest “model” for trend:
$$	ext{orders} pprox a + b dot t$$
- $t$ is time as 0, 1, 2, ...
- $b$ is the trend slope (positive = increasing)
We use `np.polyfit` for a beginner-friendly approach.
Warning: A linear trend is often **too simple** for real seasonality. It is mainly a baseline and a way to quantify direction.

In [None]:
# Fit a simple linear trend on weekly data
t = np.arange(len(weekly_ts))
y = weekly_ts['orders'].to_numpy()
slope, intercept = np.polyfit(t, y, deg=1)
weekly_ts['trend_line'] = intercept + slope * t
print(f'Trend slope (orders per week): {slope:.2f}')
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(weekly_ts['date'], weekly_ts['orders'], alpha=0.4, label='Weekly orders')
ax.plot(weekly_ts['date'], weekly_ts['trend_line'], linewidth=2, label='Linear trend')
ax.set_title('Weekly Orders with Linear Trend')
ax.set_xlabel('Week')
ax.set_ylabel('Orders')
ax.legend()
plt.show()

### Exercise 17.2
1. Recompute the trend line using **monthly totals** instead of weekly totals.
2. Compare the slope magnitude and explain (in a sentence) why the unit changes.
3. Optional: try a different smoothing window (e.g., 4-week rolling mean).

In [None]:
# Exercise 17.2 - Starter code
monthly_ts = (
    ts.set_index('date')
      .resample('MS')
      .agg(orders=('orders', 'sum'))
      .reset_index()
)
t_m = np.arange(len(monthly_ts))
y_m = monthly_ts['orders'].to_numpy()
slope_m, intercept_m = np.polyfit(t_m, y_m, deg=1)
monthly_ts['trend_line'] = intercept_m + slope_m * t_m
print(f'Monthly trend slope (orders per month): {slope_m:.2f}')
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(monthly_ts['date'], monthly_ts['orders'], marker='o', label='Monthly orders')
ax.plot(monthly_ts['date'], monthly_ts['trend_line'], linewidth=2, label='Linear trend')
ax.set_title('Monthly Orders with Linear Trend')
ax.set_xlabel('Month')
ax.set_ylabel('Orders')
ax.legend()
plt.show()

# 17.3 Time-series analysis fundamentals
A **time series** is data where each observation is associated with a timestamp.
Time series analysis often includes:
- Sorting and indexing by time
- Resampling (daily → weekly/monthly)
- Rolling statistics (rolling mean, rolling std)
- Lag features (yesterday’s value) and differencing (change over time)
- Thinking about seasonality (weekly/monthly/yearly cycles)
Tip: Always check the **frequency** and whether dates are missing.

In [None]:
# Put the time series into a DateTimeIndex for time-series operations
ts_idx = ts.set_index('date').sort_index()
# Check for missing dates
expected = pd.date_range(ts_idx.index.min(), ts_idx.index.max(), freq='D')
missing_dates = expected.difference(ts_idx.index)
print('Missing dates:', len(missing_dates))
ts_idx.head()

### Rolling statistics: mean and volatility
Rolling mean shows the “local average” around each date.
Rolling standard deviation (std) is a basic way to see **volatility** (how noisy/variable values are).
We use rolling windows because:
- Real systems change over time (what was true last year might not be true now)
- Rolling metrics help detect shifts and instability

In [None]:
ts_idx['roll_mean_30'] = ts_idx['orders'].rolling(30).mean()
ts_idx['roll_std_30'] = ts_idx['orders'].rolling(30).std()
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(ts_idx.index, ts_idx['orders'], alpha=0.25, label='Daily orders')
ax.plot(ts_idx.index, ts_idx['roll_mean_30'], linewidth=2, label='30-day rolling mean')
ax.set_title('Rolling Mean (Smoothing)')
ax.set_xlabel('Date')
ax.set_ylabel('Orders')
ax.legend()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(ts_idx.index, ts_idx['roll_std_30'], linewidth=2, color='tab:orange', label='30-day rolling std')
ax.set_title('Rolling Standard Deviation (Volatility)')
ax.set_xlabel('Date')
ax.set_ylabel('Std of orders')
ax.legend()
plt.show()

### Lag features and differencing
Two very common time-series transformations are:
- **Lag**: use the previous value(s) as features (e.g., yesterday’s orders)
- **Difference**: look at change (today - yesterday)
Why do we do this?
- Lags often carry useful information (recent history matters)
- Differencing can reduce trend and make patterns easier to model
Warning: lag features create missing values at the top (because there is no “previous day” for the first row).

In [None]:
ts_feats = ts_idx[['orders']].copy()
ts_feats['lag_1'] = ts_feats['orders'].shift(1)
ts_feats['lag_7'] = ts_feats['orders'].shift(7)
ts_feats['diff_1'] = ts_feats['orders'].diff(1)
ts_feats.head(10)

### Exercise 17.3
1. Create a `lag_14` feature.
2. Create a `diff_7` feature (today - value 7 days ago).
3. Plot `diff_1` over time and describe (in words) what it shows.

In [None]:
# Exercise 17.3 - Starter code
ts_feats_ex = ts_idx[['orders']].copy()
ts_feats_ex['lag_14'] = ts_feats_ex['orders'].shift(14)
ts_feats_ex['diff_7'] = ts_feats_ex['orders'].diff(7)
ts_feats_ex['diff_1'] = ts_feats_ex['orders'].diff(1)
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(ts_feats_ex.index, ts_feats_ex['diff_1'], alpha=0.6)
ax.axhline(0, color='black', linewidth=1)
ax.set_title('Daily Change in Orders (diff_1)')
ax.set_xlabel('Date')
ax.set_ylabel('Change vs previous day')
plt.show()

# 17.4 Segmentation and clustering concepts
Segmentation is the idea of grouping similar things (customers, products, stores) so you can treat each group differently.
There are two common approaches:

## A) Rule-based segmentation (easy and interpretable)
You define rules like:
- “High value” customers = spend > $500/year
- “New customers” = first purchase within 30 days
This is great for beginners and business communication.
## B) Clustering (data-driven grouping)
The algorithm forms groups based on similarity.
Common example: **K-Means** clustering.
Warning: Clustering does **not** know your business meaning. After clustering, you must interpret the clusters.

In [None]:
# Create a simple customer dataset for segmentation/clustering
n_customers = 600
customer_id = [f'C{str(i).zfill(4)}' for i in range(1, n_customers + 1)]
# Create three 'behavior types' to make clusters visible
segment = np.random.choice(['Budget', 'Regular', 'Premium'], size=n_customers, p=[0.45, 0.4, 0.15])
orders_per_year = np.where(
    segment == 'Budget', np.random.poisson(4, n_customers),
    np.where(segment == 'Regular', np.random.poisson(10, n_customers), np.random.poisson(18, n_customers))
)
avg_order_value = np.where(
    segment == 'Budget', np.random.normal(25, 8, n_customers),
    np.where(segment == 'Regular', np.random.normal(45, 10, n_customers), np.random.normal(85, 15, n_customers))
)
avg_order_value = np.clip(avg_order_value, 5, None)
days_since_last_purchase = np.where(
    segment == 'Budget', np.random.gamma(3, 18, n_customers),
    np.where(segment == 'Regular', np.random.gamma(3, 10, n_customers), np.random.gamma(3, 6, n_customers))
)
days_since_last_purchase = np.clip(days_since_last_purchase, 0, None)
annual_spend = orders_per_year * avg_order_value

customers = pd.DataFrame({
    'customer_id': customer_id,
    'orders_per_year': orders_per_year,
    'avg_order_value': avg_order_value,
    'days_since_last_purchase': days_since_last_purchase,
    'annual_spend': annual_spend,
})
customers.head()

## 17.4A Rule-based segmentation (RFM-style idea)
A popular segmentation idea is **RFM**:
- **R**ecency: how recently did the customer buy?
- **F**requency: how often do they buy?
- **M**onetary: how much do they spend?
We will do a simplified version using:
- Recency = `days_since_last_purchase`
- Frequency = `orders_per_year`
- Monetary = `annual_spend`
Why this is great for beginners:
- Very interpretable
- Easy to explain to stakeholders
- No special libraries required

In [None]:
# Create simple segments using quantiles (beginner-friendly)
seg = customers.copy()
# Lower recency is better (more recent)
seg['recency_bucket'] = pd.qcut(seg['days_since_last_purchase'], q=3, labels=['Recent', 'Warm', 'Cold'])
seg['frequency_bucket'] = pd.qcut(seg['orders_per_year'], q=3, labels=['Low Freq', 'Mid Freq', 'High Freq'])
seg['monetary_bucket'] = pd.qcut(seg['annual_spend'], q=3, labels=['Low Spend', 'Mid Spend', 'High Spend'])
seg['segment_label'] = (
    seg['recency_bucket'].astype(str)
    + ' | ' + seg['frequency_bucket'].astype(str)
    + ' | ' + seg['monetary_bucket'].astype(str)
)
seg['segment_label'].value_counts().head(10)

### Tip: standardize names and keep it simple
Segments are only useful if people can understand and use them.
A helpful pattern is to **rename** common segment combinations into short names like:
- “Champions”
- “At Risk”
- “New / Promising”
This naming step is a business decision, not a math decision.
### Exercise 17.4A
1. Compute average `annual_spend` by `recency_bucket`.
2. Which bucket has the highest average spend?
3. Create a pivot table: rows = `recency_bucket`, columns = `frequency_bucket`, values = average `annual_spend`.

In [None]:
# Exercise 17.4A - Starter code
avg_by_recency = seg.groupby('recency_bucket', as_index=False).agg(avg_spend=('annual_spend', 'mean'))
display(avg_by_recency)
pivot = pd.pivot_table(
    seg,
    index='recency_bucket',
    columns='frequency_bucket',
    values='annual_spend',
    aggfunc='mean'
)
pivot

## 17.4B Clustering with K-Means (optional: requires scikit-learn)
K-Means tries to put points into $k$ groups by minimizing distances inside each group.
Two important beginner points:
1. **Scaling matters**: features with larger numbers can dominate distance. We usually standardize features.
2. **Choosing k** is not automatic: you test a few values and use business judgement.
If `scikit-learn` is not installed, the code will show you how to install it.

In [None]:
# Try importing scikit-learn. If it's missing, we'll provide a helpful message.
try:
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    SKLEARN_AVAILABLE = True
except ImportError:
    SKLEARN_AVAILABLE = False

SKLEARN_AVAILABLE

In [None]:
if not SKLEARN_AVAILABLE:
    print("""
    ⚠️ scikit-learn is not installed.
    
    To install it, run one of the following commands:
    
    Using pip:
        pip install scikit-learn
    
    Using conda:
        conda install scikit-learn
    
    After installation, restart your kernel and re-run this notebook.
    
    Note: The clustering examples in this section require scikit-learn.
    You can still follow along with the rule-based segmentation approach.
    """)

In [None]:
if SKLEARN_AVAILABLE:
    features = customers[['orders_per_year', 'avg_order_value', 'days_since_last_purchase']].copy()
    scaler = StandardScaler()
    X = scaler.fit_transform(features)
    # Choose k=3 for demonstration (we generated 3 behavior types)
    kmeans = KMeans(n_clusters=3, random_state=42, n_init='auto')
    clusters = kmeans.fit_predict(X)
    clustered = customers.copy()
    clustered['cluster'] = clusters
    display(clustered.groupby('cluster').mean(numeric_only=True))
    # Visualize in 2D (orders_per_year vs avg_order_value)
    fig, ax = plt.subplots(figsize=(8, 5))
    scatter = ax.scatter(
        clustered['orders_per_year'],
        clustered['avg_order_value'],
        c=clustered['cluster'],
        cmap='tab10',
        alpha=0.7
    )
    ax.set_title('Customer Clusters (K-Means)')
    ax.set_xlabel('Orders per year')
    ax.set_ylabel('Average order value')
    plt.show()

> **Common mistakes (clustering)**
> - Not scaling features (distance gets dominated by one column)
> - Treating cluster IDs as “ranked” (cluster 2 is not automatically “better” than cluster 1)
> - Expecting perfect clusters in messy real-world data
> - Using clustering when you actually need a supervised model (when you have labels)

### Exercise 17.4B
If scikit-learn is available:
1. Try `k=2`, `k=4`, `k=5`.
2. Compare the cluster summaries. Which k produces the most interpretable groups?
3. (Optional) Create a simple label for each cluster based on its averages.

In [None]:
if SKLEARN_AVAILABLE:
    def fit_kmeans(customers_df: pd.DataFrame, k: int) -> pd.DataFrame:
        feats = customers_df[['orders_per_year', 'avg_order_value', 'days_since_last_purchase']].copy()
        X = StandardScaler().fit_transform(feats)
        model = KMeans(n_clusters=k, random_state=42, n_init='auto')
        labels = model.fit_predict(X)
        out = customers_df.copy()
        out['cluster'] = labels
        return out

    for k in [2, 3, 4, 5]:
        tmp = fit_kmeans(customers, k)
        summary = tmp.groupby('cluster').mean(numeric_only=True)
        print(f'\nK={k} cluster means')
        display(summary)

# 17.5 Forecasting basics
Forecasting answers: **“What might happen next?”**
As a beginner, you should start with **simple forecasting baselines**. They are:
- Easy to implement
- Easy to explain
- Often surprisingly strong
We will build and compare:
- Naïve forecast (tomorrow = today)
- Moving average forecast
- Simple exponential smoothing (implemented manually)
Important: forecasting should be evaluated on a **holdout period** (test set).

In [None]:
# Prepare a 1D series for forecasting
series = ts_idx['orders'].asfreq('D')
# Train-test split: last 90 days for testing
test_days = 90
train = series.iloc[:-test_days]
test = series.iloc[-test_days:]
train.index.min(), train.index.max(), test.index.min(), test.index.max()

In [None]:
def mae(y_true: pd.Series, y_pred: pd.Series) -> float:
    # Mean Absolute Error: easy to interpret (average absolute mistake)
    return float(np.mean(np.abs(y_true.to_numpy() - y_pred.to_numpy())))
def rmse(y_true: pd.Series, y_pred: pd.Series) -> float:
    return float(np.sqrt(np.mean((y_true.to_numpy() - y_pred.to_numpy()) ** 2)))

# 1) Naïve forecast: predict today's value as yesterday's value
naive_pred = test.copy()
naive_pred[:] = series.shift(1).loc[test.index]

print('Naïve MAE:', mae(test, naive_pred))
print('Naïve RMSE:', rmse(test, naive_pred))

In [None]:
# 2) Moving average forecast: use average of last N days from training/history
def moving_average_forecast(full_series: pd.Series, forecast_index: pd.DatetimeIndex, window: int) -> pd.Series:
    # For each date in forecast_index, average the previous `window` days
    preds = []
    for dt in forecast_index:
        history = full_series.loc[:dt - pd.Timedelta(days=1)]
        preds.append(float(history.tail(window).mean()))
    return pd.Series(preds, index=forecast_index)

ma7_pred = moving_average_forecast(series, test.index, window=7)
ma28_pred = moving_average_forecast(series, test.index, window=28)
print('MA(7) MAE:', mae(test, ma7_pred))
print('MA(28) MAE:', mae(test, ma28_pred))

In [None]:
# 3) Simple Exponential Smoothing (SES)
# Level-only model:
#   level_t = alpha * y_t + (1-alpha) * level_{t-1}
# Forecast = last level
def simple_exponential_smoothing_forecast(train_series: pd.Series, forecast_index: pd.DatetimeIndex, alpha: float) -> pd.Series:
    if not (0 < alpha <= 1):
        raise ValueError('alpha must be in (0, 1]')
    # Initialize level using the first value
    level = float(train_series.iloc[0])
    for y in train_series.iloc[1:]:
        level = alpha * float(y) + (1 - alpha) * level
    # Forecast constant level into the future
    return pd.Series([level] * len(forecast_index), index=forecast_index)

ses_02 = simple_exponential_smoothing_forecast(train, test.index, alpha=0.2)
ses_06 = simple_exponential_smoothing_forecast(train, test.index, alpha=0.6)
print('SES(alpha=0.2) MAE:', mae(test, ses_02))
print('SES(alpha=0.6) MAE:', mae(test, ses_06))

In [None]:
# Compare forecasts visually
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(train.index[-120:], train.iloc[-120:], label='Train (last 120 days)', alpha=0.6)
ax.plot(test.index, test, label='Test (actual)', linewidth=2)
ax.plot(test.index, naive_pred, label='Naïve', alpha=0.9)
ax.plot(test.index, ma7_pred, label='MA(7)', alpha=0.9)
ax.plot(test.index, ses_06, label='SES alpha=0.6', alpha=0.9)
ax.set_title('Forecast Comparison on Test Period')
ax.set_xlabel('Date')
ax.set_ylabel('Orders')
ax.legend()
plt.show()

> **Beginner tip (forecasting)**: Always beat a baseline.
> If your “advanced” model is not better than naïve or moving average on test data, it is probably not ready.

### Exercise 17.5
1. Compute MAE and RMSE for all methods (Naïve, MA(7), MA(28), SES alpha=0.2, SES alpha=0.6).
2. Which method performs best on this dataset?
3. Change the test period to 180 days and re-evaluate. Do results change?

In [None]:
# Exercise 17.5 - Starter code
methods = {
    'Naïve': naive_pred,
    'MA(7)': ma7_pred,
    'MA(28)': ma28_pred,
    'SES(0.2)': ses_02,
    'SES(0.6)': ses_06,
}
rows = []
for name, pred in methods.items():
    rows.append({
        'method': name,
        'MAE': mae(test, pred),
        'RMSE': rmse(test, pred),
    })
results = pd.DataFrame(rows).sort_values('MAE')
results

# 17.6 Model selection considerations
Model selection is the skill of choosing a method that fits your goal, data, and constraints.

## A Beginner's Checklist for Model Selection

| Question | Why It Matters |
|----------|----------------|
| 1. **What is your goal?** | Describing, segmenting, forecasting, or predicting a label require different approaches |
| 2. **What data do you have?** | Do you have enough history? Do you have labels? What's the quality? |
| 3. **Who needs to understand it?** | Stakeholders often need simple, explainable methods |
| 4. **What's the cost of errors?** | Is it worse to over-predict or under-predict? |
| 5. **Who will maintain it?** | Will this model be updated regularly? Who will own it? |
| 6. **What's the simplest approach?** | Always start with a baseline you can beat |

## Decision Guide: Which Technique to Use?

```
START HERE
    │
    ▼
What do you want to know?
    │
    ├── "What happened?" ──────────────────► DESCRIPTIVE ANALYTICS
    │                                         (groupby, summary stats, KPIs)
    │
    ├── "Is there an upward/downward trend?" ► TREND ANALYSIS
    │                                         (rolling averages, linear fit)
    │
    ├── "What will happen next?" ──────────► FORECASTING
    │   │                                     (naïve, MA, exponential smoothing)
    │   └── Do you have time-indexed data? 
    │       └── Yes ► TIME SERIES ANALYSIS
    │
    └── "How can I group similar items?" ───► SEGMENTATION / CLUSTERING
        │
        ├── Do you know the rules? ─► Yes ──► Rule-based segmentation (RFM)
        │
        └── No ────────────────────────────► K-Means clustering
```

### Picking techniques (practical guidance)
- **Descriptive analytics**: start here always.
- **Trend analysis**: use rolling averages; quantify with a trend line if helpful.
- **Time series forecasting**: start with naïve and moving average, then add complexity.
- **Segmentation**: start with rule-based (RFM-like), then consider clustering.
- **Clustering**: use when you *don't* have labels and want natural groupings.

> **Tip:** The best model is often NOT the most complex one. It's the one that:
> - Answers the business question clearly
> - Can be explained to stakeholders
> - Is maintainable over time
> - Beats a simple baseline on your test data

### Optional Resources
- [scikit-learn Model Selection Overview](https://scikit-learn.org/stable/model_selection.html)
- [Forecasting: Principles and Practice (Free Online Book)](https://otexts.com/fpp3/)
- [Towards Data Science: Model Selection Guide](https://towardsdatascience.com/)

## Mini-project (Chapter 17)
### Scenario
You are an analyst for a small online store. You need to:
1. Provide a descriptive summary of order volume.
2. Identify whether orders are trending up.
3. Create a simple forecast for the next 30 days.
4. Propose a customer segmentation approach.
### Deliverables
- A table of monthly totals and average daily orders
- A trend plot (weekly or monthly)
- Forecast plot + error metrics on a test set
- A segmentation table (rule-based) and (optional) a clustering result

In [None]:
# Mini-project starter: 1) monthly summary
monthly_summary = (
    ts.set_index('date')
      .resample('MS')
      .agg(total_orders=('orders', 'sum'), avg_daily_orders=('orders', 'mean'))
      .reset_index()
)
monthly_summary.head()

In [None]:
# Mini-project starter: 2) trend plot using weekly totals
weekly = (
    ts.set_index('date')
      .resample('W')
      .agg(orders=('orders', 'sum'))
)
weekly['roll_8'] = weekly['orders'].rolling(8).mean()
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(weekly.index, weekly['orders'], alpha=0.35, label='Weekly orders')
ax.plot(weekly.index, weekly['roll_8'], linewidth=2, label='8-week rolling mean')
ax.set_title('Weekly Orders Trend')
ax.set_xlabel('Week')
ax.set_ylabel('Orders')
ax.legend()
plt.show()

In [None]:
# Mini-project starter: 3) 30-day forecast using your chosen method
forecast_horizon = 30
future_index = pd.date_range(series.index.max() + pd.Timedelta(days=1), periods=forecast_horizon, freq='D')
# Choose a method (example: MA(7))
future_forecast = moving_average_forecast(series, future_index, window=7)
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(series.index[-120:], series.iloc[-120:], label='History (last 120 days)')
ax.plot(future_forecast.index, future_forecast, label='Forecast (next 30 days)', linewidth=2)
ax.set_title('30-Day Forecast (Moving Average)')
ax.set_xlabel('Date')
ax.set_ylabel('Orders')
ax.legend()
plt.show()

In [None]:
# Mini-project starter: 4) rule-based segmentation summary
seg_summary = (
    seg.groupby(['recency_bucket', 'frequency_bucket'], as_index=False)
      .agg(customers=('customer_id', 'count'), avg_spend=('annual_spend', 'mean'))
)
seg_summary.sort_values(['recency_bucket', 'frequency_bucket']).head(10)

# Summary / Key Takeaways

## What You Learned

| Technique | Purpose | Key Methods |
|-----------|---------|-------------|
| **Descriptive Analytics** | Answer "What happened?" | `groupby`, `agg`, summary stats, KPIs |
| **Trend Analysis** | Identify direction over time | Rolling averages, linear trend lines |
| **Time Series Fundamentals** | Work with time-indexed data | Resampling, lags, differencing, rolling stats |
| **Segmentation** | Group similar items (rule-based) | Quantile cuts, RFM approach |
| **Clustering** | Group similar items (data-driven) | K-Means, StandardScaler |
| **Forecasting** | Predict future values | Naïve, Moving Average, Exponential Smoothing |
| **Model Selection** | Choose the right approach | Match goal, data, interpretability, constraints |

## Key Principles to Remember

1. **Descriptive analytics** (summaries, groupby, plots) is the foundation of everything else.
2. **Trend analysis** is easier when you aggregate and smooth (weekly/monthly + rolling mean).
3. **Time series** work best when you use time indexes, rolling stats, lags, and differences.
4. **Segmentation** can be rule-based (highly interpretable) or data-driven (clustering).
5. **Forecasting** should start with strong baselines (naïve, moving average, simple smoothing).
6. **Model selection** is about matching method to goal + constraints, not maximizing complexity.

## Additional Resources

### Documentation
- [Pandas Time Series Documentation](https://pandas.pydata.org/docs/user_guide/timeseries.html) – Official guide for time-indexed data
- [scikit-learn Clustering Guide](https://scikit-learn.org/stable/modules/clustering.html) – Comprehensive clustering documentation
- [scikit-learn Model Selection](https://scikit-learn.org/stable/model_selection.html) – Guide for choosing and evaluating models

### Learning Resources
- [Forecasting: Principles and Practice (Free Online Book)](https://otexts.com/fpp3/) – Excellent introduction to forecasting
- [Kaggle Time Series Course](https://www.kaggle.com/learn/time-series) – Hands-on tutorials with real datasets
- [RFM Segmentation Tutorial](https://clevertap.com/blog/rfm-analysis/) – Business-oriented explanation of RFM

### Practice Datasets
- [Kaggle Retail Data](https://www.kaggle.com/datasets) – Search for "retail sales" or "e-commerce"
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) – Classic datasets for practice

---

**Next Chapter Preview:** In Chapter 18, you will learn how to properly validate and evaluate your models using train-test splits, evaluation metrics, and techniques to avoid overfitting.