# Chapter 20: Data Visualization, Communication, and Storytelling

This chapter focuses on the critical skills of communicating analytical findings effectively. You will learn how to transform raw data insights into compelling narratives that drive decisions, design appropriate visualizations for different purposes, create informative dashboards, and tailor your communication style for various audiences.

**Topics covered:**
- Role of storytelling in analytics
- Structuring an analytics story
- Selecting appropriate visualizations
- Designing dashboards
- Presenting to technical vs non-technical audiences
- Executive summaries

**Prerequisites:** Basic knowledge of Python, Pandas, and Matplotlib/Seaborn (Chapters 2–5).

## Introduction

In earlier chapters, you learned how to compute statistics and build visualizations. But in real work, your value often depends on one more skill:

**Can you communicate your results so that someone else can make a decision?**

Good communication is not about making charts look pretty. It is about helping your audience understand:

1. **What is happening?** (the insight)
2. **Why does it matter?** (the impact)
3. **What should we do next?** (the action)

We will use a small, realistic dataset (generated in code) so this notebook is fully self-contained.

In [None]:
# Setup: imports and plotting style
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)

# Reproducibility: set a random seed so results are consistent
rng = np.random.default_rng(42)

In [None]:
# Load the diamonds dataset from seaborn for our storytelling examples
diamonds = sns.load_dataset("diamonds")

# Transform to our online store context
# Sample to keep it manageable
df = diamonds.sample(n=2500, random_state=42).copy()

# Map to business context
df = pd.DataFrame({
    "date": pd.date_range("2025-01-01", periods=len(df), freq="h"),
    "region": df["color"].map({'D': 'North', 'E': 'North', 'F': 'South', 'G': 'South', 
                               'H': 'East', 'I': 'East', 'J': 'West'}),
    "product": df["cut"].map({'Fair': 'Basic', 'Good': 'Basic', 'Very Good': 'Plus',
                              'Premium': 'Plus', 'Ideal': 'Pro'}),
    "quantity": df["table"].clip(50, 70).apply(lambda x: max(1, int((x - 50) / 5))),
    "revenue": df["price"]
})

# Add campaign flag (after April 1st)
campaign_start = pd.Timestamp("2025-04-01")
df["campaign"] = df["date"] >= campaign_start

# Apply campaign effect
df.loc[df["campaign"], "revenue"] *= 1.08
df["revenue"] = df["revenue"].round(2)

df = df.sort_values("date").reset_index(drop=True)
print(f"Dataset shape: {df.shape}")
df.head()

### Quick data check (why we do this)
Before we make charts or write a narrative, we need basic confidence that the data looks reasonable.

We typically check:
- Number of rows
- Column types (dates should be dates, numbers should be numbers)
- Missing values
- Basic ranges (e.g., revenue should not be negative)

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df[["quantity", "revenue"]].describe()

## 20.1 Role of storytelling in analytics

**Storytelling** means guiding someone from a question to a conclusion in a way they can follow.

In analytics, storytelling is useful because:
- People are busy and don’t want to read raw tables
- A chart without context can be misunderstood
- Decisions require *trade-offs* (cost, risk, time), not just numbers

A good analytics story usually has three parts:
1. **Context:** What problem are we solving?
2. **Evidence:** What does the data show?
3. **Action:** What should we do (or decide) next?

### Tip (beginner-friendly)
A story is not fiction. You should never hide inconvenient data. The goal is clarity, not manipulation.

### Common mistake
Jumping straight into a complex chart without first stating the question.

## 20.2 Structuring an analytics story

A simple structure that works in many business settings is:

**1) Question → 2) Data → 3) Method → 4) Findings → 5) Recommendation → 6) Next steps**

You can think of it like a short movie:
- **Beginning:** set the scene (why the question matters)
- **Middle:** show evidence (2–3 key visuals, not 20)
- **End:** deliver the message (decision + what to do next)

We’ll practice with this question:

**“Did the marketing campaign improve revenue, and where?”**

First, let’s summarize revenue before/after the campaign.

In [None]:
# Step 1: Aggregate to the level that matches the question
# Here: daily revenue, split by campaign period
daily = (df
         .groupby(["date", "campaign"], as_index=False)
         .agg(total_revenue=("revenue", "sum"),
              orders=("revenue", "size")))

daily.head()

In [None]:
# Step 2: Compare typical values before vs after
summary = (daily
           .groupby("campaign", as_index=False)
           .agg(avg_daily_revenue=("total_revenue", "mean"),
                median_daily_revenue=("total_revenue", "median"),
                days=("date", "nunique")))
summary

### Why these steps?
- We **group by day** because campaigns often influence trends over time.
- We compute **mean and median** because the mean can be pulled up/down by a few unusual days.

Next we will visualize the trend over time. This is part of the **Evidence** section of the story.

In [None]:
# Line charts are great for trends over time
fig, ax = plt.subplots()
sns.lineplot(data=daily, x="date", y="total_revenue", ax=ax)
ax.axvline(pd.Timestamp("2025-04-01"), color="black", linestyle="--", linewidth=1)
ax.set_title("Daily revenue over time (campaign start marked)")
ax.set_xlabel("Date")
ax.set_ylabel("Total revenue ($)")
plt.tight_layout()
plt.show()

### Tip: annotate the key event
The dashed line makes the story easier to follow: the audience immediately sees *when* something changed.

### Warning: correlation vs causation
If revenue increased after April 1st, that *suggests* the campaign helped, but it does not prove it. Other factors (seasonality, competitor changes, holidays) could also play a role.

### Exercise 1: Write a story outline (no code)
Fill in this outline in your own words:

- **Question:** …
- **Why it matters:** …
- **Data used:** …
- **Key finding (1 sentence):** …
- **Recommendation:** …
- **Next step:** …

*Hint:* Keep it short. One paragraph is enough.

## 20.3 Selecting appropriate visualizations

The best chart depends on the question. A helpful shortcut is to identify your goal:

- **Trend over time** → line chart
- **Compare categories** → bar chart
- **Distribution (spread)** → histogram / box plot
- **Relationship between two numbers** → scatter plot
- **Part-to-whole** → usually bar chart (pie charts can confuse people)

### Common beginner mistake
Using the same chart type for every question. Choose the chart that *matches the message*.

Below we’ll create a few visuals using the same dataset to see when each is useful.

In [None]:
# Bar chart: compare total revenue by region
by_region = (df.groupby("region", as_index=False)
             .agg(total_revenue=("revenue", "sum"),
                  orders=("revenue", "size")))

fig, ax = plt.subplots()
sns.barplot(data=by_region, x="region", y="total_revenue", ax=ax)
ax.set_title("Total revenue by region")
ax.set_xlabel("Region")
ax.set_ylabel("Total revenue ($)")
plt.tight_layout()
plt.show()

by_region.sort_values("total_revenue", ascending=False)

In [None]:
# Histogram: understand the distribution of revenue per order
fig, ax = plt.subplots()
sns.histplot(df["revenue"], bins=30, kde=True, ax=ax)
ax.set_title("Distribution of revenue per order")
ax.set_xlabel("Revenue per order ($)")
ax.set_ylabel("Number of orders")
plt.tight_layout()
plt.show()

In [None]:
# Box plot: compare distributions across categories (and spot outliers)
fig, ax = plt.subplots()
sns.boxplot(data=df, x="product", y="revenue", ax=ax)
ax.set_title("Revenue per order by product (distribution comparison)")
ax.set_xlabel("Product")
ax.set_ylabel("Revenue per order ($)")
plt.tight_layout()
plt.show()

In [None]:
# Scatter plot: relationship between quantity and revenue
fig, ax = plt.subplots()
sns.scatterplot(data=df.sample(600, random_state=0), x="quantity", y="revenue", hue="product", alpha=0.6, ax=ax)
ax.set_title("Revenue vs quantity (colored by product)")
ax.set_xlabel("Quantity")
ax.set_ylabel("Revenue ($)")
plt.tight_layout()
plt.show()

### Tips for choosing and improving charts

- **Start with the message**, then pick the chart.
- Keep titles informative: “Revenue increased after campaign launch” is better than “Chart 1”.
- Label axes clearly.
- Avoid clutter (too many colors, too many grid lines, 3D charts).
- Use color for meaning, not decoration.

### Common mistakes to avoid
- **Truncated y-axis** in bar charts can exaggerate differences (it’s sometimes okay, but you must be careful and transparent).
- Using too many categories at once (consider sorting and showing top 10).
- Relying on red/green only (color-blind accessibility).

### Exercise 2: Build a simple charting function
Write a function that accepts a DataFrame and a region name, then plots daily revenue for that region.

**Why this exercise matters:** In real work, you often repeat plots for different segments. Functions reduce mistakes and keep style consistent.

In [None]:
def plot_daily_revenue_for_region(data: pd.DataFrame, region: str) -> None:
    # Validate inputs (beginner-friendly defensive programming)
    if region not in data["region"].unique():
        raise ValueError(f"Unknown region: {region}. Choose from {sorted(data['region'].unique())}")

    regional = data.loc[data["region"] == region]
    daily_regional = (regional
                     .groupby("date", as_index=False)
                     .agg(total_revenue=("revenue", "sum")))

    fig, ax = plt.subplots()
    sns.lineplot(data=daily_regional, x="date", y="total_revenue", ax=ax)
    ax.axvline(pd.Timestamp("2025-04-01"), color="black", linestyle="--", linewidth=1)
    ax.set_title(f"Daily revenue over time — {region} region")
    ax.set_xlabel("Date")
    ax.set_ylabel("Total revenue ($)")
    plt.tight_layout()
    plt.show()


# Try it
plot_daily_revenue_for_region(df, "North")

## 20.4 Designing dashboards

A **dashboard** is a compact view of key metrics and charts designed for quick monitoring and decision-making.

### What makes a dashboard effective?
- **Clear purpose:** Who is it for, and what decision will they make?
- **A few key metrics:** not everything, only what matters
- **Consistent formatting:** same units, clear labels
- **Good layout:** related items grouped together
- **Readable at a glance:** the audience should understand it in 10–30 seconds

In a notebook, we can create a dashboard-like layout using subplots. In real deployments, teams often use tools like **Power BI**, **Tableau**, **Streamlit**, or **Dash**.

Let’s build a simple 2×2 dashboard: trend, category comparison, distribution, and key KPIs.

In [None]:
# Compute a few KPIs
total_orders = len(df)
total_revenue = df["revenue"].sum()
avg_order_value = df["revenue"].mean()
top_region = (df.groupby("region")["revenue"].sum().sort_values(ascending=False).index[0])

daily_total = (df.groupby("date", as_index=False)
              .agg(total_revenue=("revenue", "sum")))

by_product = (df.groupby("product", as_index=False)
             .agg(total_revenue=("revenue", "sum")))

fig, axes = plt.subplots(2, 2, figsize=(14, 9))

# (1) Trend
sns.lineplot(data=daily_total, x="date", y="total_revenue", ax=axes[0, 0])
axes[0, 0].axvline(pd.Timestamp("2025-04-01"), color="black", linestyle="--", linewidth=1)
axes[0, 0].set_title("Daily revenue")
axes[0, 0].set_xlabel("")
axes[0, 0].set_ylabel("$")

# (2) Category comparison
sns.barplot(data=by_product, x="product", y="total_revenue", ax=axes[0, 1])
axes[0, 1].set_title("Revenue by product")
axes[0, 1].set_xlabel("")
axes[0, 1].set_ylabel("$")

# (3) Distribution
sns.histplot(df["revenue"], bins=25, kde=True, ax=axes[1, 0])
axes[1, 0].set_title("Revenue per order distribution")
axes[1, 0].set_xlabel("Revenue ($)")
axes[1, 0].set_ylabel("Orders")

# (4) KPI panel (simple text)
axes[1, 1].axis("off")
kpi_text = (
    f"Total orders: {total_orders:,}\n"
    f"Total revenue: ${total_revenue:,.0f}\n"
    f"Avg order value: ${avg_order_value:,.2f}\n"
    f"Top region: {top_region}"
)
axes[1, 1].text(0, 0.8, "Key KPIs", fontsize=14, weight="bold")
axes[1, 1].text(0, 0.55, kpi_text, fontsize=12, family="monospace")

fig.suptitle("Dashboard-style overview", fontsize=16)
plt.tight_layout()
plt.show()

### Tips and warnings for dashboards

- **Tip:** Add a *date range* and *data source* note. It builds trust.
- **Tip:** Use consistent units (don’t mix $ and % in confusing ways).
- **Warning:** Too many KPIs makes a dashboard unusable. If everything is important, nothing is important.
- **Common mistake:** Showing only totals; decision-makers often also need *rates* (conversion rate, churn rate) and *comparisons* (vs last month).

### Exercise 3: Create a ‘mini dashboard’ for one region
Modify the dashboard code to show only one region (pick any).

Steps:
1. Filter `df` to a single region
2. Recompute KPIs on that filtered data
3. Recreate the 2×2 layout

*Hint:* You can reuse the function idea from Exercise 2.

In [None]:
# Starter code for Exercise 3 (fill in the TODOs)
region_choice = "West"

# TODO 1: filter
df_r = df.loc[df["region"] == region_choice].copy()

# TODO 2: KPIs
total_orders_r = len(df_r)
total_revenue_r = df_r["revenue"].sum()
avg_order_value_r = df_r["revenue"].mean()

daily_total_r = (df_r.groupby("date", as_index=False)
                .agg(total_revenue=("revenue", "sum")))
by_product_r = (df_r.groupby("product", as_index=False)
               .agg(total_revenue=("revenue", "sum")))

fig, axes = plt.subplots(2, 2, figsize=(14, 9))
sns.lineplot(data=daily_total_r, x="date", y="total_revenue", ax=axes[0, 0])
axes[0, 0].axvline(pd.Timestamp("2025-04-01"), color="black", linestyle="--", linewidth=1)
axes[0, 0].set_title(f"Daily revenue — {region_choice}")
axes[0, 0].set_xlabel("")
axes[0, 0].set_ylabel("$")

sns.barplot(data=by_product_r, x="product", y="total_revenue", ax=axes[0, 1])
axes[0, 1].set_title("Revenue by product")
axes[0, 1].set_xlabel("")
axes[0, 1].set_ylabel("$")

sns.histplot(df_r["revenue"], bins=25, kde=True, ax=axes[1, 0])
axes[1, 0].set_title("Revenue per order distribution")
axes[1, 0].set_xlabel("Revenue ($)")
axes[1, 0].set_ylabel("Orders")

axes[1, 1].axis("off")
kpi_text_r = (
    f"Total orders: {total_orders_r:,}\n"
    f"Total revenue: ${total_revenue_r:,.0f}\n"
    f"Avg order value: ${avg_order_value_r:,.2f}"
)
axes[1, 1].text(0, 0.8, "Key KPIs", fontsize=14, weight="bold")
axes[1, 1].text(0, 0.55, kpi_text_r, fontsize=12, family="monospace")

fig.suptitle(f"Mini dashboard — {region_choice}", fontsize=16)
plt.tight_layout()
plt.show()

## 20.5 Presenting to technical vs non-technical audiences

Different audiences care about different details. Your job is to **meet the audience where they are**.

### Non-technical audiences (executives, operations, sales)
- Want the **impact** and **decision**
- Prefer simple visuals and plain language
- Need clear constraints (“this is based on 6 months of data”)

### Technical audiences (analysts, engineers, data scientists)
- Want the **method**, assumptions, and limitations
- Care about reproducibility
- Ask questions like “How did you handle missing data?”

A practical technique: prepare two versions of the same slide/story:
- **Version A (exec):** one chart + one sentence + recommendation
- **Version B (technical):** includes additional charts, validation, and methodology

Next we’ll demonstrate communicating uncertainty with a simple bootstrap confidence interval (a technical detail that can be summarized simply).

In [None]:
# Example: estimate uncertainty in average daily revenue before vs after the campaign
# We will use bootstrap resampling to get a confidence interval (CI).

daily_pre = daily.loc[daily["campaign"] == False, "total_revenue"].to_numpy()
daily_post = daily.loc[daily["campaign"] == True, "total_revenue"].to_numpy()

def bootstrap_mean_ci(values: np.ndarray, n_boot: int = 2000, ci: float = 0.95) -> tuple[float, float, float]:
    """Return (mean, lower, upper) bootstrap CI for the mean."""
    boot_means = []
    n_vals = len(values)
    for _ in range(n_boot):
        sample = rng.choice(values, size=n_vals, replace=True)
        boot_means.append(sample.mean())
    boot_means = np.array(boot_means)
    alpha = (1 - ci) / 2
    lower = np.quantile(boot_means, alpha)
    upper = np.quantile(boot_means, 1 - alpha)
    return values.mean(), lower, upper

pre_mean, pre_lo, pre_hi = bootstrap_mean_ci(daily_pre)
post_mean, post_lo, post_hi = bootstrap_mean_ci(daily_post)

pd.DataFrame({
    "period": ["Before campaign", "After campaign"],
    "mean_daily_revenue": [pre_mean, post_mean],
    "ci_lower": [pre_lo, post_lo],
    "ci_upper": [pre_hi, post_hi],
}).round(2)

In [None]:
# Visualize the two means with error bars (easy to explain)
periods = ["Before", "After"]
means = [pre_mean, post_mean]
yerr = [
    [pre_mean - pre_lo, post_mean - post_lo],
    [pre_hi - pre_mean, post_hi - post_mean],
]

fig, ax = plt.subplots()
ax.errorbar(periods, means, yerr=yerr, fmt="o", capsize=6)
ax.set_title("Average daily revenue (with ~95% bootstrap CI)")
ax.set_ylabel("Revenue ($)")
plt.tight_layout()
plt.show()

### How to explain this to different audiences

- **Executive explanation (simple):** “After the campaign, average daily revenue increased, and the uncertainty range still suggests an improvement.”
- **Technical explanation:** “We used bootstrap resampling of daily totals to estimate a 95% CI for the mean. Assumes days are exchangeable within each period.”

### Common mistake
Over-explaining technical details to non-technical listeners. If they want more detail, they will ask.

### Exercise 4: Practice audience-specific communication
Write two versions of the same message:

1. **Non-technical version (1–2 sentences)**
2. **Technical version (3–5 sentences)** including assumptions/limitations

Use the campaign result as your topic.

## 20.6 Executive summaries

An **executive summary** is a short, high-impact explanation of the analysis for decision-makers.

A beginner-friendly template is:

1. **Goal:** What question did we answer?
2. **Result:** What did we find (numbers + direction)?
3. **So what:** Why does it matter?
4. **Recommendation:** What should we do?
5. **Next step / risk:** What should we check next (or what could be wrong)?

Let’s generate an executive summary from our data using Python string formatting.

In [None]:
# Compute a few numbers for the executive summary
pre_avg = summary.loc[summary["campaign"] == False, "avg_daily_revenue"].iloc[0]
post_avg = summary.loc[summary["campaign"] == True, "avg_daily_revenue"].iloc[0]
pct_change = (post_avg - pre_avg) / pre_avg * 100

# Where was the change strongest?
regional_daily = (df
                 .groupby(["region", "date", "campaign"], as_index=False)
                 .agg(total_revenue=("revenue", "sum")))
regional_summary = (regional_daily
                   .groupby(["region", "campaign"], as_index=False)
                   .agg(avg_daily_revenue=("total_revenue", "mean")))

pivot = regional_summary.pivot(index="region", columns="campaign", values="avg_daily_revenue")
pivot.columns = ["before", "after"]
pivot["pct_change"] = (pivot["after"] - pivot["before"]) / pivot["before"] * 100
top_change_region = pivot["pct_change"].sort_values(ascending=False).index[0]
top_change_value = pivot.loc[top_change_region, "pct_change"]

executive_summary = (
    f"Goal: Evaluate whether the marketing campaign improved revenue.\n"
    f"Result: Average daily revenue increased by {pct_change:.1f}% after the campaign (from ${pre_avg:,.0f} to ${post_avg:,.0f}).\n"
    f"So what: Higher daily revenue suggests better sales performance and improved return on marketing spend.\n"
    f"Recommendation: Continue the campaign strategy and investigate scaling it, especially in {top_change_region} (largest uplift: {top_change_value:.1f}%).\n"
    f"Next step / risk: Validate with a controlled test (A/B) or adjust for seasonality to confirm the campaign caused the increase."
)

print(executive_summary)

### Tip: keep it skimmable
Use short sentences and numbers that provide scale. Avoid technical terms unless the audience expects them.

### Common mistake
Writing a summary that repeats charts (“This chart shows…”) instead of stating the *decision-relevant* conclusion.

### Mini-project: Tell a complete data story (beginner-friendly)
Create a short story (like a 1-page report) using this dataset.

**Your deliverables:**
1. A clear problem statement (1–2 sentences)
2. Two visuals that support your message
3. One executive summary paragraph

**Suggested topics (pick one):**
- Which product should we promote more, and why?
- Did the campaign help equally across regions?
- Are there signs of unusual orders (outliers) we should investigate?

Use the next cell as a starter template.

In [None]:
# Mini-project starter template

# 1) Pick a topic
topic = "Did the campaign help equally across regions?"

# 2) Create Visual 1: campaign uplift by region (bar chart)
uplift = pivot.reset_index().sort_values("pct_change", ascending=False)

fig, ax = plt.subplots()
sns.barplot(data=uplift, x="region", y="pct_change", ax=ax)
ax.axhline(0, color="black", linewidth=1)
ax.set_title("Estimated campaign uplift by region (avg daily revenue)")
ax.set_xlabel("Region")
ax.set_ylabel("% change after vs before")
plt.tight_layout()
plt.show()

# 3) Create Visual 2: daily trend for the top region
plot_daily_revenue_for_region(df, top_change_region)

# 4) Executive summary (edit in your own words)
your_summary = f"""
Topic: {topic}

Key finding: Region-level uplift varies; {top_change_region} shows the largest estimated increase (~{top_change_value:.1f}%).
Recommendation: Focus follow-up analysis on why uplift differs (channel mix, timing, customer segments) and test improvements in lower-uplift regions.
Next step: Control for seasonality and run an A/B test if possible to confirm causality.
"""

print(your_summary)

## Summary / Key Takeaways

- Storytelling in analytics helps people understand the *meaning* of the data and make decisions.
- A clear structure (question → evidence → recommendation) prevents confusion and keeps work focused.
- Choose chart types based on the message: trends, comparisons, distributions, relationships.
- Dashboards should be simple, purposeful, and readable at a glance.
- Communicate differently to technical vs non-technical audiences; tailor depth and language.
- Executive summaries should be short, specific, and action-oriented.

## Optional resources
- *Storytelling with Data* (Cole Nussbaumer Knaflic)
- The Data Visualisation Catalogue: https://datavizcatalogue.com/
- Matplotlib documentation: https://matplotlib.org/stable/
- Seaborn documentation: https://seaborn.pydata.org/

If you want, the next step is to reuse these principles with a real dataset (e.g., a CSV you have) and build a short narrative report.