# Chapter 5: Principles of Data Visualization

Data visualization transforms raw numbers into visual stories. This chapter teaches you how to create clear, honest, and effective charts using Python's most popular plotting libraries.

---

## Learning Objectives

By the end of this chapter, you will be able to:

1. **Understand** the purpose of data visualization (explore, explain, monitor)
2. **Apply** best practices and avoid common visualization mistakes
3. **Choose** the right chart type for your data and question
4. **Create** line charts, bar charts, scatter plots, histograms, and box plots
5. **Customize** plots with titles, labels, annotations, and formatting
6. **Use** both Matplotlib and Seaborn effectively

---

## Introduction

Data visualization is the skill of turning data into pictures that help people **understand** and **decide**. A good chart answers a question quickly. A poor chart can confuse, hide patterns, or even mislead.

In this chapter, we will:
- Understand the *purpose* of visualization
- Learn best practices (what to do and what to avoid)
- Choose the right chart for the right question
- Build charts using **Matplotlib** and **Seaborn**

> If you are new to plotting: don’t worry. We’ll go step-by-step and explain *why* we do each step.

## Setup: import libraries and load sample data

We'll use the **tips** dataset from seaborn — a real-world dataset containing restaurant bill information that includes:
- Daily transactions (good for **line charts** when aggregated by day)
- Categories like day and time (good for **bar charts**)
- Relationship between bill and tip amount (good for **scatter plots**)
- Distributions of numeric values (good for **histograms** and **box plots**)

This is a built-in dataset — no downloads required.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Make plots look consistent and readable
sns.set_theme(style="whitegrid", context="notebook")

# Load the tips dataset from seaborn (built-in, no download required)
tips = sns.load_dataset("tips")

# Transform for our visualization examples
df = tips.copy()
df = df.rename(columns={
    "total_bill": "sales",
    "tip": "ad_spend",
    "day": "category"
})

# Add a date column for time-series examples (spread across 244 days)
np.random.seed(42)
df["date"] = pd.date_range(start="2025-01-01", periods=len(df), freq="D")

print(f"Dataset: {len(df)} restaurant transactions")
print(f"Columns: {list(df.columns)}")
df.head()

### Quick data check
Before plotting, confirm the data types and basic ranges. This helps avoid mistakes like treating dates as text.

In [None]:
df.info()
df.describe(include="all").T

## Purpose of data visualization

A visualization is usually created to do one (or more) of these:

1. **Explore**: you are still learning what’s inside the data (patterns, outliers, missing values).
2. **Explain**: you already know the main message and want to communicate it clearly.
3. **Monitor**: you want to track metrics over time (dashboards, KPIs).

A good habit: write the question *before* making the chart.

Examples of questions:
- Are sales trending up or down?
- Which category sells the most?
- Does higher ad spend relate to higher sales?
- Are there unusual values (outliers)?

## Data visualization best practices

### Principles to follow
- **Clarity first**: label axes, title the chart, and choose readable sizes.
- **Tell the truth**: don’t distort scales or hide important context.
- **Reduce clutter**: avoid heavy gridlines, unnecessary 3D effects, and too many colors.
- **Use the right chart**: match chart type to the question.
- **Make comparisons easy**: sort bars, use consistent scales, align plots.

### Common beginner mistakes (watch out!)
- Using a **pie chart** with many categories (hard to compare).
- Forgetting **units** (is it dollars, minutes, percentage?).
- Starting a bar chart at a non-zero baseline (can exaggerate differences).
- Overplotting (too many points) without transparency or aggregation.

> Tip: If someone can’t understand the chart in ~5 seconds, simplify.

### Exercise 1 (quick practice): describe a good chart
Write short answers:
1. What question does the chart answer?
2. What labels must it include?
3. What could mislead someone if done wrong?

(No code needed—this is about thinking like an analyst.)

## Types of charts and when to use them

Here is a simple rule-of-thumb mapping:
- **Line chart** → change over time (trends)
- **Bar chart** → compare categories (totals/averages)
- **Scatter plot** → relationship between two numeric variables
- **Histogram** → distribution (shape) of one numeric variable
- **Box plot** → compare distributions across groups + detect outliers

We will build each one using both Matplotlib (the foundation) and Seaborn (higher-level, nicer defaults).

## Line charts (trend over time)

**Question:** How do total sales change over time?

### Step-by-step
1. **Aggregate** to the level you want to visualize (daily totals).
2. Plot date on the x-axis and sales on the y-axis.
3. Add labels and a title so the chart can stand alone.

> Tip: Line charts work best when x is ordered (time).

In [None]:
daily = (
    df.groupby("date", as_index=False)["sales"]
      .sum()
      .rename(columns={"sales": "total_sales"})
)

plt.figure(figsize=(10, 4))
plt.plot(daily["date"], daily["total_sales"], linewidth=2)
plt.title("Total Sales Over Time")
plt.xlabel("Date")
plt.ylabel("Total Sales ($)")
plt.tight_layout()
plt.show()

daily.head()

### Seaborn version (same idea, nicer defaults)
Seaborn sits on top of Matplotlib and makes many common plots quicker. We still label axes and use a clear title.

In [None]:
plt.figure(figsize=(10, 4))
sns.lineplot(data=daily, x="date", y="total_sales", linewidth=2)
plt.title("Total Sales Over Time (Seaborn)")
plt.xlabel("Date")
plt.ylabel("Total Sales ($)")
plt.tight_layout()
plt.show()

### Exercise 2: plot by category
Create a line chart showing **daily sales for each category**.

Hints:
- Group by `date` and `category`
- Use `sns.lineplot(..., hue="category")`
- Keep the figure size wide

In [None]:
# TODO: Your solution here
# 1) Create a grouped dataframe with columns: date, category, total_sales
# 2) Plot with seaborn using hue='category'

# Uncomment when you're ready:
# daily_by_cat = ...
# plt.figure(figsize=(10, 4))
# sns.lineplot(data=daily_by_cat, x='date', y='total_sales', hue='category')
# plt.title('Daily Sales by Category')
# plt.xlabel('Date')
# plt.ylabel('Sales ($)')
# plt.tight_layout()
# plt.show()

## Bar charts (compare categories)

**Question:** Which category sells the most on average?

### Step-by-step
1. Compute a summary per category (mean, sum, etc.).
2. Sort values to make comparisons easier.
3. Plot bars with clear labels.

> Warning: For bar charts, a non-zero baseline can visually exaggerate differences. Use baseline 0 unless you have a strong reason not to.

In [None]:
cat_summary = (
    df.groupby("category", as_index=False)["sales"]
      .mean()
      .rename(columns={"sales": "avg_sales"})
      .sort_values("avg_sales", ascending=False)
)

plt.figure(figsize=(6, 4))
sns.barplot(data=cat_summary, x="category", y="avg_sales")
plt.title("Average Sales by Category")
plt.xlabel("Category")
plt.ylabel("Average Sales ($)")
plt.ylim(0, None)  # baseline at 0
plt.tight_layout()
plt.show()

cat_summary

### Exercise 3: compare totals instead of averages
Make a bar chart of **total sales** per category.

Hint: replace `.mean()` with `.sum()` and rename the column.

In [None]:
# TODO: Your solution here
# totals = ...
# plt.figure(figsize=(6, 4))
# sns.barplot(data=totals, x='category', y='total_sales')
# plt.title('Total Sales by Category')
# plt.xlabel('Category')
# plt.ylabel('Total Sales ($)')
# plt.ylim(0, None)
# plt.tight_layout()
# plt.show()

## Scatter plots (relationship between two numeric variables)

**Question:** Do we see a relationship between ad spend and sales?

### Step-by-step
1. Put one numeric variable on x (`ad_spend`).
2. Put the other numeric variable on y (`sales`).
3. Use color (`hue`) to show groups (category).
4. Use transparency (`alpha`) to reduce overplotting.

In [None]:
plt.figure(figsize=(7, 5))
sns.scatterplot(data=df, x="ad_spend", y="sales", hue="category", alpha=0.7)
plt.title("Ad Spend vs Sales")
plt.xlabel("Ad Spend ($)")
plt.ylabel("Sales ($)")
plt.tight_layout()
plt.show()

### Optional: add a trend line
A trend line can help show the overall relationship.

**Important:** A trend line is not proof of causation. It only shows association in this data.

In [None]:
plt.figure(figsize=(7, 5))
sns.regplot(data=df, x="ad_spend", y="sales", scatter_kws={"alpha": 0.3}, line_kws={"color": "black"})
plt.title("Ad Spend vs Sales (with Trend Line)")
plt.xlabel("Ad Spend ($)")
plt.ylabel("Sales ($)")
plt.tight_layout()
plt.show()

### Exercise 4: reduce overplotting
Try one improvement for overplotting:
- Use `alpha=0.2`
- Or use smaller points: `s=20`
- Or plot a random sample of rows

Goal: make the pattern easier to see.

In [None]:
# TODO: Your solution here
# Example idea: sample 300 points
# df_sample = df.sample(n=300, random_state=42)
# plt.figure(figsize=(7, 5))
# sns.scatterplot(data=df_sample, x='ad_spend', y='sales', hue='category', alpha=0.5, s=30)
# plt.title('Ad Spend vs Sales (Improved Readability)')
# plt.xlabel('Ad Spend ($)')
# plt.ylabel('Sales ($)')
# plt.tight_layout()
# plt.show()

## Histograms (distribution of a numeric variable)

**Question:** What is the typical sales amount? Are there many low/high values?

A histogram groups values into bins and shows how many observations fall in each bin.

> Tip: If your histogram looks too spiky or too flat, try changing the number of bins.

In [None]:
plt.figure(figsize=(7, 4))
sns.histplot(data=df, x="sales", bins=25, kde=True)
plt.title("Distribution of Sales")
plt.xlabel("Sales ($)")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

### Exercise 5: compare distributions by category
Create separate histograms (or one plot with color) to compare sales distributions across categories.

Hint: `sns.histplot(..., hue='category', element='step', stat='density', common_norm=False)`

In [None]:
# TODO: Your solution here
# plt.figure(figsize=(8, 4))
# sns.histplot(data=df, x='sales', hue='category', element='step', stat='density', common_norm=False, bins=25)
# plt.title('Sales Distribution by Category')
# plt.xlabel('Sales ($)')
# plt.ylabel('Density')
# plt.tight_layout()
# plt.show()

## Box plots (compare distributions + outliers)

**Question:** How do sales vary by category? Are there outliers?

A box plot shows:
- The middle 50% of values (the box)
- The median (line in the box)
- Potential outliers (points beyond the whiskers)

> Warning: Box plots are great summaries, but they hide exact distribution shapes. Use with histograms when needed.

In [None]:
plt.figure(figsize=(6, 4))
sns.boxplot(data=df, x="category", y="sales")
plt.title("Sales by Category (Box Plot)")
plt.xlabel("Category")
plt.ylabel("Sales ($)")
plt.tight_layout()
plt.show()

### Exercise 6: add points for more detail
Overlay a strip plot (or swarm plot) on top of the box plot so you can see individual observations.

Hints:
- Use `sns.boxplot(...)`
- Then `sns.stripplot(..., color='black', alpha=0.3)`

Goal: keep it readable (use transparency).

In [None]:
# TODO: Your solution here
# plt.figure(figsize=(6, 4))
# sns.boxplot(data=df, x='category', y='sales')
# sns.stripplot(data=df, x='category', y='sales', color='black', alpha=0.25, jitter=0.2)
# plt.title('Sales by Category (Box + Points)')
# plt.xlabel('Category')
# plt.ylabel('Sales ($)')
# plt.tight_layout()
# plt.show()

## Customizing plots (titles, labels, scales, styles)

Customization is not decoration—it’s for **communication**.

### What to customize (most common)
- Figure size (so text is readable)
- Titles and axis labels
- Tick formatting (dates, currency, percentages)
- Gridlines and spines (reduce clutter)
- Limits (`xlim`, `ylim`) only when appropriate

> Tip: If you’re making many charts, set global style once (we used `sns.set_theme`).

In [None]:
import matplotlib.ticker as mtick

plt.figure(figsize=(10, 4))
ax = sns.lineplot(data=daily, x="date", y="total_sales", linewidth=2)
ax.set_title("Total Sales Over Time (Formatted)")
ax.set_xlabel("Date")
ax.set_ylabel("Total Sales ($)")

# Example: format y-axis as currency-like with commas
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter("${x:,.0f}"))

plt.tight_layout()
plt.show()

## Multiple plots and layouts

When you want comparisons, it’s often better to show multiple plots side-by-side.

### Two common approaches
- **Matplotlib subplots**: full control
- **Seaborn faceting** (`FacetGrid` / `relplot`): quick small multiples

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))

# Left: bar chart
sns.barplot(data=cat_summary, x="category", y="avg_sales", ax=axes[0])
axes[0].set_title("Average Sales by Category")
axes[0].set_xlabel("Category")
axes[0].set_ylabel("Avg Sales ($)")
axes[0].set_ylim(0, None)

# Right: histogram
sns.histplot(data=df, x="sales", bins=25, ax=axes[1])
axes[1].set_title("Sales Distribution")
axes[1].set_xlabel("Sales ($)")
axes[1].set_ylabel("Count")

plt.tight_layout()
plt.show()

## Annotations and labels

Annotations help highlight a key point (like a peak, a dip, or a target).

### When to annotate
- You want to draw attention to a specific data point
- You want to explain a sudden change
- You want to show a goal/threshold

In [None]:
# Find the maximum day in total sales to annotate it
max_row = daily.loc[daily["total_sales"].idxmax()]
max_date = max_row["date"]
max_sales = max_row["total_sales"]

plt.figure(figsize=(10, 4))
ax = sns.lineplot(data=daily, x="date", y="total_sales", linewidth=2)
ax.set_title("Total Sales Over Time (Annotated Peak)")
ax.set_xlabel("Date")
ax.set_ylabel("Total Sales ($)")

# Add a point and a text label
ax.scatter([max_date], [max_sales], color="black", zorder=5)
ax.annotate(
    text=f"Peak: ${max_sales:,.0f}",
    xy=(max_date, max_sales),
    xytext=(10, 10),
    textcoords="offset points",
    arrowprops={"arrowstyle": "->", "color": "black"},
)

plt.tight_layout()
plt.show()

max_row

## Color, scale, and readability considerations

### Color
- Use color to encode meaning (groups, categories), not decoration.
- Avoid using too many colors at once.
- Prefer colorblind-friendly palettes when possible.

### Scale
- Keep scales consistent when comparing plots.
- For bar charts, start at 0 unless there is a strong reason not to.
- For very wide ranges, consider log scale (but explain it clearly).

### Readability
- Increase figure size for dense charts
- Rotate tick labels if they overlap
- Use `tight_layout()` to prevent cut-off text

> Common mistake: tiny default figures. If you can’t read it, your audience can’t either.

In [None]:
# Example: choosing a palette and improving readability
plt.figure(figsize=(7, 5))
sns.scatterplot(
    data=df,
    x="ad_spend",
    y="sales",
    hue="category",
    palette="colorblind",
    alpha=0.6,
)
plt.title("Ad Spend vs Sales (Colorblind-friendly palette)")
plt.xlabel("Ad Spend ($)")
plt.ylabel("Sales ($)")
plt.tight_layout()
plt.show()

## Libraries: Matplotlib vs Seaborn

### Matplotlib
- The *core* plotting library in Python
- Very flexible and widely supported
- Sometimes requires more code for nicer defaults

### Seaborn
- Built on top of Matplotlib
- Works great with Pandas DataFrames
- Beautiful defaults and easy statistical plots

A practical approach:
- Use **Seaborn** for quick, clean plots
- Use **Matplotlib** to fine-tune details (titles, ticks, layouts)

## Mini-project: a small visualization report
Create a 2x2 dashboard-style figure answering four questions:
1. Total sales trend over time (line)
2. Total sales by category (bar)
3. Ad spend vs sales (scatter)
4. Sales distribution (histogram or box plot)

**Requirements:**
- Each subplot must have a title and axis labels
- Use a consistent style
- Keep it readable (figure size + `tight_layout`)

Start with the template below and complete it.

In [None]:
# TODO: Mini-project template

# Prepare any summaries you need
daily = df.groupby('date', as_index=False)['sales'].sum().rename(columns={'sales': 'total_sales'})
totals_by_cat = df.groupby('category', as_index=False)['sales'].sum().rename(columns={'sales': 'total_sales'})

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# (1) Line: total sales over time
sns.lineplot(data=daily, x='date', y='total_sales', ax=axes[0, 0])
axes[0, 0].set_title('Total Sales Over Time')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Total Sales ($)')

# (2) Bar: total sales by category
sns.barplot(data=totals_by_cat, x='category', y='total_sales', ax=axes[0, 1])
axes[0, 1].set_title('Total Sales by Category')
axes[0, 1].set_xlabel('Category')
axes[0, 1].set_ylabel('Total Sales ($)')
axes[0, 1].set_ylim(0, None)

# (3) Scatter: ad spend vs sales
sns.scatterplot(data=df, x='ad_spend', y='sales', hue='category', alpha=0.4, ax=axes[1, 0], legend=False)
axes[1, 0].set_title('Ad Spend vs Sales')
axes[1, 0].set_xlabel('Ad Spend ($)')
axes[1, 0].set_ylabel('Sales ($)')

# (4) Distribution: choose histogram OR box plot
sns.histplot(data=df, x='sales', bins=25, ax=axes[1, 1])
axes[1, 1].set_title('Sales Distribution')
axes[1, 1].set_xlabel('Sales ($)')
axes[1, 1].set_ylabel('Count')

plt.tight_layout()
plt.show()

## Additional resources (optional)
- Matplotlib gallery: https://matplotlib.org/stable/gallery/
- Seaborn tutorials: https://seaborn.pydata.org/tutorial.html
- Storytelling with Data (concepts): https://www.storytellingwithdata.com/

> Tip: Use galleries as inspiration, but always start from a question you need to answer.

## Summary / Key takeaways
- Start with a question, then choose the chart type.
- Label everything: title, axes, units. Make charts self-explanatory.
- Use line charts for trends, bars for category comparisons, scatter for relationships, histograms/box plots for distributions.
- Reduce clutter and avoid misleading scales (especially for bar charts).
- Seaborn gives quick, clean plots; Matplotlib helps you fine-tune details.

Next chapter (Chapter 6) will build on this by exploring **interactive** and more advanced visualization techniques.