# Lecture 6: Visualization Fundamentals
**BANA 4373 / ECON 4370 ‚Äî Applied Data Tools for Economics & Business**  
**Dr. Fidel Gonz√°lez ‚Äî Spring 2026**

---

## Learning Objectives
By the end of this notebook, you will be able to:

1. Create **histograms** to understand distributions
2. Build **line charts** for time series data
3. Use **scatterplots** to explore relationships
4. Create **bar charts** for group comparisons
5. Apply best practices: labels, titles, scales, and avoiding misrepresentation

**Case Study:** Labor Force Participation using CPS-style data

---
## 0) Setup and Project Structure

We begin by setting up our project folder structure and importing the libraries we'll use.

**Folder structure:**
```
lecture6_visualization/
‚îú‚îÄ‚îÄ data_raw/       # Original data files
‚îú‚îÄ‚îÄ data_clean/     # Cleaned/processed data
‚îî‚îÄ‚îÄ exports/        # Figures and output files
```

In [None]:
# 0) Setup
import os
from pathlib import Path
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)

# Project folder structure
ROOT = Path.cwd() / "lecture6_visualization"
RAW_DIR = ROOT / "data_raw"
CLEAN_DIR = ROOT / "data_clean"
EXPORT_DIR = ROOT / "exports"

for d in [RAW_DIR, CLEAN_DIR, EXPORT_DIR]:
    d.mkdir(parents=True, exist_ok=True)

# Seaborn style (makes plots look nicer)
sns.set_style("whitegrid")
sns.set_palette("colorblind")  # Accessible color palette

# For inline plots in Jupyter
%matplotlib inline

print("Setup complete!")
print(f"Project root: {ROOT}")
print(f"Raw data:     {RAW_DIR}")
print(f"Clean data:   {CLEAN_DIR}")
print(f"Exports:      {EXPORT_DIR}")

---
## 1) Create Labor Force Data (Simulated CPS-style)

The Current Population Survey (CPS) is the primary source of U.S. labor force statistics.  
We'll create realistic simulated data to practice visualization techniques.

**Variables:**
- Labor force participation rate (LFPR) by demographic group
- Time series from 2000‚Äì2024
- Breakdown by sex, age group, and education

We'll save these datasets to our `data_raw` folder.

In [None]:
# 1a) Create time series data: Labor Force Participation Rate by sex (2000-2024)
np.random.seed(4370)

years = list(range(2000, 2025))

# Men: gradual decline from ~75% to ~68%
men_base = 75 - np.linspace(0, 7, len(years)) + np.random.normal(0, 0.5, len(years))
# Add recession dip (2008-2010)
men_base[8:11] -= np.array([1.5, 3.0, 2.5])
# Add COVID dip (2020)
men_base[20] -= 4.0
men_base[21] -= 2.0

# Women: rise then plateau, from ~60% to ~57%
women_base = 60 + np.concatenate([
    np.linspace(0, 2, 8),      # 2000-2007: gradual rise
    np.linspace(2, 0, 5),      # 2008-2012: recession decline
    np.linspace(0, -1, 8),     # 2013-2020: gradual decline
    np.linspace(-1, -2, 4)     # 2021-2024: continued decline
]) + np.random.normal(0, 0.4, len(years))
# Add COVID dip
women_base[20] -= 5.0
women_base[21] -= 2.5

df_lfpr_time = pd.DataFrame({
    "year": years,
    "lfpr_men": np.round(men_base, 1),
    "lfpr_women": np.round(women_base, 1)
})

# Calculate overall (weighted average, roughly 52% women)
df_lfpr_time["lfpr_total"] = np.round(0.48 * df_lfpr_time["lfpr_men"] + 0.52 * df_lfpr_time["lfpr_women"], 1)

# Save to raw data folder
df_lfpr_time.to_csv(RAW_DIR / "lfpr_time_series.csv", index=False)

print("Dataset 1: Time Series Data")
print(f"Shape: {df_lfpr_time.shape}")
print(f"Saved to: {RAW_DIR / 'lfpr_time_series.csv'}")
df_lfpr_time.head(10)

In [None]:
# 1b) Create cross-sectional data: LFPR by demographic group (2024)

df_demographics = pd.DataFrame({
    "group": ["Men 25-54", "Women 25-54", "Men 55-64", "Women 55-64",
              "Men 16-24", "Women 16-24", "Men 65+", "Women 65+"],
    "lfpr": [88.2, 77.1, 70.5, 60.2, 55.3, 52.8, 24.1, 16.3],
    "sex": ["Men", "Women", "Men", "Women", "Men", "Women", "Men", "Women"],
    "age_group": ["25-54", "25-54", "55-64", "55-64", "16-24", "16-24", "65+", "65+"]
})

# Save to raw data folder
df_demographics.to_csv(RAW_DIR / "lfpr_demographics.csv", index=False)

print("Dataset 2: Demographics Data")
print(f"Saved to: {RAW_DIR / 'lfpr_demographics.csv'}")
df_demographics

In [None]:
# 1c) Create education data: LFPR by education level (2024)

df_education = pd.DataFrame({
    "education": ["Less than HS", "High School", "Some College", "Bachelor's", "Advanced Degree"],
    "lfpr": [45.2, 55.8, 64.3, 73.5, 75.1],
    "median_earnings": [28500, 38400, 45200, 67800, 82100]
})

# Save to raw data folder
df_education.to_csv(RAW_DIR / "lfpr_education.csv", index=False)

print("Dataset 3: Education Data")
print(f"Saved to: {RAW_DIR / 'lfpr_education.csv'}")
df_education

In [None]:
# 1d) Create state-level data: LFPR across states (for histogram)
np.random.seed(42)

# Simulate 50 states with realistic LFPR distribution
state_lfpr = np.random.normal(62, 4, 50)  # Mean ~62%, SD ~4%
state_lfpr = np.clip(state_lfpr, 52, 72)  # Reasonable bounds

df_states = pd.DataFrame({
    "state": [f"State_{i:02d}" for i in range(1, 51)],
    "lfpr": np.round(state_lfpr, 1)
})

# Save to raw data folder
df_states.to_csv(RAW_DIR / "lfpr_states.csv", index=False)

print("Dataset 4: State-Level Data")
print(f"Shape: {len(df_states)} states")
print(f"LFPR range: {df_states['lfpr'].min():.1f}% to {df_states['lfpr'].max():.1f}%")
print(f"Saved to: {RAW_DIR / 'lfpr_states.csv'}")
df_states.head()

In [None]:
# 1e) Add unemployment rate to time series data (for Your Turn exercises)
np.random.seed(123)

# Unemployment rate time series (2000-2024)
unemp_base = 4.5 + np.concatenate([
    np.linspace(0, 1, 8),       # 2000-2007: gradual rise
    np.array([2, 5, 4.5, 3.5, 2.5]),  # 2008-2012: Great Recession spike
    np.linspace(2, -1, 8),      # 2013-2020: recovery
    np.linspace(-1, 0, 4)       # 2021-2024: post-COVID
]) + np.random.normal(0, 0.3, len(years))

# Add COVID spike
unemp_base[20] += 9.0  # 2020 spike
unemp_base[21] += 2.0  # 2021 elevated

df_lfpr_time["unemployment_rate"] = np.round(np.clip(unemp_base, 2, 15), 1)

# Save updated time series to clean data folder
df_lfpr_time.to_csv(CLEAN_DIR / "lfpr_unemp_time_series.csv", index=False)

print("Added unemployment rate to time series data")
print(f"Saved to: {CLEAN_DIR / 'lfpr_unemp_time_series.csv'}")
df_lfpr_time[["year", "lfpr_total", "unemployment_rate"]].tail(10)

In [None]:
# Verify all files were created
print("=" * 50)
print("Files in data_raw:")
for f in RAW_DIR.glob("*.csv"):
    print(f"  - {f.name}")

print("\nFiles in data_clean:")
for f in CLEAN_DIR.glob("*.csv"):
    print(f"  - {f.name}")
print("=" * 50)

---
## 2) Histograms: Understanding Distributions

**Purpose:** See the shape of a single variable's distribution.

**Questions to ask:**
- Is it symmetric or skewed?
- Are there outliers?
- What's a "typical" value?

In [None]:
# 2a) Basic histogram with pandas (quick exploration)
df_states["lfpr"].hist(bins=10, edgecolor="black")
plt.title("Distribution of State Labor Force Participation Rates")
plt.xlabel("LFPR (%)")
plt.ylabel("Number of States")
plt.show()

In [None]:
# 2b) Better histogram with seaborn (for presentation)
fig, ax = plt.subplots(figsize=(10, 6))

sns.histplot(data=df_states, x="lfpr", bins=12, kde=True, color="steelblue", ax=ax)

# Add vertical line for mean
mean_lfpr = df_states["lfpr"].mean()
ax.axvline(mean_lfpr, color="red", linestyle="--", linewidth=2, label=f"Mean: {mean_lfpr:.1f}%")

ax.set_xlabel("Labor Force Participation Rate (%)", fontsize=12)
ax.set_ylabel("Number of States", fontsize=12)
ax.set_title("Distribution of Labor Force Participation Rates Across U.S. States (2024)", fontsize=14)
ax.legend()

plt.tight_layout()
plt.show()

### Discussion: What does this histogram tell us?

- The distribution is roughly **symmetric** (bell-shaped)
- Most states cluster around 60-65%
- The **kde curve** (smooth line) helps see the overall shape
- The **mean line** provides a reference point

---
## 3) Line Charts: Time Series Trends

**Purpose:** Show how variables change over time.

**Best practices:**
- Time on x-axis
- Clear labels and legend
- Highlight key events (recessions, policy changes)

In [None]:
# 3a) Basic line chart with pandas
df_lfpr_time.plot(x="year", y=["lfpr_men", "lfpr_women"], figsize=(10, 6))
plt.title("Labor Force Participation by Sex")
plt.ylabel("LFPR (%)")
plt.show()

In [None]:
# 3b) Professional line chart with matplotlib
fig, ax = plt.subplots(figsize=(12, 7))

# Plot lines
ax.plot(df_lfpr_time["year"], df_lfpr_time["lfpr_men"], 
        marker="o", markersize=4, linewidth=2, label="Men", color="#1f77b4")
ax.plot(df_lfpr_time["year"], df_lfpr_time["lfpr_women"], 
        marker="s", markersize=4, linewidth=2, label="Women", color="#ff7f0e")
ax.plot(df_lfpr_time["year"], df_lfpr_time["lfpr_total"], 
        marker="^", markersize=4, linewidth=2, linestyle="--", label="Total", color="#2ca02c")

# Add recession shading (2008-2009 and 2020)
ax.axvspan(2007.5, 2009.5, alpha=0.2, color="gray", label="Recession")
ax.axvspan(2020, 2020.5, alpha=0.2, color="gray")

# Add annotations
ax.annotate("Great Recession", xy=(2008.5, 58), fontsize=10, color="gray")
ax.annotate("COVID-19", xy=(2020, 54), fontsize=10, color="gray")

# Labels and formatting
ax.set_xlabel("Year", fontsize=12)
ax.set_ylabel("Labor Force Participation Rate (%)", fontsize=12)
ax.set_title("U.S. Labor Force Participation Rate by Sex, 2000‚Äì2024", fontsize=14, fontweight="bold")
ax.legend(loc="lower left", fontsize=10)

# Set y-axis to start near the data (but not truncated misleadingly)
ax.set_ylim(50, 80)
ax.set_xlim(1999, 2025)

# Add grid
ax.grid(True, alpha=0.3)

# Add source
ax.text(0.99, 0.01, "Source: Simulated CPS-style data", 
        transform=ax.transAxes, fontsize=9, ha="right", va="bottom", color="gray")

plt.tight_layout()
plt.show()

### Discussion: What story does this chart tell?

1. **Long-term trends:** Men's LFPR has steadily declined; Women's rose then plateaued
2. **Gender gap:** The gap has narrowed over time
3. **Cyclical effects:** Both series show dips during recessions
4. **COVID impact:** Sharp drop in 2020, partial recovery afterward

**Note:** The recession shading and annotations add important context!

---
## 4) ‚ö†Ô∏è Avoiding Misrepresentation: The Truncated Y-Axis

A common way to mislead is to truncate the y-axis, making small changes look dramatic.

In [None]:
# 4) Side-by-side: Misleading vs. Honest chart

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# LEFT: Misleading (truncated y-axis)
ax1 = axes[0]
ax1.plot(df_lfpr_time["year"], df_lfpr_time["lfpr_total"], linewidth=2, color="red")
ax1.set_ylim(58, 68)  # Truncated!
ax1.set_title("‚ö†Ô∏è MISLEADING: Labor Force Participation\n(Truncated Y-Axis)", fontsize=12, color="red")
ax1.set_xlabel("Year")
ax1.set_ylabel("LFPR (%)")
ax1.grid(True, alpha=0.3)

# RIGHT: Honest (y-axis from 0)
ax2 = axes[1]
ax2.plot(df_lfpr_time["year"], df_lfpr_time["lfpr_total"], linewidth=2, color="green")
ax2.set_ylim(0, 100)  # Full scale
ax2.set_title("‚úì HONEST: Labor Force Participation\n(Full Y-Axis Scale)", fontsize=12, color="green")
ax2.set_xlabel("Year")
ax2.set_ylabel("LFPR (%)")
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### ‚ö†Ô∏è Key Lesson: The same data tells very different stories!

- **Left chart:** Looks like LFPR "collapsed" ‚Äî dramatic decline!
- **Right chart:** Shows the decline is real but modest (~10 percentage points over 25 years)

**Rule of thumb:**
- For **line charts**, truncating can be acceptable if you're showing small changes in context
- For **bar charts**, ALWAYS start at zero
- When in doubt, show both views or clearly note the scale

---
# üéØ YOUR TURN: Practice Exercises (15-20 minutes)

Now it's your turn to create visualizations! Work through these exercises independently or with a partner.

Use the datasets we've already created:
- `df_lfpr_time` ‚Äî Time series with LFPR and unemployment rate
- `df_states` ‚Äî State-level LFPR
- `df_demographics` ‚Äî LFPR by age and sex
- `df_education` ‚Äî LFPR and earnings by education

Or load them from the data folders:
```python
df_lfpr_time = pd.read_csv(CLEAN_DIR / "lfpr_unemp_time_series.csv")
df_states = pd.read_csv(RAW_DIR / "lfpr_states.csv")
```

---

## Exercise 1: Histogram of Unemployment Rate (5 points)

Create a histogram showing the distribution of **unemployment rates** over time (2000‚Äì2024).

**Requirements:**
- Use `df_lfpr_time["unemployment_rate"]`
- Add a vertical line showing the mean unemployment rate
- Include a clear title and axis labels

*Hint: Look at the seaborn histogram example in Section 2b*

In [None]:
# YOUR CODE HERE: Create a histogram of unemployment rates



**Question:** Is this distribution symmetric or skewed? Why might that be? (Think about what happened in 2008-2010 and 2020)

**Your Answer:** *TODO*

---
## Exercise 2: Dual Time Series ‚Äî LFPR vs. Unemployment (10 points)

Create a **line chart** that shows both labor force participation (total) and unemployment rate over time.

**Requirements:**
- Plot both `lfpr_total` and `unemployment_rate` on the same chart
- Use different colors and markers for each line
- Add a legend
- Add a descriptive title
- Add shading for the 2008-2009 recession and 2020 COVID period

*Hint: Look at the professional line chart in Section 3b*

In [None]:
# YOUR CODE HERE: Create a dual time series line chart



**Question:** What relationship do you observe between LFPR and unemployment during recessions? Do they move together or in opposite directions?

**Your Answer:** *TODO*

---
## Exercise 3: Bar Chart ‚Äî LFPR by Education (10 points)

Create a **horizontal bar chart** showing labor force participation by education level.

**Requirements:**
- Use `df_education`
- Sort bars from lowest to highest LFPR
- Start the x-axis at zero
- Add value labels on or next to each bar
- Include a clear title and axis labels

*Hint: Use `ax.barh()` for horizontal bars, and `ax.text()` for labels*

In [None]:
# YOUR CODE HERE: Create a horizontal bar chart of LFPR by education



**Question:** What does this chart tell us about the relationship between education and labor force participation? What might explain this pattern?

**Your Answer:** *TODO*

---
## Exercise 4: Identify the Problems (5 points)

The chart below has **at least 4 problems**. Run the cell, then list what's wrong.

In [None]:
# Run this cell to see a PROBLEMATIC chart
fig, ax = plt.subplots(figsize=(8, 5))

# Intentionally bad chart
ax.bar(df_education["education"], df_education["lfpr"], color=["red", "blue", "green", "purple", "orange"])
ax.set_ylim(40, 80)  # Problem: doesn't start at 0
# Missing: title, axis labels, consistent ordering

plt.show()

**List at least 4 problems with the chart above:**

1. *TODO*
2. *TODO*
3. *TODO*
4. *TODO*

---
## Exercise 5 (Bonus): Fix the Bad Chart (5 bonus points)

Take the problematic chart from Exercise 4 and create a corrected version that follows all best practices.

In [None]:
# YOUR CODE HERE: Create a corrected version of the chart



---
### üéØ End of Your Turn Section

**Instructor will review solutions in:** ~5 minutes

---

---
## 5) Bar Charts: Group Comparisons (Instructor Demo Continues)

**Purpose:** Compare values across discrete categories.

**Critical rule:** Bar charts MUST start at zero!

In [None]:
# 5) Grouped bar chart: LFPR by age and sex
fig, ax = plt.subplots(figsize=(10, 6))

# Reshape for grouped bar chart
df_pivot = df_demographics.pivot(index="age_group", columns="sex", values="lfpr")

# Reorder age groups logically
age_order = ["16-24", "25-54", "55-64", "65+"]
df_pivot = df_pivot.reindex(age_order)

df_pivot.plot(kind="bar", ax=ax, color=["#1f77b4", "#ff7f0e"], edgecolor="black", width=0.7)

ax.set_xlabel("Age Group", fontsize=12)
ax.set_ylabel("Labor Force Participation Rate (%)", fontsize=12)
ax.set_title("Labor Force Participation by Age and Sex (2024)", fontsize=14)
ax.set_ylim(0, 100)  # Start at zero!
ax.legend(title="Sex")
ax.set_xticklabels(age_order, rotation=0)

plt.tight_layout()
plt.show()

### Discussion: What patterns do we see?

1. **Prime-age workers (25-54)** have the highest participation
2. **Men have higher LFPR** than women in every age group
3. **The gap varies by age:** Smallest for young workers, largest for 65+
4. **Older workers (65+)** have much lower participation (retirement)

---
## 6) Scatterplots: Exploring Relationships

**Purpose:** Visualize the relationship between two continuous variables.

In [None]:
# 6a) Scatterplot: Education vs. Earnings with LFPR as context
fig, ax = plt.subplots(figsize=(10, 6))

scatter = ax.scatter(df_education["lfpr"], df_education["median_earnings"], 
                     s=200, c=range(len(df_education)), cmap="viridis", 
                     edgecolors="black", linewidth=1.5)

# Add labels for each point
for i, row in df_education.iterrows():
    ax.annotate(row["education"], (row["lfpr"] + 1, row["median_earnings"] + 1500),
                fontsize=10)

ax.set_xlabel("Labor Force Participation Rate (%)", fontsize=12)
ax.set_ylabel("Median Annual Earnings ($)", fontsize=12)
ax.set_title("Education, Labor Force Participation, and Earnings (2024)", fontsize=14)

# Format y-axis as currency
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f"${x:,.0f}"))

plt.tight_layout()
plt.show()

In [None]:
# 6b) Scatterplot with regression line (seaborn)
fig, ax = plt.subplots(figsize=(10, 6))

sns.regplot(data=df_education, x="lfpr", y="median_earnings", 
            scatter_kws={"s": 150, "edgecolors": "black"}, 
            line_kws={"color": "red", "linestyle": "--"},
            ax=ax)

ax.set_xlabel("Labor Force Participation Rate (%)", fontsize=12)
ax.set_ylabel("Median Annual Earnings ($)", fontsize=12)
ax.set_title("Positive Relationship: Higher LFPR Associated with Higher Earnings", fontsize=14)

ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f"${x:,.0f}"))

plt.tight_layout()
plt.show()

### Discussion: Correlation vs. Causation

The scatterplot shows a **positive relationship** between education (via LFPR) and earnings.

**But be careful!** This doesn't mean:
- Higher LFPR *causes* higher earnings
- We should force people into the labor force to raise earnings

**More likely:** Education is a common driver of both outcomes.

**Visualization shows patterns; analysis explains them.**

---
## 7) Small Multiples (Faceted Plots)

**Purpose:** Compare the same visualization across subgroups.

In [None]:
# 7) Reshape time series to long format for faceting
df_long = df_lfpr_time.melt(
    id_vars=["year"],
    value_vars=["lfpr_men", "lfpr_women"],
    var_name="sex",
    value_name="lfpr"
)
df_long["sex"] = df_long["sex"].str.replace("lfpr_", "").str.title()

# Faceted line plot
g = sns.FacetGrid(df_long, col="sex", height=5, aspect=1.2)
g.map_dataframe(sns.lineplot, x="year", y="lfpr", marker="o")
g.set_axis_labels("Year", "LFPR (%)")
g.set_titles("{col_name}")
g.fig.suptitle("Labor Force Participation Trends by Sex", y=1.02, fontsize=14)

# Set same y-axis for comparison
for ax in g.axes.flat:
    ax.set_ylim(50, 80)

plt.tight_layout()
plt.show()

---
## 8) Saving Figures

For reports and presentations, save figures as high-resolution files to our `exports` folder.

In [None]:
# 8) Save a publication-quality figure
fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(df_lfpr_time["year"], df_lfpr_time["lfpr_men"], 
        marker="o", linewidth=2, label="Men")
ax.plot(df_lfpr_time["year"], df_lfpr_time["lfpr_women"], 
        marker="s", linewidth=2, label="Women")

ax.set_xlabel("Year", fontsize=12)
ax.set_ylabel("Labor Force Participation Rate (%)", fontsize=12)
ax.set_title("U.S. Labor Force Participation by Sex, 2000‚Äì2024", fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

# Save as PNG (for presentations) and PDF (for papers)
fig.savefig(EXPORT_DIR / "lfpr_by_sex.png", dpi=300, bbox_inches="tight")
fig.savefig(EXPORT_DIR / "lfpr_by_sex.pdf", bbox_inches="tight")

print(f"Figures saved to: {EXPORT_DIR}")
print(f"  - lfpr_by_sex.png")
print(f"  - lfpr_by_sex.pdf")
plt.show()

---
## 9) Summary: Visualization Checklist

Before sharing any visualization, ask yourself:

### Content
- [ ] Does the chart answer a clear question?
- [ ] Is the chart type appropriate for the data?
- [ ] Have I avoided misrepresentation (truncated axes, cherry-picking)?

### Labels
- [ ] Clear, descriptive title?
- [ ] Axis labels with units?
- [ ] Legend (if multiple series)?
- [ ] Data source noted?

### Design
- [ ] Readable font sizes?
- [ ] Appropriate color choices (colorblind-friendly)?
- [ ] Not overcrowded?
- [ ] Consistent scales if comparing panels?

---
## 10) Reflection Questions

1. Why is it important to visualize data *before* running regressions?

2. What's the difference between a histogram and a bar chart? When would you use each?

3. When is it acceptable to not start a y-axis at zero?

4. How can visualizations mislead even if all the data is accurate?

---

**Next week:** Advanced Visualization & Interactive Dashboards (Plotly, Streamlit)