# Session 4: Data Visualization

## Session Introduction & Objectives

Welcome to Session 4 of our Data Science with Python course! In this session, we'll explore the fascinating world of data visualization.

### Learning Objectives
By the end of this session, you will be able to:
- Understand the importance of data visualization in data science
- Identify different chart types and when to use them
- Create basic plots using Matplotlib
- Enhance plots with titles, labels, and styling
- Use Seaborn for statistical visualizations
- Apply appropriate themes and color palettes
- Save and export visualizations

### What We'll Cover
1. **Importance of Data Visualization**
2. **Common Chart Types & Use Cases**
3. **Introduction to Matplotlib**
4. **Introduction to Seaborn**
5. **Hands-on Practice**

## Importance of Data Visualization in Data Science

Data visualization is a crucial component of data science for several reasons:

### 1. **Pattern Recognition**
- Human eyes are excellent at spotting patterns, trends, and outliers
- Visual representations make complex data relationships immediately apparent

### 2. **Communication**
- Transforms complex data into understandable insights
- Enables effective communication with stakeholders
- Makes data stories compelling and memorable

### 3. **Exploratory Data Analysis (EDA)**
- Helps identify data quality issues
- Reveals distributions and relationships
- Guides hypothesis formation

### 4. **Decision Making**
- Supports data-driven decision making
- Provides clear evidence for recommendations
- Reduces cognitive load when processing information

### 5. **Validation**
- Helps validate statistical findings
- Makes assumptions visible
- Identifies potential model issues

## Common Chart Types & When to Use Them

Different types of data require different visualization approaches. Let's explore the main categories:

### 📊 Categorical Data Visualizations

#### **Bar Charts**
- **When to use:** Comparing quantities across different categories
- **Best for:** Discrete categories, rankings, frequencies
- **Example:** Sales by product category, survey responses

#### **Count Plots**
- **When to use:** Showing frequency distribution of categorical variables
- **Best for:** Understanding category frequencies
- **Example:** Number of customers by region

#### **Pie Charts**
- **When to use:** Showing parts of a whole (use sparingly!)
- **Best for:** When you have ≤5 categories and want to show proportions
- **Example:** Market share distribution

### 📈 Continuous Data Visualizations

#### **Histograms**
- **When to use:** Understanding distribution of a single continuous variable
- **Best for:** Identifying skewness, outliers, data distribution shape
- **Example:** Age distribution, income distribution

#### **Box Plots**
- **When to use:** Comparing distributions across groups, identifying outliers
- **Best for:** Showing quartiles, median, and outliers
- **Example:** Salary distribution by department

#### **Violin Plots**
- **When to use:** Combining box plot information with distribution shape
- **Best for:** Detailed distribution comparison
- **Example:** Test scores across different schools

### 🔗 Relationship Visualizations

#### **Scatter Plots**
- **When to use:** Exploring relationships between two continuous variables
- **Best for:** Correlation analysis, trend identification
- **Example:** Height vs. weight, advertising spend vs. sales

#### **Line Charts**
- **When to use:** Showing trends over time or ordered categories
- **Best for:** Time series data, sequential data
- **Example:** Stock prices over time, website traffic trends

#### **Bubble Charts**
- **When to use:** Three-dimensional relationships (x, y, size)
- **Best for:** Adding a third dimension to scatter plots
- **Example:** GDP vs. life expectancy vs. population

### 🔥 Advanced Visualizations

#### **Heatmaps**
- **When to use:** Showing relationships in matrices, correlation analysis
- **Best for:** Correlation matrices, pivot tables, time-based patterns
- **Example:** Feature correlations, sales by month and region

#### **Pair Plots**
- **When to use:** Exploring relationships between multiple variables simultaneously
- **Best for:** Initial EDA, feature selection
- **Example:** Analyzing relationships in the Iris dataset

#### **Treemaps**
- **When to use:** Hierarchical data with size and category information
- **Best for:** Nested categories, portfolio analysis
- **Example:** Company revenue by division and department

## Introduction to Matplotlib

Matplotlib is the foundational plotting library for Python. It provides:
- Low-level control over plot elements
- Publication-quality figures
- Extensive customization options
- Foundation for other plotting libraries (like Seaborn)

Let's start by importing the necessary libraries:

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Set up matplotlib for inline plotting
%matplotlib inline

# Set default figure size
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully!")

### Basic Plot Creation

Let's start with simple plots to understand the basics:

In [None]:
# Create sample data
x = np.linspace(0, 10, 100)
y = np.sin(x)

# Create a basic line plot
plt.plot(x, y)
plt.show()

In [None]:
# Create a scatter plot
x_scatter = np.random.randn(50)
y_scatter = np.random.randn(50)

plt.scatter(x_scatter, y_scatter)
plt.show()

In [None]:
# Create a bar chart
categories = ['A', 'B', 'C', 'D', 'E']
values = [23, 45, 56, 78, 32]

plt.bar(categories, values)
plt.show()

### Adding Titles, Labels, and Legends

Good visualizations need clear labels and context:

In [None]:
# Create a more detailed plot
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

plt.figure(figsize=(10, 6))

# Plot multiple lines
plt.plot(x, y1, label='sin(x)', linewidth=2, color='blue')
plt.plot(x, y2, label='cos(x)', linewidth=2, color='red', linestyle='--')

# Add labels and title
plt.xlabel('X values', fontsize=12)
plt.ylabel('Y values', fontsize=12)
plt.title('Trigonometric Functions', fontsize=14, fontweight='bold')

# Add legend
plt.legend(fontsize=10)

# Add grid for better readability
plt.grid(True, alpha=0.3)

# Show the plot
plt.tight_layout()
plt.show()

### Figure Size & Styles

Matplotlib provides various ways to customize appearance:

In [None]:
# Available styles
print("Available styles:")
print(plt.style.available)

In [None]:
# Using different styles
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Sample data
x = np.linspace(0, 10, 50)
y = np.sin(x)

styles = ['default', 'seaborn-v0_8', 'ggplot', 'dark_background']

for i, style in enumerate(styles):
    row, col = i // 2, i % 2
    
    with plt.style.context(style):
        axes[row, col].plot(x, y, linewidth=2)
        axes[row, col].set_title(f'Style: {style}', fontsize=12)
        axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Saving Plots

You can save plots in various formats:

In [None]:
# Create a sample plot
plt.figure(figsize=(10, 6))
x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.plot(x, y, linewidth=2, color='purple')
plt.xlabel('X values')
plt.ylabel('sin(x)')
plt.title('Sample Sine Wave')
plt.grid(True, alpha=0.3)

# Save in different formats
# plt.savefig('sine_wave.png', dpi=300, bbox_inches='tight')
# plt.savefig('sine_wave.pdf', bbox_inches='tight')
# plt.savefig('sine_wave.svg', bbox_inches='tight')

plt.show()

print("Plot displayed! (Uncomment savefig lines to save to files)")

## Introduction to Seaborn

Seaborn is a statistical data visualization library built on top of Matplotlib. It provides:
- Higher-level interface for statistical plots
- Beautiful default styles
- Built-in statistical functions
- Easy integration with pandas DataFrames

### Loading Datasets

Seaborn comes with built-in datasets perfect for learning:

In [None]:
# Load built-in datasets
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
flights = sns.load_dataset('flights')

print("Tips dataset shape:", tips.shape)
print("\nTips dataset first 5 rows:")
print(tips.head())

print("\n" + "="*50)
print("Iris dataset shape:", iris.shape)
print("\nIris dataset first 5 rows:")
print(iris.head())

### Basic Seaborn Plots

Let's explore the main plot types in Seaborn:

In [None]:
# Bar Plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.barplot(data=tips, x='day', y='total_bill')
plt.title('Average Total Bill by Day')

plt.subplot(1, 2, 2)
sns.countplot(data=tips, x='day')
plt.title('Count of Visits by Day')

plt.tight_layout()
plt.show()

In [None]:
# Histogram and Distribution Plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(data=tips, x='total_bill', bins=20)
plt.title('Distribution of Total Bill')

plt.subplot(1, 2, 2)
sns.histplot(data=tips, x='total_bill', hue='time', bins=15)
plt.title('Total Bill Distribution by Time')

plt.tight_layout()
plt.show()

In [None]:
# Scatter Plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.scatterplot(data=tips, x='total_bill', y='tip')
plt.title('Tips vs Total Bill')

plt.subplot(1, 2, 2)
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time', size='size')
plt.title('Tips vs Total Bill (by Time and Party Size)')

plt.tight_layout()
plt.show()

In [None]:
# Box Plot and Violin Plot
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.boxplot(data=tips, x='day', y='total_bill')
plt.title('Total Bill Distribution by Day')
plt.xticks(rotation=45)

plt.subplot(1, 3, 2)
sns.violinplot(data=tips, x='day', y='total_bill')
plt.title('Total Bill Distribution by Day (Violin)')
plt.xticks(rotation=45)

plt.subplot(1, 3, 3)
sns.boxplot(data=iris, x='species', y='sepal_length')
plt.title('Sepal Length by Species')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

### Advanced Seaborn Plots

In [None]:
# Correlation Heatmap
plt.figure(figsize=(10, 8))

# Calculate correlation matrix
correlation_matrix = tips.select_dtypes(include=[np.number]).corr()

# Create heatmap
sns.heatmap(correlation_matrix, 
            annot=True, 
            cmap='coolwarm', 
            center=0,
            square=True,
            fmt='.2f')
plt.title('Correlation Matrix - Tips Dataset')
plt.tight_layout()
plt.show()

In [None]:
# Pair Plot
sns.pairplot(iris, hue='species', height=2.5)
plt.suptitle('Pair Plot - Iris Dataset', y=1.02)
plt.show()

### Built-in Themes & Palettes

Seaborn provides beautiful themes and color palettes:

In [None]:
# Different Seaborn styles
styles = ['whitegrid', 'darkgrid', 'white', 'dark', 'ticks']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

for i, style in enumerate(styles):
    sns.set_style(style)
    sns.scatterplot(data=tips, x='total_bill', y='tip', ax=axes[i])
    axes[i].set_title(f'Style: {style}')

# Hide the last subplot
axes[5].set_visible(False)

plt.tight_layout()
plt.show()

# Reset to default
sns.set_style('whitegrid')

In [None]:
# Color Palettes
palettes = ['deep', 'muted', 'pastel', 'bright', 'dark', 'colorblind']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

for i, palette in enumerate(palettes):
    sns.barplot(data=tips, x='day', y='total_bill', palette=palette, ax=axes[i])
    axes[i].set_title(f'Palette: {palette}')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## Customizing Visualizations

Let's explore advanced customization techniques to make your visualizations more professional and impactful.

### Advanced Color Palettes & Themes

Understanding how to use colors effectively is crucial for creating compelling visualizations:

In [None]:
# Creating custom color palettes
plt.figure(figsize=(16, 10))

# 1. Sequential palette for continuous data
plt.subplot(2, 3, 1)
sequential_colors = sns.color_palette("Blues_r", n_colors=6)
sns.barplot(data=tips, x='day', y='total_bill', palette=sequential_colors)
plt.title('Sequential Palette (Blues)')
plt.xticks(rotation=45)

# 2. Diverging palette for data with a meaningful center
plt.subplot(2, 3, 2)
diverging_colors = sns.color_palette("RdBu_r", n_colors=4)
sns.barplot(data=tips, x='day', y='total_bill', palette=diverging_colors)
plt.title('Diverging Palette (RdBu)')
plt.xticks(rotation=45)

# 3. Qualitative palette for categorical data
plt.subplot(2, 3, 3)
qualitative_colors = sns.color_palette("Set2")
sns.barplot(data=tips, x='day', y='total_bill', palette=qualitative_colors)
plt.title('Qualitative Palette (Set2)')
plt.xticks(rotation=45)

# 4. Custom brand colors
plt.subplot(2, 3, 4)
brand_colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']  # Custom brand palette
sns.barplot(data=tips, x='day', y='total_bill', palette=brand_colors)
plt.title('Custom Brand Colors')
plt.xticks(rotation=45)

# 5. Gradient palette
plt.subplot(2, 3, 5)
gradient_colors = sns.color_palette("viridis", n_colors=4)
sns.barplot(data=tips, x='day', y='total_bill', palette=gradient_colors)
plt.title('Gradient Palette (Viridis)')
plt.xticks(rotation=45)

# 6. Colorblind-friendly palette
plt.subplot(2, 3, 6)
colorblind_colors = sns.color_palette("colorblind")
sns.barplot(data=tips, x='day', y='total_bill', palette=colorblind_colors)
plt.title('Colorblind-Friendly Palette')
plt.xticks(rotation=45)

plt.suptitle('Color Palette Comparison', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Advanced theme customization
# Create a custom theme
custom_style = {
    'axes.facecolor': '#f8f9fa',
    'axes.edgecolor': '#dee2e6',
    'axes.linewidth': 1.2,
    'xtick.color': '#495057',
    'ytick.color': '#495057',
    'axes.labelcolor': '#212529',
    'axes.titlesize': 14,
    'axes.titleweight': 'bold',
    'figure.facecolor': 'white',
    'grid.color': '#e9ecef',
    'grid.linewidth': 0.8
}

# Apply custom styling
with plt.rc_context(custom_style):
    plt.figure(figsize=(12, 8))
    
    plt.subplot(2, 2, 1)
    sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time', s=60, alpha=0.8)
    plt.title('Custom Styled Scatter Plot')
    plt.grid(True)
    
    plt.subplot(2, 2, 2)
    sns.barplot(data=tips, x='day', y='total_bill', palette='Set3')
    plt.title('Custom Styled Bar Plot')
    plt.xticks(rotation=45)
    plt.grid(True, axis='y')
    
    plt.subplot(2, 2, 3)
    sns.histplot(data=tips, x='total_bill', bins=20, color='#6c757d', alpha=0.7)
    plt.title('Custom Styled Histogram')
    plt.grid(True, axis='y')
    
    plt.subplot(2, 2, 4)
    sns.boxplot(data=tips, x='day', y='tip', palette='husl')
    plt.title('Custom Styled Box Plot')
    plt.xticks(rotation=45)
    plt.grid(True, axis='y')
    
    plt.suptitle('Custom Theme Application', fontsize=16, fontweight='bold', y=0.98)
    plt.tight_layout()
    plt.show()

### Annotations and Text

Annotations help highlight important insights and guide the viewer's attention:

In [None]:
# Adding annotations to highlight insights
plt.figure(figsize=(15, 10))

# 1. Annotated scatter plot
plt.subplot(2, 2, 1)
sns.scatterplot(data=tips, x='total_bill', y='tip', alpha=0.6)

# Find the highest tip
max_tip_idx = tips['tip'].idxmax()
max_tip_row = tips.loc[max_tip_idx]

# Annotate the highest tip
plt.annotate(f'Highest tip: ${max_tip_row["tip"]:.2f}', 
             xy=(max_tip_row['total_bill'], max_tip_row['tip']),
             xytext=(max_tip_row['total_bill'] + 5, max_tip_row['tip'] + 1),
             arrowprops=dict(arrowstyle='->', color='red', lw=2),
             fontsize=10, fontweight='bold', color='red')

plt.title('Tips vs Total Bill (with Annotation)')
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')

# 2. Bar plot with value annotations
plt.subplot(2, 2, 2)
avg_by_day = tips.groupby('day', observed=False)['total_bill'].mean()
bars = plt.bar(avg_by_day.index, avg_by_day.values, color='lightblue', edgecolor='navy')

# Add value labels on top of bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'${height:.1f}', ha='center', va='bottom', fontweight='bold')

plt.title('Average Total Bill by Day')
plt.ylabel('Average Total Bill ($)')
plt.xticks(rotation=45)

# 3. Line plot with trend annotation
plt.subplot(2, 2, 3)
# Create daily trends
daily_tips = tips.groupby(['day'], observed=False).agg({'tip': 'mean', 'total_bill': 'count'}).reset_index()
day_order = ['Thur', 'Fri', 'Sat', 'Sun']
daily_tips['day'] = pd.Categorical(daily_tips['day'], categories=day_order, ordered=True)
daily_tips = daily_tips.sort_values('day')

plt.plot(range(len(daily_tips)), daily_tips['tip'], marker='o', linewidth=2, markersize=8)
plt.xticks(range(len(daily_tips)), daily_tips['day'])

# Add trend annotation
max_tip_day = daily_tips.loc[daily_tips['tip'].idxmax()]
max_idx = daily_tips['tip'].idxmax()
plt.annotate('Weekend peak!', 
             xy=(max_idx, max_tip_day['tip']),
             xytext=(max_idx + 0.3, max_tip_day['tip'] + 0.1),
             arrowprops=dict(arrowstyle='->', color='green'),
             fontsize=10, color='green', fontweight='bold')

plt.title('Average Tip Trend by Day')
plt.ylabel('Average Tip ($)')
plt.grid(True, alpha=0.3)

# 4. Heatmap with custom annotations
plt.subplot(2, 2, 4)
pivot_data = tips.pivot_table(values='tip', index='day', columns='time', aggfunc='mean')
sns.heatmap(pivot_data, annot=True, fmt='.2f', cmap='YlOrRd', 
            cbar_kws={'label': 'Average Tip ($)'})
plt.title('Average Tips: Day vs Time Heatmap')

plt.suptitle('Advanced Annotations and Text', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## Choosing the Right Chart for Your Audience

Different audiences require different visualization approaches. Let's explore how to tailor your charts:

### 👥 Audience Types and Preferences

#### **Executive/Management Audience**
- **Prefer**: High-level summaries, clear trends, actionable insights
- **Avoid**: Complex statistical plots, too much detail
- **Best charts**: Bar charts, line charts, simple dashboards

#### **Technical/Data Science Team**
- **Prefer**: Detailed analysis, statistical significance, methodology transparency
- **Embrace**: Box plots, correlation matrices, pair plots, confidence intervals
- **Best charts**: Statistical plots, distribution plots, diagnostic plots

#### **General Public/Customers**
- **Prefer**: Simple, intuitive, visually appealing
- **Avoid**: Jargon, complex statistical concepts
- **Best charts**: Simple bar/pie charts, infographic-style visualizations

#### **Academic/Research Audience**
- **Prefer**: Rigorous methodology, comprehensive analysis, reproducibility
- **Embrace**: Error bars, confidence intervals, detailed legends
- **Best charts**: Publication-quality plots with proper statistical annotations

In [None]:
# Example: Same data, different audiences
plt.figure(figsize=(20, 12))

# Executive Dashboard Style
plt.subplot(2, 3, 1)
avg_by_day = tips.groupby('day', observed=False)['total_bill'].mean().sort_values(ascending=False)
colors = ['#1f77b4' if x == avg_by_day.max() else '#ADD8E6' for x in avg_by_day.values]
bars = plt.bar(avg_by_day.index, avg_by_day.values, color=colors)
plt.title('Revenue by Day\n(Executive View)', fontsize=14, fontweight='bold')
plt.ylabel('Average Bill ($)', fontsize=12)
# Add value labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'${height:.0f}', ha='center', va='bottom', fontsize=11, fontweight='bold')
plt.ylim(0, max(avg_by_day.values) * 1.15)

# Technical Analysis Style
plt.subplot(2, 3, 2)
sns.boxplot(data=tips, x='day', y='total_bill', showmeans=True)
plt.title('Total Bill Distribution by Day\n(Technical Analysis)', fontsize=14, fontweight='bold')
plt.ylabel('Total Bill ($)', fontsize=12)
plt.xticks(rotation=45)
# Add statistical annotation
plt.text(0.02, 0.98, f'n = {len(tips)} observations', transform=plt.gca().transAxes,
         fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

# Public/Simple Style
plt.subplot(2, 3, 3)
day_counts = tips['day'].value_counts()
colors_simple = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']
plt.pie(day_counts.values, labels=day_counts.index, autopct='%1.0f%%', 
        colors=colors_simple, startangle=90)
plt.title('Restaurant Visits by Day\n(Public View)', fontsize=14, fontweight='bold')

# Academic Style
plt.subplot(2, 3, 4)
# Calculate means and standard errors
stats_by_day = tips.groupby('day', observed=False)['total_bill'].agg(['mean', 'sem']).reset_index()
plt.errorbar(range(len(stats_by_day)), stats_by_day['mean'], 
             yerr=stats_by_day['sem'], marker='o', capsize=5, capthick=2)
plt.xticks(range(len(stats_by_day)), stats_by_day['day'])
plt.title('Mean Total Bill ± SEM by Day\n(Academic Style)', fontsize=14, fontweight='bold')
plt.ylabel('Total Bill ($)', fontsize=12)
plt.grid(True, alpha=0.3)

# Comparison: Complex vs Simple
plt.subplot(2, 3, 5)
# Complex version (avoid for general audience)
correlation_matrix = tips.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='RdBu_r', center=0, 
            square=True, fmt='.3f', cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Feature Correlation Matrix\n(Too Complex for General Audience)', fontsize=12, fontweight='bold')

plt.subplot(2, 3, 6)
# Simple version (better for general audience)
plt.scatter(tips['total_bill'], tips['tip'], alpha=0.6, color='#ff7f0e', s=50)
plt.xlabel('Bill Amount ($)', fontsize=12)
plt.ylabel('Tip Amount ($)', fontsize=12)
plt.title('Higher Bills = Higher Tips\n(Simple Message)', fontsize=14, fontweight='bold')
# Add simple trend line
z = np.polyfit(tips['total_bill'], tips['tip'], 1)
p = np.poly1d(z)
plt.plot(tips['total_bill'], p(tips['total_bill']), "r--", alpha=0.8, linewidth=2)

plt.suptitle('Same Data, Different Audiences', fontsize=18, fontweight='bold')
plt.tight_layout()
plt.show()

## Hands-on Exercise with Sample Dataset

Let's practice everything we've learned with a comprehensive exercise using a real dataset:

In [None]:
# Load and explore a new dataset
# Let's use the built-in 'car_crashes' dataset
crashes = sns.load_dataset('car_crashes')

print("Car Crashes Dataset")
print("="*50)
print(f"Shape: {crashes.shape}")
print(f"\nColumns: {list(crashes.columns)}")
print(f"\nFirst 5 rows:")
print(crashes.head())
print(f"\nBasic statistics:")
print(crashes.describe())

In [None]:
# Exercise: Create a comprehensive analysis dashboard
# Your task: Analyze the relationship between various factors and car crashes

fig = plt.figure(figsize=(20, 16))

# 1. Overall crash distribution
plt.subplot(3, 4, 1)
sns.histplot(data=crashes, x='total', bins=15, color='darkred', alpha=0.7)
plt.title('Distribution of Total Crashes\nper 100k Population', fontweight='bold')
plt.xlabel('Total Crashes per 100k')

# 2. Top 10 states with highest crash rates
plt.subplot(3, 4, 2)
top_states = crashes.nlargest(10, 'total')
sns.barplot(data=top_states, y='abbrev', x='total', palette='Reds_r')
plt.title('Top 10 States by\nCrash Rate', fontweight='bold')
plt.xlabel('Total Crashes per 100k')

# 3. Alcohol-related crashes
plt.subplot(3, 4, 3)
sns.scatterplot(data=crashes, x='alcohol', y='total', alpha=0.7, color='orange')
plt.title('Alcohol vs Total Crashes', fontweight='bold')
plt.xlabel('Alcohol-related Crashes')
plt.ylabel('Total Crashes')
# Add correlation coefficient
corr_coef = crashes['alcohol'].corr(crashes['total'])
plt.text(0.05, 0.95, f'Correlation: {corr_coef:.3f}', 
         transform=plt.gca().transAxes, fontsize=10, 
         bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

# 4. Speed-related crashes
plt.subplot(3, 4, 4)
sns.scatterplot(data=crashes, x='speeding', y='total', alpha=0.7, color='red')
plt.title('Speeding vs Total Crashes', fontweight='bold')
plt.xlabel('Speeding-related Crashes')
plt.ylabel('Total Crashes')
# Add correlation coefficient
corr_coef = crashes['speeding'].corr(crashes['total'])
plt.text(0.05, 0.95, f'Correlation: {corr_coef:.3f}', 
         transform=plt.gca().transAxes, fontsize=10,
         bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

# 5. Insurance premiums vs crashes
plt.subplot(3, 4, 5)
sns.scatterplot(data=crashes, x='ins_premium', y='total', alpha=0.7, color='blue')
plt.title('Insurance Premium vs\nTotal Crashes', fontweight='bold')
plt.xlabel('Average Insurance Premium')
plt.ylabel('Total Crashes')

# 6. Crashes by different causes (bar chart)
plt.subplot(3, 4, 6)
crash_causes = crashes[['alcohol', 'speeding', 'not_distracted', 'no_previous']].mean()
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']
bars = plt.bar(range(len(crash_causes)), crash_causes.values, color=colors)
plt.xticks(range(len(crash_causes)), 
           ['Alcohol', 'Speeding', 'Not Distracted', 'No Previous'], rotation=45)
plt.title('Average Crashes by Cause\n(per 100k)', fontweight='bold')
plt.ylabel('Average Crashes')
# Add value labels
for i, bar in enumerate(bars):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
             f'{height:.1f}', ha='center', va='bottom', fontsize=9)

# 7. Correlation heatmap
plt.subplot(3, 4, 7)
numeric_cols = ['total', 'speeding', 'alcohol', 'not_distracted', 'no_previous', 'ins_premium', 'ins_losses']
corr_matrix = crashes[numeric_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, 
            square=True, fmt='.2f', cbar_kws={'label': 'Correlation'})
plt.title('Feature Correlation Matrix', fontweight='bold')
plt.xticks(rotation=45)
plt.yticks(rotation=0)

# 8. Insurance losses vs premiums
plt.subplot(3, 4, 8)
sns.scatterplot(data=crashes, x='ins_premium', y='ins_losses', 
                size='total', alpha=0.7, color='purple')
plt.title('Insurance Losses vs Premiums\n(Size = Total Crashes)', fontweight='bold')
plt.xlabel('Insurance Premium')
plt.ylabel('Insurance Losses')

# 9. Box plot of crashes by region (create crash levels)
plt.subplot(3, 4, 9)
# Create regions based on crash rates
crashes['crash_level'] = pd.cut(crashes['total'], bins=3, labels=['Low', 'Medium', 'High'])
sns.boxplot(data=crashes, x='crash_level', y='ins_premium', palette='viridis')
plt.title('Insurance Premiums by\nCrash Level', fontweight='bold')
plt.xlabel('Crash Level')
plt.ylabel('Insurance Premium')

# 10. Multiple factors analysis
plt.subplot(3, 4, 10)
# Create a composite risk score
crashes['risk_score'] = (crashes['alcohol'] + crashes['speeding']) / 2
sns.scatterplot(data=crashes, x='risk_score', y='total', 
                hue='crash_level', alpha=0.8)
plt.title('Risk Score vs Total Crashes', fontweight='bold')
plt.xlabel('Risk Score (Alcohol + Speeding)/2')
plt.ylabel('Total Crashes')

# 11. State performance ranking
plt.subplot(3, 4, 11)
# Bottom 10 states (safest)
bottom_states = crashes.nsmallest(10, 'total')
sns.barplot(data=bottom_states, y='abbrev', x='total', palette='Greens')
plt.title('Top 10 Safest States\n(Lowest Crash Rates)', fontweight='bold')
plt.xlabel('Total Crashes per 100k')

# 12. Summary statistics
plt.subplot(3, 4, 12)
plt.axis('off')
summary_text = f"""
DATASET SUMMARY
{'='*25}
Total States: {len(crashes)}
Avg Crashes/100k: {crashes['total'].mean():.1f}
Highest: {crashes['total'].max():.1f} ({crashes.loc[crashes['total'].idxmax(), 'abbrev']})
Lowest: {crashes['total'].min():.1f} ({crashes.loc[crashes['total'].idxmin(), 'abbrev']})

CORRELATIONS WITH TOTAL CRASHES:
{'='*35}
Alcohol: {crashes['alcohol'].corr(crashes['total']):.3f}
Speeding: {crashes['speeding'].corr(crashes['total']):.3f}
Insurance Premium: {crashes['ins_premium'].corr(crashes['total']):.3f}
Insurance Losses: {crashes['ins_losses'].corr(crashes['total']):.3f}

KEY INSIGHTS:
{'='*15}
• Strong correlation between
  alcohol/speeding and crashes
• Insurance costs reflect risk
• Wide variation between states
"""
plt.text(0.05, 0.95, summary_text, transform=plt.gca().transAxes, 
         fontsize=10, verticalalignment='top', fontfamily='monospace',
         bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))

plt.suptitle('Comprehensive Car Crash Analysis Dashboard', 
             fontsize=20, fontweight='bold', y=0.98)
plt.tight_layout()
plt.show()

## Visualization Best Practices

Based on our exploration, here are the key best practices for effective data visualization:

### 🎯 **Design Principles**

#### **1. Clarity First**
- **Clear titles**: Be specific about what the chart shows
- **Descriptive labels**: Axis labels should include units
- **Logical ordering**: Sort categories meaningfully (alphabetical, by value, chronological)

#### **2. Reduce Cognitive Load**
- **Limit colors**: Use 3-5 colors maximum
- **Remove chart junk**: Eliminate unnecessary decorations
- **Use white space**: Don't cram too much information

#### **3. Highlight What Matters**
- **Color for emphasis**: Use bright colors for important data
- **Size for importance**: Larger elements draw attention
- **Annotations for insights**: Point out key findings

### 📊 **Chart Selection Guide**

| Data Type | Question | Best Chart | Alternative |
|-----------|----------|------------|-----------|
| Categorical | Compare categories | Bar chart | Column chart |
| Categorical | Show composition | Stacked bar | Pie chart (≤5 categories) |
| Continuous | Show distribution | Histogram | Box plot, Violin plot |
| Time series | Show trends | Line chart | Area chart |
| Two variables | Show relationship | Scatter plot | Bubble chart |
| Multiple variables | Show correlations | Heatmap | Pair plot |
| Geographic | Show location patterns | Map | Choropleth |
| Hierarchical | Show nested data | Treemap | Sunburst |

### 🎨 **Color Guidelines**

#### **Sequential Data** (Low to High)
- Use gradients: light to dark
- Examples: Blues, Greens, Oranges

#### **Diverging Data** (Negative/Positive)
- Use contrasting colors with neutral center
- Examples: Red-White-Blue, Orange-White-Purple

#### **Categorical Data** (Distinct Groups)
- Use distinct, contrasting colors
- Examples: Set1, Set2, Dark2

#### **Accessibility**
- Use colorblind-friendly palettes
- Don't rely solely on color to convey information
- Test with colorblind simulators

### 📐 **Technical Best Practices**

In [None]:
# Demonstration of best practices
plt.figure(figsize=(16, 12))

# BEFORE: Poor visualization practices
plt.subplot(2, 2, 1)
# Example of what NOT to do
poor_colors = ['#ff0000', '#00ff00', '#0000ff', '#ffff00', '#ff00ff', '#00ffff']
day_data = tips.groupby('day', observed=False)['total_bill'].mean()
plt.bar(day_data.index, day_data.values, color=poor_colors[:len(day_data)])
plt.title('bad chart')  # Poor title
# No axis labels, poor colors, no context
plt.xticks(rotation=90)  # Too much rotation

# AFTER: Good visualization practices
plt.subplot(2, 2, 2)
# Sort data meaningfully
day_data_sorted = day_data.sort_values(ascending=False)
# Use appropriate colors
colors = ['#2E86AB' if x == day_data_sorted.max() else '#A23B72' for x in day_data_sorted.values]
bars = plt.bar(day_data_sorted.index, day_data_sorted.values, color=colors, edgecolor='white')
# Clear, descriptive title
plt.title('Average Restaurant Bill by Day of Week\n(Sorted by Amount)', 
          fontsize=14, fontweight='bold', pad=20)
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Average Bill Amount ($)', fontsize=12)
# Add value labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'${height:.0f}', ha='center', va='bottom', fontweight='bold')
plt.xticks(rotation=0)  # Readable rotation

# BEFORE: Cluttered scatter plot
plt.subplot(2, 2, 3)
plt.scatter(tips['total_bill'], tips['tip'], c=range(len(tips)), 
           cmap='rainbow', s=100, alpha=1.0)
plt.title('scatter')
plt.colorbar(label='Random Colors')  # Meaningless colorbar
# Too many colors, no clear message

# AFTER: Clean, focused scatter plot
plt.subplot(2, 2, 4)
# Use meaningful grouping
colors_by_time = {'Lunch': '#FF6B6B', 'Dinner': '#4ECDC4'}
for time_period in tips['time'].unique():
    subset = tips[tips['time'] == time_period]
    plt.scatter(subset['total_bill'], subset['tip'], 
               c=colors_by_time[time_period], label=time_period, 
               alpha=0.6, s=50, edgecolors='white', linewidth=0.5)

# Add trend line
z = np.polyfit(tips['total_bill'], tips['tip'], 1)
p = np.poly1d(z)
plt.plot(tips['total_bill'], p(tips['total_bill']), "k--", alpha=0.7, linewidth=2)

plt.title('Restaurant Tips Increase with Bill Amount\n(Lunch vs Dinner Comparison)', 
          fontsize=14, fontweight='bold', pad=20)
plt.xlabel('Total Bill Amount ($)', fontsize=12)
plt.ylabel('Tip Amount ($)', fontsize=12)
plt.legend(title='Meal Time', loc='upper left')
plt.grid(True, alpha=0.3)

# Add correlation annotation
corr = tips['total_bill'].corr(tips['tip'])
plt.text(0.05, 0.95, f'Correlation: {corr:.3f}', transform=plt.gca().transAxes,
         bbox=dict(boxstyle='round', facecolor='white', alpha=0.8),
         fontsize=11, verticalalignment='top')

plt.suptitle('Visualization Best Practices: Before vs After', 
             fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Custom color palettes
plt.figure(figsize=(15, 5))

# Custom colors
custom_colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']

plt.subplot(1, 3, 1)
sns.barplot(data=tips, x='day', y='total_bill', palette=custom_colors)
plt.title('Custom Color Palette')
plt.xticks(rotation=45)

# Gradient palette
plt.subplot(1, 3, 2)
sns.barplot(data=tips, x='day', y='total_bill', palette='viridis')
plt.title('Viridis Palette')
plt.xticks(rotation=45)

# Diverging palette
plt.subplot(1, 3, 3)
sns.barplot(data=tips, x='day', y='total_bill', palette='RdYlBu')
plt.title('Diverging Palette (RdYlBu)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Practical Exercise: Comprehensive Data Visualization

Let's put everything together with a comprehensive analysis of the tips dataset:

In [None]:
# Load and explore the dataset
tips = sns.load_dataset('tips')

print("Dataset Info:")
print(f"Shape: {tips.shape}")
print(f"\nColumns: {list(tips.columns)}")
print(f"\nData types:")
print(tips.dtypes)
print(f"\nBasic statistics:")
print(tips.describe())

In [None]:
# Create a comprehensive dashboard
fig = plt.figure(figsize=(20, 16))

# 1. Distribution of total bills
plt.subplot(3, 3, 1)
sns.histplot(data=tips, x='total_bill', bins=20, color='skyblue')
plt.title('Distribution of Total Bills', fontsize=14, fontweight='bold')

# 2. Tips vs Total Bill
plt.subplot(3, 3, 2)
sns.scatterplot(data=tips, x='total_bill', y='tip', alpha=0.7)
plt.title('Tips vs Total Bill', fontsize=14, fontweight='bold')

# 3. Tips by day
plt.subplot(3, 3, 3)
sns.boxplot(data=tips, x='day', y='tip', palette='Set2')
plt.title('Tips by Day of Week', fontsize=14, fontweight='bold')
plt.xticks(rotation=45)

# 4. Count by day
plt.subplot(3, 3, 4)
sns.countplot(data=tips, x='day', palette='viridis')
plt.title('Number of Visits by Day', fontsize=14, fontweight='bold')
plt.xticks(rotation=45)

# 5. Tips by time
plt.subplot(3, 3, 5)
sns.barplot(data=tips, x='time', y='tip', palette='coolwarm')
plt.title('Average Tips by Time', fontsize=14, fontweight='bold')

# 6. Party size distribution
plt.subplot(3, 3, 6)
sns.countplot(data=tips, x='size', palette='pastel')
plt.title('Party Size Distribution', fontsize=14, fontweight='bold')

# 7. Tips by smoker status
plt.subplot(3, 3, 7)
sns.violinplot(data=tips, x='smoker', y='tip', palette='muted')
plt.title('Tips by Smoker Status', fontsize=14, fontweight='bold')

# 8. Total bill by day and time
plt.subplot(3, 3, 8)
pivot_data = tips.pivot_table(values='total_bill', index='day', columns='time', aggfunc='mean')
sns.heatmap(pivot_data, annot=True, fmt='.1f', cmap='YlOrRd')
plt.title('Avg Total Bill: Day vs Time', fontsize=14, fontweight='bold')

# 9. Correlation heatmap
plt.subplot(3, 3, 9)
corr_data = tips.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr_data, annot=True, cmap='coolwarm', center=0, square=True, fmt='.2f')
plt.title('Correlation Matrix', fontsize=14, fontweight='bold')

plt.suptitle('Comprehensive Tips Dataset Analysis', fontsize=20, fontweight='bold', y=0.98)
plt.tight_layout()
plt.show()

## Key Takeaways & Best Practices

### 📋 Visualization Best Practices

1. **Choose the Right Chart Type**
   - Match visualization to data type and question
   - Consider your audience and their familiarity with chart types

2. **Keep It Simple**
   - Avoid chart junk and unnecessary decorations
   - Focus on the message you want to convey

3. **Use Color Purposefully**
   - Use color to highlight important information
   - Consider colorblind-friendly palettes
   - Be consistent with color meanings

4. **Label Everything**
   - Always include axis labels, titles, and legends
   - Make labels descriptive and meaningful

5. **Consider Your Audience**
   - Technical vs. non-technical audiences need different approaches
   - Adjust complexity accordingly

### 🛠️ Technical Tips

- **Matplotlib**: Use for fine-grained control and custom visualizations
- **Seaborn**: Use for statistical plots and beautiful defaults
- **Save in appropriate formats**: PNG for web, PDF for print, SVG for scalability
- **Use appropriate DPI**: 300+ for publication, 150 for web

### 📚 Next Steps

1. Practice with different datasets
2. Explore advanced Matplotlib features (subplots, animations)
3. Learn about interactive visualizations (Plotly, Bokeh)
4. Study data storytelling techniques
5. Explore domain-specific visualization libraries

## Practice Exercises

Try these exercises to reinforce your learning:

### Exercise 1: Basic Visualizations
1. Create a histogram of the 'tip' column from the tips dataset
2. Add appropriate title, labels, and styling
3. Save the plot as a PNG file

### Exercise 2: Relationship Analysis
1. Create a scatter plot showing the relationship between total_bill and tip
2. Color the points by 'time' (Lunch/Dinner)
3. Add a trend line or regression line

### Exercise 3: Categorical Analysis
1. Create a grouped bar plot showing average total_bill by day and time
2. Use a custom color palette
3. Add error bars to show variability

### Exercise 4: Advanced Visualization
1. Load the 'flights' dataset from Seaborn
2. Create a heatmap showing passenger counts by month and year
3. Customize the color map and add annotations

### Exercise 5: Dashboard Creation
1. Create a 2x2 subplot figure
2. Include: histogram, scatter plot, box plot, and bar plot
3. Use a consistent color scheme across all subplots
4. Add a main title for the entire figure

In [None]:
# Space for your practice exercises
# Exercise 1: Create a histogram of tips

# Your code here

In [None]:
# Exercise 2: Scatter plot with color coding

# Your code here

In [None]:
# Exercise 3: Grouped bar plot

# Your code here

In [None]:
# Exercise 4: Flights heatmap

# Your code here

In [None]:
# Exercise 5: Dashboard creation

# Your code here

## Summary

Congratulations! You've completed Session 4 on Data Visualization. You now have the foundational skills to:

✅ Understand the importance of data visualization in data science  
✅ Choose appropriate chart types for different data types and questions  
✅ Create basic and advanced plots using Matplotlib  
✅ Enhance visualizations with proper labels, titles, and styling  
✅ Use Seaborn for statistical visualizations  
✅ Apply themes and color palettes effectively  
✅ Save and export your visualizations  

### 🎯 What's Next?
In our next session, we'll dive into **Statistical Analysis and Hypothesis Testing**, where you'll learn to:
- Understand descriptive and inferential statistics
- Perform hypothesis tests
- Calculate confidence intervals
- Interpret statistical results

Keep practicing with different datasets and visualization types. The more you practice, the more intuitive choosing the right visualization will become!

## Homework / Follow-up Tasks

To reinforce your learning and build practical skills, complete these assignments:

### 📝 **Assignment 1: Personal Data Analysis** (Due: Next Week)
**Objective**: Apply visualization techniques to a dataset of your choice

**Tasks**:
1. **Choose a dataset** from:
   - Kaggle (www.kaggle.com/datasets)
   - UCI Machine Learning Repository
   - Government open data portals
   - Your workplace/personal data (anonymized)

2. **Create a comprehensive analysis** including:
   - At least 5 different chart types
   - Audience-appropriate styling
   - Meaningful annotations and insights
   - Proper color scheme and labels

3. **Deliverables**:
   - Jupyter notebook with your analysis
   - 1-page summary of key insights
   - At least 2 different versions (technical vs. general audience)

### 📊 **Assignment 2: Visualization Critique** (Due: This Week)
**Objective**: Develop critical evaluation skills

**Tasks**:
1. **Find 3 visualizations** from news articles, reports, or social media
2. **Evaluate each using our best practices**:
   - What works well?
   - What could be improved?
   - Is the chart type appropriate?
   - Is it suited for the intended audience?
3. **Recreate one "bad" visualization** following best practices
4. **Write a brief critique** (200-300 words per visualization)

### 🛠️ **Assignment 3: Advanced Techniques** (Optional - Extra Credit)
**Objective**: Explore advanced visualization concepts

**Choose ONE of the following**:

#### **Option A: Interactive Visualizations**
- Learn Plotly basics
- Create 2-3 interactive plots
- Compare with static matplotlib/seaborn versions

#### **Option B: Dashboard Creation**
- Use matplotlib subplots or learn Streamlit/Dash
- Create a multi-panel dashboard
- Include filters or user controls

#### **Option C: Specialized Visualizations**
- Explore domain-specific plots (network graphs, geographic maps, etc.)
- Create visualizations for your field of interest
- Document the use cases and advantages

### 📚 **Recommended Reading/Resources**

#### **Books**:
- "The Visual Display of Quantitative Information" by Edward Tufte
- "Storytelling with Data" by Cole Nussbaumer Knaflic
- "The Grammar of Graphics" by Leland Wilkinson

#### **Online Resources**:
- **Color Palettes**: colorbrewer2.org, coolors.co
- **Inspiration**: r/dataisbeautiful, observablehq.com
- **Documentation**: matplotlib.org, seaborn.pydata.org

#### **Practice Datasets**:
- Seaborn built-in datasets (`sns.load_dataset()`)
- Kaggle Learn courses
- FiveThirtyEight data repository

### 🎯 **Self-Assessment Checklist**

Before submitting, ensure your visualizations include:

- [ ] Clear, descriptive titles
- [ ] Labeled axes with units
- [ ] Appropriate chart type for data and question
- [ ] Readable color scheme (colorblind-friendly)
- [ ] Proper aspect ratio and sizing
- [ ] Meaningful annotations where appropriate
- [ ] Consistent styling throughout
- [ ] Source citation for data
- [ ] Brief interpretation of insights

### 💡 **Next Session Preview**

In Session 5, we'll cover:
- **Statistical Analysis & Hypothesis Testing**
- **Correlation vs. Causation**
- **A/B Testing Fundamentals**
- **Confidence Intervals**
- **P-values and Statistical Significance**

**Preparation**: Review basic statistics concepts and install `scipy.stats` if not already available.

### 🤝 **Getting Help**

- **Office Hours**: [Your schedule here]
- **Discussion Forum**: [Link to course forum]
- **Study Groups**: Form groups of 3-4 students
- **Documentation**: Always check matplotlib/seaborn docs first
- **Stack Overflow**: Search before posting, use appropriate tags

### 📊 **Bonus Challenge**

Create a "visualization story" - a sequence of 4-6 charts that tell a coherent data story, building from simple exploration to complex insights. Think of it as a visual narrative that guides the reader through your analysis.