# Introduction to Data Visualisation (Solutions)

_This notebook provides exercises for basic data visualisations using Pandas. Exercises are designed to be completed in approximately 90 minutes by students who have little familiarity with the topics._

Note: This Jupyter Notebook was originally compiled by Alex Reppel (AR) based on conversations with [ClaudeAI](https://claude.ai/) *(version 3.5 Sonnet)*. For this year's materials, further revisions were made using [Claude Code](https://www.anthropic.com/claude-code) *(Sonnet 4.5)*, including updated documentation and git commit messages.

## Introduction

### Overview

1. **Basic line plot:** Begin with fundamental plotting concepts, similar to the demonstration but with slightly different requirements to ensure understanding
2. **Scatter plot analysis:** Introduce more complex visualisation by incorporating a third variable through colour mapping
3. **Statistical visualisation:** Practice creating subplots and using different types of statistical plots from [seaborn](https://seaborn.pydata.org/)
4. **Customised visualisation:** Extends Example 1 with additional customization options shown in the demonstration
5. **Comparative analysis:** Uses the subplot concept shown in the demonstration

### Tips for success

- Review the demonstration notebook for examples and syntax
- Pay attention to plot customisation options
- Consider the best way to present the data clearly
- Don't forget to add proper labels and titles
- Use appropriate colour schemes

**Remember:** The goal is to create clear, informative visualisations that effectively communicate the data's story.

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Set the style
plt.style.use("classic")

In [None]:
# Load the data
df = pd.read_csv("assets/data/data.csv")
df['date'] = pd.to_datetime(df['date'])

In [None]:
df.head()

## Exercise 1: Basic line plot

Using the business metrics dataset, create a line plot showing the daily conversion rate over time.

Requirements:

1. Set the figure size to 12x6
2. Add appropriate title and axis labels
3. Include gridlines with 30% transparency
4. Make the line dark blue with 70% opacity
5. Rotate x-axis labels by 45 degrees

In [None]:
# Set figure size
plt.figure(figsize=(12, 6))

# Create the line plot
plt.plot(
    df['date'], 
    df['conversion_rate'],
    color="darkblue",
    alpha=0.7
)

# Add title and labels
plt.title("Daily Conversion Rate Over Time")
plt.xlabel("Date")
plt.ylabel("Conversion Rate")

# Add gridlines with transparency
plt.grid(True, alpha=0.3)  # Change the alpha vaule and see what happens!

# Rotate x-axis labels
plt.xticks(rotation=45)

# Adjust layout to prevent label cutoff
plt.tight_layout()

plt.show()

### Explanation

1. First, we create a figure with the specified size using `plt.figure(figsize=(12, 6))`
2. The `plot()` function creates the line plot with:
   - x-axis: date values
   - y-axis: conversion rate values
   - dark blue colour and 70% opacity (alpha=0.7)
3. We add title and axis labels using `title()`, `xlabel()`, and `ylabel()`
4. Gridlines are added with 30% opacity using `grid(True, alpha=0.3)`
5. X-axis labels are rotated 45 degrees using `xticks(rotation=45)`
6. `tight_layout()` ensures all labels fit within the figure boundaries

## Exercise 2: Scatter plot analysis

Create a scatter plot to examine the relationship between number of visitors and revenue.

Requirements:

1. Set figure size to 10x6
2. Use different colours for points based on satisfaction scores (hint: use plt.scatter's 'c' parameter)
3. Add a colour bar to show the satisfaction scale
4. Include appropriate labels and title
5. Add gridlines

In [None]:
# Set figure size
plt.figure(figsize=(10, 6))

# Create scatter plot with colored points
scatter = plt.scatter(
    df['visitors'],
    df['revenue'],
    c=df['satisfaction'],  # Color based on satisfaction
    cmap="viridis",       # Color map
    alpha=0.6
)

# Add colorbar
colorbar = plt.colorbar(scatter)
colorbar.set_label("Satisfaction Score")

# Add title and labels
plt.title("Revenue vs Visitors (colored by Satisfaction)")
plt.xlabel("Number of Visitors")
plt.ylabel("Revenue (£)")

# Add gridlines
plt.grid(True, alpha=0.3)

# Adjust layout
plt.tight_layout()

plt.show()

### Explanation

1. Create figure with specified size
2. Use `scatter()` to create the plot with:
   - x-axis: visitors
   - y-axis: revenue
   - Point colours based on satisfaction scores using the 'c' parameter
   - 'viridis' colourmap (a perceptually uniform colourmap)
3. Add a colourbar to show the satisfaction scale
4. Add descriptive title and axis labels
5. Include gridlines with 30% transparency
6. Use `tight_layout()` to ensure proper spacing

## Exercise 3: Statistical visualisation

Create two subplots side by side:

1. A histogram showing the distribution of marketing spend
2. A box plot showing satisfaction scores across different days of the week

Requirements:

1. Use a figure size of 15x6
2. Add appropriate titles for each subplot and an overall figure title
3. Use different colours for each plot
4. Add grid lines to both plots
5. Include proper axis labels

In [None]:
# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# First subplot - Histogram
sns.histplot(
    data=df,
    x="marketing_spend",
    bins=30,
    color="darkblue",
    alpha=0.7,
    ax=ax1
)
ax1.set_title("Distribution of Marketing Spend")
ax1.set_xlabel("Marketing Spend (£)")
ax1.set_ylabel("Count")
ax1.grid(True, alpha=0.3)

# Second subplot - Box plot
# Create revenue categories using quartiles
df['revenue_category'] = pd.qcut(
    df['revenue'],
    q=4,
    labels=["Low", "Medium", "High", "Very High"]
)

sns.boxplot(
    data=df,
    x="revenue_category",
    y="satisfaction",
    color="lightblue",
    ax=ax2
)
ax2.set_title("Satisfaction Scores by Revenue Category")
ax2.set_xlabel("Revenue Category")
ax2.set_ylabel("Satisfaction Score")
ax2.grid(True, alpha=0.3)

# Add overall title
fig.suptitle("Marketing Spend Distribution and Satisfaction Analysis", y=1.05)

# Adjust layout
plt.tight_layout()

plt.show()

## Explanation

1. Create a figure with two subplots using `plt.subplots()`
2. For the first subplot (histogram):
   - Use seaborn's `histplot()` for the marketing spend distribution
   - Set appropriate colours and transparency
   - Add title and labels
3. For the second subplot (box plot):
   - Create revenue categories using pandas' `qcut()`
   - Use seaborn's `boxplot()` to show satisfaction distribution
   - Add title and labels
4. Add gridlines to both plots
5. Include an overall title using `suptitle()`
6. Adjust layout for proper spacing

## Exercise 4: Customised visualisation

reate a plot showing the relationship between marketing spend and conversion rate, similar to Example 1 in the demonstration, but with additional customisation:

Requirements:

1. Create a scatter plot of marketing spend vs conversion rate
2. Colour the points based on revenue
3. Add a title and properly labeled axes
4. Include a colourbar with an appropriate label
5. Add a grid with 30% transparency
6. Format axis labels to show currency (£) for marketing spend and percentage for conversion rate


In [None]:
# Create figure
plt.figure(figsize=(10, 6))

# Create scatter plot
scatter = plt.scatter(
    df['marketing_spend'],
    df['conversion_rate'],
    c=df['revenue'],
    cmap="YlOrRd",
    alpha=0.6
)

# Add colorbar
colorbar = plt.colorbar(scatter)
colorbar.set_label("Revenue (£)")

# Add title and labels
plt.title("Marketing Spend vs Conversion Rate")
plt.xlabel("Marketing Spend (£)")
plt.ylabel("Conversion Rate (%)")

# Add grid
plt.grid(True, alpha=0.3)

# Format axis labels using FuncFormatter
def currency_formatter(x, p):
    return f"£{x:,.0f}"

def percentage_formatter(x, p):
    return f"{x:.1f}%"

plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(currency_formatter))
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(percentage_formatter))

# Adjust layout
plt.tight_layout()

plt.show()

### Explanation

1. Create figure with specified size
2. Create scatter plot with:
   - x-axis: marketing spend
   - y-axis: conversion rate
   - Points coloured by revenue using 'YlOrRd' colourmap
3. Add colourbar with revenue label
4. Add title and axis labels
5. Include grid with 30% transparency
6. Format axis labels:
   - Marketing spend with £ symbol
   - Conversion rate as percentage
7. Adjust layout for proper spacing

## Exercise 5: Comparative analysis

Create a figure with two subplots comparing different business metrics:

Requirements:

1. Left subplot: Create a scatter plot of visitors vs revenue
2. Right subplot: Create a scatter plot of marketing spend vs revenue
3. Use the same scale for revenue on both plots
4. Add appropriate titles for each subplot and an overall figure title
5. Include gridlines on both plots
6. Add proper axis labels with units _(`£` for monetary values)_
7. Use different colours for each plot
8. Ensure proper spacing between subplots using `tight_layout()`

In [None]:
# Create figure with subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# First subplot - Visitors vs Revenue
ax1.scatter(
    df['visitors'],
    df['revenue'],
    color="darkblue",
    alpha=0.6
)
ax1.set_title("Revenue vs Visitors")
ax1.set_xlabel("Number of Visitors")
ax1.set_ylabel("Revenue (£)")
ax1.grid(True, alpha=0.3)

# Second subplot - Marketing Spend vs Revenue
ax2.scatter(
    df['marketing_spend'],
    df['revenue'],
    color="darkred",
    alpha=0.6
)
ax2.set_title("Revenue vs Marketing Spend")
ax2.set_xlabel("Marketing Spend (£)")
ax2.set_ylabel("Revenue (£)")
ax2.grid(True, alpha=0.3)

# Ensure same scale for revenue on both plots
y_min = df['revenue'].min()
y_max = df['revenue'].max()
ax1.set_ylim(y_min, y_max)
ax2.set_ylim(y_min, y_max)

# Format y-axis labels for both plots to show currency
def currency_formatter(x, p):
    return f"£{x:,.0f}"

ax1.yaxis.set_major_formatter(plt.FuncFormatter(currency_formatter))
ax2.yaxis.set_major_formatter(plt.FuncFormatter(currency_formatter))
ax2.xaxis.set_major_formatter(plt.FuncFormatter(currency_formatter))

# Add overall title
fig.suptitle("Revenue Relationships Analysis", y=1.05)

# Adjust layout
plt.tight_layout()

plt.show()

### Explanation

1. Create figure with two subplots
2. For the first subplot:
   - Create scatter plot of visitors vs revenue
   - Use dark blue colour with 60% opacity
   - Add title and labels
3. For the second subplot:
   - Create scatter plot of marketing spend vs revenue
   - Use dark red colour with 60% opacity
   - Add title and labels
4. Ensure consistent revenue scale:
   - Get min and max revenue values
   - Set same y-axis limits for both plots
5. Format currency labels:
   - Create custom formatter function
   - Apply to relevant axes
6. Add overall title and adjust layout
7. Add gridlines to both plots