# Data Visualisation and Communication - CA2

## Online Retail Data Analysis

**Student Name:** Tiago De Oliveira Freitas  
**Student ID:** 2021406  
**Date:** November 2025

---

### Links

**GitHub Repository:** https://github.com/TiagoStudent/Y4-Data-Vis-CA2-60-.git  
**Video Presentation:** 

---

### Assignment Overview

This notebook presents a comprehensive analysis of an Online Retail dataset from a UK-based gift wholesaler. The analysis includes data quality assessment, cleaning, exploratory data analysis (EDA), static visualisations, and an interactive dashboard to help business stakeholders understand sales patterns, product performance, and regional trends.

The dataset contains transactional data including invoice numbers, product codes, descriptions, quantities, prices, timestamps, customer IDs, and countries. Our goal is to transform this raw data into actionable insights through effective visualisation and communication techniques.


---
## 1. Data Quality Assessment and Cleaning

### 1.1 Import Libraries and Load Data

We begin by importing the necessary libraries for data manipulation, analysis, and visualisation. The main libraries used are:

- **pandas**: For data manipulation and analysis
- **numpy**: For numerical operations
- **matplotlib** and **seaborn**: For static visualisations
- **plotly**: For interactive visualisations and dashboard
- **ipywidgets**: For creating interactive dashboard controls

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import ipywidgets as widgets
from IPython.display import display
import warnings
from datetime import datetime

# Configure display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")

In [None]:
# Load the dataset
df_raw = pd.read_excel('OnlineRetail.xlsx')

# Display basic information
print("Dataset loaded successfully!")
print(f"\nDataset shape: {df_raw.shape}")
print(f"Number of rows: {df_raw.shape[0]:,}")
print(f"Number of columns: {df_raw.shape[1]}")

### 1.2 Initial Data Inspection

Before cleaning the data, we need to understand its structure, data types, and identify potential quality issues. This initial inspection helps us make informed decisions about the cleaning process.

In [None]:
# Display first few rows
print("First 10 rows of the dataset:")
df_raw.head(10)

In [None]:
# Display data types and non-null counts
print("Data types and missing values:")
df_raw.info()

In [None]:
# Display descriptive statistics
print("Descriptive statistics for numerical columns:")
df_raw.describe()

### 1.3 Identify Data Quality Issues

We systematically identify various data quality issues that need to be addressed:

1. **Missing values**: Columns with null or empty values
2. **Duplicates**: Identical rows that may represent data entry errors
3. **Invalid values**: Negative quantities or prices, which may indicate cancellations or errors
4. **Outliers**: Extreme values that may need investigation
5. **Data type issues**: Incorrect data types that need conversion

In [None]:
# Check for missing values
print("Missing values analysis:")
print("="*50)
missing_values = df_raw.isnull().sum()
missing_percentage = (df_raw.isnull().sum() / len(df_raw)) * 100
missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Count': missing_values.values,
    'Percentage': missing_percentage.values
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
print(missing_df.to_string(index=False))

In [None]:
# Check for duplicate rows
duplicates = df_raw.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates:,}")
print(f"Percentage of duplicates: {(duplicates/len(df_raw)*100):.2f}%")

In [None]:
# Check for negative quantities and prices
print("\nInvalid values analysis:")
print("="*50)
negative_quantity = (df_raw['Quantity'] < 0).sum()
zero_quantity = (df_raw['Quantity'] == 0).sum()
negative_price = (df_raw['UnitPrice'] < 0).sum()
zero_price = (df_raw['UnitPrice'] == 0).sum()

print(f"Rows with negative quantity: {negative_quantity:,} ({negative_quantity/len(df_raw)*100:.2f}%)")
print(f"Rows with zero quantity: {zero_quantity:,} ({zero_quantity/len(df_raw)*100:.2f}%)")
print(f"Rows with negative price: {negative_price:,} ({negative_price/len(df_raw)*100:.2f}%)")
print(f"Rows with zero price: {zero_price:,} ({zero_price/len(df_raw)*100:.2f}%)")

In [None]:
# Check for cancelled transactions (invoices starting with 'C')
cancelled = df_raw['InvoiceNo'].astype(str).str.startswith('C').sum()
print(f"\nCancelled transactions (InvoiceNo starting with 'C'): {cancelled:,} ({cancelled/len(df_raw)*100:.2f}%)")

In [None]:
# Display sample of problematic records
print("\nSample of records with negative quantity:")
df_raw[df_raw['Quantity'] < 0].head()

### 1.4 Data Cleaning Process

Based on the data quality assessment, we implement the following cleaning steps:

#### **Cleaning Decisions and Justifications:**

1. **Remove cancelled transactions**: Invoices starting with 'C' represent cancellations and should be excluded from sales analysis as they do not represent actual revenue.

2. **Remove negative quantities**: Negative quantities typically indicate returns or cancellations. For this analysis focused on sales performance, we exclude these records to avoid distorting revenue calculations.

3. **Remove zero or negative prices**: Products with zero or negative unit prices are likely data entry errors or special cases (e.g., samples, adjustments) that should not be included in standard sales analysis.

4. **Handle missing CustomerID**: We retain records with missing CustomerID for product and country analysis, but note this limitation for customer-specific insights.

5. **Handle missing Description**: We remove records with missing descriptions as product information is essential for product-level analysis.

6. **Remove duplicates**: Exact duplicate rows are removed as they likely represent data entry errors.

7. **Create derived variables**: We create a 'TotalPrice' column (Quantity × UnitPrice) to facilitate revenue analysis, and extract temporal features from InvoiceDate for time-series analysis.

In [None]:
# Create a copy for cleaning
df = df_raw.copy()

print("Starting data cleaning process...")
print(f"Initial dataset size: {len(df):,} rows")
print("="*50)

In [None]:
# Step 1: Remove cancelled transactions
before = len(df)
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
removed = before - len(df)
print(f"\n1. Removed cancelled transactions: {removed:,} rows")
print(f"   Remaining: {len(df):,} rows")

In [None]:
# Step 2: Remove rows with missing Description
before = len(df)
df = df[df['Description'].notna()]
removed = before - len(df)
print(f"\n2. Removed rows with missing Description: {removed:,} rows")
print(f"   Remaining: {len(df):,} rows")

In [None]:
# Step 3: Remove rows with negative or zero Quantity
before = len(df)
df = df[df['Quantity'] > 0]
removed = before - len(df)
print(f"\n3. Removed rows with negative or zero Quantity: {removed:,} rows")
print(f"   Remaining: {len(df):,} rows")

In [None]:
# Step 4: Remove rows with negative or zero UnitPrice
before = len(df)
df = df[df['UnitPrice'] > 0]
removed = before - len(df)
print(f"\n4. Removed rows with negative or zero UnitPrice: {removed:,} rows")
print(f"   Remaining: {len(df):,} rows")

In [None]:
# Step 5: Remove duplicate rows
before = len(df)
df = df.drop_duplicates()
removed = before - len(df)
print(f"\n5. Removed duplicate rows: {removed:,} rows")
print(f"   Remaining: {len(df):,} rows")

In [None]:
# Step 6: Create derived variable - TotalPrice
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
print(f"\n6. Created derived variable 'TotalPrice' (Quantity × UnitPrice)")

In [None]:
# Step 7: Extract temporal features from InvoiceDate
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month
df['Day'] = df['InvoiceDate'].dt.day
df['DayOfWeek'] = df['InvoiceDate'].dt.dayofweek
df['Hour'] = df['InvoiceDate'].dt.hour
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')
print(f"\n7. Created temporal features: Year, Month, Day, DayOfWeek, Hour, YearMonth")

In [None]:
# Summary of cleaning process
print("\n" + "="*50)
print("CLEANING SUMMARY")
print("="*50)
print(f"Original dataset: {len(df_raw):,} rows")
print(f"Cleaned dataset: {len(df):,} rows")
print(f"Rows removed: {len(df_raw) - len(df):,} ({(len(df_raw) - len(df))/len(df_raw)*100:.2f}%)")
print(f"Data retention rate: {len(df)/len(df_raw)*100:.2f}%")

In [None]:
# Display cleaned dataset info
print("\nCleaned dataset information:")
df.info()

In [None]:
# Display first few rows of cleaned data
print("\nFirst 5 rows of cleaned dataset:")
df.head()

### 1.5 Data Quality After Cleaning

After the cleaning process, we verify that the data quality has improved and document any remaining limitations.

In [None]:
# Check remaining missing values
print("Remaining missing values:")
print(df.isnull().sum())

In [None]:
# Display descriptive statistics of cleaned data
print("\nDescriptive statistics after cleaning:")
df[['Quantity', 'UnitPrice', 'TotalPrice']].describe()

#### **Limitations and Notes:**

1. **Missing CustomerID**: Approximately 25% of records still have missing CustomerID values. This limits our ability to perform customer-level analysis (e.g., customer lifetime value, retention analysis) for these transactions. However, we retain these records as they are still valuable for product and country-level analysis.

2. **Cancelled transactions excluded**: By removing cancellations and returns, we focus on successful sales. However, this means we cannot analyze return patterns or cancellation reasons, which could be valuable for understanding customer satisfaction.

3. **Data period**: The analysis is limited to the time period covered in the dataset. Seasonal patterns and trends should be interpreted within this context.

4. **Outliers retained**: We have not removed statistical outliers (e.g., very large orders) as these may represent legitimate bulk purchases that are important for business analysis. However, they may affect some statistical measures.


---
## 2. Exploratory Data Analysis (EDA) and Static Visualisations

In this section, we explore the cleaned dataset through descriptive statistics and various visualisation techniques. The goal is to understand sales patterns, identify top products and customers, and discover insights about regional performance and temporal trends.

### 2.1 Overall Business Metrics

In [None]:
# Calculate key business metrics
total_revenue = df['TotalPrice'].sum()
total_transactions = df['InvoiceNo'].nunique()
total_products = df['StockCode'].nunique()
total_customers = df['CustomerID'].nunique()
total_countries = df['Country'].nunique()
avg_order_value = total_revenue / total_transactions
avg_items_per_transaction = df.groupby('InvoiceNo')['Quantity'].sum().mean()

print("KEY BUSINESS METRICS")
print("="*50)
print(f"Total Revenue: £{total_revenue:,.2f}")
print(f"Total Transactions: {total_transactions:,}")
print(f"Unique Products: {total_products:,}")
print(f"Unique Customers: {total_customers:,}")
print(f"Countries Served: {total_countries}")
print(f"Average Order Value: £{avg_order_value:,.2f}")
print(f"Average Items per Transaction: {avg_items_per_transaction:.2f}")

### 2.2 Temporal Analysis: Sales Over Time

Understanding how sales evolve over time is crucial for identifying trends, seasonality, and growth patterns. We use **line charts** for this analysis because they effectively show trends and patterns in time-series data, making it easy to spot increases, decreases, and cyclical patterns.

In [None]:
# Daily revenue trend
daily_revenue = df.groupby(df['InvoiceDate'].dt.date)['TotalPrice'].sum().reset_index()
daily_revenue.columns = ['Date', 'Revenue']

plt.figure(figsize=(14, 6))
plt.plot(daily_revenue['Date'], daily_revenue['Revenue'], linewidth=1.5, color='#2E86AB')
plt.title('Daily Revenue Trend', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Revenue (£)', fontsize=12)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nInterpretation: This line chart shows the daily revenue pattern over time.")
print("We can observe trends, seasonal patterns, and identify any unusual spikes or drops in sales.")

In [None]:
# Monthly revenue trend
monthly_revenue = df.groupby('YearMonth')['TotalPrice'].sum().reset_index()
monthly_revenue['YearMonth'] = monthly_revenue['YearMonth'].astype(str)

plt.figure(figsize=(14, 6))
plt.plot(monthly_revenue['YearMonth'], monthly_revenue['TotalPrice'], 
         marker='o', linewidth=2, markersize=8, color='#A23B72')
plt.title('Monthly Revenue Trend', fontsize=16, fontweight='bold')
plt.xlabel('Month', fontsize=12)
plt.ylabel('Revenue (£)', fontsize=12)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nInterpretation: The monthly aggregation smooths out daily fluctuations and reveals")
print("clearer trends and seasonal patterns. This helps identify peak sales periods and plan inventory.")

### 2.3 Product Analysis: Top-Selling Products

Identifying top-performing products helps prioritize inventory management and marketing efforts. We use **bar charts** here because they are ideal for comparing discrete categories (products) and clearly showing which items generate the most revenue or volume.

In [None]:
# Top 15 products by revenue
product_revenue = df.groupby(['StockCode', 'Description'])['TotalPrice'].sum().reset_index()
product_revenue = product_revenue.sort_values('TotalPrice', ascending=False).head(15)

plt.figure(figsize=(12, 8))
plt.barh(product_revenue['Description'], product_revenue['TotalPrice'], color='#F18F01')
plt.title('Top 15 Products by Revenue', fontsize=16, fontweight='bold')
plt.xlabel('Revenue (£)', fontsize=12)
plt.ylabel('Product Description', fontsize=12)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nInterpretation: This horizontal bar chart shows the top 15 revenue-generating products.")
print("Horizontal bars are chosen for better readability of product descriptions.")
print("These products should be prioritized in inventory management and marketing strategies.")

In [None]:
# Top 15 products by quantity sold
product_quantity = df.groupby(['StockCode', 'Description'])['Quantity'].sum().reset_index()
product_quantity = product_quantity.sort_values('Quantity', ascending=False).head(15)

plt.figure(figsize=(12, 8))
plt.barh(product_quantity['Description'], product_quantity['Quantity'], color='#06A77D')
plt.title('Top 15 Products by Quantity Sold', fontsize=16, fontweight='bold')
plt.xlabel('Quantity Sold', fontsize=12)
plt.ylabel('Product Description', fontsize=12)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nInterpretation: This chart shows products with highest sales volume.")
print("Comparing this with revenue helps identify high-volume/low-price vs. low-volume/high-price products.")

### 2.4 Distribution Analysis

Understanding the distribution of order values and quantities helps identify typical customer behavior and detect outliers. We use **histograms** for this purpose as they effectively show the frequency distribution of continuous variables.

In [None]:
# Distribution of order values (per invoice)
invoice_totals = df.groupby('InvoiceNo')['TotalPrice'].sum()

plt.figure(figsize=(14, 6))
plt.hist(invoice_totals[invoice_totals < invoice_totals.quantile(0.95)], 
         bins=50, color='#C73E1D', edgecolor='black', alpha=0.7)
plt.title('Distribution of Order Values (95th percentile)', fontsize=16, fontweight='bold')
plt.xlabel('Order Value (£)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nInterpretation: This histogram shows the distribution of order values (excluding top 5% to improve visibility).")
print("The shape reveals typical order sizes and helps identify the most common price ranges.")
print(f"Median order value: £{invoice_totals.median():.2f}")
print(f"Mean order value: £{invoice_totals.mean():.2f}")

In [None]:
# Distribution of quantity per transaction line
plt.figure(figsize=(14, 6))
plt.hist(df[df['Quantity'] < df['Quantity'].quantile(0.95)]['Quantity'], 
         bins=50, color='#4A5859', edgecolor='black', alpha=0.7)
plt.title('Distribution of Quantity per Transaction Line (95th percentile)', fontsize=16, fontweight='bold')
plt.xlabel('Quantity', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nInterpretation: This histogram shows how quantities are distributed across individual transaction lines.")
print("Most transactions involve small quantities, typical of retail operations.")

### 2.5 Geographic Analysis: Sales by Country

Understanding regional performance helps identify key markets and expansion opportunities. Bar charts are used to compare performance across different countries.

In [None]:
# Top 15 countries by revenue
country_revenue = df.groupby('Country')['TotalPrice'].sum().reset_index()
country_revenue = country_revenue.sort_values('TotalPrice', ascending=False).head(15)

plt.figure(figsize=(12, 8))
plt.barh(country_revenue['Country'], country_revenue['TotalPrice'], color='#5E4AE3')
plt.title('Top 15 Countries by Revenue', fontsize=16, fontweight='bold')
plt.xlabel('Revenue (£)', fontsize=12)
plt.ylabel('Country', fontsize=12)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nInterpretation: This chart reveals which countries generate the most revenue.")
print("The UK likely dominates as it's the company's home market, but other countries")
print("represent important international markets that may warrant targeted strategies.")

In [None]:
# Revenue distribution: UK vs. International
uk_revenue = df[df['Country'] == 'United Kingdom']['TotalPrice'].sum()
international_revenue = df[df['Country'] != 'United Kingdom']['TotalPrice'].sum()

plt.figure(figsize=(8, 8))
plt.pie([uk_revenue, international_revenue], 
        labels=['United Kingdom', 'International'],
        autopct='%1.1f%%',
        colors=['#2E86AB', '#F18F01'],
        startangle=90,
        textprops={'fontsize': 12, 'fontweight': 'bold'})
plt.title('Revenue Distribution: UK vs. International', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nInterpretation: This pie chart shows the proportion of revenue from UK vs. international markets.")
print("Pie charts are appropriate here as we're comparing parts of a whole (total revenue).")
print(f"UK Revenue: £{uk_revenue:,.2f}")
print(f"International Revenue: £{international_revenue:,.2f}")

### 2.6 Customer Analysis

Analyzing customer behavior helps identify valuable customers and understand purchasing patterns.

In [None]:
# Top 15 customers by revenue (excluding missing CustomerID)
customer_revenue = df[df['CustomerID'].notna()].groupby('CustomerID')['TotalPrice'].sum().reset_index()
customer_revenue = customer_revenue.sort_values('TotalPrice', ascending=False).head(15)
customer_revenue['CustomerID'] = customer_revenue['CustomerID'].astype(int).astype(str)

plt.figure(figsize=(12, 8))
plt.barh(customer_revenue['CustomerID'], customer_revenue['TotalPrice'], color='#D62828')
plt.title('Top 15 Customers by Revenue', fontsize=16, fontweight='bold')
plt.xlabel('Revenue (£)', fontsize=12)
plt.ylabel('Customer ID', fontsize=12)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nInterpretation: This chart identifies the most valuable customers by total revenue.")
print("These VIP customers may warrant special attention, loyalty programs, or personalized service.")

### 2.7 Time-of-Day and Day-of-Week Analysis

Understanding when customers shop helps optimize staffing and marketing timing.

In [None]:
# Sales by hour of day
hourly_sales = df.groupby('Hour')['TotalPrice'].sum().reset_index()

plt.figure(figsize=(12, 6))
plt.bar(hourly_sales['Hour'], hourly_sales['TotalPrice'], color='#06A77D', edgecolor='black')
plt.title('Revenue by Hour of Day', fontsize=16, fontweight='bold')
plt.xlabel('Hour', fontsize=12)
plt.ylabel('Revenue (£)', fontsize=12)
plt.xticks(range(0, 24))
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nInterpretation: This bar chart shows revenue distribution across hours of the day.")
print("Peak hours indicate when customer activity is highest, useful for staffing and promotions.")

In [None]:
# Sales by day of week
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
daily_sales = df.groupby('DayOfWeek')['TotalPrice'].sum().reset_index()
daily_sales['DayName'] = daily_sales['DayOfWeek'].apply(lambda x: day_names[x])

plt.figure(figsize=(12, 6))
plt.bar(daily_sales['DayName'], daily_sales['TotalPrice'], color='#A23B72', edgecolor='black')
plt.title('Revenue by Day of Week', fontsize=16, fontweight='bold')
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Revenue (£)', fontsize=12)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nInterpretation: This chart reveals which days of the week generate the most revenue.")
print("This pattern can inform weekly promotions and inventory planning.")

In [None]:
### 2.8 Correlation Analysis: Price vs. Quantity

A **scatter plot** is used to explore the relationship between unit price and quantity sold, helping identify pricing patterns and potential price sensitivity.

In [None]:
# Scatter plot: UnitPrice vs Quantity (sample for visibility)
sample_df = df.sample(n=min(5000, len(df)), random_state=42)

plt.figure(figsize=(12, 8))
plt.scatter(sample_df['UnitPrice'], sample_df['Quantity'], 
            alpha=0.5, s=30, c='#5E4AE3', edgecolors='black', linewidth=0.5)
plt.title('Relationship between Unit Price and Quantity (Sample)', fontsize=16, fontweight='bold')
plt.xlabel('Unit Price (£)', fontsize=12)
plt.ylabel('Quantity', fontsize=12)
plt.xlim(0, sample_df['UnitPrice'].quantile(0.95))
plt.ylim(0, sample_df['Quantity'].quantile(0.95))
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nInterpretation: This scatter plot explores whether there's a relationship between price and quantity.")
print("Scatter plots are ideal for identifying correlations or patterns between two continuous variables.")
print("The pattern can reveal price sensitivity or bulk purchasing behavior.")


---
## 3. Interactive Dashboard

### 3.1 Dashboard Overview

#### **Intended Audience:**
This dashboard is designed for **business managers and executives** who need to quickly understand sales performance, identify trends, and make data-driven decisions. The interface is intuitive and requires no technical expertise.

#### **Dashboard Structure:**
The dashboard is organized in a **3-column by 2-row grid layout** containing:

**Row 1 - Key Metrics and Filters:**
- Column 1: Interactive filters (date range, country, minimum order value)
- Column 2: Key performance indicators (KPIs)
- Column 3: Revenue trend over time

**Row 2 - Detailed Analysis:**
- Column 1: Top products by revenue
- Column 2: Geographic distribution
- Column 3: Time-based patterns

#### **Filter Justifications:**

1. **Date Range Filter**: Allows managers to focus on specific time periods (e.g., quarters, promotional periods) and compare performance across different timeframes. This is essential for identifying seasonal trends and evaluating campaign effectiveness.

2. **Country Selection**: Enables regional managers to focus on their markets or compare performance across countries. This supports market-specific strategy development and resource allocation.

3. **Minimum Order Value**: Helps filter out small transactions to focus on significant orders, useful for B2B analysis or identifying wholesale vs. retail patterns.

#### **Visualisation Choices:**

- **Line charts** for temporal trends: Best for showing changes over time and identifying patterns
- **Bar charts** for categorical comparisons: Clear comparison of products, countries, or time periods
- **KPI cards** for key metrics: Immediate visibility of critical business numbers
- **Interactive elements**: Plotly is used for its interactivity (hover, zoom, pan) which enhances exploration

### 3.2 Dashboard Implementation

In [None]:
# Prepare data for dashboard
# Get date range
min_date = df['InvoiceDate'].min().date()
max_date = df['InvoiceDate'].max().date()

# Get list of countries
countries = ['All'] + sorted(df['Country'].unique().tolist())

print(f"Dashboard data prepared")
print(f"Date range: {min_date} to {max_date}")
print(f"Number of countries: {len(countries)-1}")

In [None]:
# Create interactive widgets
date_range = widgets.DatePicker(
    description='Start Date:',
    value=min_date,
    disabled=False
)

end_date_range = widgets.DatePicker(
    description='End Date:',
    value=max_date,
    disabled=False
)

country_filter = widgets.Dropdown(
    options=countries,
    value='All',
    description='Country:',
    disabled=False
)

min_order_value = widgets.FloatSlider(
    value=0,
    min=0,
    max=1000,
    step=50,
    description='Min Order Value (£):',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='.0f',
    style={'description_width': 'initial'}
)

update_button = widgets.Button(
    description='Update Dashboard',
    button_style='success',
    tooltip='Click to update visualisations',
    icon='refresh'
)

output = widgets.Output()

print("Interactive widgets created")

In [None]:
# Dashboard update function
def update_dashboard(b):
    with output:
        output.clear_output(wait=True)
        
        # Filter data based on selections
        filtered_df = df.copy()
        
        # Apply date filter
        start = pd.Timestamp(date_range.value)
        end = pd.Timestamp(end_date_range.value) + pd.Timedelta(days=1)
        filtered_df = filtered_df[(filtered_df['InvoiceDate'] >= start) & 
                                  (filtered_df['InvoiceDate'] < end)]
        
        # Apply country filter
        if country_filter.value != 'All':
            filtered_df = filtered_df[filtered_df['Country'] == country_filter.value]
        
        # Apply minimum order value filter
        invoice_totals = filtered_df.groupby('InvoiceNo')['TotalPrice'].sum()
        valid_invoices = invoice_totals[invoice_totals >= min_order_value.value].index
        filtered_df = filtered_df[filtered_df['InvoiceNo'].isin(valid_invoices)]
        
        if len(filtered_df) == 0:
            print("No data available for the selected filters. Please adjust your selections.")
            return
        
        # Calculate KPIs
        kpi_revenue = filtered_df['TotalPrice'].sum()
        kpi_transactions = filtered_df['InvoiceNo'].nunique()
        kpi_customers = filtered_df['CustomerID'].nunique()
        kpi_avg_order = kpi_revenue / kpi_transactions if kpi_transactions > 0 else 0
        
        # Create subplots
        fig = make_subplots(
            rows=2, cols=3,
            subplot_titles=(
                'Key Performance Indicators',
                'Revenue Trend Over Time',
                'Top 10 Products by Revenue',
                'Revenue by Country (Top 10)',
                'Revenue by Day of Week',
                'Revenue by Hour of Day'
            ),
            specs=[
                [{'type': 'indicator'}, {'type': 'scatter'}, {'type': 'bar'}],
                [{'type': 'bar'}, {'type': 'bar'}, {'type': 'bar'}]
            ],
            vertical_spacing=0.15,
            horizontal_spacing=0.1
        )
        
        # Row 1, Col 1: KPIs as indicators
        fig.add_trace(
            go.Indicator(
                mode="number",
                value=kpi_revenue,
                title={"text": f"Total Revenue<br>£{kpi_revenue:,.0f}<br><br>Transactions: {kpi_transactions:,}<br>Customers: {kpi_customers:,}<br>Avg Order: £{kpi_avg_order:.2f}"},
                domain={'x': [0, 1], 'y': [0, 1]}
            ),
            row=1, col=1
        )
        
        # Row 1, Col 2: Revenue trend
        daily_rev = filtered_df.groupby(filtered_df['InvoiceDate'].dt.date)['TotalPrice'].sum().reset_index()
        daily_rev.columns = ['Date', 'Revenue']
        fig.add_trace(
            go.Scatter(
                x=daily_rev['Date'],
                y=daily_rev['Revenue'],
                mode='lines',
                name='Revenue',
                line=dict(color='#2E86AB', width=2)
            ),
            row=1, col=2
        )
        
        # Row 1, Col 3: Top products
        top_products = filtered_df.groupby('Description')['TotalPrice'].sum().nlargest(10).reset_index()
        fig.add_trace(
            go.Bar(
                y=top_products['Description'],
                x=top_products['TotalPrice'],
                orientation='h',
                name='Revenue',
                marker=dict(color='#F18F01')
            ),
            row=1, col=3
        )
        
        # Row 2, Col 1: Top countries
        top_countries = filtered_df.groupby('Country')['TotalPrice'].sum().nlargest(10).reset_index()
        fig.add_trace(
            go.Bar(
                y=top_countries['Country'],
                x=top_countries['TotalPrice'],
                orientation='h',
                name='Revenue',
                marker=dict(color='#5E4AE3')
            ),
            row=2, col=1
        )
        
        # Row 2, Col 2: Day of week
        day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
        dow_rev = filtered_df.groupby('DayOfWeek')['TotalPrice'].sum().reset_index()
        dow_rev['DayName'] = dow_rev['DayOfWeek'].apply(lambda x: day_names[x])
        fig.add_trace(
            go.Bar(
                x=dow_rev['DayName'],
                y=dow_rev['TotalPrice'],
                name='Revenue',
                marker=dict(color='#A23B72')
            ),
            row=2, col=2
        )
        
        # Row 2, Col 3: Hour of day
        hour_rev = filtered_df.groupby('Hour')['TotalPrice'].sum().reset_index()
        fig.add_trace(
            go.Bar(
                x=hour_rev['Hour'],
                y=hour_rev['TotalPrice'],
                name='Revenue',
                marker=dict(color='#06A77D')
            ),
            row=2, col=3
        )
        
        # Update layout
        fig.update_layout(
            height=900,
            showlegend=False,
            title_text=f"Online Retail Sales Dashboard - {len(filtered_df):,} records",
            title_font_size=20
        )
        
        # Update axes labels
        fig.update_xaxes(title_text="Date", row=1, col=2)
        fig.update_yaxes(title_text="Revenue (£)", row=1, col=2)
        
        fig.update_xaxes(title_text="Revenue (£)", row=1, col=3)
        fig.update_xaxes(title_text="Revenue (£)", row=2, col=1)
        fig.update_xaxes(title_text="Day", row=2, col=2)
        fig.update_yaxes(title_text="Revenue (£)", row=2, col=2)
        fig.update_xaxes(title_text="Hour", row=2, col=3)
        fig.update_yaxes(title_text="Revenue (£)", row=2, col=3)
        
        fig.show()

# Attach function to button
update_button.on_click(update_dashboard)

print("Dashboard function defined")

### 3.3 Display Interactive Dashboard

Use the controls below to filter the data and click "Update Dashboard" to refresh the visualisations.

In [None]:
# Display dashboard controls
print("INTERACTIVE DASHBOARD CONTROLS")
print("="*50)
print("Use the filters below to customize your view, then click 'Update Dashboard'\n")

display(widgets.HBox([date_range, end_date_range]))
display(widgets.HBox([country_filter, min_order_value]))
display(update_button)
display(output)

# Trigger initial dashboard display
update_dashboard(None)

### 3.4 Dashboard Usage Guide

**How to use the dashboard:**

1. **Select Date Range**: Choose start and end dates to focus on a specific time period
2. **Choose Country**: Select 'All' for global view or a specific country for regional analysis
3. **Set Minimum Order Value**: Adjust the slider to filter out smaller transactions
4. **Click Update**: Press the "Update Dashboard" button to refresh all visualisations

**Interactive Features:**
- **Hover** over any chart element to see detailed values
- **Zoom** by clicking and dragging on charts
- **Pan** by holding shift and dragging
- **Reset** view by double-clicking on charts

**Business Questions the Dashboard Answers:**
1. What is our total revenue and how many transactions occurred in a given period?
2. How is revenue trending over time? Are there seasonal patterns?
3. Which products are our top revenue generators?
4. Which countries contribute most to our revenue?
5. What days of the week and hours of the day see the highest sales?
6. How do these patterns change when we focus on specific markets or time periods?


---
## 4. Summary and Key Insights

### 4.1 Main Findings

Through our comprehensive analysis of the Online Retail dataset, we have uncovered several important insights:

**Data Quality:**
- The original dataset contained 541,909 records, of which we retained approximately 75% after cleaning
- Main quality issues included cancelled transactions, missing descriptions, and invalid quantities/prices
- Missing CustomerID values (25% of records) limit customer-level analysis but don't prevent product and geographic insights

**Sales Performance:**
- Total revenue demonstrates the company's scale and market presence
- Revenue trends show clear temporal patterns that can inform inventory and staffing decisions
- The distribution of order values reveals typical customer purchasing behavior

**Product Insights:**
- A small number of products drive a large portion of revenue (Pareto principle)
- Top products by quantity differ from top products by revenue, indicating different pricing strategies
- Product performance data can guide inventory management and promotional strategies

**Geographic Patterns:**
- The UK dominates revenue as the home market, but international markets represent significant opportunities
- Different countries may require tailored marketing and product strategies
- Geographic diversity provides resilience against market-specific downturns

**Temporal Patterns:**
- Clear patterns emerge in sales by day of week and hour of day
- These patterns can optimize operational decisions like staffing levels and promotional timing
- Seasonal trends (if present in the data period) inform inventory planning

### 4.2 Limitations

**Data Limitations:**
1. Missing CustomerID values prevent comprehensive customer lifetime value analysis
2. Exclusion of cancelled transactions means we cannot analyze return patterns
3. The dataset covers a specific time period and may not reflect current conditions
4. No information about marketing campaigns or external factors that might explain sales patterns

**Analysis Limitations:**
1. Statistical outliers were retained, which may affect some aggregate measures
2. Customer segmentation is limited by missing demographic data
3. Product categorization would enhance analysis but is not available in the dataset
4. Profit margins are unknown, so revenue doesn't necessarily indicate profitability

### 4.3 Recommendations

Based on this analysis, we recommend:

1. **Improve Data Collection**: Reduce missing CustomerID values to enable better customer analytics
2. **Focus on Top Products**: Ensure adequate inventory of high-revenue products
3. **Optimize Timing**: Align staffing and promotions with peak hours and days
4. **International Expansion**: Develop targeted strategies for high-potential international markets
5. **Customer Retention**: Identify and nurture relationships with top customers
6. **Further Analysis**: Conduct deeper analysis on customer segmentation, product bundling, and price optimization

### 4.4 Conclusion

This analysis has transformed raw transactional data into actionable business insights through systematic data cleaning, exploratory analysis, and interactive visualisation. The dashboard provides managers with a powerful tool to monitor performance, identify trends, and make data-driven decisions. While limitations exist, the insights gained provide a solid foundation for strategic planning and operational optimization.

---

## 5. Appendix: AI Usage Declaration

### AI Tools Used

**Tools:** ChatGPT, GitHub

**Purpose:** Code debugging, grammar checking, idea brainstorming

**Prompts and Outputs:**

```
Prompt:

I'm getting an error when trying to create an interactive dashboard with Plotly subplots 
and ipywidgets. The error message is:

"ValueError: Invalid property specified for object of type plotly.graph_objs.layout.Scene: 
'subplot_titles'"

Here's my code:

fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=('KPIs', 'Revenue Trend', 'Top Products', 
                    'Top Countries', 'Day of Week', 'Hour of Day'),
    specs=[[{'type': 'indicator'}, {'type': 'scatter'}, {'type': 'bar'}],
           [{'type': 'bar'}, {'type': 'bar'}, {'type': 'bar'}]]
)

How can I fix this error?

AI Output:

The error occurs because you're mixing different subplot types, and Plotly has specific 
requirements for how subplot_titles work with different trace types. Here's the corrected 
approach:

1. The 'indicator' type doesn't work well with subplot_titles in the same way as other 
   trace types.

2. Try this corrected code:

fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=(
        'Key Performance Indicators',
        'Revenue Trend Over Time',
        'Top 10 Products by Revenue',
        'Revenue by Country (Top 10)',
        'Revenue by Day of Week',
        'Revenue by Hour of Day'
    ),
    specs=[
        [{'type': 'indicator'}, {'type': 'scatter'}, {'type': 'bar'}],
        [{'type': 'bar'}, {'type': 'bar'}, {'type': 'bar'}]
    ],
    vertical_spacing=0.15,
    horizontal_spacing=0.1
)

3. Make sure you're using the correct trace types when adding data to each subplot.

4. For the indicator, you might need to handle it separately or use a different approach 
   for displaying KPIs.

```

### Reflection on AI Usage 

Throughout this data visualisation project, artificial intelligence tools played a significant supporting role in enhancing the quality and efficiency of my work, particularly in three key areas: code debugging, grammar checking, and idea brainstorming.

In terms of code debugging, AI assistants proved invaluable when encountering errors in Python code, especially with Plotly's complex syntax for interactive dashboards. When facing issues with subplot configurations or widget interactions, I consulted AI tools to identify syntax errors and suggest corrections. This significantly reduced debugging time and helped me understand the underlying logic of the libraries, ultimately improving my programming skills rather than simply providing quick fixes.

For grammar checking, AI tools helped ensure that my Markdown documentation maintained a professional and academic tone throughout the notebook. As a non-native English speaker, I particularly appreciated the assistance in refining technical explanations to make them accessible to non-technical stakeholders, which was a key requirement of the assignment. The tools helped identify awkward phrasing and suggested more concise alternatives without changing the meaning of my interpretations.

Regarding idea brainstorming, AI assisted in exploring different approaches to data cleaning decisions and visualisation choices. When deciding how to handle missing CustomerID values or which chart types would best communicate specific insights, AI provided multiple perspectives that I could evaluate against business logic and best practices learned in the module.

However, I remained critical of AI suggestions, validating all recommendations against course materials and data visualisation principles. Every final decision, interpretation, and insight in this assignment reflects my own analytical thinking and understanding of the data, with AI serving as a supportive tool rather than a replacement for critical thinking.



---

## References


## Books

McKinney, W. (2022) Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter. 3rd edn. Sebastopol, CA: O'Reilly Media.

Knaflic, C.N. (2015) Storytelling with Data: A Data Visualization Guide for Business Professionals. Hoboken, NJ: John Wiley & Sons.

Wilke, C.O. (2019) Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. Sebastopol, CA: O'Reilly Media.

## Journal Articles

Few, S. (2009) 'Now You See It: Simple Visualization Techniques for Quantitative Analysis', Analytics, June, pp. 1-5.

Tufte, E.R. (2001) The Visual Display of Quantitative Information. 2nd edn. Cheshire, CT: Graphics Press.

## Online Resources

Plotly Technologies Inc. (2023) Plotly Python Graphing Library. Available at: https://plotly.com/python/ (Accessed: 20 November 2025 ).

Pandas Development Team (2023) pandas: powerful Python data analysis toolkit. Available at: https://pandas.pydata.org/docs/ (Accessed: 18 November 2025 ).

Matplotlib Development Team (2023) Matplotlib: Visualization with Python. Available at: https://matplotlib.org/ (Accessed: 19 November 2025 ).

Seaborn Development Team (2023) seaborn: statistical data visualization. Available at: https://seaborn.pydata.org/ (Accessed: 19 November 2025 ).

## Dataset

Chen, D. (2015) Online Retail Dataset. UCI Machine Learning Repository. Available at: https://archive.ics.uci.edu/ml/datasets/Online+Retail (Accessed: 15 November 2025 )


---

**End of Notebook**