# Data Visualisation and Communication - CA2

## Online Retail Data Analysis

**Student Name:** Tiago De Oliveira Freitas  
**Student ID:** 2021406  
**Date:** November 2025

---

### Links

**GitHub Repository:** https://github.com/TiagoStudent/Y4-Data-Vis-CA2-60-.git  
**Video Presentation:** 

---

### Assignment Overview

This notebook presents a comprehensive analysis of an Online Retail dataset from a UK-based gift wholesaler. The analysis includes data quality assessment, cleaning, exploratory data analysis (EDA), static visualisations, and an interactive dashboard to help business stakeholders understand sales patterns, product performance, and regional trends.

The dataset contains transactional data including invoice numbers, product codes, descriptions, quantities, prices, timestamps, customer IDs, and countries. Our goal is to transform this raw data into actionable insights through effective visualisation and communication techniques.

1. Data Quality Assessment and Cleaning
1.1 Import Libraries and Load Data
We begin by importing the necessary libraries for data manipulation, analysis, and visualisation. The main libraries used are:

pandas: For data manipulation and analysis
numpy: For numerical operations
matplotlib and seaborn: For static visualisations
plotly: For interactive visualisations and dashboard
ipywidgets: For creating interactive dashboard controls

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import ipywidgets as widgets
from IPython.display import display
import warnings
from datetime import datetime

# Configure display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")

In [None]:
# Load the dataset
df_raw = pd.read_excel('OnlineRetail.xlsx')

# Display basic information
print("Dataset loaded successfully!")
print(f"\nDataset shape: {df_raw.shape}")
print(f"Number of rows: {df_raw.shape[0]:,}")
print(f"Number of columns: {df_raw.shape[1]}")

1.2 Initial Data Inspection
Before cleaning the data, we need to understand its structure, data types, and identify potential quality issues. This initial inspection helps us make informed decisions about the cleaning process.

In [None]:
# Display first few rows
print("First 10 rows of the dataset:")
df_raw.head(10)

In [None]:
# Display data types and non-null counts
print("Data types and missing values:")
df_raw.info()

In [None]:
# Display descriptive statistics
print("Descriptive statistics for numerical columns:")
df_raw.describe()

1.3 Identify Data Quality Issues
We systematically identify various data quality issues that need to be addressed:

Missing values: Columns with null or empty values
Duplicates: Identical rows that may represent data entry errors
Invalid values: Negative quantities or prices, which may indicate cancellations or errors
Outliers: Extreme values that may need investigation
Data type issues: Incorrect data types that need conversion

In [None]:
# Check for missing values
print("Missing values analysis:")
print("="*50)
missing_values = df_raw.isnull().sum()
missing_percentage = (df_raw.isnull().sum() / len(df_raw)) * 100
missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Count': missing_values.values,
    'Percentage': missing_percentage.values
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
print(missing_df.to_string(index=False))

In [None]:
# Check for duplicate rows
duplicates = df_raw.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates:,}")
print(f"Percentage of duplicates: {(duplicates/len(df_raw)*100):.2f}%")

In [None]:
# Check for negative quantities and prices
print("\nInvalid values analysis:")
print("="*50)
negative_quantity = (df_raw['Quantity'] < 0).sum()
zero_quantity = (df_raw['Quantity'] == 0).sum()
negative_price = (df_raw['UnitPrice'] < 0).sum()
zero_price = (df_raw['UnitPrice'] == 0).sum()

print(f"Rows with negative quantity: {negative_quantity:,} ({negative_quantity/len(df_raw)*100:.2f}%)")
print(f"Rows with zero quantity: {zero_quantity:,} ({zero_quantity/len(df_raw)*100:.2f}%)")
print(f"Rows with negative price: {negative_price:,} ({negative_price/len(df_raw)*100:.2f}%)")
print(f"Rows with zero price: {zero_price:,} ({zero_price/len(df_raw)*100:.2f}%)")

In [None]:
# Check for cancelled transactions (invoices starting with 'C')
cancelled = df_raw['InvoiceNo'].astype(str).str.startswith('C').sum()
print(f"\nCancelled transactions (InvoiceNo starting with 'C'): {cancelled:,} ({cancelled/len(df_raw)*100:.2f}%)")

In [None]:
# Display sample of problematic records
print("\nSample of records with negative quantity:")
df_raw[df_raw['Quantity'] < 0].head()

1.4 Data Cleaning Process
Based on the data quality assessment, we implement the following cleaning steps:

Cleaning Decisions and Justifications:
Remove cancelled transactions: Invoices starting with 'C' represent cancellations and should be excluded from sales analysis as they do not represent actual revenue.

Remove negative quantities: Negative quantities typically indicate returns or cancellations. For this analysis focused on sales performance, we exclude these records to avoid distorting revenue calculations.

Remove zero or negative prices: Products with zero or negative unit prices are likely data entry errors or special cases (e.g., samples, adjustments) that should not be included in standard sales analysis.

Handle missing CustomerID: We retain records with missing CustomerID for product and country analysis, but note this limitation for customer-specific insights.

Handle missing Description: We remove records with missing descriptions as product information is essential for product-level analysis.

Remove duplicates: Exact duplicate rows are removed as they likely represent data entry errors.

Create derived variables: We create a 'TotalPrice' column (Quantity × UnitPrice) to facilitate revenue analysis, and extract temporal features from InvoiceDate for time-series analysis.

In [None]:
# Create a copy for cleaning
df = df_raw.copy()

print("Starting data cleaning process...")
print(f"Initial dataset size: {len(df):,} rows")
print("="*50)

In [None]:
# Step 1: Remove cancelled transactions
before = len(df)
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
removed = before - len(df)
print(f"\n1. Removed cancelled transactions: {removed:,} rows")
print(f"   Remaining: {len(df):,} rows")

In [None]:
# Step 2: Remove rows with missing Description
before = len(df)
df = df[df['Description'].notna()]
removed = before - len(df)
print(f"\n2. Removed rows with missing Description: {removed:,} rows")
print(f"   Remaining: {len(df):,} rows")

In [None]:
# Step 3: Remove rows with negative or zero Quantity
before = len(df)
df = df[df['Quantity'] > 0]
removed = before - len(df)
print(f"\n3. Removed rows with negative or zero Quantity: {removed:,} rows")
print(f"   Remaining: {len(df):,} rows")

In [None]:
# Step 4: Remove rows with negative or zero UnitPrice
before = len(df)
df = df[df['UnitPrice'] > 0]
removed = before - len(df)
print(f"\n4. Removed rows with negative or zero UnitPrice: {removed:,} rows")
print(f"   Remaining: {len(df):,} rows")

In [None]:
# Step 5: Remove duplicate rows
before = len(df)
df = df.drop_duplicates()
removed = before - len(df)
print(f"\n5. Removed duplicate rows: {removed:,} rows")
print(f"   Remaining: {len(df):,} rows")

In [None]:
# Step 6: Create derived variable - TotalPrice
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
print(f"\n6. Created derived variable 'TotalPrice' (Quantity × UnitPrice)")

In [None]:
# Step 7: Extract temporal features from InvoiceDate
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month
df['Day'] = df['InvoiceDate'].dt.day
df['DayOfWeek'] = df['InvoiceDate'].dt.dayofweek
df['Hour'] = df['InvoiceDate'].dt.hour
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')
print(f"\n7. Created temporal features: Year, Month, Day, DayOfWeek, Hour, YearMonth")

In [None]:
# Summary of cleaning process
print("\n" + "="*50)
print("CLEANING SUMMARY")
print("="*50)
print(f"Original dataset: {len(df_raw):,} rows")
print(f"Cleaned dataset: {len(df):,} rows")
print(f"Rows removed: {len(df_raw) - len(df):,} ({(len(df_raw) - len(df))/len(df_raw)*100:.2f}%)")
print(f"Data retention rate: {len(df)/len(df_raw)*100:.2f}%")

In [None]:
# Display cleaned dataset info
print("\nCleaned dataset information:")
df.info()

In [None]:
# Display first few rows of cleaned data
print("\nFirst 5 rows of cleaned dataset:")
df.head()

1.5 Data Quality After Cleaning
After the cleaning process, we verify that the data quality has improved and document any remaining limitations.

In [None]:
# Check remaining missing values
print("Remaining missing values:")
print(df.isnull().sum())

In [None]:
# Display descriptive statistics of cleaned data
print("\nDescriptive statistics after cleaning:")
df[['Quantity', 'UnitPrice', 'TotalPrice']].describe()

Limitations and Notes:
Missing CustomerID: Approximately 25% of records still have missing CustomerID values. This limits our ability to perform customer-level analysis (e.g., customer lifetime value, retention analysis) for these transactions. However, we retain these records as they are still valuable for product and country-level analysis.

Cancelled transactions excluded: By removing cancellations and returns, we focus on successful sales. However, this means we cannot analyze return patterns or cancellation reasons, which could be valuable for understanding customer satisfaction.

Data period: The analysis is limited to the time period covered in the dataset. Seasonal patterns and trends should be interpreted within this context.

Outliers retained: We have not removed statistical outliers (e.g., very large orders) as these may represent legitimate bulk purchases that are important for business analysis. However, they may affect some statistical measures.