# COM7021 – Data Visualisation
## Task 1: Confectionary Sales Data Analysis & Dashboard
### Student: Muhammad Umar Uz Zaman
### Student ID: STU1197819
### Goal: Analyze confectionary sales data and create interactive visualizations for business decision-making

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go

# For dashboard
import streamlit as st

# Set consistent visual style
sns.set_theme()
plt.style.use('default')

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

## 1. Data Loading and Initial Inspection

In [3]:
# Load the confectionary dataset
DATA_PATH = "data/confectionary.xlsx"
df = pd.read_excel(DATA_PATH)
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

Dataset shape: (1001, 7)

First few rows:


Unnamed: 0,Date,Country(UK),Confectionary,Units Sold,Cost(£),Profit(£),Revenue(£)
0,2002-11-11,England,Biscuit,1118.0,2459.6,3130.4,749954.4
1,2002-07-05,England,Biscuit,708.0,1557.6,1982.4,300758.4
2,2001-10-31,England,Biscuit,1269.0,2791.8,3553.2,966216.6
3,2004-09-13,England,Biscuit,1631.0,3588.2,4566.8,1596096.6
4,2004-03-10,England,Biscuit,2240.0,4928.0,6272.0,3010560.0


In [4]:
# Check data types and non-null counts
print("Data types and missing values:")
df.info()

# Check for any completely empty rows
print(f"\nTotal rows: {len(df)}")
print(f"Rows with all NaN values: {df.isna().all(axis=1).sum()}")

Data types and missing values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           1001 non-null   datetime64[ns]
 1   Country(UK)    1001 non-null   object        
 2   Confectionary  1001 non-null   object        
 3   Units Sold     996 non-null    float64       
 4   Cost(£)        992 non-null    float64       
 5   Profit(£)      998 non-null    float64       
 6   Revenue(£)     1001 non-null   float64       
dtypes: datetime64[ns](1), float64(4), object(2)
memory usage: 54.9+ KB

Total rows: 1001
Rows with all NaN values: 0


In [5]:
# Inspect unique values in categorical columns
print("Unique values in Country(UK):")
print(df["Country(UK)"].unique())
print(f"\nNumber of unique countries: {df['Country(UK)'].nunique()}")

print("\nUnique values in Confectionary:")
print(df["Confectionary"].unique())
print(f"\nNumber of unique confectionary types: {df['Confectionary'].nunique()}")

Unique values in Country(UK):
['England' 'Scotland' 'Wales' 'N. Ireland' 'Jersey']

Number of unique countries: 5

Unique values in Confectionary:
['Biscuit' 'Biscuit Nut' 'Choclate Chunk' 'Caramel nut' 'Caramel' 'Plain'
 'Chocolate Chunk' 'Caramel Nut']

Number of unique confectionary types: 8


In [6]:
# Summary statistics for numeric columns
numeric_cols = ["Units Sold", "Cost(£)", "Profit(£)", "Revenue(£)"]
print("Summary statistics for numeric columns:")
df[numeric_cols].describe()

Summary statistics for numeric columns:


Unnamed: 0,Units Sold,Cost(£),Profit(£),Revenue(£)
count,996.0,992.0,998.0,1001.0
mean,1633.360442,2820.190877,4012.076052,2519449.0
std,876.356045,2073.969135,2648.166312,2941639.0
min,200.0,40.0,160.0,-21962260.0
25%,923.0,1204.0,1872.4,576240.0
50%,1530.5,2456.8,3459.0,1627208.0
75%,2300.0,3977.625,5445.0,3551112.0
max,4493.0,10994.5,13479.0,20187050.0


**Dataset Overview:**
- **Size**: 1,001 rows × 7 columns
- **Time Period**: Data spans from [earliest date] to [latest date] across UK regions
- **Regions**: 5 UK regions (England, Scotland, Wales, N. Ireland, Jersey)
- **Products**: 8 confectionary types with some spelling inconsistencies ("Choclate Chunk" vs "Chocolate Chunk", "Caramel nut" vs "Caramel Nut")

**Data Quality Issues Identified:**
- **Missing Values**: 
  - Units Sold: 5 missing (0.5%)
  - Cost(£): 9 missing (0.9%) 
  - Profit(£): 3 missing (0.3%)
  - Revenue(£): Complete (0%)
- **Categorical Inconsistencies**: Confectionary names need normalization

This appears to be sales transaction data for a confectionary company operating across UK regions, with financial metrics that will allow us to analyze profitability and regional performance.

In [10]:
# Handle missing values - Decision: Drop rows with missing numeric data
# Reason: Only ~1.7% of total data, transparent approach for business stakeholders

print("Missing values before cleaning:")
print(df[["Units Sold", "Cost(£)", "Profit(£)", "Revenue(£)"]].isna().sum())

initial_rows = len(df)
df = df.dropna(subset=["Units Sold", "Cost(£)", "Profit(£)", "Revenue(£)"])
final_rows = len(df)

Missing values before cleaning:
Units Sold    5
Cost(£)       9
Profit(£)     3
Revenue(£)    0
dtype: int64


In [11]:
print(f"\nAfter cleaning: {final_rows} rows retained ({final_rows/initial_rows*100:.1f}%)")
print(f"Dropped {initial_rows - final_rows} rows with missing values")


After cleaning: 984 rows retained (98.3%)
Dropped 17 rows with missing values


In [12]:
### 1.2 Data Cleaning Summary

**Missing Value Handling:**
- **Decision**: Dropped [X] rows with missing numeric values ([X]% of data)
- **Rationale**: Small percentage, ensures data integrity for financial calculations
- **Impact**: Retained [XX.X]% of original dataset

**Data Quality**: All financial columns now complete with valid numeric data. Ready for feature engineering and analysis.

SyntaxError: invalid syntax (3921386689.py, line 3)