### Exploratory Data Analysis
This notebook provides an exploratory analysis of the cleaned Online Retail Transaction dataset. The goal is to uncover key patterns in customer behaviour, product performance, and sales trends to support analysis and dashboard development.

In [2]:
# Import necessary libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

Now let us import our cleaned dataset for analysis:

In [3]:
# Import the cleaned dataset
df = pd.read_csv('../data/clean_data/online_retail_cleaned.csv')
# Display the first few rows of the dataset
df.head()

  df = pd.read_csv('../data/clean_data/online_retail_cleaned.csv')


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalPrice,Year,Month,DayOfWeek,Hour,Date
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,2010,12,Wednesday,8,2010-12-01
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,2010,12,Wednesday,8,2010-12-01
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,2010,12,Wednesday,8,2010-12-01
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,2010,12,Wednesday,8,2010-12-01
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,2010,12,Wednesday,8,2010-12-01


The first stage of our analysis is to look at total revenue over time. 
We can do this by plotting an initial line chart visualisation. 

In [4]:
daily_sales = df.groupby('Date')['TotalPrice'].sum().reset_index()

fig = px.line(daily_sales, x='Date', y='TotalPrice', title='Daily Sales Revenue')
fig.show()

This first plot shows our Daily Sales Revenue over time. As we can see, it has a very 'spiky' appearance. In order to help combat this, it can be useful to add a rolling average line to the plot to help us smooth out the plot so that we can see changes over time more effectively. 

In [5]:
daily_sales['Rolling7Day'] = daily_sales['TotalPrice'].rolling(window=7).mean()

import plotly.express as px
fig = px.line(daily_sales, x='Date', y=['TotalPrice', 'Rolling7Day'],
              labels={'value': 'Revenue', 'variable': 'Line'},
              title='Daily Sales with 7-Day Rolling Average')
fig.show()

We observe that there are revenue spikes in December 2010, and a significant spike in December 2011, most likely indicating Christmas/Holidays sales.
We also see a spike around mid Jan 2011, possibly indicating post-holidays spends or returns. 
September 2011 and November 2011 also see revenue spikes. This could indicate early holiday period spending, back to school spending, spending for 'Thanksgiving' in November, or 'Cyber Monday' sales.

Next we can check to see which countries contribute the most to revenues. 
Let us make a bar plot of revenue by country.

In [6]:
# Group by Country and sum TotalPrice
country_sales = df.groupby('Country')['TotalPrice'].sum().reset_index()

# Sort by revenue
country_sales = country_sales.sort_values(by='TotalPrice', ascending=False)

# Plot
fig = px.bar(country_sales, x='Country', y='TotalPrice',
             title='Total Revenue by Country',
             labels={'TotalPrice': 'Total Revenue'},
             height=600)
fig.update_layout(xaxis_tickangle=45)
fig.show()

We can see from our bar plot the UK is our Primary Market, dominating revenues with sales ~£9M. This could mean that the operations are UK-centric. 
This highlights the retailer's dependence on it's domestic market, and suggests that international markets are underdeveloped or under-targetted.
It might be useful in this instance to evaluate international sales by excluding the UK.

In [7]:
# Exclude the UK for better visual balance
non_uk_sales = country_sales[country_sales['Country'] != 'United Kingdom']

fig = px.bar(non_uk_sales.sort_values(by='TotalPrice', ascending=False),
             x='Country', y='TotalPrice',
             title='Revenue by Country (Excluding UK)',
             labels={'TotalPrice': 'Total Revenue'},
             height=600)
fig.update_layout(xaxis_tickangle=45)
fig.show()

Unsurprisingly, we can see from this bar plot, that our secondary markets are situated within the European Union. 
At this point, it could be helpful to our analysis to create a dictionary to group our countries into regions. 

In [8]:
# Define region mapping
# This dictionary maps countries to their respective regions for better analysis

region_map = {
    'United Kingdom': 'UK',
    'France': 'Europe',
    'Germany': 'Europe',
    'Netherlands': 'Europe',
    'Belgium': 'Europe',
    'Switzerland': 'Europe',
    'Spain': 'Europe',
    'Portugal': 'Europe',
    'Italy': 'Europe',
    'Norway': 'Europe',
    'Austria': 'Europe',
    'Denmark': 'Europe',
    'Sweden': 'Europe',
    'Finland': 'Europe',
    'Ireland': 'Europe',
    'Greece': 'Europe',
    'Cyprus': 'Europe',
    'Channel Islands': 'Europe',

    'Australia': 'Oceania',
    'New Zealand': 'Oceania',

    'USA': 'North America',
    'Canada': 'North America',

    'Japan': 'Asia',
    'Hong Kong': 'Asia',
    'Singapore': 'Asia',
    'Israel': 'Middle East',
    'United Arab Emirates': 'Middle East',

    'Unspecified': 'Other',
    'EIRE': 'Europe'
}

Let us now create a column for 'Region' in our dataset:

In [9]:
# Create a Region column
df['Region'] = df['Country'].map(region_map).fillna('Other')

Now we can aggregate our revenues by region, and display our findings in a Bar Plot.

In [10]:
# Aggregate revenue by region
region_sales = df.groupby('Region')['TotalPrice'].sum().reset_index()

# Bar chart of regional revenue
fig = px.bar(region_sales.sort_values(by='TotalPrice', ascending=False),
             x='Region', y='TotalPrice',
             title='Revenue by Region',
             labels={'TotalPrice': 'Total Revenue'})
fig.show()

We can see here that our revenues by region are:

|__Region__|__Total Revenue__|
|----------|-----------------|
|UK|~£90M|
|Europe|~£1.4M|
|Oceania|~£138K|
|Asia|~£74K|
|Other|~£28K|
|Middle East|~£10K|
|North America|~£7K|

Let us also generate a Pie Chart for a more attractive visualisation of Revenue Share by Region.

In [11]:
fig = px.pie(region_sales, names='Region', values='TotalPrice', title='Revenue Share by Region')
fig.show()

Next we should look at which products are our top sellers by revenue, which products are the most frequently purchased (by quantity), are there products which sell well, but generate little revenue?


In [12]:
top_revenue_products = (
    df.groupby('Description')['TotalPrice']
    .sum()
    .sort_values(ascending=False)
    .head(10)
    .reset_index()
)

In [13]:
fig = px.bar(top_revenue_products, x='TotalPrice', y='Description',
             orientation='h',
             title='Top 10 Products by Revenue',
             labels={'TotalPrice': 'Revenue', 'Description': 'Product'})
fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig.show()

Here we can see that 'DOTCOM POSTAGE' is our top 'Product'. We can assume here that this 'product' is not actually a product, moreover a Postage and Packing charge.
We can also see 'POSTAGE' and 'Manual'. These are not relevant for our Product Level Analysis, so we should exclude them for our purposes.

In [14]:
non_products = ['DOTCOM POSTAGE', 'POSTAGE', 'Manual']

df_products = df[~df['Description'].isin(non_products)]

In [15]:
top_revenue_products = (
    df_products
    .groupby('Description')['TotalPrice']
    .sum()
    .sort_values(ascending=False)
    .head(10)
    .reset_index()
)

In [16]:
fig = px.bar(top_revenue_products, x='TotalPrice', y='Description',
             orientation='h',
             title='Top 10 Products by Revenue',
             labels={'TotalPrice': 'Revenue', 'Description': 'Product'})
fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig.show()

It is clear from this analysis that 'Homewares and Decorations' are our most popular type of items driving revenues across the business. 
Let's now check out what our lowest selling products are.

In [17]:
low_revenue_products = (
    df_products.groupby('Description')['TotalPrice']
    .sum()
    .sort_values()
    .head(10)
    .reset_index()
)

import plotly.express as px
fig = px.bar(low_revenue_products, x='TotalPrice', y='Description',
             orientation='h',
             title='Bottom 10 Products by Revenue',
             labels={'TotalPrice': 'Revenue', 'Description': 'Product'})
fig.show()

Now let us look at the products which have sold in the lowest quantities.

In [18]:
low_quantity_products = (
    df_products.groupby('Description')['Quantity']
    .sum()
    .sort_values()
    .head(10)
    .reset_index()
)

fig = px.bar(low_quantity_products, x='Quantity', y='Description',
             orientation='h',
             title='Bottom 10 Products by Quantity Sold',
             labels={'Quantity': 'Units Sold', 'Description': 'Product'})
fig.show()

So we can see from our lowest quantity of products purchased is 1. All of our bottom ten items have only sold a single unit throughout the whole year (2010-2011) that is covered by our dataset. These are clearly the products that perhaps our company should consider dropping from the inventory, or consider addressing pricing and marketing stategy in order to boost sales growth and revenue. 