## Online Retail Visualization

### Description
This project provide analysis and visuals that would help the executives view and understand insights to create the expansion strategy for their online retail stores. The projects would analyse the trends and the breakdown by different categories so that they have clarity on how the revenue is being generated and what are the main factors affecting the online store.

### Setup
**Load required libraries and datasets**

In [1]:
# %%
import time, os, string, json
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = 'plotly_dark'
import datetime


In [2]:
# %%
df_data = pd.read_excel('data\Tata\Online Retail.xlsx')
df_data


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


### Exploratory data analysis before visualization
**Examining transaction data**

In [5]:
# count the number of unique values in 'Quantity' column
df_count = df_data['Quantity'].value_counts().reset_index()
df_count.columns = ['Quantity', 'Count']
df_count


Unnamed: 0,Quantity,Count
0,1,148227
1,2,81829
2,12,61063
3,6,40868
4,4,38484
...,...,...
717,-355,1
718,-155,1
719,1404,1
720,388,1


We can see that there are negative value. Let's create a check that the quantity should not be below 1 unit

In [3]:

# Find the values that are below 1 in 'Quantity' column and convert them to be positive integers
df_data['Quantity'] = df_data['Quantity'].apply(lambda x: abs(x) if x < 1 else x)
df_data


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


Let's create a check that the quantity should not be below 1 unit

In [4]:
# check the values that are below  0 in 'UnitPrice' column and delete them
df_data = df_data[df_data['UnitPrice'] > 0]
df_data


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


Let's clean the null values and duplicated values in the dataset

In [5]:
# check the NaN values in the df_data and delete the rows with NaN values
df_data = df_data.dropna()
df_data

# check the duplicated values in the df_data and delete the duplicated rows
df_data = df_data.drop_duplicates()
df_data


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


Check the types of column data to make sure the format is correct

In [6]:
# check the type of the df_data 
df_data.dtypes


InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

We can see that all of the types are correct format. 
Let's create visual to view the trend of revenue. 
irst, we need to create the "Revenua" column in the dataset

In [10]:
# add the 'Revenue' column to the df_data by multiplying 'Quantity' and 'UnitPrice' columns
df_data['Revenue'] = df_data['Quantity'] * df_data['UnitPrice']
df_data


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Revenue
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.30
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.00
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France,10.20
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France,12.60
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France,16.60
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France,16.60


Secondly, let's view the revenue for top 10 countries and the breakdown by products except for UK

In [11]:
# filter top 10 countries with the highest revenue except for 'United Kingdom'
df_data_countries = df_data[df_data['Country'] != 'United Kingdom']
df_data_countries = df_data_countries.groupby('Country').agg({'Revenue': 'sum'}).reset_index()
df_data_countries = df_data_countries.sort_values('Revenue', ascending=False).head(10)
df_data_countries


Unnamed: 0,Country,Revenue
23,Netherlands,286231.14
10,EIRE,280523.14
14,Germany,235847.33
13,France,221242.57
0,Australia,139897.85
30,Spain,68361.09
32,Switzerland,57148.5
3,Belgium,41481.72
31,Sweden,40150.25
19,Japan,39492.12


Thirdly, let's check the average revenue generated by top 10 customers 

In [12]:
# filter the 'CustomerID' column with the top 10 customers with the highest revenue
df_data_customers = df_data.groupby('CustomerID').agg({'Revenue': 'sum'}).reset_index()
df_data_customers = df_data_customers.sort_values('Revenue', ascending=False).head(10)
df_data_customers


Unnamed: 0,CustomerID,Revenue
3032,16446.0,336942.1
1702,14646.0,280923.02
4232,18102.0,262876.11
3757,17450.0,201459.41
1894,14911.0,154963.61
0,12346.0,154367.2
55,12415.0,126103.61
1344,14156.0,121205.57
2721,16029.0,108532.99
3800,17511.0,93999.38
