## Sales Analysis and Visualization

**<br>1.What are the Sales Figures for Each Country?<br>
<br>2.What is the Overall Sales Trend?<br>
<br>3.How Many Customers Purchased Products Each Month?<br>
<br>4.How Many New Customers were There Each Month?<br>
<br>5.What Time During the Day Do Customers Make the Most Purchases?<br>
<br>6.Which is the Best Selling Product in Each Country?<br>
<br>7.Which are the Most Successful Products Overall?<br>
<br>8.Which Customers Contributed the Most to Total Sales?**<br>

![sales2.png](attachment:sales2.png)

In [2]:
pip install plotly

Note: you may need to restart the kernel to use updated packages.


In [3]:
#import libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px # used for interactive visualizations
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot # plot plotly graphs in line in a notebook
init_notebook_mode(connected = True)
import calendar # used to convert numbers between 1 and 12 to month names

import warnings        
warnings.filterwarnings("ignore") # ignores warnings

In [4]:
#import dataset
sale = pd.read_csv('online_retail_cleaned.csv')
sale.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,ItemTotal
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-01-12 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536373,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-01-12 09:02:00,2.55,17850.0,United Kingdom,15.3
2,536375,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-01-12 09:32:00,2.55,17850.0,United Kingdom,15.3


In [5]:
sale.shape

(537966, 9)

In [6]:
sale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537966 entries, 0 to 537965
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    537966 non-null  object 
 1   StockCode    537966 non-null  object 
 2   Description  537966 non-null  object 
 3   Quantity     537966 non-null  int64  
 4   InvoiceDate  537966 non-null  object 
 5   UnitPrice    537966 non-null  float64
 6   CustomerID   405542 non-null  float64
 7   Country      537966 non-null  object 
 8   ItemTotal    537966 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 36.9+ MB


In [7]:
sale['InvoiceDate'] = pd.to_datetime(sale['InvoiceDate'])
sale['InvoiceDate']

0        2010-01-12 08:26:00
1        2010-01-12 09:02:00
2        2010-01-12 09:32:00
3        2010-01-12 10:19:00
4        2010-01-12 10:39:00
                 ...        
537961   2011-11-28 15:31:00
537962   2011-11-29 11:23:00
537963   2011-11-29 16:47:00
537964   2011-05-12 15:48:00
537965   2011-08-12 10:53:00
Name: InvoiceDate, Length: 537966, dtype: datetime64[ns]

In [8]:
sale.isna().sum()

InvoiceNo           0
StockCode           0
Description         0
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     132424
Country             0
ItemTotal           0
dtype: int64

## 1. What are the Sales Figures for Each Country?

In [9]:
# calculate total sales by country
country_sales = pd.DataFrame(sale.groupby("Country") 
["ItemTotal"].sum()).reset_index().rename({"ItemTotal":"TotalSales"},axis=1)
country_sales

Unnamed: 0,Country,TotalSales
0,Australia,136990.0
1,Austria,8698.32
2,Bahrain,548.4
3,Belgium,36662.96
4,Brazil,1143.6
5,Canada,3115.44
6,Channel Islands,20086.29
7,Cyprus,12931.29
8,Czech Republic,671.72
9,Denmark,18042.14


In [10]:
fig = px.pie(country_sales,
            values = 'TotalSales',
            names = 'Country',
            title = 'Sales per country',
            color_discrete_sequence=px.colors.sequential.RdBu)

fig.update_traces(textposition="inside",
                 textinfo ="percent+label")

fig.update_layout(
                  margin=dict(l=10, r=50, b=10, t=70, pad=0),
                  titlefont = dict(size = 20)
                 )
fig.show()

In [11]:
fig = px.pie(country_sales, values='TotalSales', names='Country')
fig.update_traces(textposition="inside",
                 textinfo ="percent+label")
fig.show()

## What is the Overall Sales Trend?

In [12]:
year_trends = pd.DataFrame(sale.groupby("InvoiceDate") 
["ItemTotal"].sum()).reset_index().rename({"ItemTotal":"TotalSales"},axis=1)
year_trends

Unnamed: 0,InvoiceDate,TotalSales
0,2010-01-12 08:26:00,139.12
1,2010-01-12 08:28:00,22.20
2,2010-01-12 08:34:00,348.78
3,2010-01-12 08:35:00,17.85
4,2010-01-12 08:45:00,801.86
...,...,...
21585,2011-12-10 16:17:00,-1579.51
21586,2011-12-10 16:36:00,2110.05
21587,2011-12-10 16:40:00,1901.92
21588,2011-12-10 17:00:00,460.13


In [13]:
 sale["InvoiceDate"].min() ,sale["InvoiceDate"].max()


(Timestamp('2010-01-12 08:26:00'), Timestamp('2011-12-10 17:19:00'))

In [14]:
# create columns extracting the year and month of InvoiceDates

sale['year'] , sale ['month'] = sale['InvoiceDate'].dt.year ,sale['InvoiceDate'].dt.month

In [15]:
sales_trend = sale.groupby(["year","month"])["ItemTotal"].sum() \
.reset_index().rename({"ItemTotal":"TotalSales"},axis=1)
sales_trend

Unnamed: 0,year,month,TotalSales
0,2010,1,58548.56
1,2010,2,46174.28
2,2010,3,45276.46
3,2010,5,30969.95
4,2010,6,53818.59
5,2010,7,84246.46
6,2010,8,43877.84
7,2010,9,52054.13
8,2010,10,56826.91
9,2010,12,313133.17


In [16]:
# create rows about April 2010 and November 2010 with averages from preceding and following month 
                                                                                                  
new_rows = pd.DataFrame({"year":[2010,2010],
                         "month": [4,11],
                         "TotalSales": [38123.21,184980.04]},
                         index = [98,99]) # arbitrary indexes

# insert the row in the sales table
sales_trend = pd.concat([new_rows, sales_trend]) \
.sort_values(by=["year","month"]).reset_index(drop=True)

sales_trend 

Unnamed: 0,year,month,TotalSales
0,2010,1,58548.56
1,2010,2,46174.28
2,2010,3,45276.46
3,2010,4,38123.21
4,2010,5,30969.95
5,2010,6,53818.59
6,2010,7,84246.46
7,2010,8,43877.84
8,2010,9,52054.13
9,2010,10,56826.91


In [17]:
# create rows about April 2010 and November 2010 with averages from preceding and following month 
                                                                                                  
new_rows = pd.DataFrame({"year":[2010,2010],
                         "month": [4,11],
                         "TotalSales": [38123.21,184980.04]},
                         index = [98,99]) # arbitrary indexes

# insert the row in the sales table
sales_trend = pd.concat([new_rows, sales_trend]) \
.sort_values(by=["year","month"]).reset_index(drop=True)
sales_trend

Unnamed: 0,year,month,TotalSales
0,2010,1,58548.56
1,2010,2,46174.28
2,2010,3,45276.46
3,2010,4,38123.21
4,2010,4,38123.21
5,2010,5,30969.95
6,2010,6,53818.59
7,2010,7,84246.46
8,2010,8,43877.84
9,2010,9,52054.13


In [18]:
# convert numbers into month names
sales_trend["month"] = sales_trend["month"].apply(lambda x: calendar.month_abbr[x])

# combine month and year
sales_trend["month"] = sales_trend["month"].astype(str) + " " + sales_trend["year"].astype(str)


# drop the redundant year column
sales_trend= sales_trend.drop("year", axis = 1) 

sales_trend = sales_trend[0:23] # drop December 2011 since the data does not cover the whole month

In [19]:
# line chart using plotly expess Scatter
trace = go.Scatter(
                    x = sales_trend["month"],
                    y = sales_trend["TotalSales"],
                    mode = "lines+markers",
                    name = "TotalSales",
                    line = dict(width = 4),
                    marker = dict(
                                  size = 10,
                                  color = "rgba(120, 26, 120, 0.8)"
                                 ),
                    hovertemplate = " %{x}<br>£%{y:,.0f} <extra></extra>",
                  )
line_data = [trace]
layout = dict(
              title = "Total Sales by Month",
              titlefont = dict(size = 20),
              margin=dict(l=10, r=50, b=10, t=70, pad=0),
              xaxis= dict(title= "Month",ticklen = 5,zeroline = False),
              yaxis= dict(title= "Total Sales", tickformat = ",.0f", tickprefix="£")
             )
fig = dict(data = line_data, layout = layout)
iplot(fig)

We see a clear upward trend with initial sharp increses towards the end of 2010 and the beginning of 2011.
Let's investigate whether the number of customers in each month exhibit a similar trend.

# How Many Customers Purchased Products Each Month?

In [20]:
#unique Customers

sale.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,ItemTotal,year,month
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-01-12 08:26:00,2.55,17850.0,United Kingdom,15.3,2010,1
1,536373,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-01-12 09:02:00,2.55,17850.0,United Kingdom,15.3,2010,1
2,536375,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-01-12 09:32:00,2.55,17850.0,United Kingdom,15.3,2010,1
3,536390,85123A,WHITE HANGING HEART T-LIGHT HOLDER,64,2010-01-12 10:19:00,2.55,17511.0,United Kingdom,163.2,2010,1
4,536394,85123A,WHITE HANGING HEART T-LIGHT HOLDER,32,2010-01-12 10:39:00,2.55,13408.0,United Kingdom,81.6,2010,1


In [21]:
Unique = sale['InvoiceNo'].unique()
len(Unique)

23539

In [22]:
# invoices with at least one record with missing CustomerID

len(sale[sale["CustomerID"].isna()]["InvoiceNo"].unique())

1506

In [23]:
sale.groupby("InvoiceNo").apply(lambda x: all(np.isnan(i) for i in x["CustomerID"])).tolist().count(True)

1506

In [24]:
customers = sale[sale["CustomerID"].notna()].groupby(["year", "month"]) \
.agg({"CustomerID": "unique"}) \
.reset_index().rename({"CustomerID": "unique_customer_ids"}, axis = 1)

# calculate the number of unique customers and insert it as a column
customers.insert(2,"unique_customers_this_month", customers["unique_customer_ids"].str.len())

customers.head()

Unnamed: 0,year,month,unique_customers_this_month,unique_customer_ids
0,2010,1,98,"[17850.0, 17511.0, 13408.0, 15862.0, 16552.0, ..."
1,2010,2,116,"[17850.0, 17732.0, 17976.0, 17685.0, 15640.0, ..."
2,2010,3,55,"[16883.0, 12841.0, 17967.0, 14723.0, 17198.0, ..."
3,2010,5,76,"[18055.0, 18109.0, 15708.0, 16931.0, 16814.0, ..."
4,2010,6,90,"[18219.0, 14748.0, 15860.0, 14344.0, 16719.0, ..."


New customers are customers who purchase an item for the first time. In other words, these are customers whose CustomerID appears in the records for the first time at the date of the purchase.

To find which CustomerIDs appear for the first time, we create a running list which accumulates all unique_customer_ids up to each month. Then we remove duplicates and check the length of the list for each month. In doing so, we get the running total of unique customers.

In [25]:
ids = []

# creates a running list of customerids up to each month
for index, row in customers.iterrows(): 
    if index == 0:
        ids.append(row["unique_customer_ids"].tolist())
        
    else:   # adds the present ids to the accumulated list of previous ids
        ids.append(row["unique_customer_ids"].tolist() + ids[index-1])

In [26]:
total_customers = []
for i in range(len(ids)):
    total_customers.append(len(set(ids[i]))) # the set removes duplicates  
    
# insert as a column
customers.insert(3, "total_customers", total_customers)

In [27]:
# add the first difference of totaL_customers

customers.insert(3, "new_customers_this_month", customers["total_customers"].diff() \
.replace({np.nan: 98}).astype(int)) # fill in the first value

In [28]:
 # drop the long lists of unique customers
customers = customers.drop("unique_customer_ids", axis = 1)

In [29]:
# create rows about April 2010 and November 2010 with averages from preceding and following months
new_rows = \
pd.DataFrame({"year":[2010,2010],
              "month": [4,11],
              "unique_customers_this_month": [65,271],
              "new_customers_this_month": [59,163],
              "total_customers": [288,803]}, index = [98,99]) # arbitrary indexes

# insert the row in the customers table
customers = pd.concat([new_rows, customers]) \
.sort_values(by=["year","month"]).reset_index(drop=True)

In [30]:
# convert numbers into month names
customers["month"] = customers["month"].apply(lambda x: calendar.month_abbr[x])

# combine month and year
customers["month"] = customers["month"].astype(str) + " " + customers["year"].astype(str)

# drop the redundant year column
customers = customers.drop("year", axis = 1)

customers = customers[0:23] # drop December 2011 since the data does not cover the whole month

In [31]:
trace1 = go.Scatter(
                    x = customers["month"],
                    y = customers["unique_customers_this_month"],
                    mode = "lines+markers",
                    name = "Unique Customers This Month",
                    line = dict(width = 4),
                    marker = dict(
                                  size = 10,
                                  color = "#0E79B2"
                                 ),
                    hovertemplate = "%{x}<br>Unique Customers: %{y} <extra></extra>",
                  )
trace2 = go.Scatter(
                    x = customers["month"],
                    y = customers["new_customers_this_month"],
                    mode = "lines+markers",
                    name = "New Customers This Month",
                    line = dict(width = 4),
                    marker = dict(
                                  size = 10,
                                  color = "rgba(242, 225, 39, 1)"
                                 ),
                    hovertemplate = "%{x}<br>New Customers: %{y} <extra></extra>",
                  )

line_data = [trace1, trace2]

layout = dict(
              title = "Customers by Month",
              titlefont = dict(size = 20),
              margin=dict(l=10, r=50, b=10, t=70, pad=0),
              xaxis= dict(title= "Month",ticklen = 5,zeroline = False),
              yaxis= dict(title= "Number of Customers"),
              legend=dict(
                          font = dict(size = 12),
                          yanchor = "top",
                          y=0.98,
                          x= 0.01
                         )
             )
fig = dict(data = line_data, layout = layout)
iplot(fig)

In [32]:
trace1 = go.Scatter(
                    x = customers["month"],
                    y = customers["unique_customers_this_month"],
                    mode = "lines+markers",
                    name = "Unique Customers This Month",
                    line = dict(width = 4),
                    marker = dict(
                                  size = 10,
                                  color = "#0E79B2"
                                 ),
                    hovertemplate = "%{x}<br>Unique Customers: %{y} <extra></extra>",
                  )
trace2 = go.Scatter(
                    x = customers["month"],
                    y = customers["new_customers_this_month"],
                    mode = "lines+markers",
                    name = "New Customers This Month",
                    line = dict(width = 4),
                    marker = dict(
                                  size = 10,
                                  color = "rgba(242, 225, 39, 1)"
                                 ),
                    hovertemplate = "%{x}<br>New Customers: %{y} <extra></extra>",
                  )
trace3 = go.Scatter(
                    x = customers["month"],
                    y = customers["total_customers"],
                    mode = "lines+markers",
                    name = "Total Customers",
                    line = dict(width = 4),
                    marker = dict(
                                  size = 10,
                                  color = "rgba(242, 39, 127, 1)"
                                 ),
                    hovertemplate = "%{x}<br>Total Customers: %{y} <extra></extra>",
                  )

line_data = [trace1, trace2, trace3]

layout = dict(
              title = "Customers by Month",
              titlefont = dict(size = 20),
              margin=dict(l=10, r=50, b=10, t=70, pad=0),
              xaxis= dict(title= "Month",ticklen = 5,zeroline = False),
              yaxis= dict(title= "Number of Customers"),
              legend=dict(
                          font = dict(size = 12),
                          yanchor = "top",
                          y=0.98,
                          x= 0.01
                         )
             )
fig = dict(data = line_data, layout = layout)
iplot(fig)


# What Time During the Day Do Customers Make the Most Purchases?

In [37]:
# take data only about December 2010 and October, Novermber and December 2011
subset = sale[
              ((sale["year"] == 2010) & (sale["month"] == 12))
              |    
              ((sale["year"] == 2011) & (sale["month"] == 10))
              |
              ((sale["year"] == 2011) & (sale["month"] == 11))
              |
              ((sale["year"] == 2011) & (sale["month"] == 12))
              ]

In [38]:
# extract the hour of purchase from InvoiceDate and add it as a column
subset["hour"] = subset["InvoiceDate"].astype(str).str[11:13].astype(int)

In [39]:
frequency = sale.groupby(["year","month","hour"]) \
.agg({"InvoiceNo":"nunique"}).reset_index() \
.rename({"InvoiceNo": "num_orders"}, axis = 1)

frequency.head()

Unnamed: 0,year,month,hour,num_orders
0,2010,1,8,6
1,2010,1,9,18
2,2010,1,10,12
3,2010,1,11,12
4,2010,1,12,23


In [42]:
pivot = frequency.pivot(index = "hour", columns = ["year","month"], values = ["num_orders"])

pivot = pd.DataFrame(pivot.to_records()) # flattens multilevel column headings

pivot["hour"] = pivot["hour"].astype(str) + ":00" # make hours more readable

pivot = pivot.set_index("hour")

pivot.index.name = "" # remove index name for plotting

pivot.rename(columns={ # set more readable names
                      pivot.columns[0]:"Dec 2010",
                      pivot.columns[1]:"Oct 2011",
                      pivot.columns[2]:"Nov 2011",
                      pivot.columns[3]:"Dec 2011"
                      }, inplace = True)

pivot

KeyError: 'Level year not found'