This notebook contains Exploratory Data Analysis and some data visualisation using plotly package. The data used here is the Ecommerce behaviour data for a medium cosmetics online store and I have chosen only January data for the analysis.
    
Link to the dataset -> [https://www.kaggle.com/mkechinov/ecommerce-events-history-in-cosmetics-shop](http://)

Lets begin.
    

# Table of Contents

* [Import Packages](#1)
* [Data Cleaning and EDA](#2)
    1. [Reading Data](#3)
    2. [Parsing Datetime and Creating necessary Datetime columns](#4)
    3. [Checking datatypes and range for filtering unrelavent data](#5)
    4. [Checking Null data](#6)
* [Data Analysis and Data Visualisation](#7)
    1. [Checking the Unique Values](#8)
    2. [Customer Purchase Funnel](#9)
        1. [Data Prep](#10)
        2. [Funnel Visualisation](#11)
    3. [Hourly Website Traffic](#12)
        1. [Data Prep](#13)
        2. [Hourly Website Traffic Visualisation](#14)
    4. [Daily Sales,Ticket Size and Number of Orders](#15)
        1. [Data Prep](#16)
        2. [Visualising Daily sales,Ticket Size and # of Orders](#17)
    5. [Hourly sales During Jan - Heatmap Vs Countour](#18)
        1. [Data Prep](#19)
        2. [Data Visualisation - Daily Sales (by Hour)](#20)
        3. [Data Visualisation - Hourly Sales (by Week)](#21)
    6. [Summary Statistics for # of Orders and Ticket size](#22)
        1. [Data Prep](#23)
        2. [Visualising summary statistics](#24)
    

# Import Packages
<a id="1"></a>
**Lets import all the necessary packages that we are going to use**

In [None]:
import pandas as pd
import plotly.express as px 
import os
import matplotlib.pyplot as plt
import missingno as msno
import datetime
from plotly import graph_objects as go
from plotly.subplots import make_subplots

This notebook consists of some Exploratory data analysis, Basic web analysis and data visualisation. For simplicity, will be using only Jan data.

# Data Cleaning and EDA
<a id="2"></a>

## 1.Reading Data
<a id="3"></a>

In [None]:
data=pd.read_csv('../input/ecommerce-events-history-in-cosmetics-shop/2020-Jan.csv')

In [None]:
data.head()

## 2.Parsing Datetime and Creating necessary Datetime columns
<a id="4"></a>

Event_time column has string "UTC" in it. So split the actual datetime and UTC seperate. Then convert the Event_time column into datetime datatype. Also, create new columns called date,time,hour,weekday,weeknum from the "event_time" column. Lets also change the weekday format from number to string like mon,tues,etc..

In [None]:
#seperating timezone
data["timezone"]= data["event_time"].str.rsplit(" ", n=1,expand = True)[1]
data["event_time"]= data["event_time"].str.rsplit(" ", n=1,expand = True)[0]
data["event_time"]=pd.to_datetime(data["event_time"])

#creating date,time,hours,weekday,weeknum columns
data["date"]=data['event_time'].dt.date
data["time"]=data['event_time'].dt.time
data["hours"]=data['event_time'].dt.hour
data["weekday"]=data['event_time'].dt.weekday
data['weeknum']=data['event_time'].dt.isocalendar().week

#changing weekday to string and adding 'week_' prefix to weeknum
data['weeknum'] = 'week_' + data['weeknum'].astype(str)
data['weekday']= data['weekday'].replace({0:'Mon',1:'Tues',2:'Wed',3:'Thurs',4:'Fri',5:'Sat',6:'Sun'})

In [None]:
data.head()


## 3.Checking datatypes and range for filtering unrelavent data
<a id="5"></a>

In [None]:
data.info()

As we can see from above info() method, all the datatypes are correctly available. Hours is available as int64. Since hours is only used for grouping and other data analysis related works, its okay to have it in int64 format. Lets quickly check the minimum, average, maximum and other statics related to the numerical columns.

In [None]:
data.describe()

We can see that there are some entries which have negative prices. This might be a return orders. Lets just try to see how many orders are returned.

In [None]:
returned_orders=data[data['price']<0]['price'].count()
returned_orders_perc=returned_orders/(data['price'].count())

print("There are %2d returned orders which is %.5f of total orders." %(returned_orders,round(returned_orders_perc,5)))

0.00001% almost negligible.Might be result of poor data mining. Lets remove those records and save our dataset.

## 4.Checking Null data
<a id="6"></a>

In [None]:
data=data[data['price']>=0]

#Checkinng how much missing values are present in the dataset.

# Calculate the Percentage of missing values in All columns
perc=data.isnull().sum() * 100 / len(data)
print(round(perc,2))

We can clearly see that Category_code and brand has majority of missing values. user_session has very minimal missing value. Lets visually see the missing value in the dataframe as follows:

In [None]:
#Visualising in matrix form
msno.matrix(data)

In [None]:
#Visualising as bar graph
msno.bar(data)

# Data Analysis and Data Visualisation
<a id="7"></a>

## 1.Checking the Unique Values
<a id="8"></a>

Lets check the number of Unique values in the dataframe

In [None]:
#checking number of unique values in dataframe
data.nunique()

## 2.Customer Purchase Funnel
<a id="9"></a>

### A.Data Prep
<a id="10"></a>

We can see that there are only 4 types of event_types in the data. So lets check the differnt types and will create a data funnel as how customers will go thorugh the purchase funnel.

In [None]:
data.event_type.unique()

As we can see there are four event_types out of which "Remove_from_cart" is a event which all users might not go through during their journey. So lets remove that particular event before we group and make our data ready for visualising the funnel

In [None]:
#grouping and preparing data for funnel visualisation

data_funnel=data[data['event_type']!='remove_from_cart'].groupby(['event_type'],as_index=False)['event_time'].count()
data_funnel.columns=['event_type','# events']
data_funnel.sort_values('# events', inplace=True,ascending=False)
data_funnel.reset_index(drop=True,inplace=True)
data_funnel['percent']=data_funnel['# events']/(data_funnel['# events'][0].sum())*100
data_funnel

### B. Funnel Visualisation<a id="11"></a>

Now lets use plotly too visualise the customer funnel

In [None]:
#plotly to visualise funnel
fig = go.Figure(go.Funnel(
    y = data_funnel["event_type"],
    x = data_funnel["# events"],
    customdata=data_funnel["percent"],
    texttemplate= "<b>%{label}: </B>%{value:.2s}"+"<br><b>% of Total:</b> %{customdata:.2f}%",
    textposition='inside',
    marker = {"color": ["lightyellow", "lightsalmon", "tan"]}
    ))
fig.update_yaxes(visible=False)
fig.update_layout(template='simple_white',     
                  title={'xanchor': 'center',
                         'yanchor': 'top',        
                         'y':0.9,
                         'x':0.5,
                         'text':"Customer Funnel for Purchase Journey"})
fig.show()

## 3.Hourly Website Traffic<a id="12"></a>

###  A. Data Prep<a id="13"></a>

In [None]:
datahour=data.groupby(['hours','weeknum'],as_index=False)['price'].count()
datahour.columns=['hours','weeknum','price']

### B. Hourly Website Traffic Visualisation<a id="14"></a>

In [None]:
#Visualisation
fig = px.area(datahour, x='hours', y="price",color='weeknum')
fig.update_layout(template='simple_white',     
                title={'xanchor': 'center',
                         'yanchor': 'top',        
                         'y':0.9,
                         'x':0.5,
                         'text':"Customer's Hourly Website Views"},
                xaxis = dict(
                    title_text='hours',
                    tickmode = 'linear',
                    tick0 = 0,
                    dtick = 2),
                 yaxis = dict(
                    title_text='Visitors'))
fig.show()

As we can see from the above graph, there are usually two peaks in a day which happens around 10 AM to 2PM and then again the peak starts from 6PM to 8PM. Knowing the marketing spent through out the day, and conversion rate arouund these hours, we can target campaigns (especially conversion campaigns) to run specifically targeting highly converting hours.

This peak hours generally translates to lunch break, and post work. Thus having high traffic at these hours makes sense.

## 4.Daily Sales,Ticket Size and Number of Orders<a id="15"></a>

###  A. Data Prep<a id="16"></a>

In [None]:
datadate=data[data['event_type']=='purchase'].groupby(['date'],as_index=False)['price'].sum()
datadateh=data[data['event_type']=='purchase'].groupby(['date'],as_index=False)['price'].count()
datadateh['avg_ticket']=datadate['price']/datadateh['price']
datadate.columns=['date','price']

### B. Visualising Daily sales,Ticket Size and # of Orders<a id="17"></a>

In [None]:
#Visualisation
fig = make_subplots(
    rows=2, cols=1,
    column_widths=[1.0],
    row_heights=[0.5, 0.5],
    specs=[[{"type": "Bar"}],
           [{"type": "Scatter"}]])

fig.add_trace(go.Bar(x=datadate['date'], y=datadate["price"],name='Sales'),
             row=1,col=1)
fig.add_trace(go.Scatter(x=datadateh['date'], y=datadateh['price'],
                    mode='lines+markers',
                    name='No of Purchases'),
              row=1,col=1)
fig.add_trace(go.Scatter(x=datadateh['date'], y=datadateh['avg_ticket'],
                    mode='lines+markers',
                    name='Avg Ticket Size'),
              row=2,col=1)
fig.update_layout(template='simple_white',     
                title={'xanchor': 'center',
                         'yanchor': 'top',        
                         'y':0.9,
                         'x':0.5,
                         'text':"Daily sales"}
                 )
fig.update_yaxes(title_text='Sales/No of Purchases',ticks="inside", row=1)
fig.update_yaxes(title_text='Avg Ticket Size',ticks="inside", row=2)
fig.update_xaxes(title_text='Date',ticks="inside")
fig.show()

As we can see from above graph, there are two peaks when sales were high. This happened on Jan 27th and Jan 28th,2020. This might be due to some specific promotional campaigns.The number of sales is also high on these days but the average ticket size is not that low by which we can infer that,even if there is any promotional campaigns running on Jan 27 and Jan 28, the discount is not that low affecting the average ticket size.

## 5.Hourly sales During Jan - Heatmap Vs Countour<a id="18"></a>

###  A. Data Prep<a id="19"></a>

In [None]:
#preprare data for further Visualisation
datadatehour=data[data['event_type']=='purchase'].groupby(['date','hours'],as_index=False)['price'].sum()
datadatehour.columns=['date','hours','price']
datadatehour['hours']=datadatehour['hours'].astype(str)
datadatehour['date']=datadatehour['date'].astype(str)

### B. Data Visualisation - Daily Sales (by Hour)<a id="20"></a>

In [None]:
fig = make_subplots(
    rows=2, cols=1,
    column_widths=[1.0],
    row_heights=[0.5, 0.5],
    specs=[[{"type": "histogram2d"}],
           [{"type": "histogram2dcontour"}]])
fig.add_trace(
    go.Histogram2d(
        x = datadatehour["date"],
        y = datadatehour["hours"],
        z = datadatehour["price"],
        colorbar=dict(len=0.5, y=0.8,title="Overall Sales"),
        histfunc = "sum",
        colorscale = "cividis",
        nbinsx = 31,
        nbinsy=24),
    row=1,col=1)
fig.add_trace(
    go.Histogram2dContour(
        x = datadatehour["date"],
        y = datadatehour["hours"],
        z = datadatehour["price"],
        colorbar=dict(len=0.5, y=0.25,title="Overall Sales",tickmode="array",
        tickvals=[20000,40000,60000,80000,100000,120000],),
        histfunc = "sum",
        showlegend=False,
        colorscale = "cividis",
        contours = dict(
            showlabels = True),
        nbinsx=31,
        nbinsy=24),
    row=2,col=1)


fig.update_layout(
    template="simple_white",
    margin=dict(r=50, t=50, b=50, l=50),
    height=600,
    showlegend=False,
    title={'xanchor': 'center',
           'yanchor': 'top',        
            'y':1,
            'x':0.5,
           'text':"Heatmap vs Contourplot</br></br></br>Hourly sales during Jan"}
)
fig.update_yaxes(title_text='Hours',ticks="inside")
fig.update_xaxes(title_text='Date',ticks="inside")
fig.show()

We can confirm the same high volume of sales on Jan 27th and 28th in above countour or heatmap as well. From above visualisation, it is also clear that generally the sales happen around 9AM to 1PM. But in last week, there are sales happening in the evening as well.

### C. Data Visualisation - Hourly Sales (by Week)<a id="21"></a>

In [None]:
fig = make_subplots(
    rows=2, cols=1,
    column_widths=[1.0],
    row_heights=[0.5, 0.5],
    specs=[[{"type": "histogram2d"}],
           [{"type": "histogram2dcontour"}]])
fig.add_trace(
    go.Histogram2d(
        x = datahour["hours"],
        y = datahour["weeknum"],
        z = datahour["price"],
        colorbar=dict(len=0.5, y=0.8,title="Overall Sales"),
        histfunc = "sum",
        colorscale='cividis',
        nbinsx = 31,
        nbinsy=7
    ),
    row=1,col=1)
fig.add_trace(
    go.Histogram2dContour(
        x = datahour["hours"],
        y = datahour["weeknum"],
        z = datahour["price"],
        colorbar=dict(len=0.5, y=0.20,title="Overall Sales",tickmode="array",
        tickvals=[20000,40000,60000],
                     ),
        histfunc = "sum",
        showlegend=False,
        colorscale='cividis',
        contours = dict(
            showlabels = True),
        nbinsx=31,
        nbinsy=7
    ),
    row=2,col=1)


fig.update_layout(
    template="simple_white",
    margin=dict(r=50, t=50, b=50, l=50),
    height=600,
    showlegend=False,
    title={'xanchor': 'center',
           'yanchor': 'top',        
            'y':0.999,
            'x':0.5,
           'text':"Heatmap vs Contourplot</br></br></br>Hourly sales by weekdays"}
)
fig.update_yaxes(title_text='Weekdays',ticks="inside")
fig.update_xaxes(title_text='Hours',ticks="inside")
fig.show()

When we combine sales weekwise and visualise, as we already predicted sales happen around 9AM to 2PM in the morning and in evening from 4PM To 8PM (mmajorly from 6PM to 8PM)

When combining weekly sales, majority of sales happened in week3 and week4 ( because in January 2020, there are 4 days in week1 and 6days in week5 and rest of the week has 7 days. If we compensate the one missing day in week5, we will have higher sales in last week as well.

## 6.Summary Statistics for # of Orders and Ticket size <a id="22"></a>

###  A. Data Prep<a id="23"></a>

In [None]:
data.head()

In order to find the average ticket size, averge number of purchases by user, lets group the data by user_id.

In [None]:
#grouping based on user_id,date,hours,weekday,weeknum
data_user=data[data['event_type']=='purchase'].groupby(['user_id','date','hours','weekday','weeknum']).agg({'price':['sum','count']}).reset_index()

#converting columns from multi index to single index
data_user.columns=data_user.columns.to_flat_index()
data_user=data_user.rename(columns={('price', 'sum'):'purchased_value',('price', 'count'):'no_of_purchases',('user_id',''):'user_id',
                          ('date',''):'date',('hours',''):'hours',('weekday',''):'weekday',('weeknum',''):'weeknum'})

#checking whether columns are updated correctly
data_user.head()

### B. Visualising summary statistics<a id="24"></a>

In [None]:
fig = make_subplots(
    rows=1, cols=2,
    column_widths=[0.5, 0.5],
    row_heights=[1.0],
    specs=[[{"type": "Violin"},
           {"type": "Violin"}]])
fig = fig.add_trace(go.Violin(y=data_user['no_of_purchases'],
                            name='No of Purchases',
                            box_visible=True,
                            meanline_visible=True),
                       row=1,col=1)
fig = fig.add_trace(go.Violin(y=data_user['purchased_value'],
                            name='Ticked Size',
                            box_visible=True,
                            meanline_visible=True),
                       row=1,col=2)
fig.update_layout(
    template="simple_white",
    margin=dict(r=50, t=50, b=50, l=50),
    height=600,
    showlegend=False,
    yaxis_title="Count",
    title={'xanchor': 'center',
           'yanchor': 'top',        
            'y':0.95,
            'x':0.5,
           'text':"Summary Statistics for Number of Purchases and Ticket size of purchases"}
)
fig.show()

As we can see from above graph, on average 8 products have been purchased where as 50% of people have purchased 6 products in the month of January. 25%(since INR Q3 is 10) of people have atleast purchased 10 products. Similarly, on average, the ticket size is 40.7 𝑎𝑛𝑑50 . 25%(since INR Q3 is 10) of people have spent minimum of 50.1$ as ticket size.


***This notebook I have created to practice Explarotary data analysis and to practise some data visualisation. Feel free to share your feedbacks with me. If you like my work,Please upvote. This will help me be motivated***