## Introduction

In [1]:
from pandasql import sqldf
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data

The data is a sales data is a invoice data from Atliq Hardwares, a computer hardware producers in India which i have obtained from Kaggle.com. the dataset contains a transacrtion between 2017 to 2021. 

Dataset can be found via https://www.kaggle.com/datasets/ad043santhoshs/sales-domain

In [2]:
salesdata = pd.read_csv('Sales_domain.csv', encoding='ISO-8859-1')

## Data Transformation

For data transformation, i have processed the following:
- convert the date into datetime format
- convert customer code so a string instead of int to avoid accidental aggregation

In [3]:
salesdata['Date'] = pd.to_datetime(salesdata['Date'], format='%d-%m-%Y')
salesdata['customer_code'] = salesdata['customer_code'].astype(str)

In [4]:
salesdata.head(5)

Unnamed: 0,Date,product_code,customer_code,sold_quantity,fiscal_year,division,segment,category,product,variant,customer,platform,channel,market,sub_zone,region,gross_price,cost_year,manufacturing_cost,pre_invoice_discount_pct
0,2017-09-01,A0118150101,70002017,51,2018,P & A,Peripherals,Internal HDD,AQ Dracula HDD  3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,Atliq Exclusive,Brick & Mortar,Direct,India,India,APAC,15.3952,2018,4.619,0.0824
1,2017-09-01,A0118150101,70002017,51,2018,P & A,Peripherals,Internal HDD,AQ Dracula HDD  3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,Atliq Exclusive,Brick & Mortar,Direct,India,India,APAC,15.3952,2019,4.2033,0.0824
2,2017-09-01,A0118150101,70002017,51,2018,P & A,Peripherals,Internal HDD,AQ Dracula HDD  3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,Atliq Exclusive,Brick & Mortar,Direct,India,India,APAC,15.3952,2020,5.0207,0.0824
3,2017-09-01,A0118150101,70002017,51,2018,P & A,Peripherals,Internal HDD,AQ Dracula HDD  3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,Atliq Exclusive,Brick & Mortar,Direct,India,India,APAC,15.3952,2021,5.5172,0.0824
4,2017-09-01,A0118150101,70002017,51,2018,P & A,Peripherals,Internal HDD,AQ Dracula HDD  3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,Atliq Exclusive,Brick & Mortar,Direct,India,India,APAC,14.4392,2018,4.619,0.0824


## PART 1: Exploratory Data Analysis Using Python and SQLite

For the exploratory data analysis, instead of using python pandas to wrangle the data for plotting, i have decided to use SQLite bacause i wanted to practice my sql that i have learned on my leisure time. Basically, i will use SQL to query the necessary data and i will plot it using python matplot and seaborn

In [5]:
salesdata.describe()

Unnamed: 0,sold_quantity,fiscal_year,gross_price,cost_year,manufacturing_cost,pre_invoice_discount_pct
count,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0
mean,55.40025,2020.022,20.50893,2020.028,6.0985,0.2322573
std,136.9163,1.125987,3.220415,1.33466,0.9597987,0.05885463
min,0.0,2018.0,14.0555,2018.0,4.2033,0.051
25%,7.0,2019.0,18.4663,2019.0,5.3448,0.2037
50%,20.0,2020.0,19.8577,2020.0,5.9469,0.2404
75%,51.0,2021.0,23.6154,2021.0,7.0498,0.2762
max,4127.0,2022.0,30.306,2022.0,9.1877,0.3099


**1.1 General Metrics**

First, lets make sense of how big/varies the dataset is in terms of its customer base, market presence, and products

In [6]:
#unique counts
query = """
SELECT
    COUNT (DISTINCT product) AS unique_product_count,
    COUNT (DISTINCT category) AS unique_product_category,
    COUNT (DISTINCT customer_code) AS unique_customer,
    COUNT (DISTINCT market) AS unique_market
FROM salesdata;
"""
sqldf(query)

**Observations**

For general metrics, i have observed that:
- there were only few products in the dataset, consist of only 4 product and 2 product categories
- there are 209 unique customers/clients
- Atliq hardware has expanded their business to 27 countries worldwide!

**1.2 Sales Revenue Metrics**

Let's look at the company monthly revenue.

In [11]:
#gross sales (monthly revenue)
query = """ 

SELECT 
    Date,
    ROUND(SUM((gross_price * sold_quantity) * (1 - pre_invoice_discount_pct)),2) AS gross_sales
FROM salesdata
GROUP BY date
"""

df_gross_sale = sqldf(query)
df_gross_sale.index = pd.to_datetime(df_gross_sale['Date'])
df_gross_sale = df_gross_sale.resample('M').sum()
df_gross_sale['monthyear'] = df_gross_sale.index.strftime('%b-%Y')

In [12]:
plt.figure(figsize=(20,5))
sns.lineplot(x='monthyear', y='gross_sales', data=df_gross_sale)
sns.scatterplot(x='monthyear', y='gross_sales', data=df_gross_sale)
plt.xticks(rotation=90, fontsize=12)
plt.title('Gross Sales (100 Million)', fontsize=18)
plt.xlabel('Month Year', fontsize=18)

In [13]:
df_gross_sale['growthrate'] = df_gross_sale['gross_sales'].pct_change()

In [14]:
plt.figure(figsize=(20,5))
sns.lineplot(x='monthyear', y='growthrate', data=df_gross_sale)
sns.scatterplot(x='monthyear', y='growthrate', data=df_gross_sale)
plt.title('Monthly Revenue Growth Rate', fontsize=18)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel('Growth Rate (%)', fontsize=18)
plt.xlabel('Month Year', fontsize=18)

**Observations**
- Seems that we see a pattern where the monthly revenue would dratically increase at period between August - December every year
- Betweem September - December 2021, revenue jumps drastically, with more monthly revenue than any previous periods

It seem that we have some interesting pattern that we could delve deeper to find out what are reasons for such massive spike at particular period of time. Next, we will be looking at metric relating to customer engagement.

**1.3 Customer Metrics**

In [15]:
query = """
    SELECT
        Date,
        COUNT(DISTINCT customer_code) AS active_customer
    FROM salesdata
    GROUP BY Date
"""
df_active_customer = sqldf(query)
df_active_customer.index = pd.to_datetime(df_active_customer['Date'])
df_active_customer = df_active_customer.resample('M').sum()
df_active_customer['monthyear'] = df_active_customer.index.strftime('%b-%Y')

In [16]:
plt.figure(figsize=(20,5))
sns.lineplot(x='monthyear', y='active_customer', data=df_active_customer)
sns.scatterplot(x='monthyear', y='active_customer', data=df_active_customer)
plt.xticks(rotation=90, fontsize=12)
plt.title('Monthly Active Customer', fontsize=18)
plt.xlabel('Month Year', fontsize=18)
plt.ylabel('Active Customer', fontsize=18)

**Observations**
- Betweem  2017-2019 There is a sudden increase in customer activity every september which correspondes to what we have discovered in 1.2, it's possible that sudden increase in customer activity could have an effect on the monthly revenue.
- However, i have see some contradiction in later years where monhtly revenue on september 2021 did have a dratic increase, but the customer activity in that period of time remain relativly the same since 2019

While observing customer activty did answer half of the question, the sudden revenue spike in 2021 still has not been answered. let look at more matric to see if we can find more insgiht to answer our question we had in 1.2

**1.4 Order/Transaction Metrics**

In [17]:
query = """
WITH temporderid AS (
    SELECT
        Date,
        customer_code,
        SUM(sold_quantity * gross_price) AS total_revenue
    FROM salesdata
    GROUP BY Date, customer_code
)


SELECT 
    Date,
    COUNT(*) AS transaction_count
FROM temporderid
GROUP BY Date
"""

df_monthly_order = sqldf(query)
df_monthly_order.index = pd.to_datetime(df_monthly_order['Date'])
df_monthly_order = df_monthly_order.resample('M').sum()
df_monthly_order['monthyear'] = df_monthly_order.index.strftime('%b-%Y')

In [18]:
plt.figure(figsize=(20,5))
sns.lineplot(x='monthyear', y='transaction_count', data=df_monthly_order)
sns.scatterplot(x='monthyear', y='transaction_count', data=df_monthly_order)
plt.xticks(rotation=90, fontsize=12)
plt.title('Monthly Order', fontsize=18)
plt.xlabel('Month Year', fontsize=18)
plt.ylabel('Amount of Order', fontsize=18)

**Observations**
- it seems that the Monthly order has exactly the same pattern as the monthly active customer, which is expected since in the dataset, all of the order date is invoiced at beginning of every month.Basically, it is possible that clients would make a order once every month.

**1.5 Old Customer vs New Customer (Revenue)**

Let's look at if new customer that have made purchase within the same month, might have an impact on the company monhtly revenue. perhaps there are new customer that have a big purchase right away

In [29]:
query= """
WITH first_purch AS (
SELECT
    customer_code,
    MIN(Date) AS firstpurchase
FROM salesdata
GROUP BY customer_code ),

salesdata_first AS (
SELECT *
FROM salesdata AS df
LEFT JOIN first_purch AS df_first
    ON df.customer_code = df_first.customer_code),

newold_customer AS (
SELECT
    Date AS orderdate,
    product_code,
    customer_code,
    sold_quantity,
    gross_price,
    pre_invoice_discount_pct,
    CASE 
        WHEN Date = firstpurchase THEN 'new'
        WHEN Date > firstpurchase THEN 'existing'
        ELSE 'Error' END AS newcustomer
FROM salesdata_first)


SELECT 
    orderdate,
    newcustomer,
    ROUND(SUM((gross_price * sold_quantity) * (1 - pre_invoice_discount_pct)),2) AS gross_sales
FROM newold_customer
GROUP BY orderdate, newcustomer
"""

df_newold_customer = sqldf(query)
df_newold_customer['monthyear'] = pd.to_datetime(df_newold_customer['orderdate']).dt.strftime('%b-%Y')

In [31]:
plt.figure(figsize=(20,5))
sns.lineplot(x='monthyear', y='gross_sales', data= df_newold_customer[df_newold_customer['newcustomer']=='existing'])
sns.lineplot(x='monthyear', y='gross_sales', data= df_newold_customer[df_newold_customer['newcustomer']=='new'])
plt.xticks(rotation=90, fontsize=12)
plt.title('Monthly Revenue, New Customer vs Existing Customer')
plt.show()

**Observations**
- Looks like new customer that have make purchase in the same month as when they have joined, have made a relatively small purchase compared to existing customer
- it's possible that new customer might have made a small purchase first and a bigger purchased in later months evident in an increase in monthly revenue after september(monthly that has sudden increase in customer activity)
- However, it still doesn't answety why we have a suddenly increase in large amount in revenue on september 2021

In [66]:
query= """
WITH first_purch AS (
SELECT
    customer_code,
    MIN(Date) AS firstpurchase
FROM salesdata
GROUP BY customer_code),

salesdata_first AS (
SELECT *
FROM salesdata AS df
LEFT JOIN first_purch AS df_first
    ON df.customer_code = df_first.customer_code),

newold_customer AS (
SELECT
    Date AS orderdate,
    product_code,
    customer_code,
    pre_invoice_discount_pct,
    CASE 
        WHEN Date = firstpurchase THEN 'new'
        WHEN Date > firstpurchase THEN 'existing'
        ELSE 'Error' END AS newcustomer
FROM salesdata_first)

SELECT
    orderdate,
    customer_code,
    newcustomer
FROM newold_customer
WHERE newcustomer == 'new'
GROUP BY orderdate, customer_code, newcustomer
"""

df_newold_customer_order = sqldf(query)
df_newold_customer_order['monthyear'] = pd.to_datetime(df_newold_customer_order['orderdate']).dt.strftime('%b-%Y')

In [67]:
pd.DataFrame(df_newold_customer_order.groupby('orderdate')['newcustomer'].value_counts())

**Observations**
- Given that, we have found that increase in newcustomer might have an effect on incresing revenue in the same preiod. i have observed that new customer often come at every september and october 
- Looking at the amount of new customer that we aquired, the company did not aquire any new customer in 2020 and 2021 so the sudden spike in september 2022 should came from existing clietns making more purchases. 

**1.6 Product Performance**

Let's look at the performacne of the product to see if the demand of the product is the reason for the sudden large revenue spike

In [None]:
#monthly sales by product
query = """
SELECT 
    category,
    product,
    ROUND(SUM((gross_price * sold_quantity) - pre_invoice_discount_pct),2) AS gross_sales
FROM salesdata
GROUP BY category, product
"""

df_gross_sale_product= sqldf(query)

In [107]:
df_gross_sale_product

In [110]:
plt.figure(figsize=(20,10))
sns.barplot(x='gross_sales', y='product', data=df_gross_sale_product.sort_values('gross_sales', ascending=False))
plt.xticks(rotation=90)
plt.tick_params(axis='both', which='major', labelsize=18)
plt.xlabel('Gross Sales (100 Million)', fontsize=18)

**Observations**
- it seems that company most popular product is a Hard disk which made over 500 million!

In [95]:
#monthly sales by product
query = """
SELECT
    Date,
    product,
    ROUND(SUM((gross_price * sold_quantity) - pre_invoice_discount_pct),2) AS gross_sales
FROM salesdata
GROUP BY Date, product 
"""

df_monthly_product = sqldf(query)

In [96]:
df_monthly_product = df_monthly_product.pivot(index='Date', columns='product', values='gross_sales').reset_index()
df_monthly_product = df_monthly_product.fillna(0)
df_monthly_product['monthyear'] = pd.to_datetime(df_monthly_product['Date']).dt.strftime('%b-%Y')

In [105]:
plt.figure(figsize=(20,5))
sns.lineplot(x='monthyear', y='AQ Dracula HDD  3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache', data= df_monthly_product, label='AQ Dracula HDD')
sns.lineplot(x='monthyear', y='AQ Mforce Gen X', data= df_monthly_product, label='AQ Mforce Gen X')
sns.lineplot(x='monthyear', y='AQ WereWolf NAS Internal Hard Drive HDD  8.89 cm', data= df_monthly_product, label='AQ WereWolf NAS Internal Hard Drive')
sns.lineplot(x='monthyear', y='AQ Zion Saga', data= df_monthly_product, label='AQ Zion Saga')
plt.xticks(rotation=90, fontsize=12)
plt.ylabel('Gross Revenue')
plt.xlabel('Month Year')
plt.legend()
plt.show()

**Observations**
- it seems that there is  sudden increase in demand all of our products from september 2021 onward which could explain out revenue spike.

Therefore, from the data analysis, we can conlude that the sudden increase in monthly revenue are casued by a unexpected increasing in demand of computer hardware after august 2021. may be more people want to make a DIY computer!

**1.7 Revenue by market locations**

Now that we have found our culprit, Let's check if the sudden icnrease in deamnd happend in all of our market

In [123]:
#Monthly Revenue by Location
query = """
    SELECT
        Date,
        market,
        ROUND(SUM((gross_price * sold_quantity) - pre_invoice_discount_pct),2) AS gross_sales
    FROM salesdata
    GROUP BY Date, market
"""
df_monthly_sale_market = sqldf(query)

In [124]:
df_monthly_sale_market = df_monthly_sale_market.pivot(index='Date', columns='market', values='gross_sales')
df_monthly_sale_market = df_monthly_sale_market.reset_index()
df_monthly_sale_market['monthyear'] = pd.to_datetime(df_monthly_sale_market['Date']).dt.strftime('%b-%Y')

In [130]:
country_list = set(salesdata['market'].tolist())

plt.figure(figsize=(20,10))
for i in country_list:
    sns.lineplot(x = 'monthyear', y=i, data=df_monthly_sale_market, label=i)
    
plt.ylabel('Gross Sales (10 Million)', fontsize=18) 
plt.xlabel('Month Year', fontsize=18) 
plt.tick_params(axis='both', which='major', labelsize=16)
plt.xticks(rotation=90, fontsize=12)
plt.legend()
plt.show()

**Observations**
- The lineplot show that the sudden increase in demand occur in most of our market location with India have the higest spile in demand followed by USA, South Korean and Canada.

In [140]:
#Product Preference
query = """
    SELECT
        market,
        category,
        ROUND(SUM((gross_price * sold_quantity) - pre_invoice_discount_pct),2) AS gross_sales
    FROM salesdata
    GROUP BY market, category
"""
df_gross_sale_market = sqldf(query)
df_gross_sale_market = df_gross_sale_market.sort_values('gross_sales', ascending=False)
df_gross_sale_market = df_gross_sale_market.pivot(index='market', columns='category', values='gross_sales')

In [141]:
df_gross_sale_market.plot(kind='barh', stacked=True, figsize=(20,20))
plt.tick_params(axis='both', which='major', labelsize=16)
plt.legend(fontsize=20)
plt.xlabel('Gross Sales (100 Million)', fontsize=18)

**Observations**
- Looking at the proprotion of Hard disk and Graphic Card revenue by market location, i have observed that harddisk category is a high revenue generating product. 

In [19]:
#gross sales by platforms
query = """
SELECT
    platform,
    category,
    ROUND(SUM((gross_price * sold_quantity) - pre_invoice_discount_pct),2) AS gross_sales
FROM salesdata
GROUP BY platform, category
"""
df_gross_sale_customerplatform = sqldf(query)
df_gross_sale_customerplatform = df_gross_sale_customerplatform.pivot(index='platform', columns='category', values='gross_sales')
df_gross_sale_customerplatform.plot(kind='bar', stacked=True, figsize=(10,10))
plt.ylabel('Gross Sales (100 Million)', fontsize=14)

**Observations**
- Most of our revenue are from a sales of computer hardware to a Brick and Mortar store while revenue from  E-commerce platforms account for less than half of Brick and mortar platform.

**Conclusion: Insights From EDA**
1. Demand
2. Existing Customer
3. Spike in revenue
4. Pattern