## Objective:
This is the third notebook in the assignment and deals with:
- reading data from the curated database 
- answering the questions asked and 
- creating dashboards wherever applicable.

The databricks public link to this workbook is https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1721482899250574/1371319005645532/2654077590604412/latest.html

In [None]:
%sql
-- Run the following command to view all the tables present in the staging schema

USE curated;
SHOW TABLES;

database,tableName,isTemporary
curated,completed_customer_orders,False
curated,completed_orders_metrics,False
curated,other_orders_metrics,False
curated,refunded_orders_metrics,False
curated,rfm_metrics,False


#### Visualisations, where applicable, have been created using DataBricks in-built option. Please toggle between the table and visualisation options to view the output.

#### Q1.	Find revenue generated by different categories for the month of 11/2020.
- To answer this question, we'll consider the *_completed_orders_metrics_* table.

In [None]:
%sql

SELECT Category,
       (sum(Revenue)/1000000)::decimal(10,2) as revenue
FROM curated.completed_orders_metrics
where Month_Year = 'Nov-2020'
group by 1
order by 2 desc;

Category,revenue
Mobiles & Tablets,1.44
Entertainment,0.65
Women's Fashion,0.17
Computing,0.15
Others,0.15
Appliances,0.14
Beauty & Grooming,0.08
Men's Fashion,0.08
Superstore,0.05
Home & Living,0.03


Databricks visualization. Run in Databricks to view.

#### Q2.	Which top 5 categories have a maximum number of refunds in the year 2020?
- To compute this, we will consider the *_refunded_orders_metrics_* table

In [None]:
%sql

select Category,
       sum(TotalOrders) as OrdersRefunded
from refunded_orders_metrics
where Year = 2020
group by 1
order by 2 desc
limit 5;

Category,OrdersRefunded
Men's Fashion,2947
Mobiles & Tablets,1923
Women's Fashion,1318
Appliances,938
Beauty & Grooming,551


Databricks visualization. Run in Databricks to view.

#### Q3) Find the total number of orders by each category for each month and year in the dataset?
- We will consider **all three order tables** i.e, completed, refunded and other orders.
- The function MONTH(TO_DATE(Month_Year, 'MMM-yyyy')) converts the Month_Year values from 'Jan-2021', 'Feb-2021',..., 'Dec-2021' to 1,2,...12. This helps in arranging the output sequentially.

In [None]:
%sql

with all_orders as (
select Category,
       Year,
       MONTH(TO_DATE(Month_Year, 'MMM-yyyy')) AS Month,
       sum(TotalOrders) as TotalOrders
from completed_orders_metrics
group by 1,2,3

union all

select Category,
       Year,
       MONTH(TO_DATE(Month_Year, 'MMM-yyyy')) AS Month,
       sum(TotalOrders) as TotalOrders
from refunded_orders_metrics
group by 1,2,3

union all

select Category,
       Year,
       MONTH(TO_DATE(Month_Year, 'MMM-yyyy')) AS Month,
       sum(TotalOrders) as TotalOrders
from other_orders_metrics
group by 1,2,3
order by 2 asc,3 asc)

select Category,
       Year,
       Month,
       sum(TotalOrders) as TotalOrders
from all_orders
group by 1,2,3
order by 2,3 asc;

Category,Year,Month,TotalOrders
Home & Living,2020,10,323
Superstore,2020,10,226
Books,2020,10,62
Entertainment,2020,10,417
Soghaat,2020,10,316
School & Education,2020,10,53
Mobiles & Tablets,2020,10,1908
Women's Fashion,2020,10,951
Men's Fashion,2020,10,1207
Kids & Baby,2020,10,183


Databricks visualization. Run in Databricks to view.

##### Q4) Find the total spend (in percentage of total spend of categories) by customers by different age segments by different categories.


In [None]:
%sql

with category_spends as (
select Category,
       sum(case when Customer_Segment = 'Young'
                then Total
           end )::decimal(10,2) as Young,
        sum(case when Customer_Segment = 'Adults'
                 then Total
            end)::decimal(10,2) as Adults,
        sum(case when Customer_Segment = 'Middle Ages'
                 then Total
            end)::decimal(10,2) as Middle_Ages,
        sum(case when Customer_Segment = 'Old'
                 then Total
            end)::decimal(10,2) as Old,
        sum(Total)::decimal(10,2) as TotalSpend
from curated.completed_customer_orders
group by 1
order by 1 
)

select Category,
       (TotalSpend/1000000)::decimal(10,2) as TotalSpend_in_million_USD,
       ((Young/TotalSpend)*100)::decimal(10,2) as Young,
       ((Adults/TotalSpend)*100)::decimal(10,2) as Adults,
       ((Middle_Ages/TotalSpend)*100)::decimal(10,2) as Middle_Ages,
       ((Old/TotalSpend)*100)::decimal(10,2) as Old
from category_spends
order by 1

Category,TotalSpend_in_million_USD,Young,Adults,Middle_Ages,Old
Appliances,14.3,4.35,27.26,33.96,34.43
Beauty & Grooming,1.48,4.31,26.26,33.87,35.57
Books,0.01,8.88,24.87,31.3,34.95
Computing,2.93,4.84,30.46,39.91,24.79
Entertainment,12.8,4.18,26.91,35.4,33.51
Health & Sports,0.59,3.2,17.76,41.49,37.54
Home & Living,1.03,6.18,24.12,36.18,33.52
Kids & Baby,0.5,9.44,23.85,29.64,37.07
Men's Fashion,2.35,5.37,25.67,35.1,33.86
Mobiles & Tablets,38.62,5.35,25.17,34.97,34.51


Databricks visualization. Run in Databricks to view.

#### Q5) Spend by gender across different categories in terms of percentages.
- To calculate this, we will consider the completed_customer_orders table.

In [None]:
%sql

with gender_spend as (
select Category,
       sum(case when Gender = 'M'
                then Total
           end) as Male,
       sum(case when Gender = 'F'
                then Total
           end) as Female,
       Sum(Total) as TotalSpend
from curated.completed_customer_orders
group by 1
order by 1
)

select Category,
       (TotalSpend/1000000)::decimal(10,2) as TotalSpend_in_million_USD,
       ((Male/TotalSpend)*100)::decimal(10,2) as Male_Percentage,
       ((Female/TotalSpend)*100)::decimal(10,2) as Female_Percentage
from gender_spend
order by 1

Category,TotalSpend_in_million_USD,Male_Percentage,Female_Percentage
Appliances,14.3,50.73,49.27
Beauty & Grooming,1.48,51.53,48.47
Books,0.01,56.27,43.73
Computing,2.93,55.09,44.91
Entertainment,12.8,50.0,50.0
Health & Sports,0.59,60.5,39.5
Home & Living,1.03,44.77,55.23
Kids & Baby,0.5,52.69,47.31
Men's Fashion,2.35,53.16,46.84
Mobiles & Tablets,38.62,48.79,51.21


Databricks visualization. Run in Databricks to view.

#### Q6) Find top 5 customers for each month
- To calculate this, we will consider only the **completed orders**

In [None]:
%sql

with top_five_month_wise as (
SELECT Year,
       MONTH(TO_DATE(Month_Year, 'MMM-yyyy')) AS Month,
       CustID,
       TotalSpend::decimal(10,2) as TotalSpend,
       rank
FROM (
    SELECT Year,
           Month_Year,
           CustID,
           SUM(Total) AS TotalSpend,
           DENSE_RANK() OVER (PARTITION BY Year,Month_Year ORDER BY SUM(Total) DESC) AS rank
    FROM curated.completed_customer_orders
    GROUP BY 1,2,3
) ranked_data
WHERE rank <= 5
ORDER BY 1, 2, TotalSpend DESC
),

customer_details as (
select CustID,
       FullName as Customer_Name,
       County,
       Gender
from curated.completed_customer_orders
group by 1,2,3,4
)

select Year,
       Month,
       top_five_month_wise.CustID,
       Customer_Name,
       County,
       Gender,
       TotalSpend,
       rank
from top_five_month_wise
inner join customer_details on customer_details.CustID = top_five_month_wise.CustID
order by 1,2, rank asc

Year,Month,CustID,Customer_Name,County,Gender,TotalSpend,rank
2020,10,60503,"Herrin, Fidela",Nemaha,F,22377.6,1
2020,10,61165,"Wales, Humberto",Wyoming,M,19362.0,2
2020,10,56216,"Budde, Lyndon",Claiborne,M,15929.9,3
2020,10,60727,"Seltzer, Shaunda",Hampden,F,15065.9,4
2020,10,51717,"Vanzandt, Yi",Chittenden,F,14947.5,5
2020,11,49127,"Fitzsimmons, Grace",Boyle,F,25639.92,1
2020,11,114,"Grado, Hattie",Surry,F,21260.3,2
2020,11,2478,"Leclair, Norberto",Maricopa,M,18387.04,3
2020,11,44766,"Ugarte, Shirl",Stark,F,15481.24,4
2020,11,62383,"Bencomo, Zachary",Brazoria,M,13968.5,5


#### Q7)	Calculate the RFM values for each customer (by customer id).
- To calculate this, we will consider the **rfm_metrics** table that's built off the *_completed_orders_* table.

In [None]:
%sql

select CustID,
       FullName as Customer_Name,
       Recency,
       Frequency,
       Monetary::decimal(10,2) as Monetary
from curated.rfm_metrics
order by 1;

CustID,Customer_Name,Recency,Frequency,Monetary
4,"Doughty, Reggie",5,18,21635.95
15,"Diebold, Debbie",38,3,216.8
20,"Pulver, Eddy",5,7,23702.4
21,"Kan, Adam",39,1,105.0
23,"Bostwick, Roscoe",13,2,393.24
28,"Drain, Reinaldo",55,1,70.0
32,"Horne, Reginald",7,97,47835.19
33,"Rapoza, Darnell",11,49,32907.73
41,"Batty, Angelo",50,1,219.9
44,"Ro, Kendall",21,3,4143.38


#### Q8) Find top 10 customers based on frequency and monetary value. Sort them based on first frequency and then monetary value.

In [None]:
%sql

select CustID,
       FullName as Customer_Name,
       Recency,
       Frequency,
       Monetary::decimal(10,2) as Monetary
from curated.rfm_metrics
order by Frequency desc, Monetary desc
limit 10

CustID,Customer_Name,Recency,Frequency,Monetary
85775,"Gonzalez, Joel",5,770,75702.0
44619,"Divito, Sherlyn",7,149,248177.9
5769,"Summerlin, Joel",21,132,125615.23
30465,"Slattery, Tressie",21,112,137561.15
800,"Jesse, Alfonso",7,105,50630.12
32,"Horne, Reginald",7,97,47835.19
56,"Oswald, Rene",13,92,71619.25
48199,"Klar, Irving",6,90,79178.02
14625,"Pitre, Shayla",6,87,85478.63
44445,"Flanigan, Yong",21,87,72097.83


#### Q9) Additional Insights

#### a) Find the number of orders purchased, grouped by the State.

In [None]:
%sql

select State,
       Count(distinct OrderID) as unique_orders
from curated.completed_customer_orders
group by 1
order by 2 desc

State,unique_orders
CA,6215
TX,5740
NY,4998
PA,4738
IL,3987
FL,3507
OH,3260
MO,2884
VA,2852
IA,2566


Databricks visualization. Run in Databricks to view.

#### b) Find out if there is a correlation between customer tenure and order frequency.
- We will consider 31st Oct 2021 as the reference date to calculate the number of years a user has been a customer.

In [None]:
%sql

select years_since_being_customer,
       sum(TotalOrders) as TotalOrders
from
(select CustID,
        int(months_between('2021-10-31',CustomerSince)/12) as years_since_being_customer,
        count(distinct OrderID) as TotalOrders
from curated.completed_customer_orders
group by 1,2
order by 1)
group by 1
order by 1

years_since_being_customer,TotalOrders
4,8234
5,8091
6,7436
7,5572
8,5310
9,4719
10,4004
11,4073
12,3635
13,3758


Databricks visualization. Run in Databricks to view.

#### c) Payment preferences of customer segments

In [None]:
%sql

-- First view the different payment methods given to us
select *
from curated.completed_customer_orders
limit 1

CustID,OrderID,Order_Date,Status,ItemID,Quantity_Ordered,Price,Value,Discount_Amount,Total,Category,Payment_Method,Month_Year,Year,City,County,CustomerSince,Email,Gender,PlaceName,Region,State,Zip,Age,FullName,Customer_Segment
82529,100433524,2021-01-08,complete,709387,2,109.9,109.9,0.0,109.9,Beauty & Grooming,cod,Jan-2021,2021,Malta,DeKalb,1995-10-19,annie.showman@hotmail.com,F,Malta,Midwest,IL,60150,30,"Showman, Annie",Adults


Let us categorise them into four categories based on the following logic:

Digital Wallets:
- jazzwallet
- Easypay_MA
- Easypay
- apg
- Payaxis
- mcblite

Voucher/Credit:
- easypay_voucher
- customercredit
- jazzvoucher

Cash Handling/Delivery:
- cashatdoorstep
- cod

Bank Transactions:
- bankalfalah


In [None]:
%sql


with payment_methods as (
select Customer_Segment,
       count(distinct case when Payment_Method in ('jazzwallet','Easypay_MA','Easypay','apg','Payaxis','mcblite')
                           then OrderID
                      end) as Digital_Wallet,
      count(distinct case when Payment_Method in ('easypay_voucher','customercredit','jazzvoucher')
                          then OrderID
                      end) as Voucher,
      count(distinct case when Payment_Method in ('cod','cashatdoorstep')
                          then OrderID
                      end) as CoD,
      count(distinct case when Payment_Method ilike '%bank%'
                          then OrderID
                      end) as Bank,
      count(distinct OrderID) as TotalOrders
from curated.completed_customer_orders
group by 1
order by 1
)

select  Customer_Segment,
       ((Digital_Wallet/TotalOrders)*100)::decimal(10,2) as DigitalWallet,
       ((Voucher/TotalOrders)*100)::decimal(10,2) as Voucher,
       ((CoD/TotalOrders)*100)::decimal(10,2) as CoD,
       ((Bank/TotalOrders)*100)::decimal(10,2) as Bank
from payment_methods
order by 1

Customer_Segment,DigitalWallet,Voucher,CoD,Bank
Adults,32.76,21.76,41.39,4.1
Middle Ages,32.4,21.49,41.82,4.29
Old,32.99,21.65,41.29,4.06
Young,30.46,24.23,41.64,3.67


Databricks visualization. Run in Databricks to view.