## DATA MART SQL CHALLENGE
This is a sql project created by [Danny Ma](https://www.linkedin.com/in/datawithdanny/) to help establish the foundational knowledge of sql while testing and developing logical problem skills.
For this case study, I chose to create my database with pgAdmin and then remotely connected this to Jupyter Notebooks. 

To get access to this and other of Danny's projects - [8 Week SQL Challenge](https://8weeksqlchallenge.com/).

Before getting started, I would also recommend installing "ipython-sql". This allows you use the 'jupyter magic' function to interact with your relational database.

#### Importing Libraries

In [1]:
import sqlalchemy
import sqlite3 as sql

#### Create a postgresql engine to connect to database

In [2]:
engine = sqlalchemy.create_engine('postgresql://postgres:password@localhost:5432/data_mart')

#### Load the sql extension

In [3]:
%load_ext sql

#### Set up the connection

In [4]:
%sql $engine.url

#### 1. Data Cleansing Steps

In [5]:
%%sql
DROP VIEW IF EXISTS cleand_sales;
CREATE VIEW cleand_sales AS 
WITH clean_sales AS (
SELECT week_date, region, platform, segment
, customer_type, transactions, sales
, TO_DATE(week_date, 'DD/MM/YY') date
, CAST(DATE_PART('week',TO_DATE(week_date, 'DD/MM/YY'))as INT) week_number
, CAST(DATE_PART('month',TO_DATE(week_date, 'DD/MM/YY'))as INT) month_number
, CAST(DATE_PART('year',TO_DATE(week_date, 'DD/MM/YY'))as INT) calendar_year
, CASE WHEN RIGHT(segment, 1) = '1' THEN 'Young Adults'
	   WHEN RIGHT(segment, 1) = '2' THEN 'Middle Aged'
	   WHEN RIGHT(segment, 1) in ('3','4') THEN 'Retirees'
	   ELSE 'Unknown'
END age_band
, CASE WHEN LEFT(segment, 1) = 'C' THEN 'Couples'
	 WHEN LEFT(segment, 1) = 'F' THEN 'Families'
	 ELSE 'Unknown'
END demographic
, ROUND((sales/transactions),2) avg_transaction
FROM weekly_sales
)
SELECT date, week_number , month_number, calendar_year
, region, platform, segment, age_band, customer_type
, transactions, sales, demographic, avg_transaction
FROM clean_sales;

 * postgresql://postgres:***@localhost:5432/data_mart
Done.
Done.


[]

In [6]:
%%sql
SELECT *
FROM cleand_sales
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/data_mart
10 rows affected.


date,week_number,month_number,calendar_year,region,platform,segment,age_band,customer_type,transactions,sales,demographic,avg_transaction
2020-08-31,36,8,2020,ASIA,Retail,C3,Retirees,New,120631,3656163,Couples,30.0
2020-08-31,36,8,2020,ASIA,Retail,F1,Young Adults,New,31574,996575,Families,31.0
2020-08-31,36,8,2020,USA,Retail,,Unknown,Guest,529151,16509610,Unknown,31.0
2020-08-31,36,8,2020,EUROPE,Retail,C1,Young Adults,New,4517,141942,Couples,31.0
2020-08-31,36,8,2020,AFRICA,Retail,C2,Middle Aged,New,58046,1758388,Couples,30.0
2020-08-31,36,8,2020,CANADA,Shopify,F2,Middle Aged,Existing,1336,243878,Families,182.0
2020-08-31,36,8,2020,AFRICA,Shopify,F3,Retirees,Existing,2514,519502,Families,206.0
2020-08-31,36,8,2020,ASIA,Shopify,F1,Young Adults,Existing,2158,371417,Families,172.0
2020-08-31,36,8,2020,AFRICA,Shopify,F2,Middle Aged,New,318,49557,Families,155.0
2020-08-31,36,8,2020,AFRICA,Retail,C3,Retirees,New,111032,3888162,Couples,35.0


#### 2. Data Exploration

##### 1. What day of the week is used for each week_date value?

In [7]:
%%sql
SELECT DISTINCT(TO_CHAR(date, 'Day')) as Days
FROM cleand_sales;

 * postgresql://postgres:***@localhost:5432/data_mart
1 rows affected.


days
Monday


##### 2. What range of week numbers are missing from the dataset?

In [8]:
%%sql
WITH RECURSIVE numbers AS(
              SELECT 1 AS weeks
               UNION
               SELECT weeks + 1
               FROM numbers
               WHERE weeks<52)
SELECT STRING_AGG(weeks::text, ', ') missing_weeks FROM numbers
WHERE weeks NOT IN (SELECT DISTINCT(week_number) weekly_number FROM cleand_sales);

 * postgresql://postgres:***@localhost:5432/data_mart
1 rows affected.


missing_weeks
"1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52"


##### 3. How many total transactions were there for each year in the dataset?

In [9]:
%%sql
SELECT CAST(calendar_year AS INT), SUM(transactions) total_trans
FROM cleand_sales
GROUP BY calendar_year
ORDER BY calendar_year DESC;

 * postgresql://postgres:***@localhost:5432/data_mart
3 rows affected.


calendar_year,total_trans
2020,375813651
2019,365639285
2018,346406460


##### 4. What is the total sales for each region for each month?

In [10]:
%%sql
SELECT region, TO_CHAR(date, 'Month') Months, SUM(sales) total_sales
FROM cleand_sales
GROUP BY region, TO_CHAR(date, 'Month')
ORDER BY region, total_sales DESC;

 * postgresql://postgres:***@localhost:5432/data_mart
49 rows affected.


region,months,total_sales
AFRICA,July,1960219710
AFRICA,April,1911783504
AFRICA,August,1809596890
AFRICA,June,1767559760
AFRICA,May,1647244738
AFRICA,March,567767480
AFRICA,September,276320987
ASIA,April,1804628707
ASIA,July,1768844756
ASIA,August,1663320609


##### 5. What is the total count of transactions for each platform

In [11]:
%%sql
SELECT platform, SUM(transactions) total_trans
FROM cleand_sales
GROUP BY platform
ORDER BY total_trans DESC;

 * postgresql://postgres:***@localhost:5432/data_mart
2 rows affected.


platform,total_trans
Retail,1081934227
Shopify,5925169


##### 6. What is the percentage of sales for Retail vs Shopify for each month?

In [12]:
%%sql
SELECT months, ROUND((retail*1.0/(retail+shopify)*100),1)  retail_perc
, ROUND((shopify*1.0/(retail+shopify)*100),1)  shopify_perc
FROM
(
SELECT TO_CHAR(date, 'Month') Months, month_number
, SUM(CASE WHEN platform ='Retail' THEN sales ELSE 0 END)  retail
,    SUM(CASE WHEN platform ='Shopify' THEN sales ELSE 0 END) shopify
FROM cleand_sales
GROUP BY TO_CHAR(date, 'Month'), month_number
ORDER BY retail DESC, shopify DESC
) calc
ORDER BY month_number ASC;

 * postgresql://postgres:***@localhost:5432/data_mart
7 rows affected.


months,retail_perc,shopify_perc
March,97.5,2.5
April,97.6,2.4
May,97.3,2.7
June,97.3,2.7
July,97.3,2.7
August,97.1,2.9
September,97.4,2.6


##### 7. What is the percentage of sales by demographic for each year in the dataset?

In [13]:
%%sql
SELECT calendar_year, ROUND((couples*1.0/(couples+families+unknown)*100),1)  couples_perc
, ROUND((families*1.0/(couples+families+unknown)*100),1)  families_perc
, ROUND((unknown*1.0/(couples+families+unknown)*100),1)  unknown_perc
FROM
(
SELECT CAST(calendar_year AS INT)
, SUM(CASE WHEN demographic ='Couples' THEN sales ELSE 0 END)  couples
,    SUM(CASE WHEN demographic ='Families' THEN sales ELSE 0 END) families
,    SUM(CASE WHEN demographic ='Unknown' THEN sales ELSE 0 END) unknown
FROM cleand_sales
GROUP BY calendar_year
ORDER BY couples DESC, families DESC, unknown DESC
) calc;

 * postgresql://postgres:***@localhost:5432/data_mart
3 rows affected.


calendar_year,couples_perc,families_perc,unknown_perc
2020,28.7,32.7,38.6
2019,27.3,32.5,40.3
2018,26.4,32.0,41.6


##### 8. Which age_band and demographic values contribute the most to Retail sales?

In [14]:
%%sql
SELECT age_band, demographic
, SUM(CASE WHEN platform ='Retail' THEN sales ELSE 0 END)  retail
FROM cleand_sales
WHERE age_band !='Unknown'
GROUP BY age_band, demographic
ORDER BY retail DESC;

 * postgresql://postgres:***@localhost:5432/data_mart
6 rows affected.


age_band,demographic,retail
Retirees,Families,6634686916
Retirees,Couples,6370580014
Middle Aged,Families,4354091554
Young Adults,Couples,2602922797
Middle Aged,Couples,1854160330
Young Adults,Families,1770889293


##### 9. Can we use the avg_transaction column to find the average transaction size for each year for Retail vs Shopify? If not - how would you calculate it instead?

In [15]:
%%sql
SELECT calendar_year, ROUND(AVG(retail),1) Avg_retail, ROUND(AVG(shopify),1) Avg_shopify
FROM
(
SELECT CAST(calendar_year AS INT)
, SUM(CASE WHEN platform ='Retail' THEN transactions ELSE 0 END)   retail
, SUM(CASE WHEN platform ='Shopify' THEN transactions ELSE 0 END)  shopify
FROM cleand_sales
GROUP BY calendar_year
ORDER BY CAST(calendar_year AS INT) DESC
) calc
GROUP BY calendar_year
ORDER BY Avg_retail DESC, Avg_shopify DESC;

 * postgresql://postgres:***@localhost:5432/data_mart
3 rows affected.


calendar_year,avg_retail,avg_shopify
2020,373274555.0,2539096.0
2019,363740159.0,1899126.0
2018,344919513.0,1486947.0


#### 3. Before & After Analysis

##### 1. What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales?

In [16]:
%%sql
WITH sales AS (
SELECT SUM(sales) sales_b4
FROM cleand_sales
WHERE 1=1
AND date < '2020-06-15' AND date >=date '2020-06-15' - integer '28'

UNION

SELECT SUM(sales) sales
FROM cleand_sales
WHERE 1=1
AND date >= '2020-06-15' AND date <= date '2020-06-15' + integer '28'
)
, calc AS (
SELECT sales_b4, LEAD(sales_b4,1) OVER () sales_aft
, ROW_NUMBER() OVER () rnk
FROM sales 
)

SELECT sales_b4, sales_aft, ROUND(((sales_aft*1.0/sales_b4)-1)*100,2)  perc_growth
FROM calc
WHERE 1=1
AND rnk=1;

 * postgresql://postgres:***@localhost:5432/data_mart
1 rows affected.


sales_b4,sales_aft,perc_growth
2345878357,2904930571,23.83


##### 2. What about the entire 12 weeks before and after?

In [17]:
%%sql
WITH sales AS (
SELECT SUM(sales) sales_b4
FROM cleand_sales
WHERE 1=1
AND date < '2020-06-15' AND date >=date '2020-06-15' - integer '84'

UNION

SELECT SUM(sales) sales
FROM cleand_sales
WHERE 1=1
AND date >= '2020-06-15' AND date <= date '2020-06-15' + integer '84'
)
, calc AS (
SELECT sales_b4, LEAD(sales_b4,1) OVER () sales_aft
, ROW_NUMBER() OVER () rnk
FROM sales 
)

SELECT sales_b4, sales_aft, ROUND(((sales_aft*1.0/sales_b4)-1)*100,2)  perc_growth
FROM calc
WHERE 1=1
AND rnk=1;

 * postgresql://postgres:***@localhost:5432/data_mart
1 rows affected.


sales_b4,sales_aft,perc_growth
6973947753,7126273147,2.18


##### 3. How do the sale metrics for these 2 periods before and after compare with the previous years in 2018 and 2019?

In [18]:
%%sql
WITH sales AS (
SELECT SUM(sales) sales_b4
FROM cleand_sales
WHERE 1=1
AND date < '2020-06-15' AND date >=date '2020-06-15' - integer '28'

UNION

SELECT SUM(sales) sales
FROM cleand_sales
WHERE 1=1
AND date >= '2020-06-15' AND date <= date '2020-06-15' + integer '28'
)
, calc AS (
SELECT sales_b4, LEAD(sales_b4,1) OVER () sales_aft
, ROW_NUMBER() OVER () rnk
FROM sales 
)

, sales1 AS (
SELECT SUM(sales) sales_b4
FROM cleand_sales
WHERE 1=1
AND date < '2020-06-15' AND date >=date '2020-06-15' - integer '84'

UNION

SELECT SUM(sales) sales
FROM cleand_sales
WHERE 1=1
AND date >= '2020-06-15' AND date <= date '2020-06-15' + integer '84'
)
, calc1 AS (
SELECT sales_b4, LEAD(sales_b4,1) OVER () sales_aft
, ROW_NUMBER() OVER () rnk
FROM sales1 
)

, sales2 AS (
SELECT SUM(sales) sales_b4
FROM cleand_sales
WHERE 1=1
AND date < '2019-06-15' AND date >=date '2019-06-15' - integer '28'

UNION

SELECT SUM(sales) sales
FROM cleand_sales
WHERE 1=1
AND date >= '2019-06-15' AND date <= date '2019-06-15' + integer '28'
)
, calc2 AS (
SELECT sales_b4, LEAD(sales_b4,1) OVER () sales_aft
, ROW_NUMBER() OVER () rnk
FROM sales2 
)

, sales3 AS (
SELECT SUM(sales) sales_b4
FROM cleand_sales
WHERE 1=1
AND date < '2019-06-15' AND date >=date '2019-06-15' - integer '84'

UNION

SELECT SUM(sales) sales
FROM cleand_sales
WHERE 1=1
AND date >= '2019-06-15' AND date <= date '2019-06-15' + integer '84'
)
, calc3 AS (
SELECT sales_b4, LEAD(sales_b4,1) OVER () sales_aft
, ROW_NUMBER() OVER () rnk
FROM sales3
)

, sales4 AS (
SELECT SUM(sales) sales_b4
FROM cleand_sales
WHERE 1=1
AND date < '2018-06-15' AND date >=date '2018-06-15' - integer '28'

UNION

SELECT SUM(sales) sales
FROM cleand_sales
WHERE 1=1
AND date >= '2018-06-15' AND date <= date '2018-06-15' + integer '28'
)
, calc4 AS (
SELECT sales_b4, LEAD(sales_b4,1) OVER () sales_aft
, ROW_NUMBER() OVER () rnk
FROM sales4 
)

, sales5 AS (
SELECT SUM(sales) sales_b4
FROM cleand_sales
WHERE 1=1
AND date < '2018-06-15' AND date >=date '2018-06-15' - integer '84'

UNION

SELECT SUM(sales) sales
FROM cleand_sales
WHERE 1=1
AND date >= '2018-06-15' AND date <= date '2018-06-15' + integer '84'
)
, calc5 AS (
SELECT sales_b4, LEAD(sales_b4,1) OVER () sales_aft
, ROW_NUMBER() OVER () rnk
FROM sales5
)

SELECT 2020 AS year, 4 AS weeks, sales_b4, sales_aft, ROUND(((sales_aft*1.0/sales_b4)-1)*100,2)  perc_growth
FROM calc
WHERE 1=1
AND rnk=1

UNION ALL

SELECT 2020 AS year, 12 AS weeks, sales_b4, sales_aft, ROUND(((sales_aft*1.0/sales_b4)-1)*100,2)  perc_growth
FROM calc1
WHERE 1=1
AND rnk=1

UNION ALL

SELECT 2019 AS year, 4 AS weeks, sales_b4, sales_aft, ROUND(((sales_aft*1.0/sales_b4)-1)*100,2)  perc_growth
FROM calc2
WHERE 1=1
AND rnk=1

UNION ALL

SELECT 2019 AS year, 12 AS weeks, sales_b4, sales_aft, ROUND(((sales_aft*1.0/sales_b4)-1)*100,2)  perc_growth
FROM calc3
WHERE 1=1
AND rnk=1

UNION ALL

SELECT 2018 AS year, 4 AS weeks, sales_b4, sales_aft, ROUND(((sales_aft*1.0/sales_b4)-1)*100,2)  perc_growth
FROM calc4
WHERE 1=1
AND rnk=1

UNION ALL

SELECT 2018 AS year, 12 AS weeks, sales_b4, sales_aft, ROUND(((sales_aft*1.0/sales_b4)-1)*100,2)  perc_growth
FROM calc5
WHERE 1=1
AND rnk=1;

 * postgresql://postgres:***@localhost:5432/data_mart
6 rows affected.


year,weeks,sales_b4,sales_aft,perc_growth
2020,4,2345878357,2904930571,23.83
2020,12,6973947753,7126273147,2.18
2019,4,2249989796,2252326390,0.1
2019,12,6862646103,6883386397,0.3
2018,4,2125140809,2129242914,0.19
2018,12,6396562317,6500818510,1.63


- Between the 3 years, `2020` had the best growth in sales with `23.83%` growth in the 4 week period and `2.18%` growth in the 12 week period. 
- There was a relatively lower sales performance in `2019` with `0.10%` & `0.30%` growth in the 4 and 12 week period respectively. 
- `2018` had a marginally better performance than `2019` with `0.19%` & `1.63%` growth in the 4 and 12 week period respectively. 