# EDA and  Advance Analysis

This notebook contains all SQL queries used for exploratory data analysis of the data warehouse.

## Database Exploration

Purpose:
- To explore the structure of the database, including the list of tables and their schemas.
- To inspect the columns and metadata for specific tables.

Tables Used:
- INFORMATION_SCHEMA.TABLES
- INFORMATION_SCHEMA.COLUMNS

In [2]:
USE DataWarehouse;

In [3]:
-- Retrieve a list of all tables in the database
SELECT 
    TABLE_CATALOG, 
    TABLE_SCHEMA, 
    TABLE_NAME, 
    TABLE_TYPE
FROM INFORMATION_SCHEMA.TABLES;

TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,TABLE_TYPE
DataWarehouse,bronze,crm_cust_info,BASE TABLE
DataWarehouse,bronze,crm_prd_info,BASE TABLE
DataWarehouse,bronze,crm_sales_details,BASE TABLE
DataWarehouse,bronze,erp_loc_a101,BASE TABLE
DataWarehouse,bronze,erp_cust_az12,BASE TABLE
DataWarehouse,bronze,erp_px_cat_g1v2,BASE TABLE
DataWarehouse,silver,crm_cust_info,BASE TABLE
DataWarehouse,silver,crm_prd_info,BASE TABLE
DataWarehouse,silver,crm_sales_details,BASE TABLE
DataWarehouse,silver,erp_loc_a101,BASE TABLE


In [4]:
-- Retrieve all columns for a specific table (dim_customers)
SELECT 
    COLUMN_NAME, 
    DATA_TYPE, 
    IS_NULLABLE, 
    CHARACTER_MAXIMUM_LENGTH
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'dim_customers';

COLUMN_NAME,DATA_TYPE,IS_NULLABLE,CHARACTER_MAXIMUM_LENGTH
customer_key,bigint,YES,
customer_id,int,YES,
customer_number,nvarchar,YES,50.0
first_name,nvarchar,YES,50.0
last_name,nvarchar,YES,50.0
country,nvarchar,YES,50.0
marital_status,nvarchar,YES,50.0
gender,nvarchar,YES,50.0
birthdate,date,YES,
create_date,date,YES,


## Dimensions Exploration

Purpose:
- To explore the structure of dimension tables.

SQL Functions Used:
- DISTINCT
- ORDER BY

In [5]:
-- Retrieve a list of unique countries from which customers originate
SELECT DISTINCT 
    country 
FROM gold.dim_customers
ORDER BY country;

-- not sure unknown means other countries or simply no information on the counrty
-- need to discuss with team or source expert

country
Australia
Canada
France
Germany
United Kingdom
United States
Unknown


In [33]:
-- Retrieve a list of unique categories, subcategories, and products
SELECT DISTINCT TOP 10
    category,
    subcategory, 
    product_name 
FROM gold.dim_products
ORDER BY category, subcategory, product_name;

-- checking the procuct range: 295 unique products in 4 main categories and 36 subcategories
-- The business offers a diverse range of bikes along with a variety of complementary products.

category,subcategory,product_name
Accessories,Bike Racks,Hitch Rack - 4-Bike
Accessories,Bike Stands,All-Purpose Bike Stand
Accessories,Bottles and Cages,Mountain Bottle Cage
Accessories,Bottles and Cages,Road Bottle Cage
Accessories,Bottles and Cages,Water Bottle - 30 oz.
Accessories,Cleaners,Bike Wash - Dissolver
Accessories,Fenders,Fender Set - Mountain
Accessories,Helmets,Sport-100 Helmet- Black
Accessories,Helmets,Sport-100 Helmet- Blue
Accessories,Helmets,Sport-100 Helmet- Red


## Date Range Exploration

Purpose:
- To determine the temporal boundaries of key data points.
- To understand the range of historical data.

SQL Functions Used:
- MIN(), MAX(), DATEDIFF()

In [8]:
-- Determine the first and last order date and the total duration in months
SELECT 
    MIN(order_date) AS first_order_date,
    MAX(order_date) AS last_order_date,
    DATEDIFF(MONTH, MIN(order_date), MAX(order_date)) AS order_range_months
FROM gold.fact_sales;

first_order_date,last_order_date,order_range_months
2010-12-29,2014-01-28,37


In [9]:
-- Find the youngest and oldest customer based on birthdate
SELECT
    MIN(birthdate) AS oldest_birthdate,
    DATEDIFF(YEAR, MIN(birthdate), GETDATE()) AS oldest_age,
    MAX(birthdate) AS youngest_birthdate,
    DATEDIFF(YEAR, MAX(birthdate), GETDATE()) AS youngest_age
FROM gold.dim_customers;

oldest_birthdate,oldest_age,youngest_birthdate,youngest_age
1916-02-10,109,1986-06-25,39


## Measures Exploration (Key Metrics)

Purpose:

- To calculate aggregated metrics (e.g., totals, averages) for quick insights.
- To identify overall trends or spot anomalies.

SQL Functions Used:

- COUNT(), SUM(), AVG(), UNION

In [10]:
-- Find the Total Sales
SELECT SUM(sales_amount) AS total_sales FROM gold.fact_sales

-- Over 29 M

total_sales
29356250


In [12]:
-- Find the average selling price
SELECT AVG(price) AS avg_price FROM gold.fact_sales

-- Selling expensive items
-- Might sell bikes more often (verified in the magnitude analysis section)

avg_price
486


In [11]:
-- Find how many items are sold
SELECT SUM(quantity) AS total_quantity FROM gold.fact_sales

total_quantity
60423


In [13]:
-- Find the Total number of Orders (with and without duplicates)
SELECT COUNT(order_number) AS total_orders FROM gold.fact_sales;
SELECT COUNT(DISTINCT order_number) AS total_orders FROM gold.fact_sales;

-- Means the business are selling at least 2 items per order on average

total_orders
60398


total_orders
27659


In [17]:
-- Find the total number of products
SELECT COUNT(product_name) AS total_products FROM gold.dim_products

total_products
295


In [18]:
-- Find the total number of customers
SELECT COUNT(customer_key) AS total_customers FROM gold.dim_customers;

total_customers
18484


In [19]:
-- Find the total number of customers that has placed an order
SELECT COUNT(DISTINCT customer_key) AS total_customers FROM gold.fact_sales;

-- Woww, all the registered customers acutally bought something

total_customers
18484


In [22]:
-- Generate a Report that shows all key metrics of the business
SELECT 'Total Sales' AS measure_name, SUM(sales_amount) AS measure_value FROM gold.fact_sales
UNION ALL
SELECT 'Total Quantity', SUM(quantity) FROM gold.fact_sales
UNION ALL
SELECT 'Average Price', AVG(price) FROM gold.fact_sales
UNION ALL
SELECT 'Total Orders', COUNT(DISTINCT order_number) FROM gold.fact_sales
UNION ALL
SELECT 'Total Products', COUNT(DISTINCT product_name) FROM gold.dim_products
UNION ALL
SELECT 'Total Customers', COUNT(customer_key) FROM gold.dim_customers;

measure_name,measure_value
Total Sales,29356250
Total Quantity,60423
Average Price,486
Total Orders,27659
Total Products,295
Total Customers,18484


## Magnitude Analysis (Aggregating Measures by Dimensions)

Purpose:

- To quantify data and group results by specific dimensions.
- For understanding data distribution across categories.

SQL Functions Used:

- Aggregate Functions: SUM(), COUNT(), AVG()
- GROUP BY, ORDER BY, LEFT JOIN

In [23]:
-- Find total customers by countries
SELECT
    country,
    COUNT(customer_key) AS total_customers
FROM gold.dim_customers
GROUP BY country
ORDER BY total_customers DESC;

country,total_customers
United States,7482
Australia,3591
United Kingdom,1913
France,1810
Germany,1780
Canada,1571
Unknown,337


In [24]:
-- Find total customers by gender
SELECT
    gender,
    COUNT(customer_key) AS total_customers
FROM gold.dim_customers
GROUP BY gender
ORDER BY total_customers DESC;

gender,total_customers
Male,9341
Female,9128
Unknown,15


In [25]:
-- Find total customers by marital_status
SELECT
    marital_status,
    COUNT(customer_key) AS total_customers
FROM gold.dim_customers
GROUP BY marital_status
ORDER BY total_customers DESC;

marital_status,total_customers
Married,10011
Single,8473


In [26]:
-- Find total products by category
SELECT
    category,
    COUNT(product_key) AS total_products
FROM gold.dim_products
GROUP BY category
ORDER BY total_products DESC;

category,total_products
Components,134
Bikes,97
Clothing,35
Accessories,29


In [29]:
-- What is the average costs in each category?
SELECT
    category,
    AVG(cost) AS avg_cost
FROM gold.dim_products
GROUP BY category
ORDER BY avg_cost DESC;

category,avg_cost
Bikes,949
Components,252
Clothing,24
Accessories,13


In [66]:
-- What is the total revenue generated , total quantity sold per each category?
SELECT 
    p.category,
    COALESCE(SUM(f.sales_amount), 0) AS total_revenue,
    COALESCE(SUM(f.quantity), 0) AS total_quantity,
    COALESCE(AVG(f.price), 0) AS avg_price,
    COUNT(DISTINCT p.product_key) AS total_products
FROM gold.fact_sales f
FULL JOIN gold.dim_products p
    ON p.product_key = f.product_key
GROUP BY p.category
ORDER BY total_revenue DESC;

-- Woww, not even a single item sold from Components of bikes
-- This is while the business has the widest range of items in this category

category,total_revenue,total_quantity,avg_price,total_products
Bikes,28316272,15205,1862,97
Accessories,700262,36112,19,29
Clothing,339716,9106,37,35
Components,0,0,0,134


In [56]:
-- What is the total revenue generated by each customer?
SELECT TOP 5
    c.customer_key,
    c.first_name,
    c.last_name,
    SUM(f.sales_amount) AS total_revenue
FROM gold.fact_sales f
LEFT JOIN gold.dim_customers c
    ON c.customer_key = f.customer_key
GROUP BY 
    c.customer_key,
    c.first_name,
    c.last_name
ORDER BY total_revenue DESC;

-- Found high value customers aka big spenders

customer_key,first_name,last_name,total_revenue
1133,Kaitlyn,Henderson,13294
1302,Nichole,Nara,13294
1309,Margaret,He,13268
1132,Randall,Dominguez,13265
1301,Adriana,Gonzalez,13242


In [74]:
-- What is the distribution of key metrics across countries?
SELECT
    c.country,
    SUM(f.sales_amount) AS total_revenue,
    SUM(f.quantity) AS total_sold_items,
    COUNT(DISTINCT c.customer_key) AS total_customers_registered,
    COUNT(DISTINCT f.customer_key) AS total_customers_purchased
FROM gold.fact_sales f
FULL JOIN gold.dim_customers c
    ON c.customer_key = f.customer_key
GROUP BY c.country
ORDER BY total_revenue DESC;

-- USA is the biggest market
-- Interestingly, Australia has a good potential (much lesser customers puchasing costly items, generation almost the same revenue as USA)

country,total_revenue,total_sold_items,total_customers_registered,total_customers_purchased
United States,9162327,20481,7482,7482
Australia,9060172,13346,3591,3591
United Kingdom,3391376,6910,1913,1913
Germany,2894066,5626,1780,1780
France,2643751,5559,1810,1810
Canada,1977738,7630,1571,1571
Unknown,226820,871,337,337


## Ranking Analysis

Purpose:
- To rank items (e.g., products, customers) based on performance or other metrics.
- To identify top performers or laggards.

SQL Functions Used:
- Window Ranking Functions: RANK(), DENSE_RANK(), ROW_NUMBER(), TOP
- Clauses: GROUP BY, ORDER BY

In [76]:
-- Which 5 products and subcategories Generating the Highest Revenue? (Simple Ranking)
SELECT TOP 5
    p.product_name,
    SUM(f.sales_amount) AS total_revenue
FROM gold.fact_sales f
LEFT JOIN gold.dim_products p
    ON p.product_key = f.product_key
GROUP BY p.product_name
ORDER BY total_revenue DESC;

SELECT TOP 5
    p.subcategory,
    SUM(f.sales_amount) AS total_revenue
FROM gold.fact_sales f
LEFT JOIN gold.dim_products p
    ON p.product_key = f.product_key
GROUP BY p.subcategory
ORDER BY total_revenue DESC;

product_name,total_revenue
Mountain-200 Black- 46,1373454
Mountain-200 Black- 42,1363128
Mountain-200 Silver- 38,1339394
Mountain-200 Silver- 46,1301029
Mountain-200 Black- 38,1294854


subcategory,total_revenue
Road Bikes,14519438
Mountain Bikes,9952254
Touring Bikes,3844580
Tires and Tubes,244634
Helmets,225435


In [77]:
-- Which 5 products Generating the Highest Revenue?
SELECT *
FROM (
    SELECT
        p.product_name,
        SUM(f.sales_amount) AS total_revenue,
        RANK() OVER (ORDER BY SUM(f.sales_amount) DESC) AS rank_products --ROW_NUMBER()
    FROM gold.fact_sales f
    LEFT JOIN gold.dim_products p
        ON p.product_key = f.product_key
    GROUP BY p.product_name
) AS ranked_products
WHERE rank_products <= 5;

-- sounds like Mountain-200 is really trendy and is the signature of the company.

product_name,total_revenue,rank_products
Mountain-200 Black- 46,1373454,1
Mountain-200 Black- 42,1363128,2
Mountain-200 Silver- 38,1339394,3
Mountain-200 Silver- 46,1301029,4
Mountain-200 Black- 38,1294854,5


In [78]:
-- What are the 5 worst-performing products in terms of sales? 
-- (of course exept all the products in the component category, as we already know none of the items in that category sold yet)
SELECT TOP 5
    p.product_name,
    SUM(f.sales_amount) AS total_revenue
FROM gold.fact_sales f
LEFT JOIN gold.dim_products p
    ON p.product_key = f.product_key
GROUP BY p.product_name
ORDER BY total_revenue;

product_name,total_revenue
Racing Socks- L,2430
Racing Socks- M,2682
Patch Kit/8 Patches,6382
Bike Wash - Dissolver,7272
Touring Tire Tube,7440


In [79]:
-- Find the top 10 customers who have generated the highest revenue
SELECT TOP 10
    c.customer_key,
    c.first_name,
    c.last_name,
    SUM(f.sales_amount) AS total_revenue
FROM gold.fact_sales f
LEFT JOIN gold.dim_customers c
    ON c.customer_key = f.customer_key
GROUP BY 
    c.customer_key,
    c.first_name,
    c.last_name
ORDER BY total_revenue DESC;

-- special thanks to Nicole and Kaitlyn

customer_key,first_name,last_name,total_revenue
1133,Kaitlyn,Henderson,13294
1302,Nichole,Nara,13294
1309,Margaret,He,13268
1132,Randall,Dominguez,13265
1301,Adriana,Gonzalez,13242
1322,Rosa,Hu,13215
1125,Brandi,Gill,13195
1308,Brad,She,13172
1297,Francisco,Sara,13164
434,Maurice,Shan,12914


In [80]:
-- The 3 customers with the fewest orders placed
SELECT TOP 3
    c.customer_key,
    c.first_name,
    c.last_name,
    COUNT(DISTINCT order_number) AS total_orders
FROM gold.fact_sales f
LEFT JOIN gold.dim_customers c
    ON c.customer_key = f.customer_key
GROUP BY 
    c.customer_key,
    c.first_name,
    c.last_name
ORDER BY total_orders;

-- maybe we can do churn analysis later on

customer_key,first_name,last_name,total_orders
21,Jordan,King,1
17,Wyatt,Hill,1
22,Destiny,Wilson,1


## Change Over Time Analysis

Purpose:
- To track trends, growth, and changes in key metrics over time.
- For time-series analysis and identifying seasonality.
- To measure growth or decline over specific periods.

SQL Functions Used:
- Date Functions: DATEPART(), DATETRUNC(), FORMAT()
- Aggregate Functions: SUM(), COUNT(), AVG()

In [38]:
-- Analyse sales performance over time yearly
SELECT
    YEAR(order_date) AS order_year,
    SUM(sales_amount) AS total_sales,
    COUNT(DISTINCT customer_key) AS total_customers,
    SUM(quantity) AS total_quantity,
    Rank() OVER (ORDER BY SUM(sales_amount) DESC) AS rank_sales_yearly
FROM gold.fact_sales
WHERE order_date IS NOT NULL
GROUP BY YEAR(order_date)
ORDER BY YEAR(order_date);

-- Overally, it is a growing business 
-- After a slight drop in sales in 2012, it tripled over the next year.
-- This big increase in sales is due to each customer roughly buying at least 3 items in 2013. -- Gooood job marketing team
-- It is notable that data for 2010 and 2014 are incomplete (only Dec 2010, Jan 2014)


order_year,total_sales,total_customers,total_quantity,rank_sales_yearly
2010,43419,14,14,5
2011,7075088,2216,2216,2
2012,5842231,3255,3397,3
2013,16344878,17427,52807,1
2014,45642,834,1970,4


In [34]:
-- Analyse sales performance over time monthly
SELECT
    MONTH(order_date) AS order_month,
    SUM(sales_amount) AS total_sales,
    COUNT(DISTINCT customer_key) AS total_customers,
    SUM(quantity) AS total_quantity,
    Rank() OVER (ORDER BY SUM(sales_amount) DESC) AS rank_sales_monthly
FROM gold.fact_sales
WHERE order_date IS NOT NULL
GROUP BY  MONTH(order_date)
ORDER BY MONTH(order_date);

-- We can see the seasonality here
-- Towards the end of year, sales picks up probably due to christmas
-- And winter being the slowest season

order_month,total_sales,total_customers,total_quantity,rank_sales_monthly
1,1868558,1818,4043,11
2,1744517,1765,3858,12
3,1908375,1982,4449,10
4,1948226,1916,4355,9
5,2204969,2074,4781,8
6,2935883,2430,5573,3
7,2412838,2154,5107,7
8,2684313,2312,5335,5
9,2536520,2210,5070,6
10,2916550,2533,5838,4


In [35]:
-- All in one for more granular view
SELECT 
    YEAR(order_date) AS order_year,
    MONTH(order_date) AS order_month,
    SUM(SUM(sales_amount)) OVER (PARTITION BY YEAR(order_date)) AS yearly_sales,
    SUM(SUM(sales_amount)) OVER (PARTITION BY MONTH(order_date)) AS monthly_sales,
    SUM(sales_amount) AS total_sales,
    COUNT(DISTINCT customer_key) AS total_customers,
    SUM(quantity) AS total_quantity
FROM gold.fact_sales
WHERE order_date IS NOT NULL
GROUP BY YEAR(order_date), MONTH(order_date)
ORDER BY YEAR(order_date), MONTH(order_date);

order_year,order_month,yearly_sales,monthly_sales,total_sales,total_customers,total_quantity
2010,12,43419,3211396,43419,14,14
2011,1,7075088,1868558,469795,144,144
2011,2,7075088,1744517,466307,144,144
2011,3,7075088,1908375,485165,150,150
2011,4,7075088,1948226,502042,157,157
2011,5,7075088,2204969,561647,174,174
2011,6,7075088,2935883,737793,230,230
2011,7,7075088,2412838,596710,188,188
2011,8,7075088,2684313,614516,193,193
2011,9,7075088,2536520,603047,185,185


In [None]:
/*
-- Using DATETRUNC() for monthly analysis
SELECT
    DATETRUNC(month, order_date) AS order_date,
    SUM(sales_amount) AS total_sales,
    COUNT(DISTINCT customer_key) AS total_customers,
    SUM(quantity) AS total_quantity
FROM gold.fact_sales
WHERE order_date IS NOT NULL
GROUP BY DATETRUNC(month, order_date)
ORDER BY DATETRUNC(month, order_date);
*/

In [None]:
/*
-- Using FORMAT() for formatted date display
SELECT
    FORMAT(order_date, 'yyyy-MMM') AS order_date,
    SUM(sales_amount) AS total_sales,
    COUNT(DISTINCT customer_key) AS total_customers,
    SUM(quantity) AS total_quantity
FROM gold.fact_sales
WHERE order_date IS NOT NULL
GROUP BY FORMAT(order_date, 'yyyy-MMM')
ORDER BY FORMAT(order_date, 'yyyy-MMM');
*/

## Cumulative Analysis (Aggregate the data progressively over time)

Purpose:

- To calculate running totals or moving averages for key metrics.
- To track performance over time cumulatively.
- Useful for growth analysis or identifying long-term trends.

SQL Functions Used:

- Window Functions: SUM() OVER(), AVG() OVER()

In [17]:
-- Calculate the total sales per year and the running total of sales over time
-- Calculate the moving average of price
SELECT
    order_date,
    total_sales,
    SUM(total_sales) OVER (ORDER BY order_date) AS running_total_sales,
    AVG(avg_price) OVER (ORDER BY order_date) AS moving_average_price
FROM
(
    SELECT 
        YEAR(order_date) AS order_date,
        SUM(sales_amount) AS total_sales,
        AVG(price) AS avg_price
    FROM gold.fact_sales
    WHERE order_date IS NOT NULL
    GROUP BY YEAR(order_date)
) t

order_date,total_sales,running_total_sales,moving_average_price
2010,43419,43419,3101
2011,7075088,7118507,3146
2012,5842231,12960738,2670
2013,16344878,29305616,2080
2014,45642,29351258,1668


In [21]:
-- Calculate the total sales per month and the running total of sales over each year
-- Calculate the moving average of price
SELECT
    order_date_year,
    order_date_month,
    total_sales,
    SUM(total_sales) OVER (PARTITION BY (order_date_year) ORDER BY order_date_month) AS running_total_sales_per_yaer,
    avg_price,
    AVG(avg_price) OVER (PARTITION BY (order_date_year) ORDER BY order_date_month) AS moving_average_price_per_year
FROM
(
    SELECT 
        MONTH(order_date) AS order_date_month,
        YEAR(order_date) AS order_date_year,
        SUM(sales_amount) AS total_sales,
        AVG(price) AS avg_price
    FROM gold.fact_sales
    WHERE order_date IS NOT NULL
    GROUP BY YEAR(order_date), MONTH(order_date)
) t

-- One interesting observation is that the moving average price dropped constantly over the years
-- This is very obvious in 2013 where we experienced a huge drop, and as we remember 2013 was the best year in terms of sales.
-- This means that the sales generated in 2013 is not due to an increase in price but a huge growth in quantity of items sold

order_date_year,order_date_month,total_sales,running_total_sales_per_yaer,avg_price,moving_average_price_per_year
2010,12,43419,43419,3101,3101
2011,1,469795,469795,3262,3262
2011,2,466307,936102,3238,3250
2011,3,485165,1421267,3234,3244
2011,4,502042,1923309,3197,3232
2011,5,561647,2484956,3227,3231
2011,6,737793,3222749,3207,3227
2011,7,596710,3819459,3173,3219
2011,8,614516,4433975,3184,3215
2011,9,603047,5037022,3259,3220


## Performance Analysis (Year-over-Year, Month-over-Month)

Purpose:
- To measure the performance of products, customers, or regions over time.
- For benchmarking and identifying high-performing entities.
- To track yearly trends and growth.

SQL Functions Used:
- LAG(): Accesses data from previous rows.
- AVG() OVER(): Computes average values within partitions.
- CASE: Defines conditional logic for trend analysis.

In [25]:
-- Analyze the yearly performance of products by comparing their sales to both the average sales performance and previous year
WITH yearly_product_sales AS (
    SELECT
        YEAR(f.order_date) AS order_year,
        p.product_name,
        SUM(f.sales_amount) AS current_sales
    FROM gold.fact_sales f
    LEFT JOIN gold.dim_products p
        ON f.product_key = p.product_key
    WHERE f.order_date IS NOT NULL
    GROUP BY 
        YEAR(f.order_date),
        p.product_name
)
SELECT TOP 15
    order_year,
    product_name,
    current_sales,
    AVG(current_sales) OVER (PARTITION BY product_name) AS avg_sales,
    current_sales - AVG(current_sales) OVER (PARTITION BY product_name) AS diff_avg,
    CASE 
        WHEN current_sales - AVG(current_sales) OVER (PARTITION BY product_name) > 0 THEN 'Above Avg'
        WHEN current_sales - AVG(current_sales) OVER (PARTITION BY product_name) < 0 THEN 'Below Avg'
        ELSE 'Avg'
    END AS avg_change,
    -- Year-over-Year Analysis
    LAG(current_sales) OVER (PARTITION BY product_name ORDER BY order_year) AS py_sales,
    current_sales - LAG(current_sales) OVER (PARTITION BY product_name ORDER BY order_year) AS diff_py,
    CASE 
        WHEN current_sales - LAG(current_sales) OVER (PARTITION BY product_name ORDER BY order_year) > 0 THEN 'Increase'
        WHEN current_sales - LAG(current_sales) OVER (PARTITION BY product_name ORDER BY order_year) < 0 THEN 'Decrease'
        ELSE 'No Change'
    END AS py_change
FROM yearly_product_sales
ORDER BY product_name, order_year;

order_year,product_name,current_sales,avg_sales,diff_avg,avg_change,py_sales,diff_py,py_change
2012,All-Purpose Bike Stand,159,13197,-13038,Below Avg,,,No Change
2013,All-Purpose Bike Stand,37683,13197,24486,Above Avg,159.0,37524.0,Increase
2014,All-Purpose Bike Stand,1749,13197,-11448,Below Avg,37683.0,-35934.0,Decrease
2012,AWC Logo Cap,72,6570,-6498,Below Avg,,,No Change
2013,AWC Logo Cap,18891,6570,12321,Above Avg,72.0,18819.0,Increase
2014,AWC Logo Cap,747,6570,-5823,Below Avg,18891.0,-18144.0,Decrease
2013,Bike Wash - Dissolver,6960,3636,3324,Above Avg,,,No Change
2014,Bike Wash - Dissolver,312,3636,-3324,Below Avg,6960.0,-6648.0,Decrease
2013,Classic Vest- L,11968,6240,5728,Above Avg,,,No Change
2014,Classic Vest- L,512,6240,-5728,Below Avg,11968.0,-11456.0,Decrease


## Data Segmentation Analysis (Measure by Measure Bucket)

### Customer and Product Segmentation

Purpose:

- To group data into meaningful categories for targeted insights. (Scatter Plot)
- For customer segmentation, product categorization, or regional analysis.

SQL Functions Used:

- CASE: Defines custom segmentation logic.
- GROUP BY: Groups data into segments.

In [2]:
-- Segment products into cost ranges and count how many products fall into each segment
WITH product_segments AS (
    SELECT
        product_key,
        product_name,
        cost,
        CASE 
            WHEN cost < 100 THEN 'Below 100'
            WHEN cost BETWEEN 100 AND 500 THEN '100-500'
            WHEN cost BETWEEN 500 AND 1000 THEN '500-1000'
            ELSE 'Above 1000'
        END AS cost_range
    FROM gold.dim_products
)
SELECT 
    cost_range,
    COUNT(product_key) AS total_products
FROM product_segments
GROUP BY cost_range
ORDER BY total_products DESC;

-- We have lots of cheap products, mostly accessories, and as we know they are not generating a good revenue.

cost_range,total_products
Below 100,110
100-500,101
500-1000,45
Above 1000,39


In [3]:
-- Group customers into segments based on spending behavior and history
/*
VIP: Customers with at least 12 months of history and spending more than 5,000.
Regular: Customers with at least 12 months of history but spending 5,000 or less.
New: Customers with a lifespan less than 12 months.
*/
WITH customer_spending AS (
    SELECT
        c.customer_key,
        SUM(f.sales_amount) AS total_spending,
        MIN(order_date) AS first_order,
        MAX(order_date) AS last_order,
        DATEDIFF(month, MIN(order_date), MAX(order_date)) AS lifespan
    FROM gold.fact_sales f
    LEFT JOIN gold.dim_customers c
        ON f.customer_key = c.customer_key
    GROUP BY c.customer_key
)
SELECT 
    customer_segment,
    COUNT(customer_key) AS total_customers
FROM (
    SELECT 
        customer_key,
        CASE 
            WHEN lifespan >= 12 AND total_spending > 5000 THEN 'VIP'
            WHEN lifespan >= 12 AND total_spending <= 5000 THEN 'Regular'
            ELSE 'New'
        END AS customer_segment
    FROM customer_spending
) AS segmented_customers
GROUP BY customer_segment
ORDER BY total_customers DESC;

-- The business is doing great at acuiring customers and retaining them.

customer_segment,total_customers
New,14631
Regular,2198
VIP,1655


## Part-to-Whole Analysis (Proportional Analysis)

Purpose:

- To compare performance or metrics across dimensions or time periods.
- To evaluate differences between categories.
- Useful for A/B testing or regional comparisons.

SQL Functions Used:

- SUM(), AVG(): Aggregates values for comparison.
- Window Functions: SUM() OVER() for total calculations.

In [26]:
-- Which categories contribute the most to overall sales?
WITH category_sales AS (
    SELECT
        p.category,
        SUM(f.sales_amount) AS total_sales
    FROM gold.fact_sales f
    LEFT JOIN gold.dim_products p
        ON p.product_key = f.product_key
    GROUP BY p.category
)
SELECT
    category,
    total_sales,
    SUM(total_sales) OVER () AS overall_sales,
    CONCAT(ROUND((CAST(total_sales AS FLOAT) / SUM(total_sales) OVER ()) * 100, 2), '%') AS percentage_of_total
FROM category_sales
ORDER BY total_sales DESC;

-- As we said, the business is totally running around bikes
-- However, it can be very dangerous as the bussiness is too relient in one category of products


category,total_sales,overall_sales,percentage_of_total
Bikes,28316272,29356250,96.46%
Accessories,700262,29356250,2.39%
Clothing,339716,29356250,1.16%


## Customer Report

Purpose:
- This report consolidates key customer metrics and behaviors

Highlights:
1. Gathers essential fields such as names, ages, and transaction details.
2. Segments customers into categories (VIP, Regular, New) and age groups.
3. Aggregates customer-level metrics:
   - total orders
   - total sales
   - total quantity purchased
   - total products
   - lifespan (in months)
4. Calculates valuable KPIs:
   - recency (months since last order)
   - average order value
   - average monthly spend

In [5]:
-- Create the Customer Report View
-- So it would be easier for analysis and BI team to visualaize this integrated view
IF OBJECT_ID('gold.report_customers', 'V') IS NOT NULL
    DROP VIEW gold.report_customers;
GO

CREATE VIEW gold.report_customers AS
WITH base_query AS(
    SELECT
        f.order_number,
        f.product_key,
        f.order_date,
        f.sales_amount,
        f.quantity,
        c.customer_key,
        c.customer_number,
        CONCAT(c.first_name, ' ', c.last_name) AS customer_name,
        DATEDIFF(year, c.birthdate, GETDATE()) age
    FROM gold.fact_sales f
    LEFT JOIN gold.dim_customers c
        ON c.customer_key = f.customer_key
    WHERE order_date IS NOT NULL
)
, customer_aggregation AS (
    SELECT 
        customer_key,
        customer_number,
        customer_name,
        age,
        COUNT(DISTINCT order_number) AS total_orders,
        SUM(sales_amount) AS total_sales,
        SUM(quantity) AS total_quantity,
        COUNT(DISTINCT product_key) AS total_products,
        MAX(order_date) AS last_order_date,
        DATEDIFF(month, MIN(order_date), MAX(order_date)) AS lifespan
    FROM base_query
    GROUP BY 
        customer_key,
        customer_number,
        customer_name,
        age
)
SELECT
    customer_key,
    customer_number,
    customer_name,
    age,
    CASE 
        WHEN age < 20 THEN 'Under 20'
        WHEN age between 20 and 29 THEN '20-29'
        WHEN age between 30 and 39 THEN '30-39'
        WHEN age between 40 and 49 THEN '40-49'
        ELSE '50 and above'
    END AS age_group,
    CASE 
        WHEN lifespan >= 12 AND total_sales > 5000 THEN 'VIP'
        WHEN lifespan >= 12 AND total_sales <= 5000 THEN 'Regular'
        ELSE 'New'
    END AS customer_segment,
    last_order_date,
    DATEDIFF(month, last_order_date, GETDATE()) AS recency,
    total_orders,
    total_sales,
    total_quantity,
    total_products,
    lifespan,
    CASE WHEN total_sales = 0 THEN 0 -- to make sure not deviding by zero (we could use NULLIF)
         ELSE total_sales / total_orders
    END AS avg_order_value,
    CASE WHEN lifespan = 0 THEN total_sales -- means less than 1 month
         ELSE total_sales / lifespan
    END AS avg_monthly_spend
FROM customer_aggregation;

In [6]:
-- Query the Customer Report to see results
SELECT TOP 10 *
FROM gold.report_customers
ORDER BY total_sales DESC;

customer_key,customer_number,customer_name,age,age_group,customer_segment,last_order_date,recency,total_orders,total_sales,total_quantity,total_products,lifespan,avg_order_value,avg_monthly_spend
1133,AW00012132,Kaitlyn Henderson,64,50 and above,VIP,2013-10-17,137,5,13294,14,13,33,2658,402
1302,AW00012301,Nichole Nara,73,50 and above,VIP,2013-11-20,136,5,13294,13,11,30,2658,443
1309,AW00012308,Margaret He,55,50 and above,VIP,2013-11-19,136,5,13268,14,14,29,2653,457
1132,AW00012131,Randall Dominguez,64,50 and above,VIP,2013-10-10,137,5,13265,11,11,32,2653,414
1301,AW00012300,Adriana Gonzalez,73,50 and above,VIP,2013-10-17,137,5,13242,10,10,29,2648,456
1322,AW00012321,Rosa Hu,69,50 and above,VIP,2013-11-21,136,5,13215,15,12,29,2643,455
1125,AW00012124,Brandi Gill,63,50 and above,VIP,2013-10-07,137,5,13195,12,11,33,2639,399
1308,AW00012307,Brad She,65,50 and above,VIP,2013-11-17,136,5,13172,11,10,30,2634,439
1297,AW00012296,Francisco Sara,64,50 and above,VIP,2013-10-25,137,5,13164,12,9,29,2632,453
434,AW00011433,Maurice Shan,68,50 and above,New,2013-09-14,138,6,12914,13,12,9,2152,1434


## Product Report

Purpose:
- This report consolidates key product metrics and behaviors.

Highlights:
1. Gathers essential fields such as product name, category, subcategory, and cost.
2. Segments products by revenue to identify High-Performers, Mid-Range, or Low-Performers.
3. Aggregates product-level metrics:
   - total orders
   - total sales
   - total quantity sold
   - total customers (unique)
   - lifespan (in months)
4. Calculates valuable KPIs:
   - recency (months since last sale)
   - average order revenue (AOR)
   - average monthly revenue

In [29]:
-- Create the Product Report View
-- So it would be easier for analysis and BI team to visualaize this integrated view
-- Remember: the products are not sold yet are excluded
IF OBJECT_ID('gold.report_products', 'V') IS NOT NULL
    DROP VIEW gold.report_products;
GO

CREATE VIEW gold.report_products AS
WITH base_query AS (
    SELECT
        f.order_number,
        f.order_date,
        f.customer_key,
        f.sales_amount,
        f.quantity,
        p.product_key,
        p.product_name,
        p.category,
        p.subcategory,
        p.cost
    FROM gold.fact_sales f
    LEFT JOIN gold.dim_products p
        ON f.product_key = p.product_key
    WHERE order_date IS NOT NULL
),
product_aggregations AS (
    SELECT
        product_key,
        product_name,
        category,
        subcategory,
        cost,
        DATEDIFF(MONTH, MIN(order_date), MAX(order_date)) AS lifespan,
        MAX(order_date) AS last_sale_date,
        COUNT(DISTINCT order_number) AS total_orders,
        COUNT(DISTINCT customer_key) AS total_customers,
        SUM(sales_amount) AS total_sales,
        SUM(quantity) AS total_quantity,
        ROUND(AVG(CAST(sales_amount AS FLOAT) / NULLIF(quantity, 0)), 1) AS avg_selling_price
    FROM base_query
    GROUP BY
        product_key,
        product_name,
        category,
        subcategory,
        cost
)
SELECT 
    product_key,
    product_name,
    category,
    subcategory,
    cost,
    last_sale_date,
    DATEDIFF(MONTH, last_sale_date, GETDATE()) AS recency_in_months,
    CASE
        WHEN total_sales > 50000 THEN 'High-Performer'
        WHEN total_sales >= 10000 THEN 'Mid-Range'
        ELSE 'Low-Performer'
    END AS product_segment,
    lifespan,
    total_orders,
    total_sales,
    total_quantity,
    total_customers,
    avg_selling_price,
    CASE 
        WHEN total_orders = 0 THEN 0
        ELSE total_sales / total_orders
    END AS avg_order_revenue,
    CASE
        WHEN lifespan = 0 THEN total_sales
        ELSE total_sales / lifespan
    END AS avg_monthly_revenue
FROM product_aggregations;

In [30]:
-- Query the Product Report to see results
SELECT TOP 10 *
FROM gold.report_products
ORDER BY total_sales DESC;

product_key,product_name,category,subcategory,cost,last_sale_date,recency_in_months,product_segment,lifespan,total_orders,total_sales,total_quantity,total_customers,avg_selling_price,avg_order_revenue,avg_monthly_revenue
122,Mountain-200 Black- 46,Bikes,Mountain Bikes,1252,2013-12-27,135,High-Performer,24,620,1373454,620,600,2215.2,2215,57227
121,Mountain-200 Black- 42,Bikes,Mountain Bikes,1252,2013-12-28,135,High-Performer,23,614,1363128,614,604,2220.1,2220,59266
123,Mountain-200 Silver- 38,Bikes,Mountain Bikes,1266,2013-12-28,135,High-Performer,23,596,1339394,596,583,2247.3,2247,58234
125,Mountain-200 Silver- 46,Bikes,Mountain Bikes,1266,2013-12-28,135,High-Performer,23,579,1298709,579,566,2243.0,2243,56465
120,Mountain-200 Black- 38,Bikes,Mountain Bikes,1252,2013-12-28,135,High-Performer,24,581,1292559,581,564,2224.7,2224,53856
124,Mountain-200 Silver- 42,Bikes,Mountain Bikes,1266,2013-12-28,135,High-Performer,24,560,1257368,560,547,2245.3,2245,52390
17,Road-150 Red- 48,Bikes,Road Bikes,2171,2011-12-28,159,High-Performer,12,337,1205786,337,337,3578.0,3578,100482
20,Road-150 Red- 62,Bikes,Road Bikes,2171,2011-12-28,159,High-Performer,12,336,1202208,336,336,3578.0,3578,100184
18,Road-150 Red- 52,Bikes,Road Bikes,2171,2011-12-27,159,High-Performer,12,302,1080556,302,302,3578.0,3578,90046
19,Road-150 Red- 56,Bikes,Road Bikes,2171,2011-12-27,159,High-Performer,12,295,1055510,295,295,3578.0,3578,87959
