## Case Study #5 - Data Mart

#### Problem Statement

Data Mart is Danny’s latest venture and after running international operations for his online supermarket that specialises in fresh produce - Danny is asking for your support to analyse his sales performance.

In June 2020 - large scale supply changes were made at Data Mart. All Data Mart products now use sustainable packaging methods in every single step from the farm all the way to the customer.

Danny needs your help to quantify the impact of this change on the sales performance for Data Mart and it’s separate business areas.

The key business question he wants you to help him answer are the following:
- What was the quantifiable impact of the changes introduced in June 2020?
- Which platform, region, segment and customer types were the most impacted by this change?
- What can we do about future introduction of similar sustainability updates to the business to minimise impact on sales?

#### Entity Relationship Diagram

![week5.png](week5.png)

Import modules

In [1]:
# SQL Engine imports
from dotenv import load_dotenv
import os
import psycopg2
from sqlalchemy import create_engine
from sqlalchemy.sql import text
import warnings
warnings.filterwarnings("ignore")

# Python data analysis imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)

Initialize SQL

In [2]:
load_dotenv()
user = os.environ.get("USER")
pw = os.environ.get("PASS")
db = os.environ.get("DB")
host = os.environ.get("HOST")
api = os.environ.get("API")
port = 5432
schema = 'data_mart'

In [3]:
uri = f"postgresql+psycopg2://{user}:{pw}@{host}:{port}/{db}"
alchemyEngine = create_engine(uri)
conn = alchemyEngine.connect()

Verify tables

In [4]:
rs = conn.execute(text(f"SELECT table_name FROM information_schema.tables WHERE table_schema='{schema}'"))
tables = [table[0] for table in rs.fetchall()]
print(f'The tables in the database are: \n- {'\n- '.join(tables)}')

The tables in the database are: 
- weekly_sales


Fetch table information

In [5]:
for table in tables:
    print("=================================")
    print(f'Table [{table}]')
    df = pd.read_sql_query(f'SELECT * FROM {schema}.{table} LIMIT 5', conn.connection)
    print(f'Dimensions: {df.shape[0]} rows x {df.shape[1]} columns\n')
    print(df.head())
    info_df = pd.DataFrame.from_dict({'Datatypes':df.dtypes, 'NULL count':df.isna().sum()})
    print()
    print(info_df)
    print()

Table [weekly_sales]
Dimensions: 5 rows x 7 columns

  week_date  region platform segment customer_type  transactions     sales
0   31/8/20    ASIA   Retail      C3           New        120631   3656163
1   31/8/20    ASIA   Retail      F1           New         31574    996575
2   31/8/20     USA   Retail    null         Guest        529151  16509610
3   31/8/20  EUROPE   Retail      C1           New          4517    141942
4   31/8/20  AFRICA   Retail      C2           New         58046   1758388

              Datatypes  NULL count
week_date        object           0
region           object           0
platform         object           0
segment          object           0
customer_type    object           0
transactions      int64           0
sales             int64           0



In [6]:
def query(stmt: str):
    """Executes a given SQL statement and returns a Pandas DataFrame given the results.
    
    Parameters
    ----------
    stmt: str
        The SQL statement to be executed
    """
    global conn
    result = pd.read_sql_query(stmt, conn.connection)
    return result

## Case Study Questions

The following case study questions include some general data exploration analysis for the nodes and transactions before diving right into the core business questions and finishes with a challenging final request!

**A. Data Cleaning**

Q1: In a single query, perform the following operations and generate a new table in the `data_mart` schema named `clean_weekly_sales`:
- Convert the `week_date` to a DATE format
- Add a `week_number` as the second column for each `week_date` value, for example any value from the 1st of January to 7th of January will be 1, 8th to 14th will be 2 etc
- Add a `month_number` with the calendar month for each `week_date` value as the 3rd column
- Add a `calendar_year` column as the 4th column containing either 2018, 2019 or 2020 values
- Add a new column called `age_band` after the original `segment` column using the following mapping on the number inside the segment value:

<p align="center">
  <img src="week5a.png"/>
</p>

- Add a new `demographic` column using the following mapping for the first letter in the `segment` values:

<p align="center">
  <img src="week5b.png" />
</p>

- Ensure all `null` string values with an `"unknown"` string value in the original segment column as well as the new `age_band` and `demographic` columns
- Generate a new `avg_transaction` column as the `sales` value divided by `transactions` rounded to 2 decimal places for each record

In [7]:
conn.execute(text('''
  DROP TABLE IF EXISTS clean_weekly_sales;
  CREATE TEMP TABLE clean_weekly_sales AS (
  SELECT
    TO_DATE(week_date, 'DD/MM/YY') AS week_date,
    DATE_PART('week', TO_DATE(week_date, 'DD/MM/YY')) AS week_number,
    DATE_PART('month', TO_DATE(week_date, 'DD/MM/YY')) AS month_number,
    DATE_PART('year', TO_DATE(week_date, 'DD/MM/YY')) AS calendar_year,
    region, 
    platform, 
    segment,
    CASE 
      WHEN SUBSTRING(segment,2,1) = '1' THEN 'Young Adults'
      WHEN SUBSTRING(segment,2,1) = '2' THEN 'Middle Aged'
      WHEN SUBSTRING(segment,2,1) in ('3','4') THEN 'Retirees'
      ELSE 'unknown' END AS age_band,
    CASE 
      WHEN SUBSTRING(segment,1,1) = 'C' THEN 'Couples'
      WHEN SUBSTRING(segment,1,1) = 'F' THEN 'Families'
      ELSE 'unknown' END AS demographic,
    transactions,
    ROUND((sales::NUMERIC/transactions),2) AS avg_transaction,
    sales
  FROM data_mart.weekly_sales
  );
'''))

<sqlalchemy.engine.cursor.CursorResult at 0x17067bbc6e0>

In [9]:
q1_df = query('SELECT * FROM clean_weekly_sales')
q1_df

Unnamed: 0,week_date,week_number,month_number,calendar_year,region,platform,segment,age_band,demographic,transactions,avg_transaction,sales
0,2020-08-31,36.0,8.0,2020.0,ASIA,Retail,C3,Retirees,Couples,120631,30.31,3656163
1,2020-08-31,36.0,8.0,2020.0,ASIA,Retail,F1,Young Adults,Families,31574,31.56,996575
2,2020-08-31,36.0,8.0,2020.0,USA,Retail,,unknown,unknown,529151,31.20,16509610
3,2020-08-31,36.0,8.0,2020.0,EUROPE,Retail,C1,Young Adults,Couples,4517,31.42,141942
4,2020-08-31,36.0,8.0,2020.0,AFRICA,Retail,C2,Middle Aged,Couples,58046,30.29,1758388
...,...,...,...,...,...,...,...,...,...,...,...,...
17112,2018-03-26,13.0,3.0,2018.0,AFRICA,Retail,C3,Retirees,Couples,98342,37.69,3706066
17113,2018-03-26,13.0,3.0,2018.0,USA,Shopify,C4,Retirees,Couples,16,174.00,2784
17114,2018-03-26,13.0,3.0,2018.0,USA,Retail,F2,Middle Aged,Families,25665,41.46,1064172
17115,2018-03-26,13.0,3.0,2018.0,EUROPE,Retail,C4,Retirees,Couples,883,37.96,33523


**B. Data Exploration**

Q2: What day of the week is used for each `week_date` value?

Q3: What range of week numbers are missing from the dataset?

Q4: How many total transactions were there for each year in the dataset?

Q5: What is the total sales for each region for each month?

Q6: What is the total count of transactions for each platform?

Q7: What is the percentage of sales for Retail vs Shopify for each month?

Q8: What is the percentage of sales by demographic for each year in the dataset?

Q9: Which age_band and demographic values contribute the most to Retail sales?

Q10: Can we use the avg_transaction column to find the average transaction size for each year for Retail vs Shopify? If not - how would you calculate it instead?

**C. Before and After Analysis**

This technique is usually used when we inspect an important event and want to inspect the impact before and after a certain point in time.

Taking the `week_date` value of 2020-06-15 as the baseline week where the Data Mart sustainable packaging changes came into effect.

We would include all `week_date` values for 2020-06-15 as the start of the period after the change and the previous week_date values would be before

Using this analysis approach - answer the following questions:

Q11: What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales?

Q12: What about the entire 12 weeks before and after?


Q13: How do the sale metrics for these 2 periods before and after compare with the previous years in 2018 and 2019?


**D. Bonus Question**

Q14: Which areas of the business have the highest negative impact in sales metrics performance in 2020 for the 12 week before and after period?
- region
- platform
- age_band
- demographic
- customer_type

Do you have any further recommendations for Danny’s team at Data Mart or any interesting insights based off this analysis?

**Conclusion**

This case study actually is based off a real life change in Australia retailers where plastic bags were no longer provided for free - as you can expect, some customers would have changed their shopping behaviour because of this change!

Analysis which is related to certain key events which can have a significant impact on sales or engagement metrics is always a part of the data analytics menu. Learning how to approach these types of problems is a super valuable lesson and hopefully these ideas can help you next time you’re faced with a tough problem like this in the workplace!