## Case Study #3 - Foodie-Fi

#### Problem Statement
There is a new innovation in the financial industry called Neo-Banks: new aged digital only banks without physical branches.

Danny thought that there should be some sort of intersection between these new age banks, cryptocurrency and the data world…so he decides to launch a new initiative - Data Bank!

Data Bank runs just like any other digital bank - but it isn’t only for banking activities, they also have the world’s most secure distributed data storage platform!

Customers are allocated cloud data storage limits which are directly linked to how much money they have in their accounts. There are a few interesting caveats that go with this business model, and this is where the Data Bank team need your help!

The management team at Data Bank want to increase their total customer base - but also need some help tracking just how much data storage their customers will need.

This case study is all about calculating metrics, growth and helping the business analyse their data in a smart way to better forecast and plan for their future developments!

#### Entity Relationship Diagram

![week4.png](week4.png)

Import modules

In [19]:
# SQL Engine imports
from dotenv import load_dotenv
import os
import psycopg2
from sqlalchemy import create_engine
from sqlalchemy.sql import text

# Python data analysis imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)

Initialize SQL

In [20]:
load_dotenv()
user = os.environ.get("USER")
pw = os.environ.get("PASS")
db = os.environ.get("DB")
host = os.environ.get("HOST")
api = os.environ.get("API")
port = 5432
schema = 'data_bank'

In [21]:
uri = f"postgresql+psycopg2://{user}:{pw}@{host}:{port}/{db}"
alchemyEngine = create_engine(uri)
conn = alchemyEngine.connect()

Verify tables

In [22]:
rs = conn.execute(text(f"SELECT table_name FROM information_schema.tables WHERE table_schema='{schema}'"))
tables = [table[0] for table in rs.fetchall()]
print(f'The tables in the database are: \n- {'\n- '.join(tables)}')

The tables in the database are: 
- regions
- customer_nodes
- customer_transactions


Fetch table information

In [23]:
for table in tables:
    print("=================================")
    print(f'Table [{table}]')
    df = pd.read_sql_query(f'SELECT * FROM {schema}.{table} LIMIT 5', conn)
    print(f'Dimensions: {df.shape[0]} rows x {df.shape[1]} columns\n')
    print(df.head())
    info_df = pd.DataFrame.from_dict({'Datatypes':df.dtypes, 'NULL count':df.isna().sum()})
    print()
    print(info_df)
    print()

Table [regions]
Dimensions: 5 rows x 2 columns

   region_id region_name
0          1   Australia
1          2     America
2          3      Africa
3          4        Asia
4          5      Europe

            Datatypes  NULL count
region_id       int64           0
region_name    object           0

Table [customer_nodes]
Dimensions: 5 rows x 5 columns

   customer_id  region_id  node_id  start_date    end_date
0            1          3        4  2020-01-02  2020-01-03
1            2          3        5  2020-01-03  2020-01-17
2            3          5        4  2020-01-27  2020-02-18
3            4          5        4  2020-01-07  2020-01-19
4            5          3        3  2020-01-15  2020-01-23

            Datatypes  NULL count
customer_id     int64           0
region_id       int64           0
node_id         int64           0
start_date     object           0
end_date       object           0

Table [customer_transactions]
Dimensions: 5 rows x 4 columns

   customer_id    txn

In [24]:
def query(stmt: str):
    """Executes a given SQL statement and returns a Pandas DataFrame given the results.
    
    Parameters
    ----------
    stmt: str
        The SQL statement to be executed
    """
    global conn
    result = pd.read_sql_query(stmt, conn)
    return result

## Case Study Questions

The following case study questions include some general data exploration analysis for the nodes and transactions before diving right into the core business questions and finishes with a challenging final request!

**A. Customer Nodes Exploration**

Q1: How many unique nodes are there on the Data Bank system?

In [25]:
query(f'''
    SELECT COUNT(DISTINCT node_id)
    FROM data_bank.customer_nodes 
''')

Unnamed: 0,count
0,5


Q2: What is the number of nodes per region?

In [26]:
query('''
    SELECT r.region_id, r.region_name, COUNT(DISTINCT cn.node_id)
    FROM data_bank.regions r
        JOIN data_bank.customer_nodes cn USING (region_id)
    GROUP BY region_id, region_name
    ORDER BY region_id
''')

Unnamed: 0,region_id,region_name,count
0,1,Australia,5
1,2,America,5
2,3,Africa,5
3,4,Asia,5
4,5,Europe,5


Q3: How many customers are allocated to each region?

In [27]:
query('''
    SELECT 
        region_id, 
        COUNT(customer_id) AS customer_count
    FROM data_bank.customer_nodes
    GROUP BY region_id
    ORDER BY region_id
''')

Unnamed: 0,region_id,customer_count
0,1,770
1,2,735
2,3,714
3,4,665
4,5,616


Q4: How many days on average are customers reallocated to a different node?

- This problem can be solved by first obtaining the number of days a customer stayed in each node

In [33]:
query('''
    WITH node_days AS (
    SELECT 
      customer_id, 
      region_id,
      node_id,
      end_date - start_date AS days_in_node
    FROM data_bank.customer_nodes
    WHERE end_date != '9999-12-31' -- Exclude the active records
  ), 
  total_node_days AS (
    SELECT 
      customer_id,
      region_id, 
      node_id,
      SUM(days_in_node) AS total_days_in_node
    FROM node_days
    GROUP BY customer_id, region_id, node_id
  )

  SELECT ROUND(AVG(total_days_in_node),2) AS avg_node_reallocation_days
  FROM total_node_days;
''')

Unnamed: 0,avg_node_reallocation_days
0,23.57


Q5: What is the median, 80th and 95th percentile for this same reallocation days metric for each region?

- Same approach as Q4, but use PERCENTILE_CONT() to get the percentile values.

In [32]:
query('''
    WITH node_days AS (
        SELECT 
            customer_id, 
            region_id,
            node_id,
            end_date - start_date AS days_in_node
        FROM data_bank.customer_nodes
        WHERE end_date != '9999-12-31' -- Exclude the active records
    ), 
    total_node_days AS (
        SELECT 
            customer_id,
            region_id, 
            node_id,
            SUM(days_in_node) AS total_days_in_node
        FROM node_days
        GROUP BY customer_id, region_id, node_id
    )
    SELECT
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY total_days_in_node) AS median_days,
        PERCENTILE_CONT(0.80) WITHIN GROUP (ORDER BY total_days_in_node) AS perc_80_days,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_days_in_node) AS perc_95_days
    FROM
        total_node_days
''')

Unnamed: 0,median_days,perc_80_days,perc_95_days
0,22.0,34.0,52.0


- Median: 22 days
- 80th percentile: 34 days
- 95th percentile: 52 days

**B. Customer Transactions**

Q6: What is the unique count and total amount for each transaction type?

In [34]:
query('''
    SELECT
        txn_type, 
        COUNT(customer_id) AS num_transactions, 
        SUM(txn_amount) AS total_amount
    FROM data_bank.customer_transactions
    GROUP BY txn_type;
''')

Unnamed: 0,txn_type,num_transactions,total_amount
0,purchase,1617,806537
1,withdrawal,1580,793003
2,deposit,2671,1359168


Q7: What is the average total historical deposit counts and amounts for all customers?

Q8: For each month - how many Data Bank customers make more than 1 deposit and either 1 purchase or 1 withdrawal in a single month?

Q9: What is the closing balance for each customer at the end of the month?

Q10: What is the percentage of customers who increase their closing balance by more than 5%?

**C: Data Allocation Challenge**

Q11: To test out a few different hypotheses - the Data Bank team wants to run an experiment where different groups of customers would be allocated data using 3 different options:
- Option 1: data is allocated based off the amount of money at the end of the previous month
- Option 2: data is allocated on the average amount of money kept in the account in the previous 30 days
- Option 3: data is updated real-time

For this multi-part challenge question - you have been requested to generate the following data elements to help the Data Bank team estimate how much data will need to be provisioned for each option:
- running customer balance column that includes the impact each transaction
- customer balance at the end of each month
- minimum, average and maximum values of the running balance for each customer

Using all of the data available - how much data would have been required for each option on a monthly basis?

**D. Extra Challenge**

Data Bank wants to try another option which is a bit more difficult to implement - they want to calculate data growth using an interest calculation, just like in a traditional savings account you might have with a bank.

If the annual interest rate is set at 6% and the Data Bank team wants to reward its customers by increasing their data allocation based off the interest calculated on a daily basis at the end of each day, how much data would be required for this option on a monthly basis?

Special notes:
- Data Bank wants an initial calculation which does not allow for compounding interest, however they may also be interested in a daily compounding interest calculation so you can try to perform this calculation if you have the stamina!

**E. Extension Request**

The Data Bank team wants you to use the outputs generated from the above sections to create a quick Powerpoint presentation which will be used as marketing materials for both external investors who might want to buy Data Bank shares and new prospective customers who might want to bank with Data Bank.
1. Using the outputs generated from the customer node questions, generate a few headline insights which Data Bank might use to market it’s world-leading security features to potential investors and customers.
2. With the transaction analysis - prepare a 1 page presentation slide which contains all the relevant information about the various options for the data provisioning so the Data Bank management team can make an informed decision.

**Conclusion**

This case study aims to mimic traditional banking style transactions data but with a twist - hopefully it can give you some insight into the types of datasets you might encounter in a customer banking scenario.