# Midterm Project: ETL and Data Warehousing
Madelyn Khoury (mgk5ybb) and Tiara Allard (tia4qp)

DS 2002 Spring 2023

## Design and Strategy

We chose to model bank transactions as the core business process of our data warehouse, so we designed a database schema centered around bank transactions. To see the schema we designed for this project, please look at the ReadMe of our GitHub project. 

Our schema stores information about bank transactions, bank accounts and users involved in transactions, transaction dates, and transaction locations. We were unable to find a database/dataset with all this information, so instead we combined data from multiple different data sources. This had the added benefit of allowing us to meet the requirements for importing data from a number of sources.

We combined several dummy/randomly generated datasets to build a complete data warehouse. First, we got bank transaction and account information from a .csv file stored on our local filesystem. Then, we generated user data from an API and linked it to the accounts and transactions. Finally, we imported location information from the Northwind MySQL database to represent regions in which banking transactions might have occurred.

After processing the data and computing useful fields, we formatted it into our fact and dimension tables in the final data warehouse.

## Imports and Helper Functions

In [None]:
import sys
!{sys.executable} -m pip install openpyxl
!{sys.executable} -m pip install mysql-connector-python
!{sys.executable} -m pip install pymysql
!{sys.executable} -m pip install sqlalchemy
!{sys.executable} -m pip install uszipcode

In [None]:
import datetime
import json
import mysql.connector
import os
import pandas as pd
import pymysql
import random
import requests
from sqlalchemy import create_engine
from uszipcode import SearchEngine, SimpleZipcode

In [3]:
def get_api_response(url, headers, params, response_type):
    try:
        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()
    
    except requests.exceptions.HTTPError as errh:
        return "An Http Error occurred: " + repr(errh)
    except requests.exceptions.ConnectionError as errc:
        return "An Error Connecting to the API occurred: " + repr(errc)
    except requests.exceptions.Timeout as errt:
        return "A Timeout Error occurred: " + repr(errt)
    except requests.exceptions.RequestException as err:
        return "An Unknown Error occurred: " + repr(err)

    if response_type == 'json':
        # result = json.dumps(response.json(), sort_keys=True, indent=4)
        result = response.json()
    elif response_type == 'dataframe':
        result = pd.json_normalize(response.json())
    else:
        result = "An unhandled error has occurred!"
        
    return result

In [61]:
# this helper function is inspired by part of one of the provided files, 02-Python-MySQL.ipynb
def get_mysql_dataframe(user_id, pwd, host_name, db_name, sql_query):
    dframe = None
    try:
        conn_str = f"mysql+pymysql://{user_id}:{pwd}@{host_name}/{db_name}"
        sqlEngine = create_engine(conn_str, pool_recycle=3600)
        connection = sqlEngine.connect()
        try:
            dframe = pd.read_sql(sql_query, connection);
        except:
            print("Sequel query was unsuccessful.")
        connection.close()
        return dframe
    except:
        print("Unable to connect to the MySQL database.")
    return None

In [87]:
def execute_mysql_command(user_id, pwd, host_name, db_name, sql_query, use_db):
    try:
        if use_db:
            conn = pymysql.connect(host=host_name, user=user_id, password=pwd, database=db_name)
        else:
            conn = pymysql.connect(host=host_name, user=user_id, password=pwd)
        cursor = conn.cursor()
        
        try:
            cursor.execute(sql_query)
            for row in cursor.fetchall():
                print(row)
            cursor.close()
        except:
            print("Cannot execute command.")
    except:
        print("Unable to connect to database.")
        
    conn.close()

In [171]:
def insert_data_to_mysql(user_id, pwd, host_name, db_name, my_dataframe, table_name):
    conn_str = f"mysql+pymysql://{user_id}:{pwd}@{host_name}/{db_name}"
    sqlEngine = create_engine(conn_str, pool_recycle=3600)
    connection = sqlEngine.connect()
    my_dataframe.to_sql(table_name, con=connection, schema="banks", if_exists='replace')
    connection.close()

In [7]:
# this code snippet is modified from: https://www.geeksforgeeks.org/python-program-to-calculate-age-in-year/ 
def calculate_age(birth_date):
    birth_date = datetime.datetime.strptime(birth_date, '%Y-%m-%d').date()
    today = datetime.date.today()
    try:
        birthday = birth_date.replace(year = today.year)
 
    # raised when birth date is February 29 but it's not a leap year
    except ValueError:
        birthday = birth_date.replace(year = today.year,
                  month = birth_date.month + 1, day = 1) # birth date becomes march 1st
 
    if birthday > today:
        return today.year - birth_date.year - 1
    else:
        return today.year - birth_date.year

In [8]:
def get_region(state_abbreviation):
    if state_abbreviation in {"WA", "OR", "CA", "ID", "MT", "NV", "UT", "CO", "WY", "AK"}:
        return "West"
    elif state_abbreviation in {"AZ", "NM", "TX", "OK"}:
        return "Southwest"
    elif state_abbreviation in {"ND", "SD", "NE", "KS", "MN", "IA", "MO", "WI", "IL", "MI", "OH"}:
        return "Midwest"
    elif state_abbreviation in {"ME", "NH", "MA", "CT", "RI", "VT", "NY", "PA", "DE", "MD", "NJ"}:
        return "Northeast"
    else:
        return "Southeast"

In [9]:
def get_zipcode(city, state):
    search = SearchEngine()
    results = search.by_city_and_state(city, state)
    if len(results) > 0:
        return results[0].zipcode
    else:
        return None

In [10]:
def get_day(date_timestamp):
    return date_timestamp.day

In [11]:
def get_month(date_timestamp):
    return date_timestamp.month_name()

In [12]:
def get_year(date_timestamp):
    return date_timestamp.year

In [13]:
def get_week_day(date_timestamp):
    return date_timestamp.day_name()

## Loading in Data
In this section, we will be loading in the data from three sources: an API, a local filesystem, and a relational database.

### Importing Data From Local File System

The core bank transaction information that we will use came from a dataset on Kaggle (https://www.kaggle.com/datasets/apoorvwatsky/bank-transaction-data). We downloaded the data in the form of a xlsx file and will import it from the local filesystem in order to be used in our data warehouse.

In [14]:
bank_info_path = os.path.join(os.getcwd(), 'bank.xlsx')
bank_info = pd.read_excel(bank_info_path)

In [15]:
bank_info

Unnamed: 0,Account No,DATE,TRANSACTION DETAILS,CHQ.NO.,VALUE DATE,WITHDRAWAL AMT,DEPOSIT AMT,BALANCE AMT,.
0,409000611074',2017-06-29,TRF FROM Indiaforensic SERVICES,,2017-06-29,,1000000.0,1.000000e+06,.
1,409000611074',2017-07-05,TRF FROM Indiaforensic SERVICES,,2017-07-05,,1000000.0,2.000000e+06,.
2,409000611074',2017-07-18,FDRL/INTERNAL FUND TRANSFE,,2017-07-18,,500000.0,2.500000e+06,.
3,409000611074',2017-08-01,TRF FRM Indiaforensic SERVICES,,2017-08-01,,3000000.0,5.500000e+06,.
4,409000611074',2017-08-16,FDRL/INTERNAL FUND TRANSFE,,2017-08-16,,500000.0,6.000000e+06,.
...,...,...,...,...,...,...,...,...,...
116196,409000362497',2019-03-05,TRF TO 1196428 Indiaforensic SE,,2019-03-05,117934.30,,-1.901902e+09,.
116197,409000362497',2019-03-05,FDRL/INTERNAL FUND TRANSFE,,2019-03-05,,300000.0,-1.901602e+09,.
116198,409000362497',2019-03-05,FDRL/INTERNAL FUND TRANSFE,,2019-03-05,,300000.0,-1.901302e+09,.
116199,409000362497',2019-03-05,IMPS 05-03-20194C,,2019-03-05,109868.65,,-1.901412e+09,.


### Importing Data From API

We've chosen to use the `users` endpoint from random-data-api.com, which randomly generates data for a set of users. This will populate the Users table in our data warehouse.

In [16]:
size = 10 # only get info on 10 users for now
url = "https://random-data-api.com/api/v2/users"
querystring = {"size":size}
headers = None

# Get information from users API endpoint
users = get_api_response(url, headers, querystring, "dataframe")
users

Unnamed: 0,id,uid,password,first_name,last_name,username,email,avatar,gender,phone_number,...,address.zip_code,address.state,address.country,address.coordinates.lat,address.coordinates.lng,credit_card.cc_number,subscription.plan,subscription.status,subscription.payment_method,subscription.term
0,6845,af2a3fb9-230a-4666-bdaf-45cf86f62f03,ufGcrUyZsJ,Lindsey,Conn,lindsey.conn,lindsey.conn@email.com,https://robohash.org/eligendidoloremvelit.png?...,Bigender,+1-868 (212) 661-0150 x039,...,85988-9294,South Carolina,United States,-29.349517,-14.472536,5455-4430-4964-2615,Gold,Pending,Bitcoins,Full subscription
1,7366,0b9d68ce-868c-47a5-8302-923dce65b0eb,SzZfDHlKrI,Rod,Murray,rod.murray,rod.murray@email.com,https://robohash.org/nesciuntvoluptatequaerat....,Female,+681 638.361.2045 x9240,...,46698-4879,Virginia,United States,-12.624925,142.887339,4950957756465,Essential,Blocked,Credit card,Annual
2,3009,d53ec7f8-724a-4ca4-b689-d8ce4ab36d26,0nBWb3Siom,Nolan,Lowe,nolan.lowe,nolan.lowe@email.com,https://robohash.org/quoprovidentiure.png?size...,Bigender,+39 841.543.3175 x805,...,81961,Iowa,United States,-8.440843,-8.758281,6771-8968-1581-6215,Student,Blocked,Debit card,Monthly
3,6777,997c1582-fc1b-4de4-b5db-94ba3c67709d,KFiQZMGtyR,Max,Ullrich,max.ullrich,max.ullrich@email.com,https://robohash.org/eumvitaeomnis.png?size=30...,Bigender,+592 549.385.9859 x17938,...,69405-4526,Texas,United States,76.091274,-57.626874,4502973503650,Basic,Idle,Alipay,Payment in advance
4,5670,1cda5f16-0be6-4039-afd5-dde3736c3fea,LhAvoIDdCM,Flavia,Hayes,flavia.hayes,flavia.hayes@email.com,https://robohash.org/remquisut.png?size=300x30...,Male,+1-784 (965) 412-6409 x0195,...,95552,Kentucky,United States,16.263784,176.621506,6771-8933-3311-9321,Gold,Active,Debit card,Annual
5,9299,26e4ce71-38ce-46b8-a546-00cd01d95010,JSdDK8finP,Katia,Abbott,katia.abbott,katia.abbott@email.com,https://robohash.org/etnisinatus.png?size=300x...,Agender,+60 (504) 350-9695 x9261,...,66255,New Hampshire,United States,-4.047111,117.735033,4921-5698-5798-0041,Diamond,Idle,Credit card,Monthly
6,158,cccb614e-4fc4-4aa8-9fae-c847d3f45d86,dRKqnuzHpm,Lindsay,Mueller,lindsay.mueller,lindsay.mueller@email.com,https://robohash.org/harumasperioresconsectetu...,Non-binary,+1-784 1-135-419-8053 x46078,...,27931-7774,New Hampshire,United States,41.168059,-156.580402,4442-0440-6515-6024,Platinum,Active,WeChat Pay,Annual
7,5592,1e442bff-8878-4439-bed0-4ac704be435c,5PIY3XbQTH,Rebbeca,Bins,rebbeca.bins,rebbeca.bins@email.com,https://robohash.org/ducimusconsequunturid.png...,Male,+233 (831) 721-3592 x3886,...,46940,Arizona,United States,79.014294,-125.583101,4394-6868-9299-9063,Platinum,Idle,Cheque,Monthly
8,1596,f945d0e0-f9d8-4653-a2f2-f4e20e4c7c92,5zvH1khJUW,Rossana,Kunze,rossana.kunze,rossana.kunze@email.com,https://robohash.org/voluptatemquaset.png?size...,Non-binary,+886 1-496-574-2931 x021,...,07627-5639,Ohio,United States,61.066468,28.641357,6771-8933-1074-9025,Platinum,Pending,Apple Pay,Monthly
9,3648,1d859e2a-455a-4eb7-8a29-832297f2dd5f,mVtAP20icW,Carlo,Johnston,carlo.johnston,carlo.johnston@email.com,https://robohash.org/quodnesciuntomnis.png?siz...,Male,+249 450-418-2590,...,45994-9170,Missouri,United States,69.651114,89.832842,4935449282097,Gold,Idle,Credit card,Monthly


### Importing Data From Relational Database

To get location data that could represent locations in which bank transactions were completed, we've decided to import information about the shipping location of orders from the `orders` table in the Northwind database. In our data warehouse, this information will represent the location in which a customer instigated a banking transaction; perhaps it could represent the location of physical branches of the bank.

In [47]:
# define variables to set up connection to mySQL database
host_name = "localhost"
host_ip = "127.0.0.1"
port = "3306"

user_id = "ds2002"
pwd = "UVA!1819"
db_name = "northwind"

First we must get the location-related data from the `orders` table.

In [18]:
sql_query = """
    SELECT ship_address, ship_city, ship_state_province, ship_zip_postal_code, ship_country_region from orders;
"""

In [19]:
locations_info = get_mysql_dataframe(user_id, pwd, host_name, db_name, sql_query)

## Transforming/Cleaning Up the Data
In this section, we will be transforming, cleaning, and doing transformations on the data, as well as separating it into several tables that we can easily import into our data warehouse.

### Transforming the Location Data

We got location info from the Northwind database, but we must remove duplicate values so that we have a table of unique locations.

In [20]:
locations_info = locations_info.drop_duplicates()

It appears that the Northwind database didn't store actual zip codes, but instead put 99999 in for every row. So, we will fill in the table with the correct zip code for each city listed. Additionally, we will add another column to the table which will identify the region of the United States that the location is in.

In [21]:
locations_info["zipcode"] = locations_info.apply(lambda row: get_zipcode(row["ship_city"], row["ship_state_province"]), axis=1)
locations_info.drop(["ship_zip_postal_code"], axis = 1, inplace = True)

In [22]:
locations_info["region"] = locations_info.apply(lambda row: get_region(row["ship_state_province"]), axis=1)
locations_info

Unnamed: 0,ship_address,ship_city,ship_state_province,ship_country_region,zipcode,region
0,789 27th Street,Las Vegas,NV,USA,89101,West
1,123 4th Street,New York,NY,USA,10001,Northeast
2,123 12th Street,Las Vegas,NV,USA,89101,West
3,123 8th Street,Portland,OR,USA,97201,West
5,789 29th Street,Denver,CO,USA,80202,West
6,123 3rd Street,Los Angelas,CA,USA,90001,West
7,123 6th Street,Milwaukee,WI,USA,53202,Midwest
8,789 28th Street,Memphis,TN,USA,38103,Southeast
10,123 10th Street,Chicago,IL,USA,60601,Midwest
11,123 7th Street,Boise,ID,USA,83702,West


### Transforming the User Data

To transform the user data, all we had to do was drop some columns of the dataframe. As we can see by looking at the columns of users, there is a lot of superfluous information.

In [23]:
users.columns

Index(['id', 'uid', 'password', 'first_name', 'last_name', 'username', 'email',
       'avatar', 'gender', 'phone_number', 'social_insurance_number',
       'date_of_birth', 'employment.title', 'employment.key_skill',
       'address.city', 'address.street_name', 'address.street_address',
       'address.zip_code', 'address.state', 'address.country',
       'address.coordinates.lat', 'address.coordinates.lng',
       'credit_card.cc_number', 'subscription.plan', 'subscription.status',
       'subscription.payment_method', 'subscription.term'],
      dtype='object')

In [24]:
users.drop(['employment.title', 'employment.key_skill', 'uid','avatar', 'social_insurance_number', 'subscription.plan', 'subscription.payment_method', 'subscription.status', 'subscription.term', 'address.city', 'address.street_name', 'address.street_address', 'address.zip_code', 'address.state', 'address.country', 'address.coordinates.lat', 'address.coordinates.lng'], axis = 1, inplace = True)

We will also calculate the age of each user, that way we have calculations stored in our OLAP database and don't have to compute them on the fly.

In [25]:
users["age"] = users.apply(lambda row: calculate_age(row["date_of_birth"]), axis=1)

In [26]:
users

Unnamed: 0,id,password,first_name,last_name,username,email,gender,phone_number,date_of_birth,credit_card.cc_number,age
0,6845,ufGcrUyZsJ,Lindsey,Conn,lindsey.conn,lindsey.conn@email.com,Bigender,+1-868 (212) 661-0150 x039,2002-12-15,5455-4430-4964-2615,20
1,7366,SzZfDHlKrI,Rod,Murray,rod.murray,rod.murray@email.com,Female,+681 638.361.2045 x9240,1985-11-24,4950957756465,37
2,3009,0nBWb3Siom,Nolan,Lowe,nolan.lowe,nolan.lowe@email.com,Bigender,+39 841.543.3175 x805,1982-09-23,6771-8968-1581-6215,40
3,6777,KFiQZMGtyR,Max,Ullrich,max.ullrich,max.ullrich@email.com,Bigender,+592 549.385.9859 x17938,1985-07-21,4502973503650,37
4,5670,LhAvoIDdCM,Flavia,Hayes,flavia.hayes,flavia.hayes@email.com,Male,+1-784 (965) 412-6409 x0195,1983-05-10,6771-8933-3311-9321,39
5,9299,JSdDK8finP,Katia,Abbott,katia.abbott,katia.abbott@email.com,Agender,+60 (504) 350-9695 x9261,1993-10-06,4921-5698-5798-0041,29
6,158,dRKqnuzHpm,Lindsay,Mueller,lindsay.mueller,lindsay.mueller@email.com,Non-binary,+1-784 1-135-419-8053 x46078,2002-04-21,4442-0440-6515-6024,20
7,5592,5PIY3XbQTH,Rebbeca,Bins,rebbeca.bins,rebbeca.bins@email.com,Male,+233 (831) 721-3592 x3886,1972-02-02,4394-6868-9299-9063,51
8,1596,5zvH1khJUW,Rossana,Kunze,rossana.kunze,rossana.kunze@email.com,Non-binary,+886 1-496-574-2931 x021,1980-03-11,6771-8933-1074-9025,43
9,3648,mVtAP20icW,Carlo,Johnston,carlo.johnston,carlo.johnston@email.com,Male,+249 450-418-2590,1995-06-14,4935449282097,27


### Transforming the Account Information

Reviewing the bank_info table we made, we can see that it has columns relating not just to a single transaction, but also to the date and account associated with the transaction.

In [27]:
bank_info.columns

Index(['Account No', 'DATE', 'TRANSACTION DETAILS', 'CHQ.NO.', 'VALUE DATE',
       'WITHDRAWAL AMT', 'DEPOSIT AMT', 'BALANCE AMT', '.'],
      dtype='object')

We will separate this data into three tables: a Transactions fact table, an Accounts dimension, and a Date dimension. 
The bank transaction fact table can store the new balance of an account after a transaction is completed, but we want the bank account table to store current info about each account-- information which is independent of any one transaction. So, we will take the most recent balance for each account and create a "Current Balance" field in the Accounts table.  Since the rows of the spreadsheet were sorted in order of transaction date, then the most recent balance for each account will be the balance in the last-occurring transaction.

In [28]:
account_info = bank_info[["Account No", "BALANCE AMT"]]
account_info

Unnamed: 0,Account No,BALANCE AMT
0,409000611074',1.000000e+06
1,409000611074',2.000000e+06
2,409000611074',2.500000e+06
3,409000611074',5.500000e+06
4,409000611074',6.000000e+06
...,...,...
116196,409000362497',-1.901902e+09
116197,409000362497',-1.901602e+09
116198,409000362497',-1.901302e+09
116199,409000362497',-1.901412e+09


In [29]:
account_info = account_info.drop_duplicates(subset=["Account No"], keep="last") # keep only the last record for each account
account_info = account_info.reset_index(drop=True) # I'm also going to reset the indices to start from 0

Our schema design includes another field in the accounts table: the ID of the customer who holds this account. We will randomly select users from our users table to act as the "holders" of these accounts. We will also randomly select the type of each account from a list of options.

In [30]:
account_info["User ID"] = account_info.apply(lambda row: random.randint(0, users.shape[0]-1), axis=1)

In [31]:
bank_account_types = ["Checking", "Savings", "Money market (MMA)", "Certificate of deposit (CD)"]
account_info["Account Type"] = account_info.apply(lambda row: bank_account_types[random.randint(0, len(bank_account_types)-1)], axis=1)

In [32]:
account_info

Unnamed: 0,Account No,BALANCE AMT,User ID,Account Type
0,409000611074',462200.0,2,Money market (MMA)
1,409000493201',743583.3,1,Money market (MMA)
2,409000425051',-356734800.0,3,Money market (MMA)
3,409000405747',-548267500.0,6,Certificate of deposit (CD)
4,409000438611',-547919300.0,3,Money market (MMA)
5,409000493210',-546314600.0,5,Checking
6,409000438620',-539963100.0,6,Checking
7,1196711',-1586916000.0,0,Certificate of deposit (CD)
8,1196428',-1687234000.0,6,Money market (MMA)
9,409000362497',-1901417000.0,6,Certificate of deposit (CD)


### Transforming the Date Information

The other aspect of transactions that we want to put in its own dimension table is the date of each transaction. We will generate a date entry for each date that occurs in the table of transaction information, then link the two tables together.

In [33]:
# get a list of unique dates from the table of transaction info
date_info = bank_info.copy().drop_duplicates(subset=["DATE"])
date_info = date_info.reset_index(drop=True) # reset the indices to start from 0
date_info.drop([ 'Account No','TRANSACTION DETAILS', 'CHQ.NO.','VALUE DATE', 'WITHDRAWAL AMT', 'DEPOSIT AMT', 'BALANCE AMT', '.'], axis = 1, inplace = True)
date_info

Unnamed: 0,DATE
0,2017-06-29
1,2017-07-05
2,2017-07-18
3,2017-08-01
4,2017-08-16
...,...
1289,2017-12-25
1290,2018-04-02
1291,2018-04-29
1292,2018-05-12


In [34]:
date_info["Day"] = date_info.apply(lambda row: get_day(row["DATE"]), axis=1)

In [35]:
date_info["Month"] = date_info.apply(lambda row: get_month(row["DATE"]), axis=1)

In [36]:
date_info["Year"] = date_info.apply(lambda row: get_year(row["DATE"]), axis=1)

In [37]:
date_info["Day of the Week"] = date_info.apply(lambda row: get_week_day(row["DATE"]), axis=1)

In [38]:
date_info

Unnamed: 0,DATE,Day,Month,Year,Day of the Week
0,2017-06-29,29,June,2017,Thursday
1,2017-07-05,5,July,2017,Wednesday
2,2017-07-18,18,July,2017,Tuesday
3,2017-08-01,1,August,2017,Tuesday
4,2017-08-16,16,August,2017,Wednesday
...,...,...,...,...,...
1289,2017-12-25,25,December,2017,Monday
1290,2018-04-02,2,April,2018,Monday
1291,2018-04-29,29,April,2018,Sunday
1292,2018-05-12,12,May,2018,Saturday


### Transforming the Transaction Data

Finally, we can clean up the transaction data!

In [39]:
bank_info

Unnamed: 0,Account No,DATE,TRANSACTION DETAILS,CHQ.NO.,VALUE DATE,WITHDRAWAL AMT,DEPOSIT AMT,BALANCE AMT,.
0,409000611074',2017-06-29,TRF FROM Indiaforensic SERVICES,,2017-06-29,,1000000.0,1.000000e+06,.
1,409000611074',2017-07-05,TRF FROM Indiaforensic SERVICES,,2017-07-05,,1000000.0,2.000000e+06,.
2,409000611074',2017-07-18,FDRL/INTERNAL FUND TRANSFE,,2017-07-18,,500000.0,2.500000e+06,.
3,409000611074',2017-08-01,TRF FRM Indiaforensic SERVICES,,2017-08-01,,3000000.0,5.500000e+06,.
4,409000611074',2017-08-16,FDRL/INTERNAL FUND TRANSFE,,2017-08-16,,500000.0,6.000000e+06,.
...,...,...,...,...,...,...,...,...,...
116196,409000362497',2019-03-05,TRF TO 1196428 Indiaforensic SE,,2019-03-05,117934.30,,-1.901902e+09,.
116197,409000362497',2019-03-05,FDRL/INTERNAL FUND TRANSFE,,2019-03-05,,300000.0,-1.901602e+09,.
116198,409000362497',2019-03-05,FDRL/INTERNAL FUND TRANSFE,,2019-03-05,,300000.0,-1.901302e+09,.
116199,409000362497',2019-03-05,IMPS 05-03-20194C,,2019-03-05,109868.65,,-1.901412e+09,.


In [40]:
transaction_info = bank_info.copy()
transaction_info.drop(['CHQ.NO.','VALUE DATE', '.'], axis = 1, inplace = True)
transaction_info

Unnamed: 0,Account No,DATE,TRANSACTION DETAILS,WITHDRAWAL AMT,DEPOSIT AMT,BALANCE AMT
0,409000611074',2017-06-29,TRF FROM Indiaforensic SERVICES,,1000000.0,1.000000e+06
1,409000611074',2017-07-05,TRF FROM Indiaforensic SERVICES,,1000000.0,2.000000e+06
2,409000611074',2017-07-18,FDRL/INTERNAL FUND TRANSFE,,500000.0,2.500000e+06
3,409000611074',2017-08-01,TRF FRM Indiaforensic SERVICES,,3000000.0,5.500000e+06
4,409000611074',2017-08-16,FDRL/INTERNAL FUND TRANSFE,,500000.0,6.000000e+06
...,...,...,...,...,...,...
116196,409000362497',2019-03-05,TRF TO 1196428 Indiaforensic SE,117934.30,,-1.901902e+09
116197,409000362497',2019-03-05,FDRL/INTERNAL FUND TRANSFE,,300000.0,-1.901602e+09
116198,409000362497',2019-03-05,FDRL/INTERNAL FUND TRANSFE,,300000.0,-1.901302e+09
116199,409000362497',2019-03-05,IMPS 05-03-20194C,109868.65,,-1.901412e+09


## Creating and Populating the Data Warehouse
In this section, we will create a new SQL database with all the tables specified in our schema. We'll then populate it with the data tables that we created earlier.


### Creating the Database
First, we create a new database -- we'll call it "banks" -- to serve as our data warehouse.

In [174]:
# create banks database
queries = ['SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;', 
'SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;',
'SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE=\'TRADITIONAL,ALLOW_INVALID_DATES\';',
'DROP SCHEMA IF EXISTS `banks` ;',
'CREATE SCHEMA IF NOT EXISTS `banks` DEFAULT CHARACTER SET latin1 ;',
'USE `banks` ;']

for query in queries:
    execute_mysql_command(user_id, pwd, host_name, "banks", query, False)

Next, we'll make all the tables in the database.

In [175]:
db_name = "banks"
queries = ["""CREATE TABLE IF NOT EXISTS `banks`.`users` (
  `id` INT(15) NOT NULL AUTO_INCREMENT,
  `password` VARCHAR(50) NULL DEFAULT NULL,
  `first_name` VARCHAR(50) NULL DEFAULT NULL,
  `last_name` VARCHAR(50) NULL DEFAULT NULL,
  `username` VARCHAR(50) NULL DEFAULT NULL,
  `email` VARCHAR(50) NULL DEFAULT NULL,
  `gender` VARCHAR(50) NULL DEFAULT NULL,
  `phone_number` VARCHAR(50) NULL DEFAULT NULL,
  `date_of_birth` VARCHAR(25) NULL DEFAULT NULL,
  `credit_card.cc_number` VARCHAR(25) NULL DEFAULT NULL,
  `age` INT(3) NULL DEFAULT NULL,
  PRIMARY KEY (`id`))
  ENGINE = InnoDB
  DEFAULT CHARACTER SET = utf8;""",
  """CREATE TABLE IF NOT EXISTS `banks`.`locations` (
  `loc_id` INT(15) NOT NULL AUTO_INCREMENT,
  `ship_address` VARCHAR(50) NOT NULL,
  `ship_city` VARCHAR(50) NULL DEFAULT NULL,
  `ship_state_province` VARCHAR(50) NULL DEFAULT NULL,
  `zipcode` VARCHAR(15) NULL DEFAULT NULL,
  `ship_country_region` VARCHAR(15) NULL DEFAULT NULL,
  PRIMARY KEY (`loc_id`))
  ENGINE = InnoDB
  DEFAULT CHARACTER SET = utf8;""",
  """CREATE TABLE IF NOT EXISTS `banks`.`dates` (
  `date` DATETIME ,
  `day` INT(2) NULL DEFAULT NULL,
  `month` VARCHAR(15) NULL DEFAULT NULL,
  `year` VARCHAR(5) NULL DEFAULT NULL,
  `week_day` VARCHAR(15) NULL DEFAULT NULL,
  PRIMARY KEY (`date`))
  ENGINE = InnoDB
  DEFAULT CHARACTER SET = utf8;""",
  """CREATE TABLE IF NOT EXISTS `banks`.`accounts` (
  `account_no` VARCHAR(15) ,
  `balance` VARCHAR(15) NULL DEFAULT NULL,
  `user_id` INT(5) NULL DEFAULT NULL,
  `account_type` VARCHAR(30) NULL DEFAULT NULL,
  PRIMARY KEY (`account_no`),
  CONSTRAINT `userid`
  FOREIGN KEY (`user_id`)
  REFERENCES `banks`.`users` (`id`)
  ON DELETE NO ACTION
  ON UPDATE NO ACTION)
  ENGINE = InnoDB
  DEFAULT CHARACTER SET = utf8;""",
  """CREATE TABLE IF NOT EXISTS `banks`.`transactions` (
  `transaction_id` INT(15) NOT NULL AUTO_INCREMENT,
  `account_no` VARCHAR(15) NULL DEFAULT NULL,
  `balance` VARCHAR(15) NULL DEFAULT NULL,
  `withdrawal` VARCHAR(15) NULL DEFAULT NULL,
  `deposit` VARCHAR(15) NULL DEFAULT NULL,
  `details` VARCHAR(30) NULL DEFAULT NULL,
  `date` DATETIME ,
  PRIMARY KEY (`transaction_id`),
  CONSTRAINT `accounts`
    FOREIGN KEY (`account_no`)
    REFERENCES `banks`.`accounts` (`account_no`)
    ON DELETE NO ACTION
    ON UPDATE NO ACTION)
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;"""]

for query in queries:
    execute_mysql_command(user_id, pwd, host_name, db_name, query, True)

Cannot execute command.
Cannot execute command.
Cannot execute command.


### Populating the Database

In [164]:
users.columns

Index(['id', 'password', 'first_name', 'last_name', 'username', 'email',
       'gender', 'phone_number', 'date_of_birth', 'credit_card.cc_number',
       'age'],
      dtype='object')

In [172]:
insert_data_to_mysql(user_id, pwd, host_name, db_name, date_info, "dates")

In [166]:
insert_data_to_mysql(user_id, pwd, host_name, db_name, users, "users")

OperationalError: (pymysql.err.OperationalError) (3730, "Cannot drop table 'users' referenced by a foreign key constraint 'userid' on table 'accounts'.")
[SQL: 
DROP TABLE users]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

In [115]:
test_insert_query = "SELECT * from users"
test_insert = get_mysql_dataframe(user_id, pwd, host_name, db_name, test_insert_query)
test_insert

Unnamed: 0,id,last_name,first_name,username,password,gender,email,phone_number,date_of_birth,credit_card.cc_number,age


## Executing SQL Queries
Finally, in this section, we will execute several SQL queries that aggregate data from at least three of the data tables in our data warehouse.

In [138]:
sql_query = "SELECT details, week_day, first_name FROM transactions JOIN dates ON transactions.date = dates.date JOIN accounts ON transactions.account_no = accounts.account_no JOIN users ON accounts.user_id = users.id WHERE week_day = 'Thursday';"
thursday_transactions = get_mysql_dataframe(user_id, pwd, host_name, db_name, sql_query)
thursday_transactions

Unnamed: 0,details,week_day,first_name


In [139]:
sql_query = "SELECT SUM(withdrawal), transactions.account_no, first_name FROM transactions JOIN accounts ON transactions.account_no = accounts.account_no JOIN users ON accounts.user_id = users.id GROUP BY account_no; "
withdrawals = get_mysql_dataframe(user_id, pwd, host_name, db_name, sql_query)
withdrawals

Unnamed: 0,SUM(withdrawal),account_no,first_name
