# Midterm Project: ETL and Data Warehousing
Madelyn Khoury (mgk5ybb) and Tiara Allard

DS 2002 Spring 2023

### Design and Strategy

We chose to model bank transactions as the core business process of our data warehouse, so we designed a database schema centered around bank transactions. To see the schema we designed for this project, please look at the ReadMe of our GitHub project. 

Our schema stores information about bank transactions, bank accounts and users involved in transactions, transaction dates, and transaction locations. We were unable to find a database/dataset with all this information, so instead we combined data from multiple different data sources. This had the added benefit of allowing us to meet the requirements for importing data from a number of sources.

We combined several dummy/randomly generated datasets to build a complete data warehouse. First, we got bank transaction and account information from a .csv file stored on our local filesystem. Then, we generated user data from an API and linked it to the accounts and transactions. Finally, we imported location information from the Northwind MySQL database to represent regions in which banking transactions might have occurred.

After processing the data and computing useful fields, we formatted it into our fact and dimension tables in the final data warehouse.

### Imports and Helper Functions

In [None]:
import sys
!{sys.executable} -m pip install openpyxl
!{sys.executable} -m pip install mysql-connector-python
!{sys.executable} -m pip install pymysql
!{sys.executable} -m pip install sqlalchemy

In [2]:
import datetime
import json
import mysql.connector
import os
import pandas as pd
import pymysql
import requests
from sqlalchemy import create_engine

In [3]:
def get_api_response(url, headers, params, response_type):
    try:
        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()
    
    except requests.exceptions.HTTPError as errh:
        return "An Http Error occurred: " + repr(errh)
    except requests.exceptions.ConnectionError as errc:
        return "An Error Connecting to the API occurred: " + repr(errc)
    except requests.exceptions.Timeout as errt:
        return "A Timeout Error occurred: " + repr(errt)
    except requests.exceptions.RequestException as err:
        return "An Unknown Error occurred: " + repr(err)

    if response_type == 'json':
        # result = json.dumps(response.json(), sort_keys=True, indent=4)
        result = response.json()
    elif response_type == 'dataframe':
        result = pd.json_normalize(response.json())
    else:
        result = "An unhandled error has occurred!"
        
    return result

In [4]:
# this helper function is inspired by part of one of the provided files, 02-Python-MySQL.ipynb
def get_mysql_dataframe(user_id, pwd, host_name, db_name, sql_query):
    dframe = None
    try:
        conn_str = f"mysql+pymysql://{user_id}:{pwd}@{host_name}/{db_name}"
        sqlEngine = create_engine(conn_str, pool_recycle=3600)
        connection = sqlEngine.connect()
        try:
            dframe = pd.read_sql(sql_query, connection);
        except:
            print("Sequel query was unsuccessful.")
        connection.close()
        return dframe
    except:
        print("Unable to connect to the MySQL database.")
    return None

In [None]:
# to drop something:
df.drop(['description','attachments'], axis=1, inplace=True)
df.columns

### Importing Data From Local File System

The core bank transaction information that we will use came from a dataset on Kaggle (https://www.kaggle.com/datasets/apoorvwatsky/bank-transaction-data). We downloaded the data in the form of a csv file and will import it from the local filesystem in order to be used in our data warehouse.

In [5]:
bank_info_path = os.path.join(os.getcwd(), 'bank.xlsx')
transaction_info = pd.read_excel(bank_info_path)

In [6]:
transaction_info

Unnamed: 0,Account No,DATE,TRANSACTION DETAILS,CHQ.NO.,VALUE DATE,WITHDRAWAL AMT,DEPOSIT AMT,BALANCE AMT,.
0,409000611074',2017-06-29,TRF FROM Indiaforensic SERVICES,,2017-06-29,,1000000.0,1.000000e+06,.
1,409000611074',2017-07-05,TRF FROM Indiaforensic SERVICES,,2017-07-05,,1000000.0,2.000000e+06,.
2,409000611074',2017-07-18,FDRL/INTERNAL FUND TRANSFE,,2017-07-18,,500000.0,2.500000e+06,.
3,409000611074',2017-08-01,TRF FRM Indiaforensic SERVICES,,2017-08-01,,3000000.0,5.500000e+06,.
4,409000611074',2017-08-16,FDRL/INTERNAL FUND TRANSFE,,2017-08-16,,500000.0,6.000000e+06,.
...,...,...,...,...,...,...,...,...,...
116196,409000362497',2019-03-05,TRF TO 1196428 Indiaforensic SE,,2019-03-05,117934.30,,-1.901902e+09,.
116197,409000362497',2019-03-05,FDRL/INTERNAL FUND TRANSFE,,2019-03-05,,300000.0,-1.901602e+09,.
116198,409000362497',2019-03-05,FDRL/INTERNAL FUND TRANSFE,,2019-03-05,,300000.0,-1.901302e+09,.
116199,409000362497',2019-03-05,IMPS 05-03-20194C,,2019-03-05,109868.65,,-1.901412e+09,.


### Importing Data From API

We've chosen to use the `users` endpoint from random-data-api.com, which randomly generates data for a set of users. This will populate the Users table in our data warehouse.

In [7]:
size = 5 # only get info on 5 users for now
url = "https://random-data-api.com/api/v2/users"
querystring = {"size":size}
headers = None

# Get information from users API endpoint
users_json = get_api_response(url, headers, querystring, "dataframe")
users_json


Unnamed: 0,id,uid,password,first_name,last_name,username,email,avatar,gender,phone_number,...,address.zip_code,address.state,address.country,address.coordinates.lat,address.coordinates.lng,credit_card.cc_number,subscription.plan,subscription.status,subscription.payment_method,subscription.term
0,2262,0781feb7-b99c-4221-a2bb-6eb11db7f8ff,YnyMqm41Ao,Alberto,Franecki,alberto.franecki,alberto.franecki@email.com,https://robohash.org/possimusetassumenda.png?s...,Bigender,+1-649 772.272.1425 x904,...,79248-1523,Oklahoma,United States,-25.794988,-96.369002,5431-2184-1603-2692,Gold,Pending,Cheque,Full subscription
1,2003,fedd0dea-1ea7-4362-9005-481905ee0a3b,wkhIsYTD6F,Allen,White,allen.white,allen.white@email.com,https://robohash.org/inmolestiaein.png?size=30...,Genderqueer,+250 (959) 167-8451,...,74187-5734,Montana,United States,9.459585,99.060337,6771-8961-0761-0151,Essential,Active,Cheque,Monthly
2,2994,377fbcef-f8c6-4855-afdf-1899a28d8218,02QYfMzPZu,Tonja,Crist,tonja.crist,tonja.crist@email.com,https://robohash.org/etteneturid.png?size=300x...,Genderfluid,+503 (368) 734-6842 x867,...,91680,New Hampshire,United States,-8.215459,32.083272,4883-7420-1589-2368,Basic,Blocked,Alipay,Annual
3,3557,82384ae7-413e-4396-a287-504d6f7bb22e,kt4gVITqOQ,Walter,Stokes,walter.stokes,walter.stokes@email.com,https://robohash.org/aperiamsitconsequatur.png...,Female,+592 214.397.1336,...,32725-1283,Missouri,United States,-83.503204,4.718412,4922787869214,Basic,Active,Cash,Monthly
4,3261,1db8d5d2-416d-40b3-bf4a-49602512351a,qlN79yej5z,Jamey,Roob,jamey.roob,jamey.roob@email.com,https://robohash.org/autautquae.png?size=300x3...,Polygender,+593 1-935-811-0869 x89055,...,20709,Illinois,United States,55.395681,-97.718289,6771-8980-7106-6006,Bronze,Blocked,Cheque,Full subscription


### Importing Data From Relational Database

To get location data that could represent territories and regions in which bank transactions were completed, we've decided to import information about the shipping location of orders from the `orders` table in the Northwind database. In our data warehouse, this information will represent the location in which a customer instigated a banking transaction.

In [8]:
# define variables to set up connection to mySQL database
host_name = "localhost"
host_ip = "127.0.0.1"
port = "3306"

user_id = "ds2002"
pwd = "UVA!1819"
db_name = "northwind"

First we must get the location-related data from the `orders` table.

In [23]:
sql_query = """
    SELECT ship_address, ship_city, ship_state_province, ship_zip_postal_code, ship_country_region from orders;
"""

In [33]:
locations_info = get_mysql_dataframe(user_id, pwd, host_name, db_name, sql_query)

Then, we must remove duplicate values so we have a table of unique locations.

In [32]:
locations_info.drop_duplicates()

Unnamed: 0,ship_address,ship_city,ship_state_province,ship_zip_postal_code,ship_country_region
0,789 27th Street,Las Vegas,NV,99999,USA
1,123 4th Street,New York,NY,99999,USA
2,123 12th Street,Las Vegas,NV,99999,USA
3,123 8th Street,Portland,OR,99999,USA
5,789 29th Street,Denver,CO,99999,USA
6,123 3rd Street,Los Angelas,CA,99999,USA
7,123 6th Street,Milwaukee,WI,99999,USA
8,789 28th Street,Memphis,TN,99999,USA
10,123 10th Street,Chicago,IL,99999,USA
11,123 7th Street,Boise,ID,99999,USA
