# ETL Worfklow


---

## Setup


Importing my libraries of choice

In [6]:
import os
import pyspark.sql
import requests

Getting our SQL Server credientials ahead of time

In [3]:
db_user = os.environ.get('DB_USER')
db_password = os.environ.get('DB_PASSWORD')

Creating a spark session 

In [4]:
spark_sesh = pyspark.sql.SparkSession.builder.appName('Credit Card ETL').getOrCreate()

In [5]:
spark_sesh

---

## Extract 

I'm going to extract data from two datasets pertaining to the same bank, which are:


- The Credit Card Dataset
    - cdw_sapp_customer.JSON
    - cdw_sapp_branch.JSON 
    - cdw_sapp_credit.JSON

<br>

- The Bank Loan Application Dataset
    - Loan Application API Endpoint

üìù Notes: <br>
- The customer.JSON file and branch.JSON file contains information about the bank customer and bank branch, respectively. 
- The credit.JSON file contains information about credit card transactions
- The Loan Application is with respect to loans for purchasing homes, and includes information like whether or not the individuals were approved, gender, maritual status, and income.

Extracting the JSON Files with sparksession.read.load()


In [102]:
# JSON Files
branch_df = spark_sesh.read.load('../../data/credit_card_dataset/cdw_sapp_branch.json', format='json')  # üëÄ don't forget the to specify the format
credit_df = spark_sesh.read.load('../../data/credit_card_dataset/cdw_sapp_credit.json', format='json')
customer_df = spark_sesh.read.load('../../data/credit_card_dataset/cdw_sapp_custmer.json', format='json')

Defining an API endpoint


In [15]:
LOAN_API_ENDPOINT = "https://raw.githubusercontent.com/platformps/LoanDataset/main/loan_data.json"


Creating a function to create a dataframe from an API Endpoint: <br>
 
The function takes an API Endpoint URL and the current live spark session, <br>
checks to see if the HTTP request is OK and returns a pyspark.sql DataFrame.

In [23]:
# API 

def api_check(api_endpoint: str, spark_session: pyspark.sql.SparkSession) -> pyspark.sql.DataFrame:  # param :type -> output-type
    api = requests.get(LOAN_API_ENDPOINT)
    print(f"HTTP Status Code: {api.status_code}")
    if api.status_code == 200:
        api_df = spark_session.createDataFrame(api.json())
        return api_df



In [24]:
loan_df = api_check(LOAN_API_ENDPOINT, spark_sesh)

HTTP Status Code: 200


Let's check our DataFrames and make sure they are intact and ready for transformation

In [25]:
dataframe_dict = {}

dataframe_dict['branch'] = branch_df  # assign val to dict
dataframe_dict['credit'] = credit_df
dataframe_dict['customer'] = customer_df
dataframe_dict['loan'] = loan_df

In [34]:
for name, dataframe in dataframe_dict.items():  # k:v in (k, v)
    print(name)
    dataframe.printSchema()

branch
root
 |-- BRANCH_CITY: string (nullable = true)
 |-- BRANCH_CODE: long (nullable = true)
 |-- BRANCH_NAME: string (nullable = true)
 |-- BRANCH_PHONE: string (nullable = true)
 |-- BRANCH_STATE: string (nullable = true)
 |-- BRANCH_STREET: string (nullable = true)
 |-- BRANCH_ZIP: long (nullable = true)
 |-- LAST_UPDATED: string (nullable = true)

credit
root
 |-- BRANCH_CODE: long (nullable = true)
 |-- CREDIT_CARD_NO: string (nullable = true)
 |-- CUST_SSN: long (nullable = true)
 |-- DAY: long (nullable = true)
 |-- MONTH: long (nullable = true)
 |-- TRANSACTION_ID: long (nullable = true)
 |-- TRANSACTION_TYPE: string (nullable = true)
 |-- TRANSACTION_VALUE: double (nullable = true)
 |-- YEAR: long (nullable = true)

customer
root
 |-- APT_NO: string (nullable = true)
 |-- CREDIT_CARD_NO: string (nullable = true)
 |-- CUST_CITY: string (nullable = true)
 |-- CUST_COUNTRY: string (nullable = true)
 |-- CUST_EMAIL: string (nullable = true)
 |-- CUST_PHONE: long (nullable = tru

----

## Transform

We are given a mapping document, which tells us what kind of transformations we should make on the data before loading it onto the server. <br>
Let's take a look:

<div align='center'>
<h3>Customer Table</h3>
<img src='../../images/customer_mapping_doc.png' width=1200px>
<h3>Branch Table</h3>
<img src='../../images/branch_mapping_doc.png' width=1200px>
<h3>Credit-Card Table</h3>
<img src='../../images/creditcard_mapping_doc.png' width=1200px>
<div>

üìù Note: <br>
We are not given any specifics for the loan information dataframe, so I'm not going to touch that dataframe or modify any data types. <br>

Furthermore, if you're following along, now would be a good time to split up your <br> views into several windows if you're using VS code or another similar IDE to <br> enhance your workflow, like this: <br>
<img src='../../images/workflow_example.png' width="1200px">

Let's Start with Customer Table: <br>

üìùNote: <br>
Just to make sure I'm following along and not making mistakes, <br>
Im using a small pandas dataframe to store the mapping document, so I can see what changes need to be made iterate through them, using pandas as a checklist of sorts.

This is completely optional, and not required at all, but I prefer it.

In [99]:
import pandas as pd
pd.set_option('display.max_colwidth', None) # ensuring pandas wont truncate col


In [100]:
customer_map = pd.read_clipboard()
customer_map


Unnamed: 0,Source Column Names,Mapping Logic,Target Field names,Target DataType
0,SSN,Direct Move,SSN,INT
1,FIRST_NAME,Convert the Name to Title Case,FIRST_NAME,VARCHAR
2,MIDDLE_NAME,Convert the middle name in lower case,MIDDLE_NAME,VARCHAR
3,LAST_NAME,Convert the Last Name in Title Case,LAST_NAME,VARCHAR
4,CREDIT_CARD_NO,Direct_move,Credit_card_no,VARCHAR
5,"STREET_NAME,APT_NO","Concatenate Apartment no and Street name of customer's Residence with comma as a seperator (Street, Apartment)",FULL_STREET_ADDRESS,VARCHAR
6,CUST_CITY,Direct Move,CUST_CITY,VARCHAR
7,CUST_STATE,Direct Move,CUST_STATE,VARCHAR
8,CUST_COUNTRY,Direct move,CUST_COUNTRY,VARCHAR
9,CUST_ZIP,Direct move,CUST_ZIP,INT


In [123]:
customer_map['Mapping Logic'] = customer_map['Mapping Logic'].apply(lambda x: str(x).lower())
customer_map

Unnamed: 0,Source Column Names,Mapping Logic,Target Field names,Target DataType
0,SSN,direct move,SSN,INT
1,FIRST_NAME,convert the name to title case,FIRST_NAME,VARCHAR
2,MIDDLE_NAME,convert the middle name in lower case,MIDDLE_NAME,VARCHAR
3,LAST_NAME,convert the last name in title case,LAST_NAME,VARCHAR
4,CREDIT_CARD_NO,direct_move,Credit_card_no,VARCHAR
5,"STREET_NAME,APT_NO","concatenate apartment no and street name of customer's residence with comma as a seperator (street, apartment)",FULL_STREET_ADDRESS,VARCHAR
6,CUST_CITY,direct move,CUST_CITY,VARCHAR
7,CUST_STATE,direct move,CUST_STATE,VARCHAR
8,CUST_COUNTRY,direct move,CUST_COUNTRY,VARCHAR
9,CUST_ZIP,direct move,CUST_ZIP,INT


In [126]:
customer_map['Mapping Logic'] = customer_map['Mapping Logic'].replace('direct_move', 'direct move')

Unnamed: 0,Source Column Names,Mapping Logic,Target Field names,Target DataType
0,SSN,direct move,SSN,INT
1,FIRST_NAME,convert the name to title case,FIRST_NAME,VARCHAR
2,MIDDLE_NAME,convert the middle name in lower case,MIDDLE_NAME,VARCHAR
3,LAST_NAME,convert the last name in title case,LAST_NAME,VARCHAR
4,CREDIT_CARD_NO,direct move,Credit_card_no,VARCHAR
5,"STREET_NAME,APT_NO","concatenate apartment no and street name of customer's residence with comma as a seperator (street, apartment)",FULL_STREET_ADDRESS,VARCHAR
6,CUST_CITY,direct move,CUST_CITY,VARCHAR
7,CUST_STATE,direct move,CUST_STATE,VARCHAR
8,CUST_COUNTRY,direct move,CUST_COUNTRY,VARCHAR
9,CUST_ZIP,direct move,CUST_ZIP,INT


Now I can see exactly what needs to be transformed. 

In [131]:
customer_transform_guide = customer_map[(customer_map['Mapping Logic'] != 'direct move')]
customer_transform_guide

Unnamed: 0,Source Column Names,Mapping Logic,Target Field names,Target DataType
1,FIRST_NAME,convert the name to title case,FIRST_NAME,VARCHAR
2,MIDDLE_NAME,convert the middle name in lower case,MIDDLE_NAME,VARCHAR
3,LAST_NAME,convert the last name in title case,LAST_NAME,VARCHAR
5,"STREET_NAME,APT_NO","concatenate apartment no and street name of customer's residence with comma as a seperator (street, apartment)",FULL_STREET_ADDRESS,VARCHAR
10,CUST_PHONE,change the format of phone number to (xxx)xxx-xxxx,CUST_PHONE,VARCHAR


In [181]:
customer_transform_guide[(customer_transform_guide['Mapping Logic'].str.contains('title'))]

Unnamed: 0,Source Column Names,Mapping Logic,Target Field names,Target DataType
1,FIRST_NAME,convert the name to title case,FIRST_NAME,VARCHAR
3,LAST_NAME,convert the last name in title case,LAST_NAME,VARCHAR


In [129]:
customer_df.show(2)

+------+----------------+------------+-------------+-------------------+----------+----------+--------+----------+---------+--------------------+-----------+---------+-----------------+
|APT_NO|  CREDIT_CARD_NO|   CUST_CITY| CUST_COUNTRY|         CUST_EMAIL|CUST_PHONE|CUST_STATE|CUST_ZIP|FIRST_NAME|LAST_NAME|        LAST_UPDATED|MIDDLE_NAME|      SSN|      STREET_NAME|
+------+----------------+------------+-------------+-------------------+----------+----------+--------+----------+---------+--------------------+-----------+---------+-----------------+
|   656|4210653310061055|     Natchez|United States|AHooper@example.com|   1237818|        MS|   39120|      Alec|   Hooper|2018-04-21T12:49:...|         Wm|123456100|Main Street North|
|   829|4210653310102868|Wethersfield|United States|EHolman@example.com|   1238933|        CT|   06109|      Etta|   Holman|2018-04-21T12:49:...|    Brendan|123453023|    Redwood Drive|
+------+----------------+------------+-------------+------------------

### Transforming the name

In [292]:
customer_df_backup = customer_df


In [184]:
import pyspark.sql.functions as F

In [232]:
customer_df = customer_df.withColumns({'FIRST_NAME':F.initcap(customer_df['FIRST_NAME']), 'LAST_NAME':F.initcap(customer_df['LAST_NAME'])})

In [233]:
customer_df = customer_df.withColumn('MIDDLE_NAME', F.lower(customer_df['MIDDLE_NAME']))

Confirming Transformation is correct

In [234]:
customer_df['FIRST_NAME', 'LAST_NAME', 'MIDDLE_NAME'].show(2)

+----------+---------+-----------+
|FIRST_NAME|LAST_NAME|MIDDLE_NAME|
+----------+---------+-----------+
|      Alec|   Hooper|         wm|
|      Etta|   Holman|    brendan|
+----------+---------+-----------+
only showing top 2 rows



In [235]:
customer_transform_guide.iloc[3:]


Unnamed: 0,Source Column Names,Mapping Logic,Target Field names,Target DataType
5,"STREET_NAME,APT_NO","concatenate apartment no and street name of customer's residence with comma as a seperator (street, apartment)",FULL_STREET_ADDRESS,VARCHAR
10,CUST_PHONE,change the format of phone number to (xxx)xxx-xxxx,CUST_PHONE,VARCHAR


### Transforming the Address

The easiest way to do this is to generate a SQL temp-table that is attached to the current spark session. <br>
I'm operating on the temp-table with SQL instead of Python here and its much easier and cleaner.

In [245]:
test = customer_df.createOrReplaceGlobalTempView('customer')

In [253]:
customer_df = spark_sesh.sql("""
SELECT
    *,
    CONCAT(STREET_NAME, ", ", APT_NO) AS FULL_STREET_ADDRESS
FROM
    global_temp.customer
""")

üìùNote: <br>
It might be *expensive* to concatenate the columns apt_no and street_name in one dataframe and then include them *again*<br>
by using the 'select *' statement and dropping the extraneous columns, but with this current dataset the operations are pretty fast, and I'd rather save time using * and .drop() than including each and every column manually. <br>

Either way, we still have one more transformations left and pyspark's dataframes are immutable, so we generate a new copy of a dataframe every time.

In [254]:
customer_df.show(2)

+------+----------------+------------+-------------+-------------------+----------+----------+--------+----------+---------+--------------------+-----------+---------+-----------------+--------------------+
|APT_NO|  CREDIT_CARD_NO|   CUST_CITY| CUST_COUNTRY|         CUST_EMAIL|CUST_PHONE|CUST_STATE|CUST_ZIP|FIRST_NAME|LAST_NAME|        LAST_UPDATED|MIDDLE_NAME|      SSN|      STREET_NAME| FULL_STREET_ADDRESS|
+------+----------------+------------+-------------+-------------------+----------+----------+--------+----------+---------+--------------------+-----------+---------+-----------------+--------------------+
|   656|4210653310061055|     Natchez|United States|AHooper@example.com|   1237818|        MS|   39120|      Alec|   Hooper|2018-04-21T12:49:...|         wm|123456100|Main Street North|Main Street North...|
|   829|4210653310102868|Wethersfield|United States|EHolman@example.com|   1238933|        CT|   06109|      Etta|   Holman|2018-04-21T12:49:...|    brendan|123453023|    R

In [255]:
customer_df = customer_df.drop('APT_NO', 'STREET_NAME')

The address is now concatenated and the other two columns are dropped.

In [257]:
customer_df.show(2, truncate=False)

+----------------+------------+-------------+-------------------+----------+----------+--------+----------+---------+-----------------------------+-----------+---------+----------------------+
|CREDIT_CARD_NO  |CUST_CITY   |CUST_COUNTRY |CUST_EMAIL         |CUST_PHONE|CUST_STATE|CUST_ZIP|FIRST_NAME|LAST_NAME|LAST_UPDATED                 |MIDDLE_NAME|SSN      |FULL_STREET_ADDRESS   |
+----------------+------------+-------------+-------------------+----------+----------+--------+----------+---------+-----------------------------+-----------+---------+----------------------+
|4210653310061055|Natchez     |United States|AHooper@example.com|1237818   |MS        |39120   |Alec      |Hooper   |2018-04-21T12:49:02.000-04:00|wm         |123456100|Main Street North, 656|
|4210653310102868|Wethersfield|United States|EHolman@example.com|1238933   |CT        |06109   |Etta      |Holman   |2018-04-21T12:49:02.000-04:00|brendan    |123453023|Redwood Drive, 829    |
+----------------+------------+----

### Transforming the Phone number

In [259]:
customer_transform_guide.iloc[4]

Source Column Names                                            CUST_PHONE
Mapping Logic          change the format of phone number to (xxx)xxx-xxxx
Target Field names                                             CUST_PHONE
Target DataType                                                   VARCHAR
Name: 10, dtype: object

In [263]:
customer_df[['CUST_PHONE']].show(2)

+----------+
|CUST_PHONE|
+----------+
|   1237818|
|   1238933|
+----------+
only showing top 2 rows



Here we are only required to add an artificial area code and the parenthesis and dash, so I'll do just that.


Using Pyspark.sql's  user-defined function this time:

In [288]:
phone_refactor = F.udf(lambda phone: "(555)" + str(phone)[:3] + "-" + str(phone)[3:])
# phone[0, 3) + phone[3:]
# note the slice includes the end of the string with [:]

In [295]:
customer_df = customer_df.withColumn('CUST_PHONE', phone_refactor(customer_df['CUST_PHONE']))

In [296]:
customer_df.show(2)

+----------------+------------+-------------+-------------------+-------------+----------+--------+----------+---------+--------------------+-----------+---------+--------------------+
|  CREDIT_CARD_NO|   CUST_CITY| CUST_COUNTRY|         CUST_EMAIL|   CUST_PHONE|CUST_STATE|CUST_ZIP|FIRST_NAME|LAST_NAME|        LAST_UPDATED|MIDDLE_NAME|      SSN| FULL_STREET_ADDRESS|
+----------------+------------+-------------+-------------------+-------------+----------+--------+----------+---------+--------------------+-----------+---------+--------------------+
|4210653310061055|     Natchez|United States|AHooper@example.com|(555)123-7818|        MS|   39120|      Alec|   Hooper|2018-04-21T12:49:...|         wm|123456100|Main Street North...|
|4210653310102868|Wethersfield|United States|EHolman@example.com|(555)123-8933|        CT|   06109|      Etta|   Holman|2018-04-21T12:49:...|    brendan|123453023|  Redwood Drive, 829|
+----------------+------------+-------------+-------------------+----------

Checking the types, you can see I transformed the phone number to a more-appropriate string datatype while modifying the phone number. <br>
(It was previously long int type)

In [297]:
customer_df.dtypes

[('CREDIT_CARD_NO', 'string'),
 ('CUST_CITY', 'string'),
 ('CUST_COUNTRY', 'string'),
 ('CUST_EMAIL', 'string'),
 ('CUST_PHONE', 'string'),
 ('CUST_STATE', 'string'),
 ('CUST_ZIP', 'string'),
 ('FIRST_NAME', 'string'),
 ('LAST_NAME', 'string'),
 ('LAST_UPDATED', 'string'),
 ('MIDDLE_NAME', 'string'),
 ('SSN', 'bigint'),
 ('FULL_STREET_ADDRESS', 'string')]

In [299]:
customer_map

Unnamed: 0,Source Column Names,Mapping Logic,Target Field names,Target DataType
0,SSN,direct move,SSN,INT
1,FIRST_NAME,convert the name to title case,FIRST_NAME,VARCHAR
2,MIDDLE_NAME,convert the middle name in lower case,MIDDLE_NAME,VARCHAR
3,LAST_NAME,convert the last name in title case,LAST_NAME,VARCHAR
4,CREDIT_CARD_NO,direct move,Credit_card_no,VARCHAR
5,"STREET_NAME,APT_NO","concatenate apartment no and street name of customer's residence with comma as a seperator (street, apartment)",FULL_STREET_ADDRESS,VARCHAR
6,CUST_CITY,direct move,CUST_CITY,VARCHAR
7,CUST_STATE,direct move,CUST_STATE,VARCHAR
8,CUST_COUNTRY,direct move,CUST_COUNTRY,VARCHAR
9,CUST_ZIP,direct move,CUST_ZIP,INT


---