This program uses Pyspark SQL to extract customer, branch and card data from the provided json file.

In [13]:
# Import libraries 
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql.functions import * 

# 1. Functional Requirements - Load Credit Card Database (SQL)

<b>Data Extraction and Transformation with Python and PySpark. </b><br>
For “Credit Card System,” create a Python and PySpark SQL program to read/extract the following JSON files according to the specifications found in the mapping document.
1. CDW_SAPP_BRANCH.JSON <br>
2. CDW_SAPP_CREDITCARD.JSON <br>
3. CDW_SAPP_CUSTOMER.JSON <br>
Note: Data Engineers will be required to transform the data based on the requirements found in the Mapping Document.
Hint: [You can use PYSQL “select statement query” or simple Pyspark RDD].

In [14]:
# Application to create Dataframes from source
spark = SparkSession.builder.master('local[1]').appName('CreditCardSystems').getOrCreate() 

# Extract the JSON files branch, credit and customer into a dataframe
df_branch = spark.read.json('cdw_sapp_branch.json')  
df_credit = spark.read.json('cdw_sapp_credit.json') 
df_customer = spark.read.json('cdw_sapp_customer.json')


In [33]:
# Adjust acording to the mapping document 

# Convert first and last name to Title Case and middle name to lower case
df_customer = df_customer.withColumn("FIRST_NAME", initcap(df_customer["FIRST_NAME"]))
df_customer = df_customer.withColumn("MIDDLE_NAME", lower(df_customer.MIDDLE_NAME))
df_customer = df_customer.withColumn("LAST_NAME", initcap(df_customer["LAST_NAME"]))
df_customer.select("FIRST_NAME", "MIDDLE_NAME","LAST_NAME").show(10)

+----------+-----------+---------+
|FIRST_NAME|MIDDLE_NAME|LAST_NAME|
+----------+-----------+---------+
|      Alec|         wm|   Hooper|
|      Etta|    brendan|   Holman|
|    Wilber|   ezequiel|   Dunham|
|   Eugenio|      trina|    Hardy|
|   Wilfred|        may|    Ayers|
|      Beau|    ambrose|  Woodard|
|    Sheila|      larry|     Kemp|
|     Wendy|        ora|   Hurley|
|      Alec|     tracie|  Gilmore|
|    Barbra|    mitchel|      Lau|
+----------+-----------+---------+
only showing top 10 rows



In [32]:
# Create a new column called FULL_STREET_ADDRESS and put apartment # and street name with comma separating them
df_customer = df_customer.withColumn("FULL_STREET_ADDRESS", concat(df_customer["APT_NO"], lit(",") , df_customer["STREET_NAME"]))
df_customer.select("FULL_STREET_ADDRESS").show(10)

+--------------------+
| FULL_STREET_ADDRESS|
+--------------------+
|656,Main Street N...|
|   829,Redwood Drive|
|683,12th Street East|
|253,Country Club ...|
|  301,Madison Street|
|    3,Colonial Drive|
|   84,Belmont Avenue|
|    728,Oxford Court|
|    81,Forest Street|
|    561,Court Street|
+--------------------+
only showing top 10 rows



<b>Data loading into Database </b><br>
Once PySpark reads data from JSON files, and then utilizes Python, PySpark, and Python modules to load data into RDBMS(SQL), perform
the following: <br>
a) Create a Database in SQL(MariaDB), named “creditcard_capstone.” <br>
b) Create a Python and Pyspark Program to load/write the “Credit Card System Data” into RDBMS(creditcard_capstone). <br>
Tables should be created by the following names in RDBMS: <br>
CDW_SAPP_BRANCH <br>
CDW_SAPP_CREDIT_CARD <br>
CDW_SAPP_CUSTOMER <br>

In [23]:
# Create the table CDW_SAPP_BRANCH 
df_branch.write.format("jdbc") \
.mode("append") \
.option("url", "jdbc:mysql://localhost:3306/creditcard_capstone") \
.option("dbtable", "creditcard_capstone.CDW_SAPP_BRANCH") \
.option("user", "root") \
.option("password", "a") \
.save()

In [24]:
# Create the table CDW_SAPP_CREDIT_CARD 
df_credit.write.format("jdbc") \
.mode("append") \
.option("url", "jdbc:mysql://localhost:3306/creditcard_capstone") \
.option("dbtable", "creditcard_capstone.CDW_SAPP_CREDIT_CARD") \
.option("user", "root") \
.option("password", "a") \
.save()

In [30]:
# Create the table CDW_SAPP_CUSTOMER 
df_customer.write.format("jdbc") \
.mode("append") \
.option("url", "jdbc:mysql://localhost:3306/creditcard_capstone") \
.option("dbtable", "creditcard_capstone.CDW_SAPP_CUSTOMER") \
.option("user", "root") \
.option("password", "a") \
.save()

AnalysisException: Column "FULL_STREET_ADDRESS" not found in schema Some(StructType(StructField(APT_NO,StringType,true),StructField(CREDIT_CARD_NO,StringType,true),StructField(CUST_CITY,StringType,true),StructField(CUST_COUNTRY,StringType,true),StructField(CUST_EMAIL,StringType,true),StructField(CUST_PHONE,LongType,true),StructField(CUST_STATE,StringType,true),StructField(CUST_ZIP,StringType,true),StructField(FIRST_NAME,StringType,true),StructField(LAST_NAME,StringType,true),StructField(LAST_UPDATED,StringType,true),StructField(MIDDLE_NAME,StringType,true),StructField(SSN,LongType,true),StructField(STREET_NAME,StringType,true)))

# 2. Functional Requirements - Application Front-End
Once data is loaded into the database, we need a front-end (console) to see/display data. For that, create a console-based Python program to satisfy System Requirements 2 (2.1 and 2.2).

<b> Req-2.1 Transaction Details Module </b><br>
1) Used to display the transactions made by customers living in a given zip code for a given month and year. Order by day in
descending order. <br>
2) Used to display the number and total values of transactions for a given type.<br>
3) Used to display the number and total values of transactions for branches in a given state.<br>

# 2.1.1 Order by day in descending order.

In [3]:
# Input fo rmonth, year and zipcode
Month = 8 # Holds the input value for month     | 8
Year = 2018  # Holds the input value for year   | 2018
Zipcode = 39120 # Holds the input value for zipcode | 39120

# Use cdw_app_credit_card table to get TRANSACTION_VALUE, DAY, MONTH AND YEAR
# Use cdw_app_customer to get CUST_ZIP 

# Register the DataFrame as a SQL temporary view
df_credit.createOrReplaceTempView("credit")
df_customer.createOrReplaceTempView("customer")
sel = "SELECT customer.CUST_ZIP, credit.DAY, credit.MONTH, credit.YEAR, credit.TRANSACTION_TYPE, credit.TRANSACTION_VALUE"
frm = " FROM credit, customer"
where = " WHERE credit.YEAR = " + str(Year) + " AND credit.MONTH = " + str(Month) + " AND customer.CUST_ZIP = " + str(Zipcode)
ordr = " ORDER BY credit.DAY DESC" # Order by ascending 

sqlCredit = spark.sql(sel + frm + where + ordr)
sqlCredit.show(20)


+--------+---+-----+----+----------------+-----------------+
|CUST_ZIP|DAY|MONTH|YEAR|TRANSACTION_TYPE|TRANSACTION_VALUE|
+--------+---+-----+----+----------------+-----------------+
|   39120| 28|    8|2018|           Bills|            23.57|
|   39120| 28|    8|2018|            Test|             24.6|
|   39120| 28|    8|2018|           Bills|            23.57|
|   39120| 28|    8|2018|   Entertainment|            46.53|
|   39120| 28|    8|2018|           Bills|            23.57|
|   39120| 28|    8|2018|   Entertainment|            46.53|
|   39120| 28|    8|2018|           Bills|            23.57|
|   39120| 28|    8|2018|   Entertainment|            46.53|
|   39120| 28|    8|2018|           Bills|            23.57|
|   39120| 28|    8|2018|   Entertainment|            46.53|
|   39120| 28|    8|2018|           Bills|            23.57|
|   39120| 28|    8|2018|             Gas|            11.02|
|   39120| 28|    8|2018|            Test|             24.6|
|   39120| 28|    8|2018

# 2.1.2 Display the number and total values of transactions for a given type

In [11]:
# Input for a given transaction type 
transact_type = "Bills"

sel = "SELECT TRANSACTION_ID, TRANSACTION_TYPE, TRANSACTION_VALUE"
frm = " FROM credit"
where = " WHERE TRANSACTION_TYPE = " + "\""+ transact_type + "\""
sqlCredit = spark.sql(sel + frm + where)
sqlCredit.show(20)

+--------------+----------------+-----------------+
|TRANSACTION_ID|TRANSACTION_TYPE|TRANSACTION_VALUE|
+--------------+----------------+-----------------+
|            10|           Bills|           100.38|
|            14|           Bills|            17.81|
|            15|           Bills|             29.0|
|            20|           Bills|            34.34|
|            23|           Bills|             7.97|
|            42|           Bills|            30.95|
|            46|           Bills|             4.95|
|            57|           Bills|             1.32|
|            58|           Bills|            13.72|
|            71|           Bills|            87.58|
|            74|           Bills|            45.93|
|            77|           Bills|            53.03|
|            83|           Bills|             55.4|
|            84|           Bills|            70.25|
|            86|           Bills|             7.62|
|            94|           Bills|            72.72|
|           

# 2.1.3 Display the number and total values of transactions for branches in a given state

In [12]:
# Register the DataFrame as a SQL temporary view
df_branch.createOrReplaceTempView("branch")

# Input for a given state
state = "TX"

sel = "SELECT branch.BRANCH_STATE, credit.TRANSACTION_ID, credit.TRANSACTION_VALUE"
frm = " FROM branch, credit"
where = " WHERE branch.BRANCH_STATE = " + "\""+ state + "\""
sqlCredit = spark.sql(sel + frm + where)
sqlCredit.show(20)

+------------+--------------+-----------------+
|BRANCH_STATE|TRANSACTION_ID|TRANSACTION_VALUE|
+------------+--------------+-----------------+
|          TX|             1|             78.9|
|          TX|             1|             78.9|
|          TX|             1|             78.9|
|          TX|             1|             78.9|
|          TX|             1|             78.9|
|          TX|             2|            14.24|
|          TX|             2|            14.24|
|          TX|             2|            14.24|
|          TX|             2|            14.24|
|          TX|             2|            14.24|
|          TX|             3|             56.7|
|          TX|             3|             56.7|
|          TX|             3|             56.7|
|          TX|             3|             56.7|
|          TX|             3|             56.7|
|          TX|             4|            59.73|
|          TX|             4|            59.73|
|          TX|             4|           

<b>Req-2.2 Customer Details </b><br>
1) Used to check the existing account details of a customer.<br>
2) Used to modify the existing account details of a customer.<br>
3) Used to generate a monthly bill for a credit card number for a given month and year. <br>
4) Used to display the transactions made by a customer between two dates. Order by year, month, and day in descending order. <br>

# 2.2.1 Check the existing account details of a customer


In [13]:
# Input customer name
first_name = "Alec"
last_name = "Hooper"

sel = "SELECT APT_NO, CREDIT_CARD_NO, CUST_CITY, CUST_COUNTRY, CUST_EMAIL, CUST_PHONE, CUST_STATE, CUST_ZIP"
frm = " FROM customer"
where = " WHERE FIRST_NAME = " + "\""+ first_name + "\"" + "AND LAST_NAME = " + "\"" + last_name + "\""
sqlCredit = spark.sql(sel + frm + where)
sqlCredit.show(20)


+------+----------------+---------+-------------+-------------------+----------+----------+--------+
|APT_NO|  CREDIT_CARD_NO|CUST_CITY| CUST_COUNTRY|         CUST_EMAIL|CUST_PHONE|CUST_STATE|CUST_ZIP|
+------+----------------+---------+-------------+-------------------+----------+----------+--------+
|   656|4210653310061055|  Natchez|United States|AHooper@example.com|   1237818|        MS|   39120|
+------+----------------+---------+-------------+-------------------+----------+----------+--------+



# 2.2.2 Modify the exsiting account details of a customer


In [None]:
# ALTER DATABASE inventory SET DBPROPERTIES ('Edited-by' = 'John', 'Edit-date' = '01/01/2001');

# 2.2.3 Generate a monthly bill for a credit card number for a given month and year


In [20]:
# Input for month and year bill
month_bill = 8
year_bill = 2018
card_number = 4210653310061055

sel = "SELECT CREDIT_CARD_NO, YEAR, MONTH, SUM(TRANSACTION_VALUE)"
frm = " FROM credit"
where = " WHERE CREDIT_CARD_NO = " + str(card_number) + " AND MONTH = " + str(month_bill) + " AND YEAR = " + str(year_bill) 
gr_by = "GROUP BY CREDIT_CARD_NO"

sql_bill = spark.sql(sel + frm + where)
sql_bill.show(20)



AnalysisException: grouping expressions sequence is empty, and 'credit.CREDIT_CARD_NO' is not an aggregate function. Wrap '(sum(credit.TRANSACTION_VALUE) AS `sum(TRANSACTION_VALUE)`)' in windowing function(s) or wrap 'credit.CREDIT_CARD_NO' in first() (or first_value) if you don't care which value you get.;
Aggregate [CREDIT_CARD_NO#33, YEAR#40L, MONTH#36L, sum(TRANSACTION_VALUE#39) AS sum(TRANSACTION_VALUE)#293]
+- Filter (((cast(CREDIT_CARD_NO#33 as bigint) = 4210653310061055) AND (MONTH#36L = cast(8 as bigint))) AND (YEAR#40L = cast(2018 as bigint)))
   +- SubqueryAlias credit
      +- View (`credit`, [BRANCH_CODE#32L,CREDIT_CARD_NO#33,CUST_SSN#34L,DAY#35L,MONTH#36L,TRANSACTION_ID#37L,TRANSACTION_TYPE#38,TRANSACTION_VALUE#39,YEAR#40L])
         +- Relation [BRANCH_CODE#32L,CREDIT_CARD_NO#33,CUST_SSN#34L,DAY#35L,MONTH#36L,TRANSACTION_ID#37L,TRANSACTION_TYPE#38,TRANSACTION_VALUE#39,YEAR#40L] json


# 2.2.4 Display the transactions made by a customer between two date. Order by year, month, and day in desc. 

# 3 - Functional Requirements - Data analysis and Visualization

After data is loaded into the database, users can make changes from the front end, and they can also view data from the front end. Now, the business analyst team wants to analyze and visualize the data according to the below requirements.

<b>Req - 3 Data Analysis and Visualization </b> <br>
1) Find and plot which transaction type has a high rate of transactions. <br>
2) Find and plot which state has a high number of customers. <br>
3) Find and plot the sum of all transactions for each customer, and which customer has the highest transaction amount.hint(use CUST_SSN).



# 4. Functional Requirements - LOAN Application Dataset

1. Create a Python program to GET (consume) data from the above API endpoint for the loan application dataset. <br>
2. Find the status code of the above API endpoint. <br>
3. Once Python reads data from the API, utilize PySpark to load data into RDBMS(SQL). The table name should be CDW-SAPP_loan_application in the database.

# 5 - Functional Requirements - Data Analysis and Visualization for Loan Application

1. Find and plot the percentage of applications approved for self-employed applicants. <br>
2. Find the percentage of rejection for married male applicants. <br>
3. Find and plot the top three months with the largest transaction data.<br>
4. Find and plot which branch processed the highest total dollar value of healthcare transactions.