Initiating the process for cleaning the branch dataset. The raw data is contained in a json file. The data upload occured on 6/14/24 as prescribed in the Capstone 350 Project Requirements document. The process requires Pyspark modules.

In [1]:
import pyspark
import pyspark.sql.functions as funct
from pyspark.sql import SparkSession
import pandas as pd
import credentials as cred 


In [2]:
#This creates the new Sparksession
spark = SparkSession.builder.appName("branch_clean").getOrCreate()

In [3]:
#Creating a shortcut to the filepath for the source file for branches
branch_filepath = r"C:\Users\chito\Developer\Capstone_350\Raw_Data\cdw_sapp_branch.json"


In [4]:
#The raw data json file contains multiline records that output an error when the spark.read.json function is used to attempt to load the data thus use the option("multiline", True)

branch_df = spark.read.option("multiLine", True).json(branch_filepath)

# **Data Exploration**

In [5]:
#Showing the first 10 rows 
branch_df.show(10)

+-----------------+-----------+------------+------------+------------+-----------------+----------+--------------------+
|      BRANCH_CITY|BRANCH_CODE| BRANCH_NAME|BRANCH_PHONE|BRANCH_STATE|    BRANCH_STREET|BRANCH_ZIP|        LAST_UPDATED|
+-----------------+-----------+------------+------------+------------+-----------------+----------+--------------------+
|        Lakeville|          1|Example Bank|  1234565276|          MN|     Bridle Court|     55044|2018-04-18T16:51:...|
|          Huntley|          2|Example Bank|  1234618993|          IL|Washington Street|     60142|2018-04-18T16:51:...|
|SouthRichmondHill|          3|Example Bank|  1234985926|          NY|    Warren Street|     11419|2018-04-18T16:51:...|
|       Middleburg|          4|Example Bank|  1234663064|          FL| Cleveland Street|     32068|2018-04-18T16:51:...|
|    KingOfPrussia|          5|Example Bank|  1234849701|          PA|      14th Street|     19406|2018-04-18T16:51:...|
|         Paterson|          7|E

In [6]:
branch_df.printSchema()

root
 |-- BRANCH_CITY: string (nullable = true)
 |-- BRANCH_CODE: long (nullable = true)
 |-- BRANCH_NAME: string (nullable = true)
 |-- BRANCH_PHONE: string (nullable = true)
 |-- BRANCH_STATE: string (nullable = true)
 |-- BRANCH_STREET: string (nullable = true)
 |-- BRANCH_ZIP: long (nullable = true)
 |-- LAST_UPDATED: string (nullable = true)



In [7]:
branch_df.count()

115

In [8]:
branch_df.describe().show()

+-------+-----------+-----------------+------------+--------------------+------------+-------------+------------------+--------------------+
|summary|BRANCH_CITY|      BRANCH_CODE| BRANCH_NAME|        BRANCH_PHONE|BRANCH_STATE|BRANCH_STREET|        BRANCH_ZIP|        LAST_UPDATED|
+-------+-----------+-----------------+------------+--------------------+------------+-------------+------------------+--------------------+
|  count|        115|              115|         115|                 115|         115|          115|               115|                 115|
|   mean|       NULL|76.67826086956522|        NULL|1.2345499259478261E9|        NULL|         NULL|  38975.2347826087|                NULL|
| stddev|       NULL|52.94113709535237|        NULL|  258751.74757815443|        NULL|         NULL|23938.156819564818|                NULL|
|    min|    Acworth|                1|Example Bank|          1234105725|          AL|  11th Street|              2155|2018-04-18T16:51:...|
|    max|   Y

In [9]:
print(branch_df.distinct().count()) # 115 unique branches
branch_df.select(pyspark.sql.functions.countDistinct("BRANCH_CITY")).show() #115 unique branch cities
branch_df.select(pyspark.sql.functions.countDistinct("BRANCH_CODE")).show() #115 unique branch codes

115
+---------------------------+
|count(DISTINCT BRANCH_CITY)|
+---------------------------+
|                        115|
+---------------------------+

+---------------------------+
|count(DISTINCT BRANCH_CODE)|
+---------------------------+
|                        115|
+---------------------------+



In [10]:
branch_df.explain()

== Physical Plan ==
FileScan json [BRANCH_CITY#0,BRANCH_CODE#1L,BRANCH_NAME#2,BRANCH_PHONE#3,BRANCH_STATE#4,BRANCH_STREET#5,BRANCH_ZIP#6L,LAST_UPDATED#7] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex(1 paths)[file:/C:/Users/chito/Developer/Capstone_350/Raw_Data/cdw_sapp_branch.j..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<BRANCH_CITY:string,BRANCH_CODE:bigint,BRANCH_NAME:string,BRANCH_PHONE:string,BRANCH_STATE:...




In [11]:
branch_df.groupBy('BRANCH_NAME').count().orderBy('count').show() #All 115 branches have the Example Bank name
branch_df.groupBy()

+------------+-----+
| BRANCH_NAME|count|
+------------+-----+
|Example Bank|  115|
+------------+-----+



GroupedData[grouping expressions: [], value: [BRANCH_CITY: string, BRANCH_CODE: bigint ... 6 more fields], type: GroupBy]

In [12]:
from pyspark.sql.functions import col
# Find count of NA or missing values for each column
na_counts = {col_name: branch_df.filter(col(col_name).isNull() | (col(col_name) == "")).count() for col_name in branch_df.columns}

# Print the counts of missing values per column
for column, count in na_counts.items():
    print(f"Column {column} has {count} missing values")

Column BRANCH_CITY has 0 missing values
Column BRANCH_CODE has 0 missing values
Column BRANCH_NAME has 0 missing values
Column BRANCH_PHONE has 0 missing values
Column BRANCH_STATE has 0 missing values
Column BRANCH_STREET has 0 missing values
Column BRANCH_ZIP has 0 missing values
Column LAST_UPDATED has 0 missing values


#### Preliminary analysis indicates that the data is consistent, aligns with the schema, and there are no na or null values. As per the mapping requirements for the branch dataset, since there are no NA or Null values, there is no benefit to adding the defalut value of "99999". However, some of the zip code values reveal that they are missing a digit since all US zip codes have a length of 5 digits. Let's investigate how many and which zip codes are of improper length.

In [13]:
branch_df.select('BRANCH_CITY', 'BRANCH_STATE','BRANCH_ZIP')\
    .where(pyspark.sql.functions.length(branch_df["BRANCH_ZIP"]) < 5).show()

+------------+------------+----------+
| BRANCH_CITY|BRANCH_STATE|BRANCH_ZIP|
+------------+------------+----------+
|    Paterson|          NJ|      7501|
|Wethersfield|          CT|      6109|
|Hillsborough|          NJ|      8844|
|     Medford|          MA|      2155|
|    Rockaway|          NJ|      7866|
|  LongBranch|          NJ|      7740|
|   Irvington|          NJ|      7111|
|    NewHaven|          CT|      6511|
|      Quincy|          MA|      2169|
+------------+------------+----------+



!US_zip_code_map


#### **The zip code map reveals that these northeastern US states have zip codes that begin with the range of 02 - 08; thus includes the zip codes presented in the states above. The entire region has zip codes with an intial "0" so we will need to ajust the Branch_ZIP column to correct and add the inital "0" value back to the zip code.**

<img src="US_zip_code_map.png" alt="US Zip Code Map" style="width:750px;height:600px;">


# Transforming the Data

In [14]:
branch_df = branch_df.withColumn('BRANCH_ZIP',\
                    pyspark.sql.functions.when((pyspark.sql.functions.length(branch_df['BRANCH_ZIP']) == 4) &
                        branch_df['BRANCH_STATE'].isin(["NJ", "CT", "NH", "MA", "VT", "RI", "ME"]),
                    pyspark.sql.functions.format_string("0%s",branch_df['BRANCH_ZIP']))\
                    .otherwise(branch_df["BRANCH_ZIP"]))
# "0%s" adds a leading 0 to each string in the column that meets both conditions as specified with the & operator

In [15]:
#Verifying that all zip code values are of length = 5 as per zip code requirements
branch_df.withColumn("zip_len", pyspark.sql.functions.length(branch_df["BRANCH_ZIP"]))\
    .groupBy("zip_len").count().show()

+-------+-----+
|zip_len|count|
+-------+-----+
|      5|  115|
+-------+-----+



In [16]:
branch_ac_df = branch_df.withColumn("first_phone", pyspark.sql.functions.substring("BRANCH_PHONE",0,3))
branch_ac_df.groupBy('first_phone').count().orderBy('count').show()
#Verifying that all phone numbers have the same prefix area code

+-----------+-----+
|first_phone|count|
+-----------+-----+
|        123|  115|
+-----------+-----+



### Creating a User Defined Function(UDF) to transform phone numbers into the format (xxx)xxx-xxxx

In [17]:
def format_phone_number(phone_number):
    if len(phone_number) == 10:
        return f"({phone_number[:3]}){phone_number[3:6]}-{phone_number[6:10]}"
    else:
        return "Invalid phone number length"



formatted_number = format_phone_number("1234567890")
print(formatted_number)









(123)456-7890


In [18]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType



# Create a UDF from the Python function
format_phone_number_udf = udf(format_phone_number, StringType())

# Apply the UDF to the DataFrame column

branch_df = branch_df.withColumn("BRANCH_PHONE", format_phone_number_udf(branch_df["BRANCH_PHONE"]))

branch_df.show(50
               )
#UDF successfully transforms the phone numbers

+-----------------+-----------+------------+-------------+------------+-------------------+----------+--------------------+
|      BRANCH_CITY|BRANCH_CODE| BRANCH_NAME| BRANCH_PHONE|BRANCH_STATE|      BRANCH_STREET|BRANCH_ZIP|        LAST_UPDATED|
+-----------------+-----------+------------+-------------+------------+-------------------+----------+--------------------+
|        Lakeville|          1|Example Bank|(123)456-5276|          MN|       Bridle Court|     55044|2018-04-18T16:51:...|
|          Huntley|          2|Example Bank|(123)461-8993|          IL|  Washington Street|     60142|2018-04-18T16:51:...|
|SouthRichmondHill|          3|Example Bank|(123)498-5926|          NY|      Warren Street|     11419|2018-04-18T16:51:...|
|       Middleburg|          4|Example Bank|(123)466-3064|          FL|   Cleveland Street|     32068|2018-04-18T16:51:...|
|    KingOfPrussia|          5|Example Bank|(123)484-9701|          PA|        14th Street|     19406|2018-04-18T16:51:...|
|       

### After review of the table above, one issue has become obvious. Cities that have mutiple words, like El Paso or Redondo Beach, are not seperated with a white space in the BRANCH_City column. Correcting this will provide better presenation of the column data

In [19]:
from pyspark.sql.functions import regexp_replace

#Using regexp_replace, the text is searched through regex functionality to search for a lowercase letter followed by an uppercase letter. The two characters are then separated by a white space.
branch_df = branch_df.withColumn("BRANCH_CITY", regexp_replace(branch_df["BRANCH_CITY"], r"([a-z])([A-Z])", r"$1 $2"))

branch_df.show(50)


+-------------------+-----------+------------+-------------+------------+-------------------+----------+--------------------+
|        BRANCH_CITY|BRANCH_CODE| BRANCH_NAME| BRANCH_PHONE|BRANCH_STATE|      BRANCH_STREET|BRANCH_ZIP|        LAST_UPDATED|
+-------------------+-----------+------------+-------------+------------+-------------------+----------+--------------------+
|          Lakeville|          1|Example Bank|(123)456-5276|          MN|       Bridle Court|     55044|2018-04-18T16:51:...|
|            Huntley|          2|Example Bank|(123)461-8993|          IL|  Washington Street|     60142|2018-04-18T16:51:...|
|South Richmond Hill|          3|Example Bank|(123)498-5926|          NY|      Warren Street|     11419|2018-04-18T16:51:...|
|         Middleburg|          4|Example Bank|(123)466-3064|          FL|   Cleveland Street|     32068|2018-04-18T16:51:...|
|    King Of Prussia|          5|Example Bank|(123)484-9701|          PA|        14th Street|     19406|2018-04-18T16:

#### **As outlined in the mapping document requirements, the dataframe columns should be re-ordered to reflect the document column order, such that: 
BRANCH_CODE|
BRANCH_NAME|
BRANCH_STREET|
BRANCH_CITY|
BRANCH_STATE|
BRANCH_ZIP|
BRANCH_PHONE|
LAST_UPDATED|


In [20]:
branch_df = branch_df.select('BRANCH_CODE','BRANCH_NAME','BRANCH_STREET','BRANCH_CITY',
                 'BRANCH_STATE','BRANCH_ZIP','BRANCH_PHONE','LAST_UPDATED')

branch_df.show(5)

+-----------+------------+-----------------+-------------------+------------+----------+-------------+--------------------+
|BRANCH_CODE| BRANCH_NAME|    BRANCH_STREET|        BRANCH_CITY|BRANCH_STATE|BRANCH_ZIP| BRANCH_PHONE|        LAST_UPDATED|
+-----------+------------+-----------------+-------------------+------------+----------+-------------+--------------------+
|          1|Example Bank|     Bridle Court|          Lakeville|          MN|     55044|(123)456-5276|2018-04-18T16:51:...|
|          2|Example Bank|Washington Street|            Huntley|          IL|     60142|(123)461-8993|2018-04-18T16:51:...|
|          3|Example Bank|    Warren Street|South Richmond Hill|          NY|     11419|(123)498-5926|2018-04-18T16:51:...|
|          4|Example Bank| Cleveland Street|         Middleburg|          FL|     32068|(123)466-3064|2018-04-18T16:51:...|
|          5|Example Bank|      14th Street|    King Of Prussia|          PA|     19406|(123)484-9701|2018-04-18T16:51:...|
+-------

#### The final transformation requires that the LAST_UPDATED column be in a timestamp format as per the mapping document

In [21]:
from pyspark.sql import functions as funct


branch_df = branch_df.withColumn('LAST_UPDATED', funct.to_timestamp('LAST_UPDATED', 'yyyy-MM-dd\'T\'HH:mm:ss.SSSXXX'))
branch_df.show()


+-----------+------------+-------------------+-------------------+------------+----------+-------------+-------------------+
|BRANCH_CODE| BRANCH_NAME|      BRANCH_STREET|        BRANCH_CITY|BRANCH_STATE|BRANCH_ZIP| BRANCH_PHONE|       LAST_UPDATED|
+-----------+------------+-------------------+-------------------+------------+----------+-------------+-------------------+
|          1|Example Bank|       Bridle Court|          Lakeville|          MN|     55044|(123)456-5276|2018-04-18 15:51:47|
|          2|Example Bank|  Washington Street|            Huntley|          IL|     60142|(123)461-8993|2018-04-18 15:51:47|
|          3|Example Bank|      Warren Street|South Richmond Hill|          NY|     11419|(123)498-5926|2018-04-18 15:51:47|
|          4|Example Bank|   Cleveland Street|         Middleburg|          FL|     32068|(123)466-3064|2018-04-18 15:51:47|
|          5|Example Bank|        14th Street|    King Of Prussia|          PA|     19406|(123)484-9701|2018-04-18 15:51:47|


In [22]:
print(spark.version)



3.5.1


### **Writing the branch_df directly to the creditcard_capstone DB is key. Below, you will notice several code blocks in which I tried roundabout methods of loading data. TRANSFORMING INTO A PANDAS DATAFRAME AND THEN READING TO JSON FILES CREATES AN INFINITE NUMBER OF ISSUES. DO NOT ATTEMPT THIS APPROACH!! 

In [23]:
branch_df.write.format("jdbc") \
  .mode("append") \
  .option("url", "jdbc:mysql://localhost:3306/creditcard_capstone") \
  .option("dbtable", "creditcard_capstone.CDW_SAPP_BRANCH") \
  .option("user", cred.user) \
  .option("password", cred.password) \
  .save()



In [24]:
branch_df.printSchema()

root
 |-- BRANCH_CODE: long (nullable = true)
 |-- BRANCH_NAME: string (nullable = true)
 |-- BRANCH_STREET: string (nullable = true)
 |-- BRANCH_CITY: string (nullable = true)
 |-- BRANCH_STATE: string (nullable = true)
 |-- BRANCH_ZIP: string (nullable = true)
 |-- BRANCH_PHONE: string (nullable = true)
 |-- LAST_UPDATED: timestamp (nullable = true)



In [28]:
pandas_df = branch_df.toPandas()
pandas_df.head()

Unnamed: 0,BRANCH_CODE,BRANCH_NAME,BRANCH_STREET,BRANCH_CITY,BRANCH_STATE,BRANCH_ZIP,BRANCH_PHONE,LAST_UPDATED
0,1,Example Bank,Bridle Court,Lakeville,MN,55044,(123)456-5276,2018-04-18 15:51:47
1,2,Example Bank,Washington Street,Huntley,IL,60142,(123)461-8993,2018-04-18 15:51:47
2,3,Example Bank,Warren Street,South Richmond Hill,NY,11419,(123)498-5926,2018-04-18 15:51:47
3,4,Example Bank,Cleveland Street,Middleburg,FL,32068,(123)466-3064,2018-04-18 15:51:47
4,5,Example Bank,14th Street,King Of Prussia,PA,19406,(123)484-9701,2018-04-18 15:51:47


In [26]:
!pip list

Package                   Version
------------------------- -----------
asttokens                 2.4.1
attrs                     23.2.0
blinker                   1.8.2
certifi                   2024.2.2
cffi                      1.16.0
charset-normalizer        3.3.2
click                     8.1.7
colorama                  0.4.6
comm                      0.2.2
contourpy                 1.2.1
cryptography              42.0.8
cycler                    0.12.1
dash                      2.17.0
dash-core-components      2.0.0
dash-html-components      2.0.0
dash-table                5.0.0
debugpy                   1.8.1
decorator                 5.1.1
executing                 2.0.1
fastjsonschema            2.19.1
findspark                 2.0.1
Flask                     3.0.3
fonttools                 4.51.0
grpcio                    1.64.1
grpcio-tools              1.64.1
idna                      3.7
importlib_metadata        7.1.0
ipykernel                 6.29.4
ipython              