## Portfolio Summary

Using two notebooks (01_Ingestion and 02_Transformation), my goal is to simulate a development environment with a multi-hop/medallion architecture (see image below). To achieve this, I'm implementing Auto-loader and Structured Streaming.<br>
<br>
>*placeholder image, will be replaced ASAP*

<img src ='/files/images/ADE_arch_bronze.png'>

The data and base idea are derived from the Databricks Academy's Advanced Data Engineering with Databricks course.<br>
The entire pipeline was developed using the community version of Databricks, so I primarily worked within a workspace with basic permissions.<br>

## 01_Ingestion

The goal of this notebook is to ingest JSON files into a multiplex table, which will serve as our bronze table, containing the entire history of this incremental feed.<br>
I will start by setting up the enviroment

## Environment Settings

The goal here is to have a perfect blank canvas to work with in the end, just as the title suggests. To achieve this, I'll:<br>
<br>
- Create our three schemas
- Set them up in a function with the code needed to restart the environment
- Import libraries
- Create base objects to use throughout the pipeline

In [0]:
# Import libraries
import pyspark.sql.functions as f

def env_setup():
    # Creating Bronze schema
    dbutils.fs.rm("dbfs:/user/hive/warehouse/bronze.db", True) # Clearing previous data 
    spark.sql("DROP SCHEMA IF EXISTS bronze;") # Dropping schema
    spark.sql("CREATE SCHEMA bronze;") # Creating schema
    spark.catalog.setCurrentDatabase('bronze') # Setting the bronze schema as default for this notebook

    # Creating Silver schema
    dbutils.fs.rm("dbfs:/user/hive/warehouse/silver.db", True)
    spark.sql("DROP SCHEMA IF EXISTS silver;")
    spark.sql("CREATE SCHEMA silver;")

    # Creating Gold schema
    dbutils.fs.rm("dbfs:/user/hive/warehouse/gold.db", True)
    spark.sql("DROP SCHEMA IF EXISTS gold;")
    spark.sql("CREATE SCHEMA gold;")

    # Cleaning any structured streaming or Auto Loader checkpoints
    dbutils.fs.rm("dbfs:/mnt/borges_portifolio/_checkpoints", True) # Clearing checkpoints

#Setting path objects
main_path = "dbfs:/mnt/borges_portifolio"
files_path = f"{main_path}/auto_loader_files/files"
new_batch = f"{main_path}/auto_loader_files/second_wave/"

Removing the folder may seem unnecessary, especially when dropping the schema right after. However, it's needed because I'm always using a new cluster, so while the schema is dropped, the files persist. This way, I'm prepared for either case.

In [0]:
env_setup()

## Exploring samples

The main goal here is to gain a better understanding of the sample files I'll be working with.

In [0]:
# Listing file's folders
files = dbutils.fs.ls(files_path)

# Displaying files
display(files)

path,name,size,modificationTime
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00000-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-376-1-c000.json,part-00000-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-376-1-c000.json,0,1708028988000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00003-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-379-1-c000.json,part-00003-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-379-1-c000.json,84059292,1708028994000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00004-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-380-1-c000.json,part-00004-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-380-1-c000.json,27495,1708028995000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00005-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-381-1-c000.json,part-00005-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-381-1-c000.json,8759,1708028996000


In [0]:
# List with file names
file_name = (
  "part-00000-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-376-1-c000.json","part-00003-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-379-1-c000.json","part-00004-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-380-1-c000.json","part-00005-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-381-1-c000.json"
  )

# Loop through each file
for file in file_name:
    # Display file name
    print("")
    print(f" -> {file}")
    print("")

    # Display first 10 rows of the file
    display(spark.read.json(f"{files_path}/{file}").limit(10))
    
    # Display the size of this file
    print("")
    print("Number of rows in this file", spark.read.json(f"{files_path}/{file}").count())
    print("")
    
    # Display the schema of this file
    spark.read.json(f"{files_path}/{file}").printSchema()


 -> part-00000-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-376-1-c000.json




Number of rows in this file 0

root


 -> part-00003-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-379-1-c000.json



key,offset,partition,timestamp,topic,value
MTE5NzE1,65087128,0,1575158410818,bpm,eyJkZXZpY2VfaWQiOiAxMTk3MTUsICJ0aW1lIjogMTU3NTE1ODQwMCwgImhlYXJ0cmF0ZSI6IDY2LjYyMjQ0OTczMTU3OTc2fQ==
MTA3OTUy,65102257,2,1575158411394,bpm,eyJkZXZpY2VfaWQiOiAxMDc5NTIsICJ0aW1lIjogMTU3NTE1ODQwMSwgImhlYXJ0cmF0ZSI6IDExNi4wMzEyNTg5ODAxMjA4N30=
MTYxOTEw,65087131,0,1575158417678,bpm,eyJkZXZpY2VfaWQiOiAxNjE5MTAsICJ0aW1lIjogMTU3NTE1ODQwMywgImhlYXJ0cmF0ZSI6IDgyLjMzMzg1OTEyMDg3MzE3fQ==
MTE2NzA4,64899138,3,1575158421868,bpm,eyJkZXZpY2VfaWQiOiAxMTY3MDgsICJ0aW1lIjogMTU3NTE1ODQxMiwgImhlYXJ0cmF0ZSI6IDgzLjEzNDUyNzA2MDg3NzA5fQ==
MTYxMzQz,64899148,3,1575158430841,bpm,eyJkZXZpY2VfaWQiOiAxNjEzNDMsICJ0aW1lIjogMTU3NTE1ODQyMCwgImhlYXJ0cmF0ZSI6IDU5LjcxODMyNzY1MTYyODA0fQ==
MTEzMDI0,65002565,4,1575158432217,bpm,eyJkZXZpY2VfaWQiOiAxMTMwMjQsICJ0aW1lIjogMTU3NTE1ODQyMywgImhlYXJ0cmF0ZSI6IDk0LjcyNTkyNjkxMDg1MjAxfQ==
MTQxNjg3,65102289,2,1575158437927,bpm,eyJkZXZpY2VfaWQiOiAxNDE2ODcsICJ0aW1lIjogMTU3NTE1ODQzMCwgImhlYXJ0cmF0ZSI6IDg0LjM4NzM2MzY4NjEzMDU0fQ==
MTQyODM4,64899162,3,1575158445876,bpm,eyJkZXZpY2VfaWQiOiAxNDI4MzgsICJ0aW1lIjogMTU3NTE1ODQzNCwgImhlYXJ0cmF0ZSI6IDg1Ljc2MzkzNjgwMDQ1MTAzfQ==
MTk3OTMw,65087190,0,1575158460427,bpm,eyJkZXZpY2VfaWQiOiAxOTc5MzAsICJ0aW1lIjogMTU3NTE1ODQ1MCwgImhlYXJ0cmF0ZSI6IDg0LjYyMDk2MzIyMTczNTA3fQ==
MTEwMzY3,65102354,2,1575158482354,bpm,eyJkZXZpY2VfaWQiOiAxMTAzNjcsICJ0aW1lIjogMTU3NTE1ODQ3MSwgImhlYXJ0cmF0ZSI6IDc2LjUwOTk5NTg3NjA2Njc0fQ==



Number of rows in this file 417164

root
 |-- key: string (nullable = true)
 |-- offset: long (nullable = true)
 |-- partition: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- topic: string (nullable = true)
 |-- value: string (nullable = true)


 -> part-00004-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-380-1-c000.json



key,offset,partition,timestamp,topic,value
NDA4NzI=,189240,0,1575229677709,workout,eyJ1c2VyX2lkIjogNDA4NzIsICJ3b3Jrb3V0X2lkIjogOCwgInRpbWVzdGFtcCI6IDE1NzUyMjk2NzUuMCwgImFjdGlvbiI6ICJzdGFydCIsICJzZXNzaW9uX2lkIjogNzZ9
MjkyMTM=,189198,0,1575196569155,workout,eyJ1c2VyX2lkIjogMjkyMTMsICJ3b3Jrb3V0X2lkIjogMTMsICJ0aW1lc3RhbXAiOiAxNTc1MTk2NTYyLjAsICJhY3Rpb24iOiAic3RhcnQiLCAic2Vzc2lvbl9pZCI6IDI5NX0=
MjkyMTM=,189199,0,1575200018703,workout,eyJ1c2VyX2lkIjogMjkyMTMsICJ3b3Jrb3V0X2lkIjogMTMsICJ0aW1lc3RhbXAiOiAxNTc1MjAwMDE2LjAsICJhY3Rpb24iOiAic3RvcCIsICJzZXNzaW9uX2lkIjogMjk1fQ==
Mjc3MDM=,189202,0,1575203437440,workout,eyJ1c2VyX2lkIjogMjc3MDMsICJ3b3Jrb3V0X2lkIjogNDcsICJ0aW1lc3RhbXAiOiAxNTc1MjAzNDMzLjAsICJhY3Rpb24iOiAic3RhcnQiLCAic2Vzc2lvbl9pZCI6IDQ1Nn0=
MTQ1MDg=,189227,0,1575224969124,workout,eyJ1c2VyX2lkIjogMTQ1MDgsICJ3b3Jrb3V0X2lkIjogMzEsICJ0aW1lc3RhbXAiOiAxNTc1MjI0OTY1LjAsICJhY3Rpb24iOiAic3RhcnQiLCAic2Vzc2lvbl9pZCI6IDM5Mn0=
MTI0NzQ=,189232,0,1575225843611,workout,eyJ1c2VyX2lkIjogMTI0NzQsICJ3b3Jrb3V0X2lkIjogMzUsICJ0aW1lc3RhbXAiOiAxNTc1MjI1ODQwLjAsICJhY3Rpb24iOiAic3RvcCIsICJzZXNzaW9uX2lkIjogMn0=
Mjg1MjE=,189234,0,1575227043638,workout,eyJ1c2VyX2lkIjogMjg1MjEsICJ3b3Jrb3V0X2lkIjogMjAsICJ0aW1lc3RhbXAiOiAxNTc1MjI3MDM3LjAsICJhY3Rpb24iOiAic3RhcnQiLCAic2Vzc2lvbl9pZCI6IDMyOH0=
MzUyMjY=,189192,0,1575193983608,workout,eyJ1c2VyX2lkIjogMzUyMjYsICJ3b3Jrb3V0X2lkIjogNDMsICJ0aW1lc3RhbXAiOiAxNTc1MTkzOTc5LjAsICJhY3Rpb24iOiAic3RhcnQiLCAic2Vzc2lvbl9pZCI6IDQwMn0=
MjY4NDc=,189207,0,1575205633152,workout,eyJ1c2VyX2lkIjogMjY4NDcsICJ3b3Jrb3V0X2lkIjogMTQsICJ0aW1lc3RhbXAiOiAxNTc1MjA1NjI5LjAsICJhY3Rpb24iOiAic3RhcnQiLCAic2Vzc2lvbl9pZCI6IDJ9
Mjg1ODg=,189209,0,1575207670803,workout,eyJ1c2VyX2lkIjogMjg1ODgsICJ3b3Jrb3V0X2lkIjogMzQsICJ0aW1lc3RhbXAiOiAxNTc1MjA3NjY2LjAsICJhY3Rpb24iOiAic3RhcnQiLCAic2Vzc2lvbl9pZCI6IDF9



Number of rows in this file 93

root
 |-- key: string (nullable = true)
 |-- offset: long (nullable = true)
 |-- partition: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- topic: string (nullable = true)
 |-- value: string (nullable = true)


 -> part-00005-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-381-1-c000.json



key,offset,partition,timestamp,topic,value
Mjg3NzY=,12778,0,1562995123117,user_info,eyJ1c2VyX2lkIjogMjg3NzYsICJ1cGRhdGVfdHlwZSI6ICJuZXciLCAidGltZXN0YW1wIjogMTU2Mjk5NTExNCwgImRvYiI6ICIwMS8yMS8xOTM4IiwgInNleCI6ICJGIiwgImdlbmRlciI6ICJPIiwgImZpcnN0X25hbWUiOiAiQnJpYW4iLCAibGFzdF9uYW1lIjogIkNhc3RpbGxvIiwgImFkZHJlc3MiOiB7InN0cmVldF9hZGRyZXNzIjogIjM0OTQxIERhdmlkIFR1cm5waWtlIiwgImNpdHkiOiAiVmFuIE51eXMiLCAic3RhdGUiOiAiQ0EiLCAiemlwIjogOTE0MDF9fQ==
NDEzNjc=,12774,0,1562691359088,user_info,eyJ1c2VyX2lkIjogNDEzNjcsICJ1cGRhdGVfdHlwZSI6ICJuZXciLCAidGltZXN0YW1wIjogMTU2MjY5MTM1NCwgImRvYiI6ICIwMS8yNS8xOTMwIiwgInNleCI6ICJNIiwgImdlbmRlciI6ICJNIiwgImZpcnN0X25hbWUiOiAiQ2hyaXN0b3BoZXIiLCAibGFzdF9uYW1lIjogIktlbGx5IiwgImFkZHJlc3MiOiB7InN0cmVldF9hZGRyZXNzIjogIjA5MSBFcmljIEN1cnZlIiwgImNpdHkiOiAiUGFjb2ltYSIsICJzdGF0ZSI6ICJDQSIsICJ6aXAiOiA5MTMzNH19
MjQwMTg=,12776,0,1562889518343,user_info,eyJ1c2VyX2lkIjogMjQwMTgsICJ1cGRhdGVfdHlwZSI6ICJuZXciLCAidGltZXN0YW1wIjogMTU2Mjg4OTUwOSwgImRvYiI6ICIwOC8yMy8xOTQ0IiwgInNleCI6ICJGIiwgImdlbmRlciI6ICJGIiwgImZpcnN0X25hbWUiOiAiUGFtZWxhIiwgImxhc3RfbmFtZSI6ICJKb2huc3RvbiIsICJhZGRyZXNzIjogeyJzdHJlZXRfYWRkcmVzcyI6ICIwMDkgV2lsbGlhbSBUcmFjZSIsICJjaXR5IjogIkJldmVybHkgSGlsbHMiLCAic3RhdGUiOiAiQ0EiLCAiemlwIjogOTAyMTB9fQ==
MTk4NTk=,12777,0,1562912816709,user_info,eyJ1c2VyX2lkIjogMTk4NTksICJ1cGRhdGVfdHlwZSI6ICJuZXciLCAidGltZXN0YW1wIjogMTU2MjkxMjgxNCwgImRvYiI6ICIxMS8wNy8xOTkxIiwgInNleCI6ICJGIiwgImdlbmRlciI6ICJGIiwgImZpcnN0X25hbWUiOiAiVGFtbXkiLCAibGFzdF9uYW1lIjogIkJhcnJ5IiwgImFkZHJlc3MiOiB7InN0cmVldF9hZGRyZXNzIjogIjcwNjc4IE9saXZlciBXYXlzIiwgImNpdHkiOiAiU2FudGEgQ2xhcml0YSIsICJzdGF0ZSI6ICJDQSIsICJ6aXAiOiA5MTM1MH19
MjMyNTA=,12775,0,1562844689376,user_info,eyJ1c2VyX2lkIjogMjMyNTAsICJ1cGRhdGVfdHlwZSI6ICJuZXciLCAidGltZXN0YW1wIjogMTU2Mjg0NDY4NCwgImRvYiI6ICIxMS8xMi8xOTUxIiwgInNleCI6ICJNIiwgImdlbmRlciI6ICJNIiwgImZpcnN0X25hbWUiOiAiUGF1bCIsICJsYXN0X25hbWUiOiAiSGludG9uIiwgImFkZHJlc3MiOiB7InN0cmVldF9hZGRyZXNzIjogIjMwODAgQmVjayBCcmFuY2giLCAiY2l0eSI6ICJWYW4gTnV5cyIsICJzdGF0ZSI6ICJDQSIsICJ6aXAiOiA5MTQwNn19
MzU3Mjg=,12772,0,1562155015640,user_info,eyJ1c2VyX2lkIjogMzU3MjgsICJ1cGRhdGVfdHlwZSI6ICJuZXciLCAidGltZXN0YW1wIjogMTU2MjE1NTAxMiwgImRvYiI6ICIwNi8yOC8xOTY0IiwgInNleCI6ICJGIiwgImdlbmRlciI6ICJGIiwgImZpcnN0X25hbWUiOiAiQW5nZWxhIiwgImxhc3RfbmFtZSI6ICJLZXkiLCAiYWRkcmVzcyI6IHsic3RyZWV0X2FkZHJlc3MiOiAiOTMxIEppbGwgVGVycmFjZSBTdWl0ZSA0MDQiLCAiY2l0eSI6ICJNb250ZWJlbGxvIiwgInN0YXRlIjogIkNBIiwgInppcCI6IDkwNjQwfX0=
MTUxNDk=,12773,0,1562201886676,user_info,eyJ1c2VyX2lkIjogMTUxNDksICJ1cGRhdGVfdHlwZSI6ICJuZXciLCAidGltZXN0YW1wIjogMTU2MjIwMTg4MiwgImRvYiI6ICIwMy8zMC8xOTcyIiwgInNleCI6ICJNIiwgImdlbmRlciI6ICJNIiwgImZpcnN0X25hbWUiOiAiQ2FtZXJvbiIsICJsYXN0X25hbWUiOiAiVmFzcXVleiIsICJhZGRyZXNzIjogeyJzdHJlZXRfYWRkcmVzcyI6ICI5NTkzMiBHYXJ5IFJpZGdlcyIsICJjaXR5IjogIkxvcyBBbmdlbGVzIiwgInN0YXRlIjogIkNBIiwgInppcCI6IDkwMDE4fX0=
MjkyMTM=,12770,0,1562010799911,user_info,eyJ1c2VyX2lkIjogMjkyMTMsICJ1cGRhdGVfdHlwZSI6ICJuZXciLCAidGltZXN0YW1wIjogMTU2MjAxMDc5OSwgImRvYiI6ICIwMy8xMC8xOTI3IiwgInNleCI6ICJGIiwgImdlbmRlciI6ICJGIiwgImZpcnN0X25hbWUiOiAiQXVkcmV5IiwgImxhc3RfbmFtZSI6ICJIYWxsIiwgImFkZHJlc3MiOiB7InN0cmVldF9hZGRyZXNzIjogIjgxNzAgTWFyY3VzIENvdXJzZSBTdWl0ZSA3NjYiLCAiY2l0eSI6ICJOb3J0aCBIb2xseXdvb2QiLCAic3RhdGUiOiAiQ0EiLCAiemlwIjogOTE2MTh9fQ==
MTQyMzI=,12771,0,1562113873633,user_info,eyJ1c2VyX2lkIjogMTQyMzIsICJ1cGRhdGVfdHlwZSI6ICJuZXciLCAidGltZXN0YW1wIjogMTU2MjExMzg2NywgImRvYiI6ICIwMS8wNC8xOTc5IiwgInNleCI6ICJNIiwgImdlbmRlciI6ICJNIiwgImZpcnN0X25hbWUiOiAiRWR3YXJkIiwgImxhc3RfbmFtZSI6ICJTaW1wc29uIiwgImFkZHJlc3MiOiB7InN0cmVldF9hZGRyZXNzIjogIjkyMDEyIEJyYWRsZXkgU2hvYWxzIiwgImNpdHkiOiAiTG9uZyBCZWFjaCIsICJzdGF0ZSI6ICJDQSIsICJ6aXAiOiA5MDgxNX19
MTQ1MDg=,12782,0,1564262812062,user_info,eyJ1c2VyX2lkIjogMTQ1MDgsICJ1cGRhdGVfdHlwZSI6ICJuZXciLCAidGltZXN0YW1wIjogMTU2NDI2MjgwNSwgImRvYiI6ICIwMS8yOC8xOTM2IiwgInNleCI6ICJNIiwgImdlbmRlciI6ICJNIiwgImZpcnN0X25hbWUiOiAiSnVzdGluIiwgImxhc3RfbmFtZSI6ICJFYXRvbiIsICJhZGRyZXNzIjogeyJzdHJlZXRfYWRkcmVzcyI6ICIwNDk1MiBMb3JpIFBsYWluIiwgImNpdHkiOiAiU2llcnJhIE1hZHJlIiwgInN0YXRlIjogIkNBIiwgInppcCI6IDkxMDI0fX0=



Number of rows in this file 19

root
 |-- key: string (nullable = true)
 |-- offset: long (nullable = true)
 |-- partition: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- topic: string (nullable = true)
 |-- value: string (nullable = true)



> In development, this stage would be used to verify if the information the client provided checks out with the files received. Unless directly requested or in dire need, guessing column types needs a deep understanding of the system and is a big responsibility to take lightly.

## Date_lookup

To partition the Bronze table, I'll use a pre-made table (*would be in production*).<br> 
To have access to it here I have two choices `clone` it to my dev schema or get the last version of it with `time travel`. <br>
Since I'll use this table only once, `time travel` it is!

In [0]:
# Time traveling to obtain the date_lookup in a DataFrame
date_lookup_df = spark.read.option("versionAsOf", 0).load(f"{main_path}/date_lookup")

# Taking a sneak peek at the DF
display(date_lookup_df.limit(10))

date,week,year,month,dayofweek,dayofmonth,dayofyear,week_part
2019-01-01,1,2019,1,3,1,1,2019-1
2019-01-02,1,2019,1,4,2,2,2019-1
2019-01-03,1,2019,1,5,3,3,2019-1
2019-01-04,1,2019,1,6,4,4,2019-1
2019-01-05,1,2019,1,7,5,5,2019-1
2019-01-06,1,2019,1,1,6,6,2019-1
2019-01-07,2,2019,1,2,7,7,2019-2
2019-01-08,2,2019,1,3,8,8,2019-2
2019-01-09,2,2019,1,4,9,9,2019-2
2019-01-10,2,2019,1,5,10,10,2019-2


The column from the date_lookup_df that will be used for partitioning is the `week_part`. The `date` column will serve as the key between the DF and 
the files. <br>
Therefore, I'll select the `date` and `week_part` from the date_lookup_df and transform the `timestamp` column from the files to match and join them.<br>
I'll work through these steps below.


In [0]:
# Selecting only the usefull columns from the DF
date_lookup_df = date_lookup_df.select("date", "week_part")

# Generating a DF from the json files
json_df = spark.read.json(files_path)

# Testing the join
joined_df = json_df.join(f.broadcast(date_lookup_df),
                         f.to_date((f.col("timestamp")/1000).cast("timestamp")) == f.col("date"),
                         "left")

# Disposing the results
display(joined_df.limit(10))

key,offset,partition,timestamp,topic,value,date,week_part
MTE5NzE1,65087128,0,1575158410818,bpm,eyJkZXZpY2VfaWQiOiAxMTk3MTUsICJ0aW1lIjogMTU3NTE1ODQwMCwgImhlYXJ0cmF0ZSI6IDY2LjYyMjQ0OTczMTU3OTc2fQ==,2019-12-01,2019-48
MTA3OTUy,65102257,2,1575158411394,bpm,eyJkZXZpY2VfaWQiOiAxMDc5NTIsICJ0aW1lIjogMTU3NTE1ODQwMSwgImhlYXJ0cmF0ZSI6IDExNi4wMzEyNTg5ODAxMjA4N30=,2019-12-01,2019-48
MTYxOTEw,65087131,0,1575158417678,bpm,eyJkZXZpY2VfaWQiOiAxNjE5MTAsICJ0aW1lIjogMTU3NTE1ODQwMywgImhlYXJ0cmF0ZSI6IDgyLjMzMzg1OTEyMDg3MzE3fQ==,2019-12-01,2019-48
MTE2NzA4,64899138,3,1575158421868,bpm,eyJkZXZpY2VfaWQiOiAxMTY3MDgsICJ0aW1lIjogMTU3NTE1ODQxMiwgImhlYXJ0cmF0ZSI6IDgzLjEzNDUyNzA2MDg3NzA5fQ==,2019-12-01,2019-48
MTYxMzQz,64899148,3,1575158430841,bpm,eyJkZXZpY2VfaWQiOiAxNjEzNDMsICJ0aW1lIjogMTU3NTE1ODQyMCwgImhlYXJ0cmF0ZSI6IDU5LjcxODMyNzY1MTYyODA0fQ==,2019-12-01,2019-48
MTEzMDI0,65002565,4,1575158432217,bpm,eyJkZXZpY2VfaWQiOiAxMTMwMjQsICJ0aW1lIjogMTU3NTE1ODQyMywgImhlYXJ0cmF0ZSI6IDk0LjcyNTkyNjkxMDg1MjAxfQ==,2019-12-01,2019-48
MTQxNjg3,65102289,2,1575158437927,bpm,eyJkZXZpY2VfaWQiOiAxNDE2ODcsICJ0aW1lIjogMTU3NTE1ODQzMCwgImhlYXJ0cmF0ZSI6IDg0LjM4NzM2MzY4NjEzMDU0fQ==,2019-12-01,2019-48
MTQyODM4,64899162,3,1575158445876,bpm,eyJkZXZpY2VfaWQiOiAxNDI4MzgsICJ0aW1lIjogMTU3NTE1ODQzNCwgImhlYXJ0cmF0ZSI6IDg1Ljc2MzkzNjgwMDQ1MTAzfQ==,2019-12-01,2019-48
MTk3OTMw,65087190,0,1575158460427,bpm,eyJkZXZpY2VfaWQiOiAxOTc5MzAsICJ0aW1lIjogMTU3NTE1ODQ1MCwgImhlYXJ0cmF0ZSI6IDg0LjYyMDk2MzIyMTczNTA3fQ==,2019-12-01,2019-48
MTEwMzY3,65102354,2,1575158482354,bpm,eyJkZXZpY2VfaWQiOiAxMTAzNjcsICJ0aW1lIjogMTU3NTE1ODQ3MSwgImhlYXJ0cmF0ZSI6IDc2LjUwOTk5NTg3NjA2Njc0fQ==,2019-12-01,2019-48


The `join` went smoothly, and I've kept the `timestamp` column unchanged to maintain the data `raw`. <br>
With the data now in the right state for ingestion, I'll take a `count` as a benchmark before integrating all this into the `Auto Loader` logic.

In [0]:
joined_df.count()

Out[7]: 417276

## Data Processing

This is a key step in the pipeline during development and even more so in production. If not correctly configured, Auto Loader could use a lot of API calls, resulting in unexpected charges.<br>
Throughout the pipeline, I'll use `trigger(availableNow=True)`, so it will only run when scheduled and will ingest every file up to the default limit. In production, this approach might be viable if the flow of files is known and constant. Another method that could be used is File Notification, which has its own advantages. Typically, this decision would involve the client, the billing department, the architect, and the developers as consultants.<br>
<br>
Another point to consider is the life expectancy of the pipeline. Auto Loader offers options like `Schema Evolution (cloudFiles.schemaEvolutionMode)` to extend this expectancy if the data is likely to change. I won't go this far here since there is no one-size-fits-all solution.<br>
Below is the `Auto Loader` processing query:<br>
<br>

In [0]:
# Estabilishing the folder where Auto Loader will maintain identifying information of the stream
checkpoint_path = f'{main_path}/_checkpoints'

# Defining data type of each column in the JSON files
bronze_schema = "key BINARY, value BINARY, topic STRING, partition INT, offset INT, timestamp LONG"

# Creating the function that will ingest the files into the Bronze table
def process_bronze():
    (spark.readStream
                  .format("cloudFiles")
                  .schema(bronze_schema)
                  .option("cloudFiles.format", "json")
                  .option("cloudFiles.schemaLocation", f"{checkpoint_path}/bronze_schema") # Location to store schema information
                  .load(files_path) # Source Folder
                  .join(f.broadcast(date_lookup_df), f.to_date((f.col("timestamp")/1000).cast("timestamp")) == f.col("date"), "left")
                  .writeStream
                        .option("checkpointLocation", f"{checkpoint_path}/bronze") # Location to store checkpoint information
                        .partitionBy("topic", "week_part") # Partitioning by topic and week_part to increase performance
                        .trigger(availableNow=True)
                        .table("bronze")
                        .awaitTermination())

# Creating a function for reprocessing all the files into the Bronze table
def reprocess_bronze():
    spark.sql("DROP TABLE IF EXISTS bronze.bronze")

    dbutils.fs.rm(f"{checkpoint_path}/bronze", True)
    
    process_bronze()

In [0]:
process_bronze()

### Verifying Ingestion
I'll check the Bronze table to ensure the data is correctly partitioned, the volumetry is right and the records match the expectations. <br>
This step confirms the pipeline is working as intended before moving on.

In [0]:
# Count of ingested rows, should match the count done earlier in the DF
bronze_first_batch = spark.table("bronze.bronze").count()
print("")
print(f"{bronze_first_batch} ingested rows")
print("")

# Check a sample of rows to ensure no columns went null
display(spark.table("bronze.bronze").limit(10))

# Verify the partitioning, location, and other details of the Bronze table
display(spark.sql("DESCRIBE EXTENDED bronze.bronze"))


417276 ingested rows



key,value,topic,partition,offset,timestamp,date,week_part
MTE5NzE1,eyJkZXZpY2VfaWQiOiAxMTk3MTUsICJ0aW1lIjogMTU3NTE1ODQwMCwgImhlYXJ0cmF0ZSI6IDY2LjYyMjQ0OTczMTU3OTc2fQ==,bpm,0,65087128,1575158410818,2019-12-01,2019-48
MTA3OTUy,eyJkZXZpY2VfaWQiOiAxMDc5NTIsICJ0aW1lIjogMTU3NTE1ODQwMSwgImhlYXJ0cmF0ZSI6IDExNi4wMzEyNTg5ODAxMjA4N30=,bpm,2,65102257,1575158411394,2019-12-01,2019-48
MTYxOTEw,eyJkZXZpY2VfaWQiOiAxNjE5MTAsICJ0aW1lIjogMTU3NTE1ODQwMywgImhlYXJ0cmF0ZSI6IDgyLjMzMzg1OTEyMDg3MzE3fQ==,bpm,0,65087131,1575158417678,2019-12-01,2019-48
MTE2NzA4,eyJkZXZpY2VfaWQiOiAxMTY3MDgsICJ0aW1lIjogMTU3NTE1ODQxMiwgImhlYXJ0cmF0ZSI6IDgzLjEzNDUyNzA2MDg3NzA5fQ==,bpm,3,64899138,1575158421868,2019-12-01,2019-48
MTYxMzQz,eyJkZXZpY2VfaWQiOiAxNjEzNDMsICJ0aW1lIjogMTU3NTE1ODQyMCwgImhlYXJ0cmF0ZSI6IDU5LjcxODMyNzY1MTYyODA0fQ==,bpm,3,64899148,1575158430841,2019-12-01,2019-48
MTEzMDI0,eyJkZXZpY2VfaWQiOiAxMTMwMjQsICJ0aW1lIjogMTU3NTE1ODQyMywgImhlYXJ0cmF0ZSI6IDk0LjcyNTkyNjkxMDg1MjAxfQ==,bpm,4,65002565,1575158432217,2019-12-01,2019-48
MTQxNjg3,eyJkZXZpY2VfaWQiOiAxNDE2ODcsICJ0aW1lIjogMTU3NTE1ODQzMCwgImhlYXJ0cmF0ZSI6IDg0LjM4NzM2MzY4NjEzMDU0fQ==,bpm,2,65102289,1575158437927,2019-12-01,2019-48
MTQyODM4,eyJkZXZpY2VfaWQiOiAxNDI4MzgsICJ0aW1lIjogMTU3NTE1ODQzNCwgImhlYXJ0cmF0ZSI6IDg1Ljc2MzkzNjgwMDQ1MTAzfQ==,bpm,3,64899162,1575158445876,2019-12-01,2019-48
MTk3OTMw,eyJkZXZpY2VfaWQiOiAxOTc5MzAsICJ0aW1lIjogMTU3NTE1ODQ1MCwgImhlYXJ0cmF0ZSI6IDg0LjYyMDk2MzIyMTczNTA3fQ==,bpm,0,65087190,1575158460427,2019-12-01,2019-48
MTEwMzY3,eyJkZXZpY2VfaWQiOiAxMTAzNjcsICJ0aW1lIjogMTU3NTE1ODQ3MSwgImhlYXJ0cmF0ZSI6IDc2LjUwOTk5NTg3NjA2Njc0fQ==,bpm,2,65102354,1575158482354,2019-12-01,2019-48


col_name,data_type,comment
key,binary,
value,binary,
topic,string,
partition,int,
offset,int,
timestamp,bigint,
date,date,
week_part,string,
# Partition Information,,
# col_name,data_type,comment


### New Batch
Next, I'll simulate the arrival of new files to the Auto Loader source folder and then check if new rows arrive in the Bronze table. <br>
This ensures that Auto Loader is correctly set up to handle new data batches as they come in.

In [0]:
# List of new sample file names
nb_files = ("part-00000-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-492-1-c000.json","part-00003-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-495-1-c000.json","part-00004-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-496-1-c000.json")

# Loop to copy the new samples to the source folder
for i in nb_files:
    dbutils.fs.cp(f"{new_batch}/{i}", f"{files_path}")

# Run the stream to process the new files
process_bronze()

# Verify if new rows were ingested into the Bronze table
bronze_second_batch = spark.table("bronze.bronze").count()
print("")
print(f"{bronze_second_batch - bronze_first_batch} rows were ingested in the second batch")
print("")


462415 rows were ingested in the second batch



In [0]:
files = dbutils.fs.ls(files_path)
display(files)

path,name,size,modificationTime
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00000-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-376-1-c000.json,part-00000-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-376-1-c000.json,0,1708028988000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00000-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-492-1-c000.json,part-00000-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-492-1-c000.json,85786236,1716980229000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00003-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-379-1-c000.json,part-00003-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-379-1-c000.json,84059292,1708028994000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00003-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-495-1-c000.json,part-00003-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-495-1-c000.json,7404178,1716980236000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00004-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-380-1-c000.json,part-00004-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-380-1-c000.json,27495,1708028995000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00004-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-496-1-c000.json,part-00004-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-496-1-c000.json,21448,1716980237000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00005-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-381-1-c000.json,part-00005-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-381-1-c000.json,8759,1708028996000


## Closing Comments

As I've said before, I opted for only running Auto Loader on a schedule. If instead of `trigger(availableNow=True)` if I'd used `trigger(processingTime='5 seconds')`, every 5 secods the Source Folder would be scanned for new files and the Bronze table would be updated automatically. <br>
<br>
That's all for this notebook, in the next one the fun part will begin!

In [0]:
# List of New Batch file's names
nb_files = ("part-00000-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-492-1-c000.json","part-00003-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-495-1-c000.json","part-00004-tid-8917337824388924965-fd775af0-714b-4bcf-9d17-5a414d03e156-496-1-c000.json")

# Loop to delete files from the Source Folder
for i in nb_files:
    dbutils.fs.rm(f"{files_path}/{i}")

# Ensuring the Source Folder is resetted
files = dbutils.fs.ls(files_path)
display(files)

path,name,size,modificationTime
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00000-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-376-1-c000.json,part-00000-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-376-1-c000.json,0,1708028988000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00003-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-379-1-c000.json,part-00003-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-379-1-c000.json,84059292,1708028994000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00004-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-380-1-c000.json,part-00004-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-380-1-c000.json,27495,1708028995000
dbfs:/mnt/borges_portifolio/auto_loader_files/files/part-00005-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-381-1-c000.json,part-00005-tid-358676900307454419-9a6ae6ec-a431-4418-90b5-95bfbeed9fa7-381-1-c000.json,8759,1708028996000
