# BRONZE LAYER

## General Info
  | Info | Details |
  |---------|------|
  |Table Name | customers + sales|
  |From | transient (Azure) |

## Update timeline
|Date | Developed/altered by: | Comment |
|:------:|--------|-------|
|2025/01/23|Luca Ainstein|Project Data Ingestion|


Library Import

In [0]:
### Import of libs

from pyspark.sql.functions import current_date, current_timestamp, expr

Cluster Conection to Azure Transient

In [0]:
### There is the alternative of using the conection directly on the cluster creation. However, since the Databricks Community Cloud only allows 1-2 hours of inactivity before deleting the cluster, it is better to create the connection on the code. 

# Set up the configurations for Azure Data Lake Storage Gen2 using OAuth authentication
configs = {
    "fs.azure.account.auth.type.dlsprojetofixo.dfs.core.windows.net": "OAuth",
    "fs.azure.account.oauth.provider.type.dlsprojetofixo.dfs.core.windows.net": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id.dlsprojetofixo.dfs.core.windows.net": "238b3ff5-2a4b-49a0-838e-61aea5e55f0a",
    "fs.azure.account.oauth2.client.secret.dlsprojetofixo.dfs.core.windows.net": "~h.8Q~XwXAMBiXDY69nUxONcfoO49RweWIPELdkM",
    "fs.azure.account.oauth2.client.endpoint.dlsprojetofixo.dfs.core.windows.net": "https://login.microsoftonline.com/d16f0536-3c2f-4035-887f-8949bacfacfd/oauth2/token"
}

# Apply the configurations to Spark
for key, value in configs.items():
    spark.conf.set(key, value)

# Verify the configuration
print("Configuration applied successfully.")

Configuration applied successfully.


### Import Dataset 1 (**customers**) `.csv`

In [0]:
path_source = 'abfss://transient@dlsprojetofixo.dfs.core.windows.net/ecommerce_customers_500.csv'

df = spark.read.format("csv").option("header", True).load(path_source).withColumn("load_time_GMT", current_timestamp())

## Comment: the time stamp is loaded in GMT (consider that Brasil is GMT -3:00)

# On a real life import, check the best practice on the company: if they prefer local time or a standar time (as IST, EST, etc)

# Note for Silver developer: note that AGE is a STRING. Must be turn into a INT on silver layer (of even float, depending on the data)


Schema Creation

In [0]:
schema_layer = "bronze"
table_name = "customers"

path_target = 'abfss://bronze@dlsprojetofixo.dfs.core.windows.net/clientes_lucaainstein'

spark.sql("CREATE DATABASE IF NOT EXISTS bronze")
df.write \
    .format('delta') \
    .mode('overwrite') \
    .option('path', path_target) \
    .option('overwriteSchema', 'true') \
    .saveAsTable(f'{schema_layer}.{table_name}')
print("Schema is created!")

Schema is created!


In [0]:
spark.sql(f"OPTIMIZE {schema_layer}.{table_name}")
print("Optimization " + f'{schema_layer}.{table_name}' + " completed!")

Optimization bronze.customers completed!


### Import Dataset 2 (**sales**) `.json`

In [0]:
path_source = 'abfss://transient@dlsprojetofixo.dfs.core.windows.net/ecommerce_large.json'

df = spark.read.format("json").option("multiline", True).load(path_source).withColumn("load_time_GMT", current_timestamp())

## Comment: the time stamp is loaded in GMT (consider that Brasil is GMT -3:00)

# On a real life import, check the best practice on the company: if they prefer local time or a standar time (as IST, EST, etc)

# Note for Silver developer: note that AGE is a STRING. Must be turn into a INT on silver layer (of even float, depending on the data)

In [0]:
schema_layer = "bronze"
table_name = "sales"

path_target = 'abfss://bronze@dlsprojetofixo.dfs.core.windows.net/sales_lucaainstein'

spark.sql("CREATE DATABASE IF NOT EXISTS bronze")
df.write \
    .format('delta') \
    .mode('overwrite') \
    .option('path', path_target) \
    .option('overwriteSchema', 'true') \
    .saveAsTable(f'{schema_layer}.{table_name}')
print("Schema is created!")

Schema is created!


In [0]:
spark.sql(f"OPTIMIZE {schema_layer}.{table_name}")
print("Optimization " + f'{schema_layer}.{table_name}' + " completed!")

Optimization bronze.sales completed!
