## Directory Listing Mode

Configuring Service Principle

In [0]:
client_id = "XXXX-XXXX-XXXX-XXXX-XXXX-XXXX" 
tenant_id = "XXXX-XXXX-XXXX-XXXX-XXXX-XXXX"
client_secret = "XXXX-XXXX-XXXX-XXXX-XXXX-XXXX"
storage_account_name = "dldatalakestorage"
container_name = "sourcedata"

In [0]:
spark.conf.set(f"fs.azure.account.auth.type.{storage_account_name}.dfs.core.windows.net", "OAuth")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{storage_account_name}.dfs.core.windows.net", 
               "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"fs.azure.account.oauth2.client.id.{storage_account_name}.dfs.core.windows.net", client_id)
spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storage_account_name}.dfs.core.windows.net", client_secret)
spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storage_account_name}.dfs.core.windows.net", 
               f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")

file_path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net"

## Reading files in stream

- cloudFiles is a special feature in Databricks that helps you ingest files from cloud storage and enables the features of autoloader.

In [0]:
df = spark.readStream.format("cloudFiles")\
    .option("cloudFiles.format", "csv")\
    .option("cloudFiles.schemaLocation", f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/schema")\
    .option("cloudFiles.inferColumnTypes", "true")\
    .option("header", "true")\
    .option("cloudFiles.schemaEvolutionMode", "rescue")\
    .load(file_path)

#### Schema Evolution Modes in Autoloader

1. **`addNewColumns` (default)**  
   - Stream fails if new columns are found. New columns are added to the schema, but existing columns don't change.

2. **`rescue`**  
   - Stream doesn't fail on new columns. New columns are recorded in a "rescued data" column, but the schema doesn't evolve.

3. **`failOnNewColumns`**  
   - Stream fails if new columns are encountered. You need to update the schema or remove the file to restart the stream.

4. **`none`**  
   - New columns are ignored. The schema doesn't evolve, and no data is rescued unless a "rescued data" column is specified.


In [0]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- firstname: string (nullable = true)
 |-- surname: string (nullable = true)
 |-- _rescued_data: string (nullable = true)



## Write Streaming

#### Structured Streaming Output Modes

1. **`append`**  
   Adds new data to the existing data without removing or overwriting anything.

2. **`overwrite`**  
   Replaces all existing data with the new incoming data.

3. **`errorifexists`** (default)  
   Throws an error if data already exists in the destination.

4. **`ignore`**  
   Ignores the new data if data already exists in the destination.


In [0]:
container_name = 'sinkdata'

df.writeStream.format("delta") \
    .option("checkpointLocation", f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/checkpoint") \
    .outputMode("append") \
    .trigger(processingTime="10 seconds")\
    .option("mergeSchema", "true")\
    .start(f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net")


<pyspark.sql.streaming.query.StreamingQuery at 0x7f7603a9e990>

In [0]:
df_sink = spark.read.format("delta").load(f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net")
df_sink.printSchema()

root
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- _rescued_data: string (nullable = true)
 |-- firstname: string (nullable = true)
 |-- surname: string (nullable = true)



In [0]:
df_sink.show()

+---------+---------+-------------+---------+-------+
|     name|     city|_rescued_data|firstname|surname|
+---------+---------+-------------+---------+-------+
|     NULL|  ayodhya|         NULL|   keshav|maharaj|
|dhruvpuri|    surat|         NULL|     NULL|   NULL|
|    vivek|   rajkot|         NULL|     NULL|   NULL|
|      raj| junagadh|         NULL|     NULL|   NULL|
|    nayan|ahmedabad|         NULL|     NULL|   NULL|
|    jatin| varanasi|         NULL|     NULL|   NULL|
|    viraj|     agra|         NULL|     NULL|   NULL|
+---------+---------+-------------+---------+-------+

