Create streaming pipeline that reads data from source increamently.
Setting up streaming query that uses autoloader to moniter a cloud storage for new files

**Below cell creates a streaming DataFrame df by reading CSV files from the specified path in the Azure Data Lake Storage (ADLS) Raw container. It uses the cloudFiles format for auto-loading and specifies the schema location for schema inference and evolution. The checkpoint_location is used to store the schema information**

In [0]:
checkpoint_location= "abfss://checkpoints@storageformetastore.dfs.core.windows.net/"

df= spark.readStream.format("cloudFiles").option("cloudFiles.format","csv")\
    .option("cloudFiles.schemaLocation",checkpoint_location)\
    .load("abfss://raw@storageforproject.dfs.core.windows.net/")
        

Write stream method allows you to continuously append data to the target starage as new files arrive.

Checkpoint ensures that the write operations is fault-tolerant, if the stream pipeline fails. It can resume from where it left off without duplicate data.

In [0]:
df.writeStream\
    .format("csv")\
    .outputMode("append")\
    .option("header", "true")\
    .option("checkpointLocation", checkpoint_location)\
    .trigger(processingTime="10 seconds")\
    .start("abfss://bronze@storageforproject.dfs.core.windows.net/netflix_titles/")

**above cell writes the streaming DataFrame df to the specified path in the ADLS Bronze container. It uses the csv format and appends new data to the existing files. The checkpoint_location is used to store the checkpoint information, ensuring fault tolerance and exactly-once processing. The stream is triggered to process data every 10 seconds.**

In [0]:
df_read = spark.read.format("csv")\
    .option("header", "true")\
    .load("abfss://bronze@storageforproject.dfs.core.windows.net/netflix_titles/")

#  reads the CSV files from the specified path in the ADLS Bronze container and stores the result in the DataFrame df_read

display(df_read)

`cloudFiles` is used because Auto Loader simplifies the ingestion process, especially for continuously arriving data. It automatically detects new files, tracks processed ones, and handles schema evolution. If we used `readStream` without specifying `cloudFiles`, we would have to handle these aspects manually, which is less efficient and more error-prone.

The notebook's overall purpose is to demonstrate a streaming data pipeline that reads cloud files, processes the data, and stores the results in a target location within Azure Data Lake Storage. The use of checkpoints ensures fault tolerance and data integrity in the streaming process. The notebook provides a structured approach to handling streaming data in an Azure Databricks environment using Spark.