## Databricks Auto Loader

###### Step 1: Creating input folder in volume, use data ingestion --> upload files to a volume --> give a name to volume you want to create and copy the path present below . In order process other cloud files in S3 or GCS you can provide that path
 
We simulate daily data arrival by creating date-based subfolders under volume (e.g., /2025/08/01/). This structure mimics real-world pipelines where files arrive daily in partitioned folders.

In [0]:
dbutils.fs.mkdirs("/Volumes/workspace/default/autoloader_demo/2025/08/01")
dbutils.fs.mkdirs("/Volumes/workspace/default/autoloader_demo/2025/08/02")
dbutils.fs.mkdirs("/Volumes/workspace/default/autoloader_demo/2025/08/03")
dbutils.fs.mkdirs("/Volumes/workspace/default/autoloader_demo/2025/08/04")
dbutils.fs.mkdirs("/Volumes/workspace/default/autoloader_demo/2025/08/05")

True

**Step 2: Set Paths for Auto Loader**

input_path: where Auto Loader watches for new files.

schema_path: where the inferred schema is stored.

checkpoint_path: where processing state is saved to ensure exactly-once ingestion.

In [0]:
spark.sql("SHOW VOLUMES")

DataFrame[database: string, volume_name: string]

In [0]:
dbutils.fs.mkdirs("/Volumes/workspace/default/autoloader_demo/autoloader_demo_schema")
dbutils.fs.mkdirs("/Volumes/workspace/default/autoloader_demo/autoloader_demo_checkpoint")

True

**Step 3: Set Schema and Checkpoint Paths**

In [0]:
schema_path = "/Volumes/workspace/default/autoloader_demo/autoloader_demo_schema"
checkpoint_path = "/Volumes/workspace/default/autoloader_demo/autoloader_demo_checkpoint1"


In [0]:
import pandas as pd

df_day = pd.DataFrame({
    "order_id": [101, 102],
    "customer": ["Grace", "Liam"],
    "amount": [210.0, 320.5]
})

df_day.to_csv("/Volumes/workspace/default/autoloader_demo/2025/08/01/sales_20250801.csv", index=False)


In [0]:
df_day = pd.DataFrame({
    "order_id": [103, 104],
    "customer": ["Rita", "Sita"],
    "amount": [215.0, 320.5]
})

df_day.to_csv("/Volumes/workspace/default/autoloader_demo/2025/08/02/sales_20250802.csv", index=False)

In [0]:
df_day = pd.DataFrame({
    "order_id": [105, 106],
    "customer": ["X", "Y"],
    "amount": [215.0, 325.5]
})

df_day.to_csv("/Volumes/workspace/default/autoloader_demo/2025/08/03/sales_20250803.csv", index=False)

**Step 4: Reading with Auto Loader**

In [0]:
df=(
    spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "csv") \
    .option("pathGlobFilter", "*.csv") \
    .option("header","true")
    .option("cloudFiles.schemaHints","amount double")
    .option("cloudFiles.schemaLocation", schema_path) \
    .load("/Volumes/workspace/default/autoloader_demo/2025/08/*/")
)

**Step 5:Writing to a Delta Table**

In [0]:
from pyspark.sql.functions import col
df.withColumn("file_name", col("_metadata.file_name")) \
  .writeStream \
    .format("delta") \
    .option("checkpointLocation", checkpoint_path) \
    .outputMode("append") \
    .option("mergeSchema", "true") \
    .trigger(availableNow=True) \
    .table("autoloader_demo")
display(spark.sql("select * from autoloader_demo"))

order_id,customer,amount,_rescued_data,file_name
101,Grace,210.0,,sales_20250801.csv
102,Liam,320.5,,sales_20250801.csv
103,Rita,215.0,,sales_20250802.csv
104,Sita,320.5,,sales_20250802.csv
105,X,215.0,,sales_20250803.csv
106,Y,325.5,,sales_20250803.csv


In [0]:
%sql
select file_name,count(*) count from autoloader_demo group by file_name

file_name,count
sales_20250803.csv,2
sales_20250802.csv,2
sales_20250801.csv,2
sales_20250805.csv,2
sales_20250804.csv,2


**Step 6 :  Simulate Streaming**

In [0]:
df_day = pd.DataFrame({
    "order_id": [107, 108],
    "customer": ["Z", "A"],
    "amount": [215, 326]
})

df_day.to_csv("/Volumes/workspace/default/autoloader_demo/2025/08/04/sales_20250804.csv", index=False)

In [0]:
df_day = pd.DataFrame({
    "order_id": [107, 108],
    "customer": ["B", "K"],
    "amount": [215, 326]
})

df_day.to_csv("/Volumes/workspace/default/autoloader_demo/2025/08/05/sales_20250805.csv", index=False)