
<div  style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://raw.githubusercontent.com/derar-alhussein/Databricks-Certified-Data-Engineer-Associate/main/Includes/images/bookstore_schema.png" alt="Databricks Learning" style="width: 600">
</div>

In [0]:
%run ../Includes/Copy-Datasets


## 🗂️ Exploring The Source Directory

In [0]:
files = dbutils.fs.ls(f"{dataset_bookstore}/orders-raw")
display(files)


## 🔄 Auto Loader

Auto Loader enables efficient and scalable ingestion of new files from cloud storage using Spark Structured Streaming.

- Designed for file-based ingestion scenarios.
- Ideal for environments where data arrives continuously, such as logs or event data.
- Leverages schema inference and checkpointing for reliable and repeatable ingestion.

The schema location and checkpointing directories must be defined to ensure proper tracking of both schema evolution and ingestion progress.


In [0]:
(spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "parquet")
        .option("cloudFiles.schemaLocation", "dbfs:/mnt/demo/orders_checkpoint")
        .load(f"{dataset_bookstore}/orders-raw")
      .writeStream
        .option("checkpointLocation", "dbfs:/mnt/demo/orders_checkpoint")
        .table("orders_updates")
)

In [0]:
%sql
SELECT * FROM orders_updates

In [0]:
%sql
SELECT count(*) FROM orders_updates


## ➕ Landing New Files

New files can be introduced to the source directory at any time.  
Auto Loader automatically detects and ingests these new files without restarting the stream.

- Ingestion is triggered as soon as new data is available.
- The record count in the destination table increases dynamically as new files are processed.

In [0]:
##Load files twice to get see 3 files in the next step and view the autoloader in the above steps
load_new_data()

In [0]:
files = dbutils.fs.ls(f"{dataset_bookstore}/orders-raw")
display(files)

In [0]:
%sql
--The data should be ingested 2 more times from the above step
SELECT count(*) FROM orders_updates


## 📊 Exploring Table History

The ingested data is written to a target Delta table that reflects the most current state.  
Users can explore this target using SQL queries to monitor live updates.

The table history allows users to view when data was ingested and how it has evolved over time.

In [0]:
%sql
--A new table version is created for each streaming update
DESCRIBE HISTORY orders_updates


##🧹 Cleaning Up

After completing the ingestion process:

- Temporary checkpoint locations and schema storage directories can be removed.
- The target Delta table used for experimentation can be dropped to maintain a clean workspace.

This ensures that resources are reclaimed and the environment remains organized for future work.

In [0]:
%sql
DROP TABLE orders_updates

In [0]:
##Remove the checkpoint location
dbutils.fs.rm("dbfs:/mnt/demo/orders_checkpoint", True)