# Creating Streaming Tables with SQL using Auto Loader

In this demonstration we will create a streaming table to incrementally ingest files from a volume using Auto Loader with SQL. 

When you create a streaming table using the CREATE OR REFRESH STREAMING TABLE statement, the initial data refresh and population begin immediately. These operations do not consume DBSQL warehouse compute. Instead, streaming tables rely on serverless DLT for both creation and refresh. A dedicated serverless DLT pipeline is automatically created and managed by the system for each streaming table.

### Learning Objectives

By the end of this lesson, you should be able to:
- Create streaming tables in Databricks SQL for incremental data ingestion.
- Refresh streaming tables using the REFRESH statement.
- Understand how streaming tables automatically track and ingest only new files.

### RECOMMENDATION

The CREATE STREAMING TABLE SQL command is the recommended alternative to the legacy COPY INTO SQL command for incremental ingestion from cloud object storage. Databricks recommends using streaming tables to ingest data using Databricks SQL. 

A streaming table is a table registered to Unity Catalog with extra support for streaming or incremental data processing. A DLT pipeline is automatically created for each streaming table. You can use streaming tables for incremental data loading from Kafka and cloud object storage.

---

**Environment Setup:**
- **Catalog:** `lakeflow_demo`
- **Schema:** `lakeflow_schema`
- **Volumes:** 
  - `csv_files_autoloader_source` (source volume for streaming ingestion)
  - `autoloader_staging_files` (staging volume with additional files for incremental ingestion demo)

**Note:** Make sure to run the `00_Setup_Environment.ipynb` notebook first to create the catalog, schema, volumes, and sample data files.


## A. Setup

Run the following cell to configure your working environment for this notebook.


In [0]:
%sql
-- Set default catalog and schema
USE CATALOG lakeflow_demo;
USE SCHEMA lakeflow_schema;

-- View current catalog and schema
SELECT 
  current_catalog(), 
  current_schema();


## B. SQL Warehouse Requirement

**REQUIRED - SELECT YOUR SERVERLESS SQL WAREHOUSE**

**NOTE: Creating streaming tables with Databricks SQL requires a SQL warehouse.**

Before executing cells in this notebook, please select a **SQL WAREHOUSE** in the lab. Follow these steps:

1. Navigate to the top-right of this notebook and click the drop-down to select compute (it might say **Connect**).
2. Select **More**.
3. Then select the **SQL Warehouse** button.
4. Select or create a SQL warehouse.
5. Then, at the bottom of the pop-up, select **Start and attach**.

**Important:** Without a SQL warehouse, the streaming table creation commands will fail.


## C. Create Streaming Tables for Incremental Processing

### C1. Explore the Data Source

1. Explore the volume `/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source` and confirm it contains CSV file(s).

   Use the `LIST` statement to view the files in this volume.


In [0]:
%sql
LIST '/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source';


2. Run the query below to view the data in the CSV file(s) in your cloud storage location. Notice that it was returned in tabular format.


In [0]:
%sql
SELECT *
FROM read_files(
  '/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source',
  format => 'CSV',
  sep => '|',
  header => true
)
LIMIT 10;


### C2. Create a STREAMING TABLE using Databricks SQL

3. Your goal is to create an incremental pipeline that only ingests new files (instead of using traditional batch ingestion). You can achieve this by using [streaming tables in Databricks SQL](https://docs.databricks.com/aws/en/dlt/dbsql/streaming) (Auto Loader).

   - The SQL code below creates a streaming table that will incrementally ingest only new data.
   
   - A pipeline is automatically created for each streaming table. You can use streaming tables for incremental data loading from Kafka and cloud object storage.

   **NOTE:** Incremental batch ingestion automatically detects new records in the data source and ignores records that have already been ingested. This reduces the amount of data processed, making ingestion jobs faster and more efficient in their use of compute resources.

   **REQUIRED: This process will take about a minute to run and set up the incremental ingestion pipeline.**

   **IMPORTANT:** If you encounter schema errors, make sure to drop any existing table first, or use the explicit schema option shown in the alternative approach below.


In [0]:
%sql
-- Drop existing table if it exists (to avoid schema conflicts)
DROP TABLE IF EXISTS sql_csv_autoloader;

-- Create streaming table with automatic schema inference
CREATE STREAMING TABLE sql_csv_autoloader
AS
SELECT *
FROM STREAM read_files(
  '/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source',
  format => 'CSV',
  sep => '|',
  header => true
);


**Alternative: If you encounter schema errors, you can explicitly define the schema:**

```sql
-- Drop existing table if it exists
DROP TABLE IF EXISTS sql_csv_autoloader;

-- Create streaming table with explicit schema
CREATE STREAMING TABLE sql_csv_autoloader
AS
SELECT *
FROM STREAM read_files(
  '/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source',
  format => 'CSV',
  sep => '|',
  header => true,
  schema => 'order_id STRING, product STRING, quantity INT, price DOUBLE, sale_date TIMESTAMP'
);
```

**Note:** The explicit schema approach is useful when:
- Schema inference fails
- You want to ensure consistent column types
- You're working with files that might have inconsistent schemas


### C3. View and Inspect the Streaming Table

4. Run the cell below to view the streaming table. Confirm that the results contain the expected number of rows.


In [0]:
%sql
SELECT *
FROM sql_csv_autoloader;


5. Describe the STREAMING TABLE and view the results. Notice the following:

- Under **Detailed Table Information**, notice the following rows:
  - **View Text**: The query that created the table.
  - **Type**: Specifies that it is a STREAMING TABLE.
  - **Provider**: Indicates that it is a Delta table.

- Under **Refresh Information**, you can see specific refresh details including Last Refreshed, Last Refresh Type, Latest Refresh Status, etc.


In [0]:
%sql
DESCRIBE TABLE EXTENDED sql_csv_autoloader;


6. The `DESCRIBE HISTORY` statement displays a detailed list of all changes, versions, and metadata associated with a Delta streaming table, including information on updates, deletions, and schema changes.

   Run the cell below and view the results. Notice the following:

   - In the **operation** column, you can see that a streaming table performs operations: **CREATE TABLE**, **DLT SETUP** and **STREAMING UPDATE**.
   
   - Scroll to the right and find the **operationMetrics** column to see the number of rows processed.


In [0]:
%sql
DESCRIBE HISTORY sql_csv_autoloader;


## D. Demonstrating Incremental Ingestion

### D1. Add New Files for Incremental Ingestion

7. To demonstrate incremental ingestion, manually add another file to your cloud storage location: `/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source`.

   **Option 1 - Using Python:**
   - Copy a file from the staging volume to the source volume

   **Option 2 - Using UI:**
   - Click the catalog icon on the left
   - Expand the **lakeflow_demo** catalog
   - Expand your **lakeflow_schema** schema
   - Expand **Volumes**
   - Open the **autoloader_staging_files** volume
   - Copy a file from there to the **csv_files_autoloader_source** volume


In [0]:
%python
# Option 1: Copy a clean CSV file from staging to source volume using Python
# CRITICAL: Only copy CSV files, NOT Spark metadata files (_SUCCESS, _committed_*, etc.)

def copy_clean_csv_files(copy_from, copy_to, n=1):
    """
    Copy only CSV files from staging to source, excluding Spark metadata.
    This ensures Autoloader sees clean file arrivals.
    """
    all_files = dbutils.fs.ls(copy_from)
    
    # Filter to ONLY CSV files (exclude Spark metadata and directories)
    csv_files = [f for f in all_files if f.name.endswith('.csv') and not f.isDir()]
    
    if len(csv_files) == 0:
        print(f"⚠ WARNING: No CSV files found in {copy_from}")
        return
    
    # Take first n CSV files
    files_to_copy = csv_files[:n]
    
    print(f"Copying {len(files_to_copy)} clean CSV file(s) from staging to source...")
    for f in files_to_copy:
        # Use timestamp to make filename unique for Autoloader
        import time
        timestamp = int(time.time())
        dest_name = f"sales_incremental_{timestamp}_{f.name}"
        dest_path = f"{copy_to}/{dest_name}"
        dbutils.fs.cp(f.path, dest_path)
        print(f"  ✓ Copied: {f.name} → {dest_name} ({f.size:,} bytes)")
    
    print(f"\n✓ Successfully copied {len(files_to_copy)} file(s) for incremental ingestion")
    print("  You can now refresh the streaming table to see incremental ingestion.")

# Copy one additional CSV file for incremental ingestion demo
copy_clean_csv_files(
    copy_from="/Volumes/lakeflow_demo/lakeflow_schema/autoloader_staging_files",
    copy_to="/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source",
    n=1
)


### D2. Refresh the Streaming Table

8. Next, manually refresh the STREAMING TABLE using `REFRESH STREAMING TABLE table-name`. 

   - [Refresh a streaming table](https://docs.databricks.com/aws/en/dlt/dbsql/streaming#refresh-a-streaming-table) documentation

   **NOTE:** You can also rerun the CREATE STREAMING TABLE cell to incrementally ingest only new files.


In [0]:
%sql
REFRESH STREAMING TABLE lakeflow_demo.lakeflow_schema.sql_csv_autoloader;


### D3. Verify Incremental Ingestion

9. Run the cell below to view the data in the **sql_csv_autoloader** table. Notice that the table now contains additional rows from the newly added file.


In [0]:
%sql
SELECT *
FROM sql_csv_autoloader;


10. Describe the history of the **sql_csv_autoloader** table. Observe the following:

  - Additional versions of the streaming table include **STREAMING UPDATE** operations.

  - Expand the **operationMetrics** column and note the number of rows that were incrementally ingested.


In [0]:
%sql
DESCRIBE HISTORY sql_csv_autoloader;


## E. Key Concepts and Benefits

### Key Benefits of Streaming Tables

1. **Automatic Incremental Processing**: Streaming tables automatically track which files have been processed and only ingest new files on each refresh.

2. **Serverless DLT Pipeline**: Each streaming table has its own automatically managed DLT (Delta Live Tables) pipeline that runs serverlessly.

3. **Idempotent Operations**: Running the same refresh multiple times won't duplicate data - already processed files are skipped.

4. **No Compute Consumption**: The initial data refresh and population do not consume DBSQL warehouse compute.

5. **Unified Catalog Integration**: Streaming tables are registered in Unity Catalog, making them accessible like any other table.

### When to Use Streaming Tables

- **Incremental Batch Ingestion**: When you need to periodically ingest new files from cloud storage
- **Real-time Data Pipelines**: When combined with streaming sources like Kafka
- **Automated Data Refresh**: When you want automatic tracking of processed files
- **Production Workloads**: When you need reliable, managed data ingestion pipelines

### Comparison: Streaming Tables vs COPY INTO

| Feature | Streaming Tables | COPY INTO |
|---------|------------------|-----------|
| Incremental Processing | Automatic | Manual tracking |
| DLT Pipeline | Automatic | Not required |
| Serverless Execution | Yes | No (uses compute) |
| Recommended for New Projects | Yes | Legacy approach |
| Complexity | Lower | Higher |


## F. Additional Resources

- [Streaming Tables Documentation](https://docs.databricks.com/aws/en/dlt/dbsql/streaming)
- [CREATE STREAMING TABLE Syntax](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-streaming-table)
- [Using Streaming Tables in Databricks SQL](https://docs.databricks.com/aws/en/dlt/dbsql/streaming)
- [REFRESH (MATERIALIZED VIEW or STREAMING TABLE)](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-refresh-full)
- [Auto Loader Documentation](https://docs.databricks.com/ingestion/auto-loader/index.html)
- [Delta Live Tables (DLT) Overview](https://docs.databricks.com/dlt/index.html)
