# Data Ingestion with Lakeflow Connect - Python/PySpark Edition

This notebook provides Python/PySpark implementations of the data ingestion exercises covered in the SQL version.

You will learn how to:
- Use PySpark to ingest Parquet, CSV, and JSON files into Delta tables
- Add metadata columns during ingestion using PySpark
- Handle rescued data columns for malformed records
- Work with JSON data and decode base64-encoded fields
- Perform incremental data ingestion patterns

---

**Environment Setup:**
- **Catalog:** `lakeflow_demo`
- **Schema:** `lakeflow_schema`
- **Volume:** `raw` (located at `/Volumes/lakeflow_demo/lakeflow_schema/raw/`)

**Note:** Make sure to run the `00_Setup_Environment.ipynb` notebook first to create the catalog, schema, volume, and sample data files.


## A. Setup

Run the following cell to configure your working environment for this notebook.


In [0]:
%python
# Set default catalog and schema
spark.sql("USE CATALOG lakeflow_demo")
spark.sql("USE SCHEMA lakeflow_schema")

# View current catalog and schema
current_catalog = spark.sql("SELECT current_catalog()").collect()[0][0]
current_schema = spark.sql("SELECT current_schema()").collect()[0][0]

print(f"Current Catalog: {current_catalog}")
print(f"Current Schema: {current_schema}")


# 1. Data Ingestion with PySpark - Parquet Files

In this demonstration, we'll explore ingesting data from cloud storage into Delta tables using PySpark.

### Learning Objectives

By the end of this lesson, you should be able to:

- Use PySpark to read Parquet files and create Delta tables.
- Use PySpark DataFrame operations to perform incremental data loads.
- Understand the differences between SQL and PySpark approaches.


## B. Explore the Data Source Files

1. Let's first explore the Parquet files stored in the volume `/Volumes/lakeflow_demo/lakeflow_schema/raw/users-historical/` using PySpark.


In [0]:
%python
# List files in the volume
files = dbutils.fs.ls('/Volumes/lakeflow_demo/lakeflow_schema/raw/users-historical/')
print("Files in the volume:")
for file in files:
    if file.name.startswith('part-'):
        print(f"  {file.name} - {file.size} bytes")


2. Read and preview the Parquet files using PySpark to view the raw data structure.


In [0]:
%python
# Read Parquet files using PySpark
users_df = spark.read.format("parquet").load("/Volumes/lakeflow_demo/lakeflow_schema/raw/users-historical/")

# Display schema and preview data
print("Schema:")
users_df.printSchema()

print("\nPreview (first 10 rows):")
users_df.display()


## C. Batch Data Ingestion with PySpark

### C1. Creating Delta Tables with PySpark

1. Create a Delta table from the Parquet files using PySpark. This is equivalent to the SQL `CREATE TABLE AS SELECT` statement.


In [0]:
%python
# Drop the table if it exists
spark.sql("DROP TABLE IF EXISTS historical_users_bronze_pyspark")

# Read Parquet files
users_df = spark.read.format("parquet").load("/Volumes/lakeflow_demo/lakeflow_schema/raw/users-historical/")

# Write to Delta table (equivalent to CREATE TABLE AS SELECT)
users_df.write.format("delta").mode("overwrite").saveAsTable("historical_users_bronze_pyspark")

# Preview the Delta table
print("Table created successfully!")
spark.table("historical_users_bronze_pyspark").display()


2. Describe the table to view metadata information.


In [0]:
%python
# Describe table extended to view metadata
table_info = spark.sql("DESCRIBE TABLE EXTENDED historical_users_bronze_pyspark")
table_info.display()


## D. Incremental Data Ingestion with PySpark

### D1. Incremental Load Pattern

In PySpark, we can implement incremental loading by:
1. Reading new files from the source
2. Appending to an existing Delta table
3. Using merge operations for upserts

1. Let's demonstrate an incremental append operation.


In [0]:
%python
# Create a table for incremental loading
spark.sql("DROP TABLE IF EXISTS historical_users_bronze_incremental")

# First load - create the table
users_df = spark.read.format("parquet").load("/Volumes/lakeflow_demo/lakeflow_schema/raw/users-historical/")
users_df.write.format("delta").mode("overwrite").saveAsTable("historical_users_bronze_incremental")

print("Initial load complete. Row count:", spark.table("historical_users_bronze_incremental").count())

# Simulate incremental load - append mode
# In a real scenario, you would read only new files
users_df.write.format("delta").mode("append").saveAsTable("historical_users_bronze_incremental")

print("After incremental append. Row count:", spark.table("historical_users_bronze_incremental").count())
print("Note: In production, you would filter for only new files to avoid duplicates")


# 2. Adding Metadata Columns During Ingestion - Python

In this demonstration, we'll explore how to add metadata columns during data ingestion using PySpark.

### Learning Objectives

By the end of this lesson, you should be able to:

- Add metadata columns using PySpark DataFrame operations.
- Convert Unix timestamps to readable dates.
- Use PySpark functions to capture file-level metadata.


## B. Read Data and Add Metadata Columns

1. Read the Parquet files and add metadata columns using PySpark functions.


In [0]:
%python
from pyspark.sql.functions import col, from_unixtime, current_timestamp, input_file_name
from pyspark.sql.types import DateType

# Read Parquet files
users_df = spark.read.format("parquet").load("/Volumes/lakeflow_demo/lakeflow_schema/raw/users-historical/")

# Add metadata columns
users_with_metadata = (
    users_df
    .withColumn("first_touch_date", from_unixtime(col("user_first_touch_timestamp") / 1_000_000).cast(DateType()))
    .withColumn("source_file", input_file_name())
    .withColumn("ingestion_time", current_timestamp())
)

# Display the result
print("Data with metadata columns:")
users_with_metadata.display()


2. Note: In PySpark, `_metadata` is available when using `read_files()` SQL function. For native PySpark, we use `input_file_name()` for file names. For file modification time, we can use Spark SQL functions or read metadata separately.

Let's create the final bronze table with all metadata columns.


In [0]:
%python
from pyspark.sql.functions import col, from_unixtime, current_timestamp, input_file_name
from pyspark.sql.types import DateType

# Drop the table if it exists
spark.sql("DROP TABLE IF EXISTS historical_users_bronze_pyspark_metadata")

# Read Parquet files
users_df = spark.read.format("parquet").load("/Volumes/lakeflow_demo/lakeflow_schema/raw/users-historical/")

# Add metadata columns
users_with_metadata = (
    users_df
    .withColumn("first_touch_date", from_unixtime(col("user_first_touch_timestamp") / 1_000_000).cast(DateType()))
    .withColumn("source_file", input_file_name())
    .withColumn("ingestion_time", current_timestamp())
)

# Write to Delta table
users_with_metadata.write.format("delta").mode("overwrite").saveAsTable("historical_users_bronze_pyspark_metadata")

# View the final bronze table
print("Final bronze table with metadata:")
spark.table("historical_users_bronze_pyspark_metadata").display()


3. Alternative: Using `read_files()` SQL function from PySpark to access `_metadata` column directly.


In [0]:
%python
from pyspark.sql.functions import col, from_unixtime, current_timestamp
from pyspark.sql.types import DateType

# Using read_files() SQL function to access _metadata
users_with_metadata_sql = spark.sql("""
    SELECT
      *,
      cast(from_unixtime(user_first_touch_timestamp / 1000000) AS DATE) AS first_touch_date,
      _metadata.file_modification_time AS file_modification_time,
      _metadata.file_name AS source_file,
      current_timestamp() as ingestion_time
    FROM read_files(
      '/Volumes/lakeflow_demo/lakeflow_schema/raw/users-historical/',
      format => 'parquet'
    )
""")

# Display the result
print("Using read_files() with _metadata:")
users_with_metadata_sql.display()


# 3. Handling CSV Ingestion with the Rescued Data Column - Python

In this demonstration, we'll focus on ingesting CSV files into Delta Lake using PySpark and exploring the rescued data column.

### Learning Objectives

By the end of this lesson, you will be able to:

- Ingest CSV files as Delta tables using PySpark.
- Define and apply explicit schemas with PySpark.
- Handle and inspect rescued data that does not conform to the defined schema.


## B. Inspect the Dataset

1. Let's first inspect the CSV file with malformed data using PySpark.


In [0]:
%python
# Read CSV file as text to inspect raw content
text_df = spark.read.text("/Volumes/lakeflow_demo/lakeflow_schema/raw/products-csv/lab_malformed_data.csv")
print("Raw CSV content:")
text_df.display()


### B2. Ingesting and Rescuing Malformed Data with PySpark

1. Using `read_files()` SQL function from PySpark to read CSV with rescued data column.


In [0]:
%python
# Using read_files() SQL function to read CSV with rescued data
products_df = spark.sql("""
    SELECT * 
    FROM read_files(
        '/Volumes/lakeflow_demo/lakeflow_schema/raw/products-csv/lab_malformed_data.csv',
        format => 'csv',
        sep => ',',
        header => true,
        schema => 'item_id STRING, name STRING, price DOUBLE',
        rescuedDataColumn => '_rescued_data'
    )
""")

print("CSV data with rescued data column:")
products_df.display()


2. Alternative: Using native PySpark to read CSV with schema and handle errors.


In [0]:
%python
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

# Define schema
schema = StructType([
    StructField("item_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("price", DoubleType(), True)
])

# Read CSV with schema and mode for handling malformed records
# Note: PySpark's native CSV reader doesn't have rescued_data column
# We need to use read_files() SQL function for that feature
try:
    products_df_native = spark.read.format("csv") \
        .option("sep", ",") \
        .option("header", "true") \
        .option("mode", "PERMISSIVE") \
        .schema(schema) \
        .load("/Volumes/lakeflow_demo/lakeflow_schema/raw/products-csv/lab_malformed_data.csv")
    
    print("Native PySpark read (malformed rows will have null values):")
    products_df_native.display()
except Exception as e:
    print(f"Error: {e}")
    print("Note: For rescued_data column, use read_files() SQL function instead")


### B3. Add Additional Metadata Columns During Ingestion

1. Create the final bronze table with metadata columns using PySpark.


In [0]:
%python
from pyspark.sql.functions import current_timestamp

# Drop the table if it exists
spark.sql("DROP TABLE IF EXISTS products_bronze_pyspark")

# Read CSV with rescued data using read_files() SQL function
products_df = spark.sql("""
    SELECT 
      *,
      _metadata.file_modification_time AS file_modification_time,
      _metadata.file_name AS source_file,
      current_timestamp() as ingestion_time
    FROM read_files(
        '/Volumes/lakeflow_demo/lakeflow_schema/raw/products-csv/lab_malformed_data.csv',
        format => 'csv',
        sep => ',',
        header => true,
        schema => 'item_id STRING, name STRING, price DOUBLE',
        rescuedDataColumn => '_rescued_data'
    )
""")

# Write to Delta table
products_df.write.format("delta").mode("overwrite").saveAsTable("products_bronze_pyspark")

# View the final table
print("Final products bronze table:")
spark.table("products_bronze_pyspark").display()


# 4. Ingesting JSON Files with PySpark

In this demonstration, we'll explore how to ingest JSON files and perform transformations using PySpark, including decoding encoded fields and flattening nested JSON strings.

### Learning Objectives
By the end of this lesson, you should be able to:
- Ingest raw JSON data into Unity Catalog using PySpark.
- Apply techniques to flatten JSON string columns.
- Decode base64-encoded fields using PySpark functions.


## B. Overview of JSON Ingestion with PySpark

### B1. Inspect JSON files

1. List and preview the JSON files.


In [0]:
%python
# List files in the volume
files = dbutils.fs.ls('/Volumes/lakeflow_demo/lakeflow_schema/raw/events-kafka/')
print("JSON files in the volume:")
for file in files:
    if file.name.startswith('part-'):
        print(f"  {file.name} - {file.size} bytes")


2. Read JSON files using PySpark.


In [0]:
%python
# Read JSON files using PySpark
kafka_events_df = spark.read.format("json").load("/Volumes/lakeflow_demo/lakeflow_schema/raw/events-kafka/")

print("Schema:")
kafka_events_df.printSchema()

print("\nPreview (first 5 rows):")
kafka_events_df.display()


### B2. Create Bronze Table with Raw JSON Data

1. Store the raw JSON data in a Delta table.


In [0]:
%python
# Drop the table if it exists
spark.sql("DROP TABLE IF EXISTS kafka_events_bronze_raw_pyspark")

# Read JSON files
kafka_events_df = spark.read.format("json").load("/Volumes/lakeflow_demo/lakeflow_schema/raw/events-kafka/")

# Write to Delta table
kafka_events_df.write.format("delta").mode("overwrite").saveAsTable("kafka_events_bronze_raw_pyspark")

# Display the table
print("Raw Kafka events table:")
spark.table("kafka_events_bronze_raw_pyspark").display()


### B3. Decoding base64 Strings with PySpark

1. Decode the base64-encoded key and value columns using PySpark functions.


In [0]:
%python
from pyspark.sql.functions import col, unbase64

# Read from the bronze table
kafka_events_df = spark.table("kafka_events_bronze_raw_pyspark")

# Decode base64 columns
decoded_df = kafka_events_df.select(
    col("key").alias("encoded_key"),
    unbase64(col("key")).alias("decoded_key_binary"),
    col("value").alias("encoded_value"),
    unbase64(col("value")).alias("decoded_value_binary"),
    col("offset"),
    col("partition"),
    col("timestamp"),
    col("topic")
)

print("Decoded columns (as BINARY):")
decoded_df.display()


2. Convert BINARY columns to STRING columns.


In [0]:
%python
from pyspark.sql.functions import col, unbase64

# Read from the bronze table
kafka_events_df = spark.table("kafka_events_bronze_raw_pyspark")

# Decode and cast to STRING
decoded_df = kafka_events_df.select(
    col("key").alias("encoded_key"),
    unbase64(col("key")).cast("string").alias("decoded_key"),
    col("value").alias("encoded_value"),
    unbase64(col("value")).cast("string").alias("decoded_value"),
    col("offset"),
    col("partition"),
    col("timestamp"),
    col("topic")
)

print("Decoded columns (as STRING):")
decoded_df.display()


3. Create a decoded bronze table.


In [0]:
%python
from pyspark.sql.functions import col, unbase64

# Drop the table if it exists
spark.sql("DROP TABLE IF EXISTS kafka_events_bronze_decoded_pyspark")

# Read from the bronze table
kafka_events_df = spark.table("kafka_events_bronze_raw_pyspark")

# Decode and cast to STRING
decoded_df = kafka_events_df.select(
    unbase64(col("key")).cast("string").alias("decoded_key"),
    col("offset"),
    col("partition"),
    col("timestamp"),
    col("topic"),
    unbase64(col("value")).cast("string").alias("decoded_value")
)

# Write to Delta table
decoded_df.write.format("delta").mode("overwrite").saveAsTable("kafka_events_bronze_decoded_pyspark")

# View the new table
print("Decoded Kafka events table:")
spark.table("kafka_events_bronze_decoded_pyspark").display()


## C. Working with JSON Formatted Strings in PySpark

### C1. Flattening JSON String Columns

1. Extract fields from JSON-formatted strings using PySpark's JSON functions.


In [0]:
%python
from pyspark.sql.functions import col, get_json_object

# Read from decoded table
decoded_df = spark.table("kafka_events_bronze_decoded_pyspark")

# Extract fields from JSON string using get_json_object
flattened_df = decoded_df.select(
    col("decoded_value"),
    get_json_object(col("decoded_value"), "$.device").alias("device"),
    get_json_object(col("decoded_value"), "$.traffic_source").alias("traffic_source"),
    get_json_object(col("decoded_value"), "$.geo").alias("geo"),
    get_json_object(col("decoded_value"), "$.items").alias("items"),
    col("decoded_key"),
    col("offset"),
    col("partition"),
    col("timestamp"),
    col("topic")
)

print("Flattened JSON fields:")
flattened_df.display()


2. Alternative: Using SQL-style JSON path extraction (column:field syntax) via Spark SQL.


In [0]:
%python
# Using SQL-style JSON path extraction
flattened_sql_df = spark.sql("""
    SELECT 
      decoded_value,
      decoded_value:device,
      decoded_value:traffic_source,
      decoded_value:geo,
      decoded_value:items,
      decoded_key,
      offset,
      partition,
      timestamp,
      topic
    FROM kafka_events_bronze_decoded_pyspark
    LIMIT 5
""")

print("Using SQL-style JSON path extraction:")
flattened_sql_df.display()


3. Create a flattened bronze table.


In [0]:
%python
# Drop the table if it exists
spark.sql("DROP TABLE IF EXISTS kafka_events_bronze_string_flattened_pyspark")

# Create flattened table using SQL
flattened_table = spark.sql("""
    SELECT
      decoded_key,
      offset,
      partition,
      timestamp,
      topic,
      decoded_value:device,
      decoded_value:traffic_source,
      decoded_value:geo,
      decoded_value:items
    FROM kafka_events_bronze_decoded_pyspark
""")

# Write to Delta table
flattened_table.write.format("delta").mode("overwrite").saveAsTable("kafka_events_bronze_string_flattened_pyspark")

# Display the table
print("Flattened Kafka events table:")
spark.table("kafka_events_bronze_string_flattened_pyspark").display()


### C2. Flattening JSON Formatting Strings via STRUCT Conversion - Python

Similar to the previous section, we will discuss how to flatten our JSON STRING column **decoded_value** using a STRUCT column.

#### Benefits and Considerations of STRUCT Columns

**Benefits**
- **Schema Enforcement** – STRUCT columns define and enforce a schema, helping maintain data integrity.
- **Improved Performance** – STRUCTs are generally more efficient for querying and processing than plain strings.

**Considerations**
- **Schema Enforcement** – Because the schema is enforced, issues can arise if the JSON structure changes over time.
- **Reduced Flexibility** – The data must consistently match the defined schema, leaving less room for structural variation.

#### C2.1 Converting a JSON STRING to a STRUCT Column

To convert a JSON-formatted STRING column to a STRUCT column, you can use PySpark's `from_json()` function with a defined schema.

1. First, get a sample JSON string to determine the schema.


In [0]:
%python
# Get a sample JSON string from the decoded table
sample_json = spark.table("kafka_events_bronze_decoded_pyspark").select("decoded_value").first()[0]
print("Sample JSON string:")
print(sample_json[:200] + "..." if len(sample_json) > 200 else sample_json)


2. Use `from_json()` to convert the JSON string column to a STRUCT. You can define the schema manually or use `schema_of_json()` to infer it.

   **Note:** For this exercise, we'll use a simplified schema. In practice, you can use `schema_of_json()` to get the schema from a sample JSON string.


In [0]:
%python
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, LongType, MapType

# Define schema for the JSON structure
json_schema = StructType([
    StructField("device", StringType(), True),
    StructField("page", StringType(), True),
    StructField("action", StringType(), True),
    StructField("event_timestamp", LongType(), True),
    StructField("location", MapType(StringType(), StringType()), True),
    StructField("session_id", StringType(), True),
    StructField("customer_id", StringType(), True)
])

# Drop table if exists
spark.sql("DROP TABLE IF EXISTS kafka_events_bronze_struct_pyspark")

# Read the decoded table
decoded_df = spark.table("kafka_events_bronze_decoded_pyspark")

# Convert JSON string to STRUCT
struct_df = decoded_df.select(
    col("decoded_key"),
    col("offset"),
    col("partition"),
    col("timestamp"),
    col("topic"),
    from_json(col("decoded_value"), json_schema).alias("value")
)

# Write to Delta table
struct_df.write.format("delta").mode("overwrite").saveAsTable("kafka_events_bronze_struct_pyspark")

# Display the table
print("Kafka events with STRUCT column:")
spark.table("kafka_events_bronze_struct_pyspark").display()


#### C2.2 Extract fields, nested fields from STRUCT columns

We can query the STRUCT column using `value.browser` or `value.location` in our SELECT statement.

1. Using this syntax, we can obtain values from the **value** struct column. Notice the following:

   - We obtained values from the STRUCT column for **device** and **city** (nested field from location)
   
   - The STRUCT provides better performance and type safety than JSON string extraction


In [0]:
%python
from pyspark.sql.functions import col

# Extract fields from STRUCT column
struct_df = spark.table("kafka_events_bronze_struct_pyspark")

extracted_df = struct_df.select(
    col("decoded_key"),
    col("value.device").alias("device"),           # Field
    col("value.page").alias("page"),               # Field
    col("value.location").alias("location"),       # Nested struct
    col("value.customer_id").alias("customer_id")  # Field
)

print("Extracted fields from STRUCT:")
extracted_df.display()


## D. Working with a VARIANT Column (Public Preview) - Python

#### VARIANT Column Benefits and Considerations:

**BENEFITS**
- **Open** - Fully open-sourced, no proprietary data lock-in.
- **Flexible** - No strict schema. You can put any type of semi-structured data into VARIANT.
- **Performant** - Improved performance over existing methods.

**CONSIDERATIONS**
- Currently in public preview as of 2025 Q2.
- [Variant support in Delta Lake](https://docs.databricks.com/aws/en/delta/variant)

**RESOURCES**:
- [Introducing the Open Variant Data Type in Delta Lake and Apache Spark](https://www.databricks.com/blog/introducing-open-variant-data-type-delta-lake-and-apache-spark)
- [Say goodbye to messy JSON headaches with VARIANT](https://www.youtube.com/watch?v=fWdxF7nL3YI)
- [Variant Data Type - Making Semi-Structured Data Fast and Simple](https://www.youtube.com/watch?v=jtjOfggD4YY)

**NOTE:** Variant data type will not work on Serverless Version 1.

1. View the **kafka_events_bronze_decoded_pyspark** table. Confirm the **decoded_value** column contains a JSON formatted string.


In [0]:
%python
# View the decoded table
print("Kafka events bronze decoded table:")
spark.table("kafka_events_bronze_decoded_pyspark").display()


2. Use the `parse_json()` function to return a VARIANT value from the JSON formatted string.

   Run the cell and view the results. Notice that the **json_variant_value** column is of type VARIANT.


In [0]:
%python
# Use parse_json SQL function to convert JSON string to VARIANT
spark.sql("DROP TABLE IF EXISTS kafka_events_bronze_variant_pyspark")

variant_df = spark.sql("""
    SELECT
      decoded_key,
      offset,
      partition,
      timestamp,
      topic,
      parse_json(decoded_value) AS json_variant_value
    FROM kafka_events_bronze_decoded_pyspark
""")

# Write to Delta table
variant_df.write.format("delta").mode("overwrite").saveAsTable("kafka_events_bronze_variant_pyspark")

# Display the table
print("Kafka events with VARIANT column:")
spark.table("kafka_events_bronze_variant_pyspark").display()


3. You can parse the VARIANT data type column using `:` to create your desired table.

   [VARIANT type](https://docs.databricks.com/aws/en/sql/language-manual/data-types/variant-type)


In [0]:
%python
# Parse VARIANT column using SQL-style path extraction
variant_parsed = spark.sql("""
    SELECT
      json_variant_value,
      json_variant_value:browser :: STRING AS browser,  -- Obtain the value of browser and cast to a string
      json_variant_value:page :: STRING AS page,
      json_variant_value:location AS location
    FROM kafka_events_bronze_variant_pyspark
    LIMIT 10
""")

print("Parsed VARIANT column:")
variant_parsed.display()


# 5. Creating Streaming Tables with SQL using Auto Loader - Python

In this demonstration we will create a streaming table to incrementally ingest files from a volume using Auto Loader with SQL. 

When you create a streaming table using the CREATE OR REFRESH STREAMING TABLE statement, the initial data refresh and population begin immediately. These operations do not consume DBSQL warehouse compute. Instead, streaming tables rely on serverless DLT for both creation and refresh. A dedicated serverless DLT pipeline is automatically created and managed by the system for each streaming table.

### Learning Objectives

By the end of this lesson, you should be able to:
- Create streaming tables in Databricks SQL for incremental data ingestion.
- Refresh streaming tables using the REFRESH statement.

### RECOMMENDATION

The CREATE STREAMING TABLE SQL command is the recommended alternative to the legacy COPY INTO SQL command for incremental ingestion from cloud object storage. Databricks recommends using streaming tables to ingest data using Databricks SQL. 

A streaming table is a table registered to Unity Catalog with extra support for streaming or incremental data processing. A DLT pipeline is automatically created for each streaming table. You can use streaming tables for incremental data loading from Kafka and cloud object storage.

**NOTE:** Streaming tables are created using SQL syntax, but we can execute them from Python using `spark.sql()`.


## A. Setup for Streaming Tables

**REQUIRED - SELECT YOUR SERVERLESS SQL WAREHOUSE**

**NOTE: Creating streaming tables with Databricks SQL requires a SQL warehouse.**

Before executing cells in this notebook, please select a **SQL WAREHOUSE** in the lab. Follow these steps:

1. Navigate to the top-right of this notebook and click the drop-down to select compute (it might say **Connect**).
2. Select **More**.
3. Then select the **SQL Warehouse** button.
4. Select or create a SQL warehouse.
5. Then, at the bottom of the pop-up, select **Start and attach**.


## B. Create Streaming Tables for Incremental Processing

1. Explore the volume `/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source` and confirm it contains CSV file(s).

   Use Python to list the files in this volume.


In [0]:
%python
# List files in the autoloader source volume
files = dbutils.fs.ls("/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source")
print("Files in csv_files_autoloader_source volume:")
for file in files:
    print(f"  {file.name} - {file.size} bytes")


2. Run the query below to view the data in the CSV file(s) in your cloud storage location. Notice that it was returned in tabular format.


In [0]:
%python
# View data in CSV files
csv_df = spark.sql("""
    SELECT *
    FROM read_files(
      '/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source',
      format => 'CSV',
      sep => '|',
      header => true
    )
    LIMIT 10
""")

print("Sample data from CSV files:")
csv_df.display()


#### Create a STREAMING TABLE using Databricks SQL

3. Your goal is to create an incremental pipeline that only ingests new files (instead of using traditional batch ingestion). You can achieve this by using [streaming tables in Databricks SQL](https://docs.databricks.com/aws/en/dlt/dbsql/streaming) (Auto Loader).

   - The SQL code below creates a streaming table that will incrementally ingest only new data.
   
   - A pipeline is automatically created for each streaming table. You can use streaming tables for incremental data loading from Kafka and cloud object storage.

   **NOTE:** Incremental batch ingestion automatically detects new records in the data source and ignores records that have already been ingested. This reduces the amount of data processed, making ingestion jobs faster and more efficient in their use of compute resources.

   **REQUIRED: This process will take about a minute to run and set up the incremental ingestion pipeline.**


In [0]:
%python
# Create streaming table using SQL
spark.sql("""
    CREATE STREAMING TABLE sql_csv_autoloader_pyspark
    AS
    SELECT *
    FROM STREAM read_files(
      '/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source',
      format => 'CSV',
      sep => '|',
      header => true
    )
""")

print("Streaming table created successfully!")


4. Run the cell below to view the streaming table. Confirm that the results contain the expected number of rows.


In [0]:
%python
# View the streaming table
print("Streaming table data:")
spark.table("sql_csv_autoloader_pyspark").display()


5. Describe the STREAMING TABLE and view the results. Notice the following:

- Under **Detailed Table Information**, notice the following rows:
  - **View Text**: The query that created the table.
  - **Type**: Specifies that it is a STREAMING TABLE.
  - **Provider**: Indicates that it is a Delta table.

- Under **Refresh Information**, you can see specific refresh details including Last Refreshed, Last Refresh Type, Latest Refresh Status, etc.


In [0]:
%python
# Describe the streaming table
describe_df = spark.sql("DESCRIBE TABLE EXTENDED sql_csv_autoloader_pyspark")
print("Table description:")
describe_df.display()


6. The `DESCRIBE HISTORY` statement displays a detailed list of all changes, versions, and metadata associated with a Delta streaming table, including information on updates, deletions, and schema changes.

   Run the cell below and view the results. Notice the following:

   - In the **operation** column, you can see that a streaming table performs operations: **CREATE TABLE**, **DLT SETUP** and **STREAMING UPDATE**.
   
   - Scroll to the right and find the **operationMetrics** column to see the number of rows processed.


In [0]:
%python
# Describe history of the streaming table
history_df = spark.sql("DESCRIBE HISTORY sql_csv_autoloader_pyspark")
print("Table history:")
history_df.display()


7. To demonstrate incremental ingestion, manually add another file to your cloud storage location: `/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source`.

   **Option 1 - Using Python:**
   - Copy a file from the staging volume to the source volume

   **Option 2 - Using UI:**
   - Click the catalog icon on the left
   - Expand the **lakeflow_demo** catalog
   - Expand your **lakeflow_schema** schema
   - Expand **Volumes**
   - Open the **autoloader_staging_files** volume
   - Copy a file from there to the **csv_files_autoloader_source** volume


In [0]:
%python
# Option 1: Copy a file from staging to source volume using Python
def copy_files(copy_from, copy_to, n=1):
    files = dbutils.fs.ls(copy_from)
    for f in files[:n]:
        dbutils.fs.cp(f.path, f"{copy_to}/{f.name}")
    print(f"Copied {min(n, len(files))} file(s) from {copy_from} to {copy_to}")

# Copy one additional file for incremental ingestion demo
copy_files(
    copy_from="/Volumes/lakeflow_demo/lakeflow_schema/autoloader_staging_files",
    copy_to="/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source",
    n=1
)

print("File copied. You can now refresh the streaming table to see incremental ingestion.")


8. Next, manually refresh the STREAMING TABLE using `REFRESH STREAMING TABLE table-name`. 

   - [Refresh a streaming table](https://docs.databricks.com/aws/en/dlt/dbsql/streaming#refresh-a-streaming-table) documentation

   **NOTE:** You can also rerun the CREATE STREAMING TABLE cell to incrementally ingest only new files.


In [0]:
%python
# Refresh the streaming table
spark.sql("REFRESH STREAMING TABLE lakeflow_demo.lakeflow_schema.sql_csv_autoloader_pyspark")
print("Streaming table refreshed successfully!")


9. Run the cell below to view the data in the **sql_csv_autoloader_pyspark** table. Notice that the table now contains additional rows from the newly added file.


In [0]:
%python
# View the streaming table after refresh
print("Streaming table data after incremental ingestion:")
spark.table("sql_csv_autoloader_pyspark").display()


10. Describe the history of the **sql_csv_autoloader_pyspark** table. Observe the following:

  - Additional versions of the streaming table include **STREAMING UPDATE** operations.

  - Expand the **operationMetrics** column and note the number of rows that were incrementally ingested.


In [0]:
%python
# Describe history after incremental ingestion
history_df = spark.sql("DESCRIBE HISTORY sql_csv_autoloader_pyspark")
print("Table history after incremental ingestion:")
history_df.display()


## Additional Resources

- [Streaming Tables Documentation](https://docs.databricks.com/aws/en/dlt/dbsql/streaming)
- [CREATE STREAMING TABLE Syntax](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-streaming-table)
- [Using Streaming Tables in Databricks SQL](https://docs.databricks.com/aws/en/dlt/dbsql/streaming)
- [REFRESH (MATERIALIZED VIEW or STREAMING TABLE)](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-refresh-full)


## Summary

This notebook demonstrated Python/PySpark equivalents of the SQL-based data ingestion exercises:

1. **Parquet Ingestion**: Using `spark.read.format("parquet")` and `df.write.saveAsTable()`
2. **Metadata Columns**: Using PySpark functions like `input_file_name()`, `current_timestamp()`, and `from_unixtime()`
3. **CSV with Rescued Data**: Using `read_files()` SQL function from PySpark to access `_rescued_data` column
4. **JSON Ingestion**: Using `spark.read.format("json")` and JSON parsing functions
5. **Base64 Decoding**: Using `unbase64()` function in PySpark
6. **JSON Flattening**: Using `get_json_object()` or SQL-style path extraction
7. **STRUCT Conversion**: Using `from_json()` with defined schemas to convert JSON strings to STRUCT columns
8. **VARIANT Columns**: Using `parse_json()` to convert JSON strings to VARIANT data type
9. **Streaming Tables**: Creating streaming tables using SQL executed from Python for incremental data ingestion

### Key Differences: SQL vs PySpark

- **SQL**: Direct use of `read_files()` with `_metadata` column
- **PySpark**: Use `input_file_name()` for file names, or call `read_files()` via `spark.sql()`
- **SQL**: `CREATE TABLE AS SELECT` syntax
- **PySpark**: `df.write.saveAsTable()` method
- **SQL**: Built-in `_metadata` column support
- **PySpark**: Access `_metadata` via `read_files()` SQL function or use native functions
- **SQL**: Direct `CREATE STREAMING TABLE` syntax
- **PySpark**: Execute `CREATE STREAMING TABLE` via `spark.sql()`

Both approaches are valid and can be used based on your preference and use case!
