# Auto Loader - Incremental File Ingestion

## Overview
This notebook covers Auto Loader (cloudFiles) - the recommended way to incrementally and efficiently ingest data from cloud storage into Delta Lake.

## Learning Objectives
- Understand Auto Loader architecture
- Configure Auto Loader for different file formats
- Handle schema inference and evolution
- Process new files automatically
- Handle errors and rescue data
- Monitor Auto Loader streams

---

## 1. Auto Loader Basics

### What is Auto Loader?

**Auto Loader** incrementally and efficiently processes new data files as they arrive in cloud storage.

**Benefits**:
- ✅ Scalable (handles millions of files)
- ✅ Efficient (only processes new files)
- ✅ Automatic schema inference and evolution
- ✅ Built-in error handling
- ✅ Exactly-once processing guarantees

**Use Cases**:
- Ingesting logs from S3/ADLS/GCS
- Processing streaming data files
- Building data lakes
- ETL pipelines

## 2. Basic Auto Loader Configuration

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Define paths
source_path = "/path/to/source/files"
checkpoint_path = "/path/to/checkpoint"
target_path = "/path/to/delta/table"

# Basic Auto Loader configuration
df = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", checkpoint_path) \
    .load(source_path)

# Write to Delta
query = df.writeStream \
    .format("delta") \
    .option("checkpointLocation", checkpoint_path) \
    .start(target_path)

print("Auto Loader stream configured")
# query.awaitTermination()

## 3. File Format Support

### JSON Files

In [None]:
# JSON Auto Loader
json_stream = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/checkpoint/json") \
    .option("multiLine", "true") \
    .load("/source/json/*")

print("JSON Auto Loader configured")

### CSV Files

In [None]:
# CSV Auto Loader
csv_stream = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "csv") \
    .option("cloudFiles.schemaLocation", "/checkpoint/csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/source/csv/*")

print("CSV Auto Loader configured")

### Parquet Files

In [None]:
# Parquet Auto Loader
parquet_stream = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "parquet") \
    .option("cloudFiles.schemaLocation", "/checkpoint/parquet") \
    .load("/source/parquet/*")

print("Parquet Auto Loader configured")

## 4. Schema Inference and Evolution

### Automatic Schema Inference

In [None]:
# Auto Loader automatically infers schema from sample files
auto_schema = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/checkpoint/infer") \
    .option("cloudFiles.inferColumnTypes", "true") \
    .load("/source/data/*")

print("Schema will be inferred automatically")

### Explicit Schema

In [None]:
# Define explicit schema
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("value", DoubleType(), True)
])

explicit_schema = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .schema(schema) \
    .load("/source/data/*")

print("Explicit schema applied")

### Schema Evolution

In [None]:
# Enable schema evolution
evolving_schema = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/checkpoint/evolve") \
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
    .load("/source/data/*")

# Schema evolution modes:
# - addNewColumns: Add new columns (default)
# - rescue: Put unknown columns in _rescued_data
# - failOnNewColumns: Fail if schema changes

print("Schema evolution enabled")

## 5. File Metadata

In [None]:
# Include file metadata columns
with_metadata = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/checkpoint/metadata") \
    .option("cloudFiles.includeExistingFiles", "true") \
    .load("/source/data/*") \
    .select(
        "*",
        "_metadata.file_path",
        "_metadata.file_name",
        "_metadata.file_size",
        "_metadata.file_modification_time"
    )

print("File metadata columns included")

## 6. Error Handling and Rescue Data

### Rescued Data Column

In [None]:
# Rescue malformed or unexpected data
with_rescue = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/checkpoint/rescue") \
    .option("cloudFiles.schemaEvolutionMode", "rescue") \
    .option("rescuedDataColumn", "_rescued_data") \
    .schema(schema) \
    .load("/source/data/*")

# Unknown columns will be stored in _rescued_data
print("Rescue data column enabled")

### Bad Records Path

In [None]:
# Write bad records to separate location
with_bad_records = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/checkpoint/badrecords") \
    .option("badRecordsPath", "/bad_records") \
    .load("/source/data/*")

print("Bad records path configured")

## 7. Performance Optimization

### File Notification Mode

In [None]:
# Use file notification for better scalability
# (Requires setup of cloud notifications)

file_notification = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/checkpoint/notify") \
    .option("cloudFiles.useNotifications", "true") \
    .load("/source/data/*")

# File notification modes:
# - Directory listing (default): Lists directory to find files
# - File notification: Uses cloud notifications (more scalable)

print("File notification mode configured")

### Max Files Per Trigger

In [None]:
# Control processing rate
rate_limited = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/checkpoint/rate") \
    .option("maxFilesPerTrigger", "100") \
    .load("/source/data/*")

print("Rate limiting configured: 100 files per trigger")

## 8. Complete Auto Loader Pipeline Example

In [None]:
# Complete example: JSON to Delta with transformations

# Read with Auto Loader
raw_stream = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/checkpoint/bronze") \
    .option("cloudFiles.inferColumnTypes", "true") \
    .option("cloudFiles.schemaEvolutionMode", "rescue") \
    .load("/source/events/*")

# Add metadata and transformations
processed_stream = raw_stream \
    .withColumn("ingestion_timestamp", current_timestamp()) \
    .withColumn("ingestion_date", current_date()) \
    .withColumn("source_file", input_file_name())

# Write to Bronze layer
bronze_query = processed_stream.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoint/bronze") \
    .partitionBy("ingestion_date") \
    .trigger(processingTime="30 seconds") \
    .start("/delta/bronze/events")

print("Complete Auto Loader pipeline configured")
print(f"Query ID: {bronze_query.id}")

## 9. Monitoring Auto Loader

In [None]:
# Monitor stream status
# print(f"Is active: {bronze_query.isActive}")
# print(f"Status: {bronze_query.status}")

# Get recent progress
# recent = bronze_query.recentProgress
# print(f"Recent batches: {len(recent)}")

# Get last progress
# last = bronze_query.lastProgress
# if last:
#     print(f"Input rows: {last['numInputRows']}")
#     print(f"Processed rows: {last['processedRowsPerSecond']}")

print("Monitoring methods available (commented out)")

## 10. Best Practices

### Best Practice Guidelines

1. **Schema Location**: Always specify `cloudFiles.schemaLocation`
   ```python
   .option("cloudFiles.schemaLocation", "/checkpoint/path")
   ```

2. **Schema Evolution**: Enable for flexibility
   ```python
   .option("cloudFiles.schemaEvolutionMode", "rescue")
   ```

3. **File Notification**: Use for large-scale ingestion
   ```python
   .option("cloudFiles.useNotifications", "true")
   ```

4. **Rate Limiting**: Control processing rate
   ```python
   .option("maxFilesPerTrigger", "1000")
   ```

5. **Checkpoint Management**: Use separate checkpoints per stream

6. **Partitioning**: Partition by date for better organization
   ```python
   .partitionBy("date")
   ```

7. **Metadata**: Include file metadata for traceability
   ```python
   .withColumn("source_file", input_file_name())
   ```

8. **Error Handling**: Use rescue columns and bad records path
   ```python
   .option("badRecordsPath", "/bad_records")
   ```

## Practice Exercises

### Exercise 1: CSV Ingestion
Create an Auto Loader pipeline for CSV files with schema evolution.

In [None]:
# Your solution here
# TODO: Configure Auto Loader for CSV with schema evolution

### Exercise 2: Bronze-to-Silver Pipeline
Read from a bronze Delta table and write to silver with data quality checks.

In [None]:
# Your solution here
# TODO: Create bronze-to-silver pipeline with quality checks

## Summary

In this notebook, you learned:

✅ Auto Loader architecture and benefits
✅ Configuration for different file formats
✅ Schema inference and evolution
✅ File metadata tracking
✅ Error handling with rescue columns
✅ Performance optimization techniques
✅ Complete pipeline examples
✅ Monitoring and best practices

## Next Steps

1. Implement Auto Loader in your projects
2. Set up file notifications for scale
3. Build medallion architecture pipelines
4. Learn about Delta Live Tables

## Additional Resources

- [Auto Loader Documentation](https://docs.databricks.com/ingestion/auto-loader/index.html)
- [Schema Evolution](https://docs.databricks.com/ingestion/auto-loader/schema.html)
- [Spark By Examples - Auto Loader](https://sparkbyexamples.com/pyspark/pyspark-autoloader/)