# 🔄 Incremental Data Ingestion in Databricks

Incremental data ingestion enables efficient loading of only **new or updated files**, avoiding reprocessing of previously handled data. This is essential for scalable and cost-effective data pipelines.

---

## 📥 COPY INTO Command

**Definition:**  
The `COPY INTO` command is a SQL-based method used to load data from a specified file location into a Delta table.

### 🔹 Key Characteristics:
- **Idempotent** – Skips files that have already been processed.
- **Incremental** – Loads only new files, ideal for growing datasets.
- **Flexible** – Supports various formats (CSV, Parquet, etc.) and schema evolution options.

### 🔹 Requirements:
- Target Delta table
- Source file path
- File format
- Optional settings (e.g., header, schema evolution)

This method is simple and efficient for small to medium-scale file ingestion workloads.

---

## ⚙️ Auto Loader

**Definition:**  
Auto Loader uses **Structured Streaming** to continuously ingest new files from cloud storage. It is designed for **high-volume, scalable** data pipelines.

### 🔹 Benefits:
- Can handle **millions of files per hour**
- Tracks progress via **checkpointing** to ensure **exactly-once** file processing
- Supports **schema inference** and **schema evolution**

### 🔹 How It Works:
- Uses `readStream` with format `"cloudFiles"`
- Monitors directories for new files and automatically queues them for ingestion
- Works with multiple formats like CSV, JSON, and Parquet

### 🔹 Schema Management:
- Automatically infers schema of incoming data
- Can optionally **store inferred schema** to improve startup time and consistency

---

## 🤔 Choosing Between COPY INTO vs. Auto Loader

| Method        | Best For                              | Key Advantages                  |
|---------------|----------------------------------------|----------------------------------|
| **COPY INTO** | Small to medium workloads (thousands of files) | Simpler setup, SQL-based         |
| **Auto Loader** | Large-scale ingestion (millions of files)     | Scalable, streaming, fault-tolerant |

> Auto Loader is generally recommended for cloud-based ingestion at scale due to its performance and reliability.

---

## ✅ Practical Application

The session concludes by transitioning into hands-on practice, where you'll apply these techniques — especially **Auto Loader** — to set up reliable and scalable ingestion pipelines within Databricks.
