
## Delta Lake & Data Ingestion

- Getting data into databricks is comonly called data ingestion.
- data engineers or data warehouse managers are primarily responsible for data ingestion.

### Delta Lake

![delta lake](./images/delta-lake.png)

- the goal of data ingestion is to bring in files and data from external data sourceslike cloud storage and sql tables inro Delta Lake as Delta tabels.
- Delta Lake is an open-source protocol that databricks uses for the data layer.


### Delta Table


- Delta tables store data within a folder directory. Within that directory the data is stored as **Parquet files** .
- Delta adds delta logs that are stored as JSON files alogside the parquet files.
- delta logs keep track of all transactions on data and table versions.
- table states are maintained using the transaction logs. If data is inserted, deleted or updated in the table, Delta adds a transaction (log file) and the table stays updated and managed.

- The transaction log provides:

  - **ACID transactions** (atomicity, consistency, isolation, durability) for concurrent reads/writes.
  - **Table versioning** enabling **time travel** (querying historical data).
  

### Key Features of Delta Lake

* ACID transaction support for safe concurrent operations.
* DML operations (INSERT, UPDATE, DELETE, MERGE).
* Time travel to query or restore previous versions.
* Schema enforcement and evolution.
* Unified batch and streaming support.
* Optimizations and scalability.



In [0]:
USE CATALOG workspace;
USE SCHEMA `2235-wk3`;

In [0]:
SELECT current_catalog(), current_schema();


**Common Data Importing Methods for Data Analysts:**

- File Upload UI

- CTAS (Creat table as Select)

- COPY INTO 

- FROM read_files()



### Data Ingestion with CTAS and read_files() - BATCH Ingestion

- `CREATE TABLE AS (CTAS)` is used to create and populate tables using the results of a query.
- `read_files()` table-valued function enebles reading data of various file formats and provides additional options for data ingestion.

**Documentation:**

- [read_files](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files)


**Note:**

- a `_rescued_data` column is automatically added to capture any data that does not match the inferred schema.



In [0]:
%sh 

ls /Volumes/workspace/2235-wk3/orders


In [0]:
SELECT * FROM csv.`/Volumes/workspace/2235-wk3/orders`

In [0]:
SELECT * FROM 

read_files(

  '/Volumes/workspace/2235-wk3/orders',
  format => 'csv',
  inferSchema => 'true',
  header => 'true',
  escape => '"'
) LIMIT  10

In [0]:
CREATE TABLE orders_bronze
USING DELTA -- optional
SELECT * FROM 
read_files(

  '/Volumes/workspace/2235-wk3/orders',
  format => 'csv',
  inferSchema => 'true',
  header => 'true',
  escape => '"'
);

-- preview the table

SELECT * FROM orders_bronze;

In [0]:
DESCRIBE TABLE orders_bronze;

In [0]:
DESCRIBE TABLE EXTENDED orders_bronze;


#### tabels

- managed table - UC manages everything; even cloud storage.
  - discards metadata and deletes the associated data when table is dropped
  - format is delta
  - comes with new features, performance, simplicity, stricter access
- external table - external cloud location
  - discards meteadata. does not dete the data
  - the path specified bt the `LOCATION` keyword
  - manually manages
  - format can be DELTA, CSV, JSON, AVRO, and PARQUET etc.,

In [0]:
CREATE TABLE IF NOT EXISTS customers (
  customer_id VARCHAR(10) PRIMARY KEY,
  first_name VARCHAR(50),
  last_name VARCHAR(50),
  email VARCHAR(100),
  signup_date DATE,
  country VARCHAR(50),
  is_active BOOLEAN
)
LOCATION 's3://db-external-storage-orders-2235/output/';

-- Insert customer records
INSERT INTO customers (customer_id, first_name, last_name, email, signup_date, country, is_active) VALUES
('CU-7D31', 'Alice', 'Johnson', 'alice.johnson@email.com', '2022-01-15', 'USA', TRUE),
('CU-9A52', 'Bob', 'Smith', 'bob.smith@email.com', '2023-03-22', 'USA', TRUE),
('CU-2L68', 'Carol', 'Davis', 'carol.davis@email.com', '2021-07-30', 'Canada', TRUE),
('CU-4E93', 'David', 'Lee', 'david.lee@email.com', '2020-11-05', 'UK', FALSE),
('CU-1B75', 'Emma', 'Wilson', 'emma.wilson@email.com', '2019-09-14', 'USA', TRUE),
('CU-8F42', 'Frank', 'Taylor', 'frank.taylor@email.com', '2023-01-20', 'Canada', TRUE),
('CU-6K17', 'Grace', 'Martinez', 'grace.martinez@email.com', '2022-05-09', 'UK', TRUE),
('CU-3R84', 'Henry', 'Anderson', 'henry.anderson@email.com', '2021-12-11', 'USA', TRUE),
('CU-5N26', 'Irene', 'Thomas', 'irene.thomas@email.com', '2020-08-23', 'Canada', FALSE),
('CU-9L73', 'Jack', 'Moore', 'jack.moore@email.com', '2023-06-04', 'USA', TRUE),
('CU-2M35', 'Karen', 'Jackson', 'karen.jackson@email.com', '2021-04-16', 'UK', TRUE),
('CU-4T89', 'Leo', 'White', 'leo.white@email.com', '2022-10-30', 'USA', TRUE);




In [0]:
DROP TABLE IF EXISTS customers;

In [0]:
%python

df = (spark.read.format("csv").load("/Volumes/workspace/2235-wk3/orders", header = True, inferSchema = True, escape = '"'))

(df.write.mode("overwrite").saveAsTable("orders_bronze_py"))

orders_bronze = spark.table("orders_bronze_py")

orders_bronze.display()



### Data Ingestion using COPY INTO - Incremental data ingestion

- `COPY INTO` allows to load data from a file location into Delta table. 
- re-triable and idempotent.
- new files in source location are added and files already loaded are skipped.

**Documentation**

[COPY INTO](https://docs.databricks.com/aws/en/sql/language-manual/delta-copy-into)


- `mergeSchema` copy option is used for schema evolution.



In [0]:
DROP TABLE IF EXISTS orders_bronze_ci;
CREATE TABLE orders_bronze_ci;

COPY INTO orders_bronze_ci
FROM '/Volumes/workspace/2235-wk3/orders'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'escape' = '"')
COPY_OPTIONS ('mergeSchema' = 'true');


SELECT * FROM orders_bronze_ci;



In [0]:

COPY INTO orders_bronze_ci
FROM '/Volumes/workspace/2235-wk3/orders'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'escape' = '"')
COPY_OPTIONS ('mergeSchema' = 'true');


-- SELECT * FROM orders_bronze_ci;

| Feature/Method                | File Upload UI                                | CTAS (Create Table As Select)                            | COPY INTO                                      | Auto Loader                                                  |
| ----------------------------- | --------------------------------------------- | -------------------------------------------------------- | ---------------------------------------------- | ------------------------------------------------------------ |
| **Purpose**                   | Upload files manually via Databricks UI       | Create a new table by selecting data from a query        | Load data from external files into a table     | Automatically ingest new files from a cloud storage location |
| **Data Source**               | Local files from user’s machine               | Query result or existing tables                          | External files (e.g., CSV, JSON, Parquet)      | Files landing in cloud storage (e.g., Azure Blob, S3)        |
| **Automation**                | Manual via UI                                 | Manual or scripted                                       | Manual or scripted                             | Fully automated continuous ingestion                         |
| **Use Case**                  | Ad hoc, small-scale uploads                   | Quick table creation from existing data or query results | Bulk loading or incremental loading from files | Large scale, incremental ingestion of new files              |
| **Supports Schema Evolution** | No                                            | No (schema fixed at creation)                            | Yes (with schema update options)               | Yes (infers schema, can evolve schema)                       |
| **Performance**               | Limited by manual upload size and user action | Efficient for creating tables from existing data         | Optimized for bulk file loading                | Optimized for streaming/append scenarios                     |
| **Data Format Support**       | Any file supported by Databricks UI upload    | Any data readable by SQL (tables, views, etc.)           | Common file formats: CSV, JSON, Parquet, etc.  | Common cloud storage file formats (CSV, JSON, Parquet, Avro) |
| **Incremental Loading**       | No                                            | No                                                       | Yes, supports loading new files incrementally  | Yes, continuous detection of new files                       |
