
<div  style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://raw.githubusercontent.com/derar-alhussein/Databricks-Certified-Data-Engineer-Associate/main/Includes/images/bookstore_schema.png" alt="Databricks Learning" style="width: 600">
</div>

In [0]:
%run ../Includes/Copy-Datasets

In [0]:
--Use a CTAS statement to create the orders delta table
CREATE TABLE orders AS
SELECT * FROM parquet.`${dataset.bookstore}/orders`

In [0]:
SELECT * FROM orders

## 🔁 Overwriting Tables

Two primary methods are used to completely overwrite data in a Delta table:

- **CREATE OR REPLACE TABLE**  
  Replaces the entire content of an existing Delta table and generates a new version in the transaction log.  
  This enables easy rollbacks using Delta Lake’s Time Travel feature.

- **INSERT OVERWRITE**  
  Replaces data within the existing table without creating a new table.  
  It only overwrites the records that match the current schema, preserving the table structure and metadata.

Both options are useful when the goal is to **refresh or reset** data while maintaining version history and schema integrity.

In [0]:
CREATE OR REPLACE TABLE orders AS
SELECT * FROM parquet.`${dataset.bookstore}/orders`

In [0]:
--Version 0 is a CREATE TABLE AS SELECT operation
--Version 1 is a CREATE OR REPLACE operation
DESCRIBE HISTORY orders

In [0]:
INSERT OVERWRITE orders
SELECT * FROM parquet.`${dataset.bookstore}/orders`

In [0]:
--Version 0 is a CREATE TABLE AS SELECT operation
--Version 1 is a CREATE OR REPLACE operation
--Version 2 is a WRITE operation
DESCRIBE HISTORY orders

In [0]:
--This would fail due do schma mismatch detected
INSERT OVERWRITE orders
SELECT *, current_timestamp() FROM parquet.`${dataset.bookstore}/orders`

## ➕ Appending Data

To add new records to an existing Delta table, data can be appended using standard SQL `INSERT INTO` statements.

- This method simply adds rows to the current table without modifying existing data.
- While easy to use, it does **not perform deduplication**, so duplicate records may occur if not handled separately.

Appending is commonly used for incremental data ingestion or batch updates.

In [0]:
INSERT INTO orders
SELECT * FROM parquet.`${dataset.bookstore}/orders-new`

In [0]:
SELECT count(*) FROM orders

## 🔀 Merging Data (Upserts)

Delta Lake supports the **MERGE INTO** operation, which combines inserts, updates, and deletes into a single transactional command.

- This approach is ideal for keeping datasets up to date by **upserting** data — inserting new rows and updating existing ones based on a matching condition.
- It ensures **data integrity** and prevents duplication.
- Commonly used in real-world scenarios such as syncing data between systems or integrating incremental updates.

In [0]:
--Create a temporary view of the new customer data to MERGE INTO the customers table
CREATE OR REPLACE TEMP VIEW customers_updates AS 
SELECT * FROM json.`${dataset.bookstore}/customers-json-new`;

MERGE INTO customers c
USING customers_updates u
ON c.customer_id = u.customer_id
WHEN MATCHED AND c.email IS NULL AND u.email IS NOT NULL THEN
  UPDATE SET email = u.email, updated = u.updated
WHEN NOT MATCHED THEN INSERT *

In [0]:
--Another example
CREATE OR REPLACE TEMP VIEW books_updates
   (book_id STRING, title STRING, author STRING, category STRING, price DOUBLE)
USING CSV
OPTIONS (
  path = "${dataset.bookstore}/books-csv-new",
  header = "true",
  delimiter = ";"
);

SELECT * FROM books_updates

In [0]:
MERGE INTO books b
USING books_updates u
ON b.book_id = u.book_id AND b.title = u.title
WHEN NOT MATCHED AND u.category = 'Computer Science' THEN 
  INSERT *