# Lecture 21. Writing to Tables (Hands On)

In this notebook, we are going to explore SQL syntax to insert and update records in Delta tables.

And remember Delta technology provides ACID compliant updates to Delta tables.

For this demonstration, we will continue working with our bookstore dataset.

<div style="text-align: center;">
<img src="../../assets/images/Presentation-Images/bookstore_schema.png" style="width:640px" >
</div> 

Let us first run our helping notebook to copy the dataset.

In [0]:
%run ../Includes/Copy-Datasets


And then we will use a CTAS statement to create `orders` delta table `AS SELECT` statement from parquet files.

In [0]:
%sql
CREATE TABLE orders AS
SELECT * FROM parquet.`${dataset.bookstore}/orders`

num_affected_rows,num_inserted_rows


Let's query this table.

In [0]:
%sql
SELECT * FROM orders

order_id,order_timestamp,customer_id,quantity,total,books
3559,1657722056,C00001,2,48,"List(List(B09, 2, 48))"
4243,1658786901,C00002,2,55,"List(List(B07, 1, 33), List(B06, 1, 22))"
4321,1658934252,C00003,2,40,"List(List(B04, 2, 40))"
4392,1659034513,C00004,2,82,"List(List(B08, 2, 82))"
3673,1657934721,C00005,2,87,"List(List(B01, 1, 49), List(B11, 1, 38))"
4464,1659171834,C00006,2,77,"List(List(B01, 1, 49), List(B02, 1, 28))"
3495,1657541649,C00007,2,52,"List(List(B09, 1, 24), List(B02, 1, 28))"
4105,1658590490,C00008,2,69,"List(List(B08, 1, 41), List(B02, 1, 28))"
3825,1658166059,C00009,2,48,"List(List(B09, 2, 48))"
4062,1658519180,C00010,2,48,"List(List(B09, 2, 48))"


As you can see, parquet files have a well-defined schema, so we managed to extract the data correctly.

## Overwriting Tables

When writing to tables, we could be interested by completely overwriting the data in the table.

In fact, there are multiple **benefits to overwriting tables** instead of deleting and recreating tables.

  * For example, the old version of the table still exists and can easily retrieve all data using Time Travel.

  * In addition, overwriting a table is much faster because it does not need to list the directory recursively or delete any files.

  * In addition, it's an atomic operation.

Concurrent queries can still read the table while you are overwriting it.

And of course due to the ACID transaction guarantees, if overwriting the table fails, the table will be in its previous state.

### CRAS statement

The first method to accomplish complete overwrite is to use `CREATE OR REPLACE TABLE`. Also known as ***CRAS statement***.

`CREATE OR REPLACE TABLE` statements fully replace the content of a table each time they execute.

In [0]:
%sql
CREATE OR REPLACE TABLE orders AS
SELECT * FROM parquet.`${dataset.bookstore}/orders`

num_affected_rows,num_inserted_rows


Let us now check our table history.

In [0]:
%sql
DESCRIBE HISTORY orders

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
1,2024-10-14T04:45:41Z,2895352578531874,suryapulika38@gmail.com,CREATE OR REPLACE TABLE AS SELECT,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(3802355948382795),1011-150700-u18wk0fi,0.0,WriteSerializable,False,"Map(numFiles -> 3, numOutputRows -> 2150, numOutputBytes -> 50780)",,Databricks-Runtime/13.3.x-scala2.12
0,2024-10-14T04:42:39Z,2895352578531874,suryapulika38@gmail.com,CREATE TABLE AS SELECT,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(3802355948382795),1011-150700-u18wk0fi,,WriteSerializable,True,"Map(numFiles -> 3, numOutputRows -> 2150, numOutputBytes -> 50780)",,Databricks-Runtime/13.3.x-scala2.12


As you can see, the version 0 is a CREATE TABLE AS SELECT statement.

While, `CREATE OR REPLACE` statement has generated a new table version.

### `INSERT OVERWRITE` statement

The second method to overwrite table data is to use INSERT OVERWRITE statement.

It provides a nearly identical output as above.

It means data in the target table will be replaced by data from the query.

However, `INSERT OVERWRITE` statement has some differences.

- For example, it can only overwrite an existing table and not creating a new one like our CREATE OR REPLACE statement.

- And it can override only the new records that match the current table schema, which means that it is a safer technique for overwriting an existing table without the risk of modifying the table schema.

Let us run this command.

In [0]:
%sql
INSERT OVERWRITE orders
SELECT * FROM parquet.`${dataset.bookstore}/orders`

num_affected_rows,num_inserted_rows
2150,2150


As you can see here, we have successfully overwriting the table data and rewriting 2150 records.

And we can see our table history again.

In [0]:
%sql
DESCRIBE HISTORY orders

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
2,2024-10-14T04:48:31Z,2895352578531874,suryapulika38@gmail.com,WRITE,"Map(mode -> Overwrite, statsOnLoad -> false, partitionBy -> [])",,List(3802355948382795),1011-150700-u18wk0fi,1.0,WriteSerializable,False,"Map(numFiles -> 3, numOutputRows -> 2150, numOutputBytes -> 50780)",,Databricks-Runtime/13.3.x-scala2.12
1,2024-10-14T04:45:41Z,2895352578531874,suryapulika38@gmail.com,CREATE OR REPLACE TABLE AS SELECT,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(3802355948382795),1011-150700-u18wk0fi,0.0,WriteSerializable,False,"Map(numFiles -> 3, numOutputRows -> 2150, numOutputBytes -> 50780)",,Databricks-Runtime/13.3.x-scala2.12
0,2024-10-14T04:42:39Z,2895352578531874,suryapulika38@gmail.com,CREATE TABLE AS SELECT,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(3802355948382795),1011-150700-u18wk0fi,,WriteSerializable,True,"Map(numFiles -> 3, numOutputRows -> 2150, numOutputBytes -> 50780)",,Databricks-Runtime/13.3.x-scala2.12



As you can see here, the `INSERT OVERWRITE` operation has been recorded as a new version in the table as `WRITE` operation.

And it has the modw "Overwrite".

And if you try to insert overwrite the data with different schema, for example, here we are adding a new column of the data for the current timestamp.

By running this command, we see that it generates an exception.

In [0]:
%sql
INSERT OVERWRITE orders
SELECT *, current_timestamp() FROM parquet.`${dataset.bookstore}/orders`

And the exception says "A schema mismatch detected when writing to the Delta table...".

So the way how they enforce schema on-write is the primary difference between `INSERT OVERWRITE` and `CREATE OR REPLACE TABLE` statements.

## Appending Data

Let us now talk about appending records to tables.

The easiest method is to use `INSERT INTO` statement.

Here we are inserting new data using an input query that query the parquet files in the `orders-new` directory.

In [0]:
%sql
INSERT INTO orders
SELECT * FROM parquet.`${dataset.bookstore}/orders-new`

num_affected_rows,num_inserted_rows
700,700


We have successfully added 700 new records to our table.

And we can check the new number of orders.

In [0]:
%sql
SELECT count(*) FROM orders

count(1)
2850


Now we have 2850 records in the orders table.

## Merging Data

The `INSERT INTO` statement is a simple and efficient operation for inserting new data. 
However, it does not have any built in guarantees to prevent inserting the same records multiple times.
It means re-executing the query will write the same records to the target table resulting in duplicate records.

To resolve this issue, we can use our second method, which is `MERGE INTO` statement.

With the merge statement, you can upsert data from a source table, view, or dataframe into the target data table.
It means you can insert, update and delete using the `MERGE INTO` statements.

### An Example

Here, we will use the merge operation to update the customer data with updated emails and adding new customers.

- We are creating a temporary view of the new customer data.

- And now we can apply the merge operation that says `MERGE INTO customers`

  * the new changes coming from `customer_updates` temp view on the customer ID key.

  * And we have two actions here.

    When match, we do an update and when not match, we do an insert.

  * In addition, we add extra conditions.

    In this case, we are checking that the current row has a null email while the new record does not.

    In such a case, we update the email and we also update the last updated timestamp.

  * And again, if the new record does not match any existing customers based on the customer ID, in this case, we will insert this new record.

In [0]:
%sql
CREATE OR REPLACE TEMP VIEW customers_updates AS 
SELECT * FROM json.`${dataset.bookstore}/customers-json-new`;

MERGE INTO customers c
USING customers_updates u
ON c.customer_id = u.customer_id
WHEN MATCHED AND c.email IS NULL AND u.email IS NOT NULL THEN
  UPDATE SET email = u.email, updated = u.updated
WHEN NOT MATCHED THEN INSERT *

num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
301,100,0,201


As we can see here, we have updated 100 records and we have inserted 201 records.
And no records have been deleted.

So in a merge operation, updates, inserts and deletes are completed in a single atomic transaction.
In addition, merge operation is a great solution for avoiding duplicates when inserting records.

### Another Eexample

Here we have new books to be inserted and they are coming in CSV format.

We will create this temporary view against this new data.

In [0]:
%sql
CREATE OR REPLACE TEMP VIEW books_updates
   (book_id STRING, title STRING, author STRING, category STRING, price DOUBLE)
USING CSV
OPTIONS (
  path = "${dataset.bookstore}/books-csv-new",
  header = "true",
  delimiter = ";"
);

SELECT * FROM books_updates

book_id,title,author,category,price
B14,Data Communications and Networking,Behrouz A. Forouzan,Computer Science,34.0
B15,Inside the Java Virtual Machine,Bill Venners,Computer Science,41.0
B13,Linux pocket guide,Daniel J. Barrett,Computer Science,26.0
B16,Green for Life,Victoria Boutenko,Food,18.0
B17,Cooking with Love,Carla Hall,Food,23.0


Here we have five new books and we are only interested by inserting the computer science books in our database.

Let us now use the `MERGE INTO` statement to update the table `books` with the data coming from the temporary view `books_updates`.

And now we can use the `MERGE INTO` statement where we provide only the not match condition.

It means we are only inserting new data if they are not already exist based on our key, which is the `book_id` and `the title`.

In addition, we are specifying the category of the new record to be inserted is only computer science.

In [0]:
%sql
MERGE INTO books b
USING books_updates u
ON b.book_id = u.book_id AND b.title = u.title
WHEN NOT MATCHED AND u.category = 'Computer Science' THEN 
  INSERT *

num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
3,0,0,3


As expected, we are only inserting three new records, which are the three computer science books.

And as we said, one of the main benefits of the merge operation is to avoid duplicates.

So if we try to rerun this statement, it will not reinsert those records as they are already on the table.

In [0]:
%sql
MERGE INTO books b
USING books_updates u
ON b.book_id = u.book_id AND b.title = u.title
WHEN NOT MATCHED AND u.category = 'Computer Science' THEN 
  INSERT *

num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
0,0,0,0


Yes, indeed. 
Zero record has been inserted.