###DataLake (Deltalake) + Lakehouse (Deltatables) - using Delta format (parquet+snappy+delta log)

-------------
Delta Lake is an open-source storage framework that brings reliability, ACID transactions, and performance to data lakes. It sits on top of Parquet files and is most commonly used with Apache Spark and Databricks.<br>
Delta Lake & Deltalakhouse is the Core/Analytical storage layer behind Bronze–Silver–Gold (medallion) architectures.
<img src="https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-logo-whitebackground.png" style="width:300px; float: right"/>
## ![](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) Creating our first Delta Lake table
Delta is the default file and table format using Databricks.
--------

![](https://docs.databricks.com/aws/en/assets/images/well-architected-lakehouse-7d7b521addc268ac8b3d597bafa8cae9.png)

-----------

In [0]:
%sql
 drop table lakehousecat1.deltadb.customer_txn;
 drop table lakehousecat1.deltadb.customer_txn_part;
 drop table lakehousecat1.deltadb.drugstbl;
 drop table lakehousecat1.deltadb.drugstbl_merge;
 drop table lakehousecat1.deltadb.drugstbl_partitioned;
 drop table lakehousecat1.deltadb.employee_dv_demo1;
 drop table lakehousecat1.deltadb.product_inventory;
 drop table lakehousecat1.deltadb.tblsales;

In [0]:
#spark.sql(f"drop catalog if exists lakehousecat1 cascade")
spark.sql(f"CREATE CATALOG IF NOT EXISTS lakehousecat1")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS lakehousecat1.deltadb;")
spark.sql(f"""CREATE VOLUME IF NOT EXISTS lakehousecat1.deltadb.datalake;""")
#spark.sql(f"""CREATE VOLUME IF NOT EXISTS lakehousecat1.deltadb.deltavolume;""")
#spark.sql(f"""CREATE VOLUME IF NOT EXISTS lakehousecat1.deltadb.deltavolume2;""")

#### 1. Write data into delta file (Datalake) and table (Lakehouse)
1. How to migrate csv to delta format
2. Difference between Delta and Parquet
3. How to create Datalake & Lakehouse

In [0]:
#1. How to migrate csv to delta format (Delta Lake creation)
df = spark.read.csv('/Volumes/lakehousecat1/deltadb/datalake/druginfo.csv',header=True,inferSchema=True)#Reading normal data from datalake
df.write.format("delta").mode("overwrite").save("/Volumes/lakehousecat1/deltadb/datalake/targetdir")#writing normal data into deltalake(deltalake)
#2. Difference between Delta and Parquet
df.write.format("parquet").mode("overwrite").save("dbfs:/Volumes/lakehousecat1/deltadb/datalake/targetdirparquet")#writing normal data into parquet(datalake)
df.write.mode("overwrite").save("/Volumes/lakehousecat1/deltadb/datalake/targetdirdefaultdeltaparquet")#Databricks default format is delta(parquet)
#3. How to create Delta Lakehouse
spark.sql("drop table if exists lakehousecat1.deltadb.drugstbl")
df.write.saveAsTable("lakehousecat1.deltadb.drugstbl",mode='overwrite')#writing normal data from deltalakehouse(lakehouse)
#behind it stores the data in deltafile format in the s3 bucket (location is hidden for us in databricks free edition)

In [0]:
%sql
--under the hood data is stored in S3
explain select * from lakehousecat1.deltadb.drugstbl

#####We can have schema evolution performed

In [0]:
#We can add Schema evolution feature just by adding the below option in Delta tables.
#df.write.option("mergeSchema","True").saveAsTable("lakehousecat1.deltadb.drugstbl",mode='overwrite')

####2. DML Operations in Delta Tables & Files
 - We are overcoming the WORM (Write Once Read Many) limitation in Cloud S3/GCS/ADLS or in Distributed storage layers like HDFS
 - Delta file/table supports WMRM(Write Manay Read Many) operations, using DMLs such as  INSERT/DELETE/UPDATE/MERGE


In [0]:
%sql
--DDL is supportive (we will do more of these further)
create or replace table lakehousecat1.deltadb.sampletable(id int, name string)
using delta;

In [0]:
%sql
insert into lakehousecat1.deltadb.sampletable values(1,'irfan');--Though the data is stored internally in delta file, we can't see the data in delta format in databricks serverless
describe history lakehousecat1.deltadb.sampletable;

In [0]:
%sql
use lakehousecat1.deltadb

In [0]:
%sql
DESCRIBE HISTORY lakehousecat1.deltadb.drugstbl

In [0]:
%sql
--DQL is supported
SELECT * FROM drugstbl where uniqueid=163740;

#####a. Table Update

In [0]:
%sql
--DML - update is possible in the delta tables/files
UPDATE drugstbl
SET rating = rating-1
where uniqueid=163740;

In [0]:
%sql
--default latest version will be shown
SELECT * FROm drugstbl
WHERE uniqueid=163740;

#####b. Table Delete

In [0]:
%sql
--DML - Delete is possible on delta tables/files
DELETE FROM drugstbl
where uniqueid=163740;

In [0]:
%sql
SELECT * FROM drugstbl
 where uniqueid in (163740,206473);

In [0]:
%sql
desc history drugstbl;

#####c. File DML (Update/Delete)
We don't do file DML usually, we are doing here just for learning about 
 - file also can be undergone with limited DML operation
 - we need to learn about how the background delta operation is happening when i do DML


In [0]:
spark.read.format('delta').load('/Volumes/lakehousecat1/deltadb/datalake/targetdir').where('uniqueid=163740').show()