# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Module 3 - Delta_Lake_Demo_notebook

## Learning Objectives

At the end of the experiment, you will be able to :

* have an overview of storage solutions and their limitations (Databases and Data Lakes).
* get an overview of futuristic storage  solution - lakehouses
* work on Delta Lake tutorial.

## Information

**Databases:**

Databases  are  designed  to  store  structured  data  as  tables,  which  can  be  read  using SQL queries. They provide very fast computations on the  stored  data  along  with  strong  transactional  ACID  guarantees  on  all  read/write operations.

**However, there are few limitations of Databases like:**
*  inability to handle growth in data size.
*  inability to handle growth in the diversity of analytics.
*  they are extremely expensive to scale out.
*  they do not support non–SQL based analytics very well.

Database and data warehouses can only store data that has been structured. A data lake, on the other hand, stores all types of data: structured, semi-structured, or unstructured.

The image in below URL shows how data is handled in Data warehouse, Data Lake and Lakehouse:

![img](https://databricks.com/wp-content/uploads/2020/01/data-lakehouse.png)




**Data Lakes:**

The data lake architecture, unlike that of databases, decouples the distributed storage system from the distributed compute system. This allows each system to scale out as needed  by  the  workload.  
The  data  is  saved  as  files  with  open  formats, such  that  any  processing  engine  can  read  and  write  them  using  standard  APIs. 


**Limitations of Data Lakes:**

* **No atomicity:** means failed production jobs leave data in corrupt state requiring tedious recovery.

* **No quality enforcement:** creates insconsistent and unusable data.

* **No consistency / isolation:** makes it almost impossible to mix appends and reads, batch and streaming.





**Lakehouses:** The Next Step in the Evolution of Storage Solutions.

The  lakehouse  is  a  new  paradigm  that  combines  the  best  elements  of  data  lakes  and data  warehouses  for  OLAP  workloads.  
Lakehouses  are  enabled  by  a  new  system design  that  provides  data  management  features  similar  to  databases  directly  on  the low-cost, scalable storage used for data lakes. 


**Lakehouse features:**

* Transaction support
* Schema enforcement and governance
* Support for diverse data types in open formats
* Support for diverse workloads
* Support for upserts and deletes
* Data governance

Currently, there are a few open source systems, such as Apache Hudi, Apache Iceberg, and Delta Lake, that can be used to build lakehouses with these properties. 



**Delta Lake** - Data reliability for Data Lakes.

Delta  Lake  is  an  open  source  project  hosted  by  the  Linux  Foundation,  built  by  the original creators of Apache Spark. It is an open data storage format that provides transactional guarantees and enables schema enforcement and evolution. 

Delta Lake has the tightest integration with Apache  Spark (when compared to Apache Hudi and Apache Iceberg) data  sources  (both  for  batch  and  streaming  workloads)  and  SQL operations  (e.g.,  MERGE).

**Delta Lake** - Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption.

Delta Lake Features:

1.  **ACID Transactions:** ACID is an acronym for atomicity, consistency, isolation, and durability

       **Atomicity:** means that a transaction must exhibit an “all or nothing” behavior. Either all of the instructions within the transaction happen successfully, or none of them happen. Atomicity preserves the “completeness” of the business process.

       **Consistency:** refers to the state of the data both before and after the transaction is executed. A transaction maintains the consistency of the state of the data. In other words, after running a transaction, all data in the database is “correct.”

      **Isolation:** means that transactions can run at the same time. Any transactions running in parallel have the illusion that there is no concurrency. Multiple transactions occur in isolation.

      **Durability:** refers to the impact of an outage or a failure on a running transaction. A durable transaction will not impact the state of data if the transaction ends abnormally. In other words, the data survives any failures. 

2. **Enables Time travel:** Query previous versions of the table by time or version number.
3. **Deletes and upserts:** Supports deleting and upserting into tables with programmatic APIs.


#### Databricks

We will be using Databricks to work on Delta Lake:

Databricks Connect is a client library for Databricks Runtime. It allows you to write jobs using Spark APIs and run them remotely on a Databricks cluster instead of in the local Spark session.

###Databricks Community Edition

It is the free version of cloud-based big data platform where users can access a micro-cluster as well as a cluster manager and notebook environment. All users can share their notebooks and host them free of charge with Databricks.

It is hosted on Amazon Web Services. However, you are not charged when you use the Databricks Community Edition. 

The Databricks Community Edition notebooks are compatible with IPython notebooks. You can easily import your existing IPython notebooks into the Databricks Community Edition notebook environment.

Click on below link to Sign Up:

https://community.cloud.databricks.com/login.html

* Click on Sign Up (which you will see next to "New to Databricks" below "Sign In" button).
* Once you click on "Sign Up", you will be directed to a page where you are required to enter your details.
* Once the details are entered, click on "Get Started for Free".
* You will be navigated to a page which will ask you to "Select a Platform".
* Click on "Get Started" under - "Community Edition" (For students and educational institutions), 

* Setup your password.

**Once the Databricks Community Edition is set up, please follow the below instructions to "Import the Notebook" using URL.**

**Note : This notebook is adapted from the reference [here](https://docs.databricks.com/_static/notebooks/delta/quickstart-python.html)**

* Click the "Workspace button" or the "Home button" in the sidebar.
* Next to any folder, click the "Menu Dropdown" on the right side of the text and select Import.
* In the Workspace or a user folder, click "Down Caret" and select Import.
* Specify the URL given below and Click Import (Download the file to your Desktop from this link or directly copy the link):

https://cdn.iisc.talentsprint.com/CDS/DeltaLakeTutorial.ipynb 

This will load Delta Lake Tutorial (Python) to Databricks Community edition where you can go through the example (on Delta Lake) provided and execute the commands.