### Delta Lake Tables:

Delta Lake is the optimized storage layer that provides the foundation for tables in a LakeHouse on Databricks. Delta Lake is Open Source and it extends parquet based file systems with a transaction log for ACID transactions and scalable metadata handling. 

It allows you to use a single copy of data for both batch and streaming ops and provides incremental processing at scale.

**Characteristics:**
- ACID Transactions  --> Provides resilience for data transactions and can be rolled back if necessary, which wasn't the case with tradtional DL's
- Scalable Metadata
- Time Travel --> Because of the Transcation Log which gets updated for every action that you do on the table, it maintains a log providing tracking.
- Simple Solution Architecture  --> Single API for both batch and streaming makes it easy
- Support for DML ops ---> Enables incremental data
- Better performance

**Architecture:** 
Delta Lakes stores files in Parquet format (columnar storage), it creates transaction log (aka delta log-json) which provides history, time-travelling,etc. Delta Tables are created and can be tagged to Unity Catalog for Data security, governance, a delta engine which uses Spark for computing transformations, etc. 

Data Storage ---> Delta Tables (Unity Catalog) ----> Delta Engine (spark compatible) ---> Compute 

#### 1.1 Delta Transaction Log

Suppose you've created an empty table in the workspace of DTB, the metadata of the table is stored in the unity catalog and DTB also creates a folder in the cloud storage for the table. It contains two subfolders one for the parquet files to store any data that you ingest and the other is for the logs. Since the first step in this case is an empty table, the file folder has 0 files and the logs have 1 file json_000 (something like that..). If you insert new data, the table gets populated and depending on the no.of files and data, the file storage keeps the files in the parquet format and also the log folder gets new log files with the updated information. 

In [0]:
----- Creating a new Schema under the demo catalog ------
CREATE CATALOG IF NOT EXISTS demo
MANAGED LOCATION 'abfss://demo@deacourseextdld.dfs.core.windows.net/';


create schema if not exists demo.delta_lake
managed location 'abfss://demo@deacourseextdld.dfs.core.windows.net/delta_lake';

In [0]:
-------1. Create a DLT table ----------

create table if not exists demo.delta_lake.companies
(
  company_name string,
  founded_date date,
  country string
);

In [0]:
desc extended demo.delta_lake.companies;

In [0]:
--------- 2. Insert Some Data ----------

INSERT INTO demo.delta_lake.companies
VALUES('Apple', '1976-04-01', 'USA'),
('Microsoft', '1975-04-01', 'USA'),
('Google', '1998-08-14', 'USA'),
('SpaceX', '2002-06-03', 'USA');

In [0]:
select * from demo.delta_lake.companies;

##### 1.2 History and Time Travel

In [0]:
---- Query the table's History

DESCRIBE HISTORY demo.delta_lake.companies;


In [0]:
---- Query data from a specific version

select * from demo.delta_lake.companies version as of 1;

In [0]:
--- Query the table as per the timestamp/from a specific time

select * from demo.delta_lake.companies timestamp as of '2026-01-20T20:37:16.000+00:00';

In [0]:
----- Restore Data in the Table to a specific version ------------

restore table demo.delta_lake.companies to version as of 1;

In [0]:
select * from demo.delta_lake.companies;

In [0]:
desc history demo.delta_lake.companies;

##### Support for ACID Transactions
- Tranaction logs are written at the end of the transaction
- Readers will always read the transaction logs first to identify the list of data files to read.

Scenario-1: A process is writing a file but failed midway and DA is trying to read the latest data, then he can see only the data that is already is present in the transaction logs and not the failed file's data as the transaction log doesn't get updated as it failed the process.

Next, let's say the new data gets successfully written to the storage letting go of the previous failed data, the reader this time reads the 1st file and the 3rd file only and not the partial written 2nd file as the transaction log contains only the information of processes which got succeeded. 

##### Creating the Delta Lake Tables in various ways

- DTB recommends using """CREATE OR REPLACE TABLE""" syntax for creating tables, as it retains the metadata or history. Where as dropping and recreating the tables doesn't retain any history
- External table means DTB deals with only the metadata part and not the files, whereas Managed table, DTB deals with both metadata and files. 

In [0]:
--------- Column and Table Properties while creating the table ------------
DROP TABLE IF EXISTS demo.delta_lake.companies;


create table if not exists demo.delta_lake.companies 
(company_name string, founded_date date, country string)
comment 'This table contains info about some of the successful companies'
TBLPROPERTIES ('sensitive' = 'true', 'delta.enableDeletionVectors' = 'false') ;--- Can configure table properties over here

In [0]:
desc extended demo.delta_lake.companies;

In [0]:
---------- Column Properties of the Table ---------------

DROP TABLE IF EXISTS demo.delta_lake.companies;


create table if not exists demo.delta_lake.companies 
(
  company_name string NOT NULL,
  founded_date date COMMENT 'The date the company was founded',
  country string)
comment 'This table contains info about some of the successful companies'
TBLPROPERTIES ('sensitive' = 'true', 'delta.enableDeletionVectors' = 'false') ;

In [0]:
desc extended demo.delta_lake.companies;

In [0]:
--- Generated Identity cols: used to generate an identity for example a PK value -----
--- Generated Computed Columns: auto calculate and store derived values based on other cols in the same  ----
DROP TABLE IF EXISTS demo.delta_lake.companies;


create table if not exists demo.delta_lake.companies 
(
  company_id BIGINT NOT NULL generated always as identity (start with 1 increment by 1),
  company_name string NOT NULL,
  founded_date date COMMENT 'The date the company was founded',
  country string)
comment 'This table contains info about some of the successful companies'
TBLPROPERTIES ('sensitive' = 'true', 'delta.enableDeletionVectors' = 'false') ;

In [0]:
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Microsoft', '1975-04-04', 'USA');
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Google', '1998-09-04', 'USA');
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Amazon', '1994-05-15', 'USA');

In [0]:
select * from demo.delta_lake.companies;

In [0]:
--- Generated computed cols syntax '''Generated always as (expr)'''' ----

--- expr maybe composed of literals, col identifiers and deterministice, built-in SQL functions or operators except: Aggregate function, window fn's, ranking window fn and table value generator fns, also shouldn't contain subquery functions-----

DROP TABLE IF EXISTS demo.delta_lake.companies;


create table if not exists demo.delta_lake.companies 
(
  company_id BIGINT NOT NULL generated always as identity (start with 1 increment by 1),
  company_name string NOT NULL,
  founded_date date COMMENT 'The date the company was founded',
  founded_year int generated always as (YEAR(founded_date)), 
  country string)
comment 'This table contains info about some of the successful companies'
TBLPROPERTIES ('sensitive' = 'true', 'delta.enableDeletionVectors' = 'false') ;

In [0]:
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Microsoft', '1975-04-04', 'USA');
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Google', '1998-09-04', 'USA');
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Amazon', '1994-05-15', 'USA');

In [0]:
select * from demo.delta_lake.companies;

#### Create or Replace & CTAS

In [0]:
DROP TABLE IF EXISTS demo.delta_lake.companies;


create table if not exists demo.delta_lake.companies 
(
  company_id BIGINT NOT NULL generated always as identity (start with 1 increment by 1),
  company_name string NOT NULL,
  founded_date date COMMENT 'The date the company was founded',
  founded_year int generated always as (YEAR(founded_date)), 
  country string)
comment 'This table contains info about some of the successful companies'
TBLPROPERTIES ('sensitive' = 'true', 'delta.enableDeletionVectors' = 'false') ;

INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Microsoft', '1975-04-04', 'USA');
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Google', '1998-09-04', 'USA');
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Amazon', '1994-05-15', 'USA');

In [0]:
desc history demo.delta_lake.companies;

In [0]:
DROP TABLE IF EXISTS demo.delta_lake.companies;

In [0]:



create or replace table demo.delta_lake.companies 
(
  company_id BIGINT NOT NULL generated always as identity (start with 1 increment by 1),
  company_name string NOT NULL,
  founded_date date COMMENT 'The date the company was founded',
  founded_year int generated always as (YEAR(founded_date)), 
  country string)
comment 'This table contains info about some of the successful companies'
TBLPROPERTIES ('sensitive' = 'true', 'delta.enableDeletionVectors' = 'false') ;

INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Microsoft', '1975-04-04', 'USA');
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Google', '1998-09-04', 'USA');
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Amazon', '1994-05-15', 'USA');
INSERT INTO demo.delta_lake.companies(company_name, founded_date, country) VALUES ('Tencent', '1998-11-17', 'China');
INSERT INTO demo.delta_lake.companies (company_name, founded_date, country) VALUES ('Facebook', '2004-02-04', 'USA');

In [0]:
describe history demo.delta_lake.companies; ---history gets maintained !!! 

#### CTAS statement

You'll create and select the data to the table at the same time. You can't specify the column properties in the CTAS statement as it is inferred directly from the select statement. You can't give a comment directly in the CTAS statement but you can do by using alter table command

In [0]:
drop table if exists demo.delta_lake.companies_china;

In [0]:
create table demo.delta_lake.companies_china
as 
select cast(company_id as int), company_name, founded_date, founded_year, country from demo.delta_lake.companies where country = 'China';

In [0]:
alter table demo.delta_lake.companies_china
alter column founded_date comment 'Date the company was founded';

In [0]:
alter table demo.delta_lake.companies_china
alter column company_id set not null;

In [0]:
select * from demo.delta_lake.companies_china;

In [0]:
desc demo.delta_lake.companies_china;

In [0]:
desc history demo.delta_lake.companies_china;
