##Building Data Pipelines with Delta Live Tables

Challenges with complex ETL pipelines:
1. Ensure Data Quality
2. Infrastructure Management and Scaling
3. Handle batch and streaming data
4. Handle failures and retires
5. Monitor, Optimize and Maintain
6. Track lineage
7. Deployment in multiple environments
8. Manage dependencies

#####Delta Line tables:
It is an ETL Framework to build reliable, automated, testable and declarative ETL pipelines

Declarative ETL pipelines: 
like making easily drag and drop and declaration.

#####Components of Delta Live tables
1. Datasets (Delta Tables)
2. Data Quality checks
3. Queries
4. Pipelines
5. Environment

#####Steps to Create ETL pipelines
1. Create live datasets
   - Delta tables or Views
   - Types - Complete, Incremental, Streaming
2. Define data quality checks (expectations) on datasets
   - Constraints to apply
   - Actions in case of errors
3. Define data transformation queries
   - Business logic to clean, filter, transform and aggregate
   - Define inter-dependencies between datasets
4. Create, run and test pipelines
   - Auto-manage infra, process data, maintain lineage, apply quality checks, handle logs, retry on failures, etc
   - Mode of execution - Triggered or continuous
5. Promote to production


In [0]:
%sql
-- Create 1 Bronze Live Table
-- Create 1 Silver Live Table
-- Create 2 Gold Live Tables

-- A) Create Live Bronze table
      -- Need to provide select data statement, keyword - Live table
      -- Will Only check Syntax and run via pipeline

CREATE LIVE TABLE YellowTaxis_Bronzelive
(
  RideId                 INT             COMMENT 'This is a primary key column',
  VendorId               INT,
  PickupTime             TIMESTAMP,
  DropTime               TIMESTAMP,
  PickupLocationId       INT,
  DropLocationId         INT,
  CabNumber              STRING,
  DriverLicenseNumber    STRING,
  PassengerCount         INT,
  TripDistance           DOUBLE,
  RatecodeId             INT,
  PaymentType            INT,
  TotalAmount            DOUBLE,
  FareAmount             DOUBLE,
  Extra                  DOUBLE,
  MtaTax                 DOUBLE,
  TipAmount              DOUBLE,
  TollsAmount            DOUBLE,
  ImprovementSurcharge   DOUBLE,

  FileName               STRING,
  CreatedOn              TIMESTAMP
)
USING DELTA 
LOCATION "abfss://datalake@mue10dadls01.dfs.core.windows.net/ShauryaRawat/Output/YellowTaxis_Bronzelive.delta"
PARTITIONED BY (VendorId)
COMMENT "Live Bronze table for YellowTaxis"
AS
SELECT *,
       _metadata.file_path AS FileName,
       CURRENT_TIMESTAMP() AS CreatedOn
FROM parquet.`abfss://datalake@mue10dadls01.dfs.core.windows.net/ShauryaRawat/raw/YellowTaxisParquet/YellowTaxis1.parquet`

Name,Type
RideId,int
VendorId,int
PickupTime,timestamp
DropTime,timestamp
PickupLocationId,int
DropLocationId,int
CabNumber,string
DriverLicenseNumber,string
PassengerCount,int
TripDistance,double


In [0]:
%sql
-- Create Live Silver Table (Define expectation and constraints)
CREATE LIVE TABLE YellowTaxis_SilverLive
(
  RideId                         INT,
  VendorId                       INT,
  PickupTime                     TIMESTAMP,
  DropTime                       TIMESTAMP,
  PickupLocationId               INT,
  DropLocationId                 INT,
  CabNumber                      STRING,
  DriverLicenseNumber            STRING,
  PassengerCount                 INT,
  TripDistance                   DOUBLE,
  TotalAmount                    DOUBLE,

  PickupYear                     INT            GENERATED ALWAYS AS (YEAR(PickupTime)),
  PickupMonth                    INT            GENERATED ALWAYS AS (MONTH(PickupTime)),
  PickupDay                      INT            GENERATED ALWAYS AS (DAY(PickupTime)),

  CreatedOn                      TIMESTAMP,

  -- Define constraints
  CONSTRAINT Valid_TotalAmount  EXPECT (TotalAmount IS NOT NULL AND TotalAmount > 0) ON VIOLATION DROP ROW,     -- Delete row
  CONSTRAINT Valid_TripDistance EXPECT (TripDistance > 0)                            ON VIOLATION DROP ROW,     -- Delete row
  CONSTRAINT Valid_RideId       EXPECT (RideId IS NOT NULL AND RideId > 0)           ON VIOLATION FAIL UPDATE   -- Fail pipeline
)
USING DELTA LOCATION "abfss://datalake@mue10dadls01.dfs.core.windows.net/ShauryaRawat/Output/YellowTaxis_Silverlive.delta"
PARTITIONED BY (PickupLocationId)
AS
SELECT RideId,
       VendorId,
       PickupTime,
       DropTime,
       PickupLocationId,
       DropLocationId,
       TripDistance,
       TotalAmount,
       current_timestamp() AS CreatedOn
FROM live.YellowTaxis_Bronzelive

In [0]:
%sql
-- Create Gold table 1 (Without defining schema)

CREATE LIVE TABLE YellowTaxis_SummaryByLocation_Goldlive
LOCATION "abfss://datalake@mue10dadls01.dfs.core.windows.net/ShauryaRawat/Output/YellowTaxis_SummaryByLocation_Goldlive.delta"
AS
SELECT PickupLocationId, DropLocationId,
       Count(RideId)      AS TotalRides,
       Sum(TripDistance)  AS TotalDistance,
       Sum(TotalAmount)   AS TotalAmount
FROM live.YellowTaxis_SilverLive
GROUP BY PickupLocationId, DropLocationId

In [0]:
%sql
-- Create Gold table 2 (Without defining schema)

CREATE LIVE TABLE YellowTaxis_SummaryByDate_Goldlive
LOCATION "abfss://datalake@mue10dadls01.dfs.core.windows.net/ShauryaRawat/Output/YellowTaxis_SummaryByDate_Goldlive.delta"
AS
SELECT PickupYear, PickupMonth, PickupDay,
       Count(RideId)      AS TotalRides,
       Sum(TripDistance)  AS TotalDistance,
       Sum(TotalAmount)   AS TotalAmount
FROM live.YellowTaxis_SilverLive
GROUP BY PickupYear, PickupMonth, PickupDay

#####DLT Execution Mode
1. Development
   - Reuses the same cluster for multiple runs to avoid overhead
   - In case of errors does not retry to run pipeline
2. Production
   - Creates a new cluster for each run and shuts down after execution
   - In case of specific errors, retry the execution of pipeline