# Delta Lake Lab 
## Unit 3: Delta Table Utilities

This lab is powered by Dataproc Serverless Spark.

In the previous units, we covered the below-
1. Create a base delta table off of the parquet base table loan_db.loans_by_state_parquet
2. Take a peek under the hood of the Delta table
3. Review the delta transaction log
4. Look at delta table details
5. Look at delta table history
6. Create a manifest file
7. Review entries in the Hive Metastore (Dataproc Metastore Service)

In this unit, we will -
1. Review Delta table details
2. Review Delta table history
3. Learn how to create a manifest file
4. Review metastore entries
,

### 1. Imports

In [None]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [None]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

### 3. Declare variables

In [None]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

In [None]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

In [None]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

In [None]:
ACCOUNT_NAME = "YOUR_ACCOUNT_NAME"

In [None]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}-{ACCOUNT_NAME}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"

### 4. Peek under the hood of our Delta Lake table (loan_db.loans_by_state_delta)

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

In [None]:
!gsutil cat $DELTA_LAKE_DIR_ROOT/_delta_log/00000000000000000000.json

### 5. Table Details
https://docs.delta.io/latest/delta-utility.html#id6

In [None]:
deltaTable = DeltaTable.forPath(spark, DELTA_LAKE_DIR_ROOT)
detailDF = deltaTable.detail()
detailPDF=detailDF.toPandas()
detailPDF

### 6. Table History

https://docs.delta.io/latest/delta-utility.html#id4

In [None]:
deltaTable = DeltaTable.forPath(spark, DELTA_LAKE_DIR_ROOT)
fullHistoryPDF = deltaTable.history().toPandas()    # get the full history of the table
lastOperationPDF = deltaTable.history(1).toPandas() # get the last operation

#### Last operation

In [None]:
lastOperationPDF

#### Full History

In [None]:
fullHistoryPDF

### 7. Table manifest file
https://docs.delta.io/latest/delta-utility.html#id8

You can a generate manifest file for a Delta table that can be used by other processing engines (that is, other than Apache Spark) to read the Delta table. For example, to generate a manifest file that can be used by Presto and Athena to read a Delta table, you run the following:

In [None]:
deltaTable = DeltaTable.forPath(spark, DELTA_LAKE_DIR_ROOT)
deltaTable.generate("symlink_format_manifest")

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT | grep "_symlink_format_manifest/manifest"

In [None]:
MANIFEST_LIST = !gsutil ls -r $DELTA_LAKE_DIR_ROOT | grep "_symlink_format_manifest/manifest"
MANIFEST_FILE = MANIFEST_LIST[0]
print(MANIFEST_FILE)

In [None]:
!gsutil cat $MANIFEST_FILE

Using this manifest file, you can create an external table in BigQuery on the Delta Table, except it will be point in time to when the manifest was generated.

### 8. Hive Metastore Entry

In [None]:
spark.sql("show tables in "+ ACCOUNT_NAME +"_loan_db").show(truncate=False)

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK