# Delta Lake Lab 
## Unit 2: Create a Delta Lake table
In the previous unit -
1. We read parquet data in the datalake
2. Cleansed it, subset it and persisted it as parquet to the datalake parquet-consumable directory
3. We crated a database called loan_db and defined an external table on the data in parquet-consumable

In this unit you will learn to -
1. Create a base table in Delta off of the Parquet table in the prior notebook.
2. Create a partitioned Delta table off of the Parquet table in the prior notebook.

### 1. Imports

In [None]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [None]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

### 3. Declare variables

In [None]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

In [None]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

In [None]:
ACCOUNT_NAME = "YOUR_ACCOUNT_NAME"

In [None]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

In [None]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}-{ACCOUNT_NAME}"

In [None]:
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"

In [None]:
# Create delta dataset from the Parquet table
spark.sql("SELECT addr_state,count(*) as count FROM "+ ACCOUNT_NAME +"_loan_db.loans_by_state_parquet group by addr_state").write.mode("overwrite").format("delta").save(f"{DELTA_LAKE_DIR_ROOT}")

In [None]:
# Define external delta table definition
spark.sql("DROP TABLE IF EXISTS "+ ACCOUNT_NAME +"_loan_db.loans_by_state_delta;").show(truncate=False)
spark.sql(f"CREATE TABLE YOUR_ACCOUNT_NAME_loan_db.loans_by_state_delta USING delta LOCATION \"{DELTA_LAKE_DIR_ROOT}\"")

In [None]:
spark.sql("show tables from "+ ACCOUNT_NAME +"_loan_db;").show()

In [None]:
spark.sql("select * from "+ ACCOUNT_NAME +"_loan_db.loans_by_state_delta limit 2").show()

In [None]:
spark.sql("DESCRIBE FORMATTED "+ ACCOUNT_NAME +"_loan_db.loans_by_state_delta").show()

In [None]:
spark.sql("DESCRIBE EXTENDED "+ ACCOUNT_NAME +"_loan_db.loans_by_state_delta").show()

### 5. Create a partitioned Delta Lake table

In [None]:
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-sample-partitioned"

In [None]:
# Create delta dataset from the Parquet table
spark.sql("SELECT addr_state,count(*) as count FROM "+ ACCOUNT_NAME +"_loan_db.loans_by_state_parquet group by addr_state").write.mode("overwrite").partitionBy("addr_state").format("delta").save(f"{DELTA_LAKE_DIR_ROOT}")

### 6. A quick peek at the data lake layout
Compare this to the last cell of the prior notebook.

In [None]:
!gsutil ls -r $DATA_LAKE_ROOT_PATH

In [None]:
!gsutil ls -r $DATA_LAKE_ROOT_PATH/delta-consumable/part* | wc -l

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK