# Delta Lake Lab 
## Unit 8: Table Clone 

In the previous unit we-
1. Learned about zordering and data skipping native to delta lake

In this unit, we will learn about-
1. Table cloning - shallow clone; creation, and understanding of what happens when a shallow clone is created and when updated
2. Table cloning - deep clone; creation and understanding of what happens when a shallow clone is created and when updated 

### 1. Imports

In [None]:
import pandas as pd

from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

from delta.tables import *

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [None]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

### 3. Declare variables

In [None]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

In [None]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

In [None]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

In [None]:
ACCOUNT_NAME = "YOUR_ACCOUNT_NAME"

In [None]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}-{ACCOUNT_NAME}"
DELTA_LAKE_DIR_ROOT = f"{DATA_LAKE_ROOT_PATH}/delta-consumable"
print(DELTA_LAKE_DIR_ROOT)

### 4. File listing

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/_delta_log/*.json | wc -l

### 5. Create a shallow clone

In [None]:
SHALLOW_CLONE_DIR = f"{DELTA_LAKE_DIR_ROOT}/shallow_clone/"
print(SHALLOW_CLONE_DIR)

In [None]:
spark.sql("SELECT * FROM "+ ACCOUNT_NAME +"_loan_db.loans_by_state_delta WHERE addr_state='IA' LIMIT 2").show(truncate=False)

In [None]:
spark.sql(f"CREATE TABLE IF NOT EXISTS YOUR_ACCOUNT_NAME_loan_db.loans_by_state_delta_clone_shallow SHALLOW CLONE YOUR_ACCOUNT_NAME_loan_db.loans_by_state_delta LOCATION \"{SHALLOW_CLONE_DIR}\"")

Shallow clone creation is a metadata operation until a CRUD operation is done against it, at which point, the data gets copy-persisted.

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

In [None]:
spark.sql("UPDATE "+ ACCOUNT_NAME +"_loan_db.loans_by_state_delta_clone_shallow SET count = 11111 WHERE addr_state='IL'")

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

Because of the update, the data got copied over

### 6. Create a deep clone
A deep clone copies the data over.

In [None]:
DEEP_CLONE_DIR = f"{DELTA_LAKE_DIR_ROOT}/deep_clone/"
print(DEEP_CLONE_DIR)

In [None]:
spark.sql(f"CREATE TABLE IF NOT EXISTS YOUR_ACCOUNT_NAME_loan_db.loans_by_state_delta_clone_shallow DEEP CLONE YOUR_ACCOUNT_NAME_loan_db.loans_by_state_delta LOCATION \"{DEEP_CLONE_DIR}\"")



In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT

In [None]:
!gsutil ls -r $DELTA_LAKE_DIR_ROOT/part* | wc -l

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK.