# Delta Lake Lab 
## Unit 1: Create a base Parquet table
Create a base table in Parquet, off of the Kaggle Lending Club Loan dataset, preloaded into your GCS data bucket in directory parquet-source.

### 1. Imports

In [None]:
import pandas as pd
from pyspark.sql.functions import month, date_format
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession
import warnings

warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [None]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark

### 3. Declare variables

In [None]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

In [None]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

In [None]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

In [None]:
ACCOUNT_NAME = "YOUR_ACCOUNT_NAME"

In [None]:
DATA_LAKE_ROOT_PATH= f"gs://dll-data-bucket-{PROJECT_NUMBER}-{ACCOUNT_NAME}"

In [None]:
RAW_SOURCE_FQ_GCS_PATH = f"{DATA_LAKE_ROOT_PATH}/*"

### 4. Explore the raw loans data

In [None]:
!gsutil ls -r $DATA_LAKE_ROOT_PATH

In [None]:
rawDF = spark.read.parquet(DATA_LAKE_ROOT_PATH)

In [None]:
rawDF.printSchema()

In [None]:
rawDF=rawDF.na.drop(subset=["addr_state"])
rawDF.createOrReplaceTempView("loans_raw")

In [None]:
# Count total loans
spark.sql("select addr_state as state,loan_status, count(*) as loan_count from loans_raw group by addr_state,loan_status").show()

In [None]:
# How many distinct states?
spark.sql("select count(distinct addr_state) from loans_raw").show(truncate=False)

### 5. Cleanse the raw data

In [None]:
# Distinct states
spark.sql("select distinct addr_state from loans_raw").collect()

In [None]:
# Remove data with invalid states
cleasedSubsettedDF=spark.sql("select * from loans_raw where addr_state not in ('531xx','debt_consolidation')")

In [None]:
# Quick counts
count1=cleasedSubsettedDF.count()
print(f"Cleansed and subsetted row count={count1}")

count2=cleasedSubsettedDF.select("addr_state").distinct().count()
print(f"Cleansed and subsetted distinct state count={count2}")

### 6. Persist the cleansed data to the data lake, as Parquet & create an external table definition on it

In [None]:
# Persist the cleaned data
cleasedSubsettedDF.coalesce(3).write.format("parquet").mode("overwrite").save(f"{DATA_LAKE_ROOT_PATH}/parquet-cleansed")

In [None]:
# Check if we are using the Dataproc Metastore
spark.sparkContext._conf.get("spark.hive.metastore.uris")

In [None]:
# Create a database if it does not exist already
spark.sql("SHOW DATABASES;").show(truncate=False)

In [None]:
# Create a database if it does not exist already
spark.sql("CREATE DATABASE IF NOT EXISTS "+ ACCOUNT_NAME +"_loan_db;").show(truncate=False)

In [None]:
# Create an external table defintion on the parquet files
spark.sql("DROP TABLE IF EXISTS "+ ACCOUNT_NAME +"_loan_db.loans_cleansed_parquet;").show(truncate=False)
spark.sql(f"CREATE TABLE YOUR_ACCOUNT_NAME_loan_db.loans_cleansed_parquet USING parquet LOCATION '{DATA_LAKE_ROOT_PATH}/parquet-cleansed';").show(truncate=False)

In [None]:
# Review what's in the data lake
!gsutil ls -r $DATA_LAKE_ROOT_PATH

### 7. Create a parquet table on the base parquet dataset

In [None]:
# Remove any residual files from potential prior run
!gsutil rm -rf $DATA_LAKE_ROOT_PATH/parquet-consumable

In [None]:
# Create table in Parquet off of the cleansed raw data
spark.sql("DROP TABLE IF EXISTS "+ ACCOUNT_NAME +"_loan_db.loans_by_state_parquet;").show(truncate=False)
spark.sql(f"CREATE TABLE YOUR_ACCOUNT_NAME_loan_db.loans_by_state_parquet USING parquet LOCATION '{DATA_LAKE_ROOT_PATH}/parquet-consumable' AS SELECT addr_state, count(loan_status) as count FROM YOUR_ACCOUNT_NAME_loan_db.loans_cleansed_parquet GROUP BY addr_state;")

In [None]:
# Check the Dataproc metastore for the new table
spark.sql("show tables from "+ ACCOUNT_NAME +"_loan_db;").show(truncate=False)

In [None]:
# List some data
spark.sql("select * from "+ ACCOUNT_NAME +"_loan_db.loans_by_state_parquet").show(truncate=False)

### 8. Review what is in the data lake

Review cell #8. There was just one directory - parquet-source. 

Next review cell #19. A directory called parquet-cleased was added. 

At the end of this notebook, we also have a parquet-cleansed directory

In [None]:
!gsutil ls -r $DATA_LAKE_ROOT_PATH

We will use the data under the parquet-consumable directory in the next unit, and create a Delta table off of it.

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK