# Lab : Write a MultiKey Partitioned table to a S3 Datalake using Apache Hudi

## Table of Contents:

1. [Overview](#Overview)
2. [Working with Partitioned Tables](#Working-with-Partitioned-Tables)

## Overview

This notebook demonstrates using PySpark on [Apache Hudi](https://aws.amazon.com/emr/features/hudi/) on Amazon EMR to Write a MultiKey Partitioned table records to an S3 data lake.


This notebook covers the following concepts when writing Copy-On-Write tables to an S3 Datalake:

- Write a MultiKey Partitioned table 



### This demo is based on Hudi version 0.8.0 and runs fine on Jupyter Notebooks connected to a 1 node (r5.4xlarge) EMR cluster with configuration listed below 

 - EMR versions 6.5.0 
 
 - Software configuration

       - Hadoop 3.2.1
       - Hive 3.1.2
       - Livy 0.7.1
       - JupyterHub 1.4.1
       - Spark 3.1.2
       
       
 - AWS Glue Data Catalog settings - Select the below listed check boxes
       - Use for Hive table metadata  
       - Use for Spark table metadata



### Connect to the Master Node of EMR cluster Using SSH :
    - ssh -i ~/xxxx.pem hadoop@<ec2-xx-xxx-xx-xx.us-west-2.compute.amazonaws.com>

    - Ensure  the below listed files are copied into HDFS.

    - hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-spark-bundle.jar hdfs:///user/hadoop/

    - hdfs dfs -copyFromLocal /usr/lib/spark/external/lib/spark-avro.jar hdfs:///user/hadoop/

    - hdfs dfs -copyFromLocal /usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar hdfs:///user/hadoop/
      (https://github.com/apache/hudi/issues/5053)

Let's start by initializing the Spark Session to connect this notebook to our Spark EMR cluster:

In [1]:
%%configure -f
{
    "conf":  { 
             "spark.jars":"hdfs:///user/hadoop/aws-java-sdk-bundle-1.12.31.jar, hdfs:///user/hadoop/hudi-spark-bundle.jar,hdfs:///user/hadoop/spark-avro.jar",
             "spark.sql.hive.convertMetastoreParquet":"false",     
             "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
             "spark.dynamicAllocation.executorIdleTimeout": 3600,
             "spark.executor.memory": "5G",
             "spark.executor.cores": 4,
             "spark.dynamicAllocation.initialExecutors":5
           } 
}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8,application_1648194189527_0010,pyspark,idle,Link,Link,


The constants for Python to use:

In [2]:
# General Constants
HUDI_FORMAT = "org.apache.hudi"
TABLE_NAME = "hoodie.table.name"
RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
UPSERT_OPERATION_OPT_VAL = "upsert"
BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
S3_CONSISTENCY_CHECK = "hoodie.consistency.check.enabled"
HUDI_CLEANER_POLICY = "hoodie.cleaner.policy"
KEEP_LATEST_COMMITS = "KEEP_LATEST_COMMITS"
HUDI_COMMITS_RETAINED = "hoodie.cleaner.commits.retained"
PAYLOAD_CLASS_OPT_KEY = "hoodie.datasource.write.payload.class"
EMPTY_PAYLOAD_CLASS_OPT_VAL = "org.apache.hudi.EmptyHoodieRecordPayload"

# Hive Constants
HIVE_SYNC_ENABLED_OPT_KEY="hoodie.datasource.hive_sync.enable"
HIVE_PARTITION_FIELDS_OPT_KEY="hoodie.datasource.hive_sync.partition_fields"
HIVE_ASSUME_DATE_PARTITION_OPT_KEY="hoodie.datasource.hive_sync.assume_date_partitioning"
HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY="hoodie.datasource.hive_sync.partition_extractor_class"
HIVE_TABLE_OPT_KEY="hoodie.datasource.hive_sync.table"

# Partition Constants
NONPARTITION_EXTRACTOR_CLASS_OPT_VAL="org.apache.hudi.hive.NonPartitionedExtractor"
MULTIPART_KEYS_EXTRACTOR_CLASS_OPT_VAL="org.apache.hudi.hive.MultiPartKeysValueExtractor"
KEYGENERATOR_CLASS_OPT_KEY="hoodie.datasource.write.keygenerator.class"
NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL="org.apache.hudi.NonpartitionedKeyGenerator"
COMPLEX_KEYGENERATOR_CLASS_OPT_VAL="org.apache.hudi.ComplexKeyGenerator"
PARTITIONPATH_FIELD_OPT_KEY="hoodie.datasource.write.partitionpath.field"
CUSTOM_KEY_GENERATOR_CLASS_OPT_VAL="org.apache.hudi.keygen.CustomKeyGenerator"

#Incremental Constants
VIEW_TYPE_OPT_KEY="hoodie.datasource.view.type"
BEGIN_INSTANTTIME_OPT_KEY="hoodie.datasource.read.begin.instanttime"
VIEW_TYPE_INCREMENTAL_OPT_VAL="incremental"
END_INSTANTTIME_OPT_KEY="hoodie.datasource.read.end.instanttime"

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
9,application_1648194189527_0011,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Functions to create JSON data and Spark dataframe from this data

In [3]:
## Generates Data

from datetime import datetime

def get_json_data(start, count, dest):
    time_stamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    data = [{"trip_id": i, "tstamp": time_stamp, "route_id": chr(65 + (i % 10)), "destination": dest[i%10]} for i in range(start, start + count)]
    return data

# Creates the Dataframe
def create_json_df(spark, data):
    sc = spark.sparkContext
    return spark.read.json(sc.parallelize(data))


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Working with Partitioned Tables

Let's do the same thing with Partitioned Tables. For the sake of this demo, we will be making route_id and timestamp(yyyy-mm-dd) as partition fields. You can also have a nested partition structure like yyyy/mm/dd

In [4]:
## CHANGE ME ##
config = {
    "table_name": "hudi_partitioned_trips_table",
    "target": "s3://<Your S3 Bucket Here>/hudi/hudi_partitioned_trips_table",
    "primary_key": "trip_id",
    "sort_key": "tstamp",
    "commits_to_retain": "2",
    "partition_keys" : "routeid,tstamp"
}

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Let's generate the data:

In [5]:
part_dest = ["Seattle", "New York", "New Jersey", "Los Angeles", "Las Vegas", "Tucson","Washington DC","Philadelphia","Miami","San Francisco"]
df1 = create_json_df(spark, get_json_data(0, 2000000, part_dest))
df1.show(5,False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+--------+-------+-------------------+
|destination|route_id|trip_id|tstamp             |
+-----------+--------+-------+-------------------+
|Seattle    |A       |0      |2022-03-26 03:18:48|
|New York   |B       |1      |2022-03-26 03:18:48|
|New Jersey |C       |2      |2022-03-26 03:18:48|
|Los Angeles|D       |3      |2022-03-26 03:18:48|
|Las Vegas  |E       |4      |2022-03-26 03:18:48|
+-----------+--------+-------+-------------------+
only showing top 5 rows

And we can now write out the data to S3. Hudi provides various [key generators](https://hudi.apache.org/docs/writing_data#key-generation). We will be using CustomKeyGenerator which can take two possible values - SIMPLE and TIMESTAMP.  When using timestamp you will also need to provide supporting configs for [Key Generators](https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#key-generators) depending upon your input type. 

```
      .option(HIVE_PARTITION_FIELDS_OPT_KEY, config["partition_keys"])
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,MULTIPART_KEYS_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,CUSTOM_KEY_GENERATOR_CLASS_OPT_VAL) 
      .option(PARTITIONPATH_FIELD_OPT_KEY,"route_id:simple, tstamp:timestamp")
```


In [6]:
(df1.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL)
      .option(BULK_INSERT_PARALLELISM, 5)
      .option(HIVE_PARTITION_FIELDS_OPT_KEY, config["partition_keys"])
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,MULTIPART_KEYS_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,CUSTOM_KEY_GENERATOR_CLASS_OPT_VAL)    
      .option(PARTITIONPATH_FIELD_OPT_KEY, "route_id:simple, tstamp:timestamp")
      .option("hoodie.deltastreamer.keygen.timebased.timestamp.type","DATE_STRING")
      .option("hoodie.deltastreamer.keygen.timebased.output.dateformat","yyyy-MM-dd")
      .option("hoodie.deltastreamer.keygen.timebased.input.dateformat","yyyy-MM-dd kk:mm:ss")
      .option("hoodie.datasource.write.hive_style_partitioning", "true")
      .mode("Overwrite")
      .save(config['target']))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
spark.sql("describe table "+config['table_name']).show(20,False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|_hoodie_commit_time    |string   |null   |
|_hoodie_commit_seqno   |string   |null   |
|_hoodie_record_key     |string   |null   |
|_hoodie_partition_path |string   |null   |
|_hoodie_file_name      |string   |null   |
|destination            |string   |null   |
|route_id               |string   |null   |
|trip_id                |bigint   |null   |
|routeid                |string   |null   |
|tstamp                 |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|routeid                |string   |null   |
|tstamp                 |string   |null   |
+-----------------------+---------+-------+

We can see the partitions fields are present in our Hive table. 

```
Partition Information
col_name : route_id
col_name : tstamp
```

Let's now query the data and group by the the partition columns:

In [8]:
result_df=spark.sql("select route_id,tstamp, count(*) as num_trips from "+config['table_name']+" group by route_id, tstamp order by route_id")
result_df.show(20,False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+----------+---------+
|route_id|tstamp    |num_trips|
+--------+----------+---------+
|A       |2022-03-26|200000   |
|B       |2022-03-26|200000   |
|C       |2022-03-26|200000   |
|D       |2022-03-26|200000   |
|E       |2022-03-26|200000   |
|F       |2022-03-26|200000   |
|G       |2022-03-26|200000   |
|H       |2022-03-26|200000   |
|I       |2022-03-26|200000   |
|J       |2022-03-26|200000   |
+--------+----------+---------+

Let us check the S3 path

```
$ aws s3 ls s3://<S3 Bucket>/hudi/hudi_partitioned_trips_table/
                           PRE .hoodie/
                           PRE route_id=A/
                           PRE route_id=B/
                           PRE route_id=C/
                           PRE route_id=D/
                           PRE route_id=E/
                           PRE route_id=F/
                           PRE route_id=G/
                           PRE route_id=H/
                           PRE route_id=I/
                           PRE route_id=J/
2021-12-09 11:32:42          0 .hoodie_$folder$
2021-12-09 11:34:12          0 route_id=A_$folder$
2021-12-09 11:34:17          0 route_id=B_$folder$
2021-12-09 11:34:12          0 route_id=C_$folder$
2021-12-09 11:34:15          0 route_id=D_$folder$
2021-12-09 11:34:18          0 route_id=E_$folder$
2021-12-09 11:34:23          0 route_id=F_$folder$
2021-12-09 11:34:23          0 route_id=G_$folder$
2021-12-09 11:34:26          0 route_id=H_$folder$
2021-12-09 11:34:30          0 route_id=I_$folder$
2021-12-09 11:34:33          0 route_id=J_$folder$


$ aws s3 ls s3://<S3 Bucket>/hudi/hudi_partitioned_trips_table/route_id=A/
                           PRE tstamp=2021-12-09/
2021-12-09 11:34:13          0 tstamp=2021-12-09_$folder$


$ aws s3 ls s3://<S3 Bucket>/hudi/hudi_partitioned_trips_table/route_id=A/tstamp=2021-12-09/
2021-12-09 11:34:13         93 .hoodie_partition_metadata
2021-12-09 11:34:17    1729933 39d96271-824b-449c-98ae-e3f931bf8cee-0_0-39-112_20211209193240.parquet
```



## The other operations Insert, Upsert etc. behave the same way on Partitioned tables.

Lets insert new reccords into partitioned table. Before that lets take a look at the record count.

In [10]:
spark.sql("select count(*) from " +config['table_name'] ).show(20,False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+
|count(1)|
+--------+
|2000000 |
+--------+

In [11]:
insert_dest = ["Syracuse", "Syracuse", "Syracuse", "Syracuse", "Syracuse", "Syracuse", "Syracuse", "Syracuse", "Syracuse", "Syracuse"]
df2 = create_json_df(spark, get_json_data(2000000, 10, insert_dest))
df2.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+--------+-------+-------------------+
|destination|route_id|trip_id|             tstamp|
+-----------+--------+-------+-------------------+
|   Syracuse|       A|2000000|2022-03-26 03:22:03|
|   Syracuse|       B|2000001|2022-03-26 03:22:03|
|   Syracuse|       C|2000002|2022-03-26 03:22:03|
|   Syracuse|       D|2000003|2022-03-26 03:22:03|
|   Syracuse|       E|2000004|2022-03-26 03:22:03|
|   Syracuse|       F|2000005|2022-03-26 03:22:03|
|   Syracuse|       G|2000006|2022-03-26 03:22:03|
|   Syracuse|       H|2000007|2022-03-26 03:22:03|
|   Syracuse|       I|2000008|2022-03-26 03:22:03|
|   Syracuse|       J|2000009|2022-03-26 03:22:03|
+-----------+--------+-------+-------------------+

In [12]:
(df2.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
      .option(UPSERT_PARALLELISM, 10)
      .option(HIVE_PARTITION_FIELDS_OPT_KEY, config["partition_keys"])
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,MULTIPART_KEYS_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,CUSTOM_KEY_GENERATOR_CLASS_OPT_VAL)    
      .option(PARTITIONPATH_FIELD_OPT_KEY, "route_id:simple, tstamp:timestamp")
      .option("hoodie.deltastreamer.keygen.timebased.timestamp.type","DATE_STRING")
      .option("hoodie.deltastreamer.keygen.timebased.output.dateformat","yyyy-MM-dd")
      .option("hoodie.deltastreamer.keygen.timebased.input.dateformat","yyyy-MM-dd kk:mm:ss")
      .option("hoodie.datasource.write.hive_style_partitioning", "true")
      .mode("Append")
      .save(config['target']))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
spark.sql("select count(*) from " +config['table_name'] ).show(20,False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+
|count(1)|
+--------+
|2000010 |
+--------+

In [14]:
result_df=spark.sql("select route_id,tstamp, count(*) as num_trips from "+config['table_name']+" group by route_id, tstamp order by route_id")
result_df.show(20,False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+----------+---------+
|route_id|tstamp    |num_trips|
+--------+----------+---------+
|A       |2022-03-26|200001   |
|B       |2022-03-26|200001   |
|C       |2022-03-26|200001   |
|D       |2022-03-26|200001   |
|E       |2022-03-26|200001   |
|F       |2022-03-26|200001   |
|G       |2022-03-26|200001   |
|H       |2022-03-26|200001   |
|I       |2022-03-26|200001   |
|J       |2022-03-26|200001   |
+--------+----------+---------+