<a href="https://colab.research.google.com/github/DenysNunes/data-examples/blob/main/spark/intermediate/BasicHudiUpsert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Hudi table upsert basic example

Example using Apache Hudi to create a table with UPSERT feature. <br>
This technique can be find [here](https://hudi.apache.org/docs/quick-start-guide).

## Init spark

Creating a session with default configurations and all dependencies.

In [1]:
!pip install -q pyspark==3.1.1
!sudo apt install tree
!rm -rf /tmp/hudi/persons/

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("New Session Example") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-spark-bundle_2.12:0.7.0") \
    .config("spark.sql.hive.convertMetastoreParquet", "false") \
    .enableHiveSupport() \
    .getOrCreate()

spark.sql("DROP TABLE IF EXISTS tb_person")



Reading package lists... Done
Building dependency tree       
Reading state information... Done
tree is already the newest version (1.7.0-5).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


DataFrame[]

## Creating two dataframes with diferent times

In [3]:
from pyspark.sql.types import Row
from datetime import datetime
import time

df1 = spark.createDataFrame([
        Row(id=1, name='John', ts=time.mktime(datetime.now().timetuple())),
        Row(id=2, name='Maria', ts=time.mktime(datetime.now().timetuple())),
        Row(id=3, name='Ben', ts=time.mktime(datetime.now().timetuple()))
])

time.sleep(5)

df2 = spark.createDataFrame([
        Row(id=1, name='Ana', ts=time.mktime(datetime.now().timetuple())),
])

In [30]:
hudi_options = {
    'hoodie.table.name': "tb_person",
    'hoodie.datasource.write.recordkey.field': 'id',
    'hoodie.datasource.write.table.name': "tb_person",
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.precombine.field': 'ts',
    'hoodie.combine.before.upsert': True,
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2
}

df1.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save("/tmp/hudi/persons/")

## ID 1 = John

In [31]:
spark.read.load(format='hudi', path='/tmp/hudi/persons/default/').show()

+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id| name|           ts|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+
|     20220106215535|  20220106215535_0_5|                 2|               default|73fdfb94-015a-4b1...|  2|Maria|1.641503146E9|
|     20220106215535|  20220106215535_0_6|                 1|               default|73fdfb94-015a-4b1...|  1| John|1.641503146E9|
|     20220106215535|  20220106215535_0_7|                 3|               default|73fdfb94-015a-4b1...|  3|  Ben|1.641503146E9|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+



In [6]:
df2.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save("/tmp/hudi/persons/")

## ID 1 = Ana

In [7]:
spark.read.load(format='hudi', path='/tmp/hudi/persons/default/').show()

+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id| name|           ts|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+
|     20220106210556|  20220106210556_0_1|                 2|               default|73fdfb94-015a-4b1...|  2|Maria|1.641503146E9|
|     20220106210607|  20220106210607_0_4|                 1|               default|73fdfb94-015a-4b1...|  1|  Ana|1.641503151E9|
|     20220106210556|  20220106210556_0_3|                 3|               default|73fdfb94-015a-4b1...|  3|  Ben|1.641503146E9|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+



In [7]:
!tree /tmp/hudi/persons/

/tmp/hudi/persons/
└── default
    ├── dde821ac-fdb2-4c60-ac1c-a1321d16a407-0_0-21-22_20220106205459.parquet
    └── dde821ac-fdb2-4c60-ac1c-a1321d16a407-0_0-54-54_20220106205510.parquet

1 directory, 2 files


## Creating table in metastore

In [8]:
spark.sql("""

CREATE EXTERNAL TABLE `tb_person`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `id` long,
  `name` string,
  `ts` double)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/tmp/hudi/persons/default'
""")

DataFrame[]

In [27]:
spark.sql("""

select id,
       name, 
       from_unixtime(ts,'MM-dd-yyyy HH:mm:ss') as ts 
from  tb_person
order by id

""").show(20, False)

+---+-----+-------------------+
|id |name |ts                 |
+---+-----+-------------------+
|1  |Ana  |01-06-2022 21:05:51|
|2  |Maria|01-06-2022 21:05:46|
|3  |Ben  |01-06-2022 21:05:46|
+---+-----+-------------------+

