## Hudi on Snurran
This notebook test the reading and writing times of Apache Hudi on HopsFS, passing through Hopsworks, deployed on Snurran for better performances (thanks to NVMe disks).
The procedures followed is the same used to test PyIceberg with both Polars/Pandas.

In [None]:
# ONLY ONCE!
# Get the NYC Taxi dataset from the network
!curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o "/home/yarnapp/hopsfs/Resources/nyc_taxiparquet"

In [None]:
!pip install pandas --upgrade
!pip install polars

In [1]:
import hopsworks
import numpy as np
import pandas as pd
import polars as pl
import os
import importlib
import time
import math
import string
import random
import sys
import requests
import re

In [2]:
def column_renamer(df):
    '''
    Given a dataframe, renames all the column to small lower case, in order to make it possible to save the dataframe on Hopsworks
    via the usage of a Feature Group.
    '''
    
    for name in df.columns:
        df.rename(columns={name : name.lower()}, inplace=True)
    
    return df

In [3]:
def time_printer(total_time):
    minutes = math.floor(total_time / 60)
    seconds = math.ceil( total_time % 60)
    
    to_print = ""
    if (minutes > 0):
        to_print = str(minutes) + "m "
        
    to_print = to_print + str(seconds) +"s "
    return to_print

In [4]:
def extract_time(text):
    pattern = r'(\d+m\s*)?\d+(\.\d+)?s'
    match = re.search(pattern, text)
    if match:
        return match.group()
    else:
        return None

In [5]:
# Load the data previously downloaded into a Parquet DataFrame (df)
nyc_data_path = "/home/yarnapp/hopsfs/Resources/nyc_taxiparquet"
df = pd.read_parquet(nyc_data_path)

In [6]:
# Bring the column names to lower cases
df = column_renamer(df)

In [7]:
# ðŸ§ªðŸ§ª TESTING ðŸ§ªðŸ§ª
# Reduce the size of the dataframe, just for testing purposes.
#df = df[0:20]

In [8]:
print(str(int(sys.getsizeof(df))/(1024*1024*1024)) + " GBs occupied by the Pandas Dataframe!")

0.574670372530818 GBs occupied by the Pandas Dataframe!


In [9]:
#%%capture creation_time
# Login to the project and insert/upload the new dataset in a new feature group, while keeping track of the time required by each operation.
before_login = time.time()
project = hopsworks.login()

before_creation = time.time()
fs = project.get_feature_store()
fg = fs.get_or_create_feature_group(
    name="hudi_test",
    version=1,
    primary_key=df.columns,
    description='Uploaded NYC Dataset for testing reasons')

before_insertion = time.time()
fg.insert(df)
before_materialization = time.time()

print("\n**Time needed to login:       " + time_printer(before_creation        - before_login)     + "**")
print("**Time needed to create fg:   "   + time_printer(before_insertion       - before_creation)  + "**")
print("**Time needed to insert data: "   + time_printer(before_materialization - before_insertion) + "**")

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/15483
Connected. Call `.close()` to terminate connection gracefully.


Uploading Dataframe: 0.00% |          | Rows 0/3066766 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: hudi_test_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/15483/jobs/named/hudi_test_1_offline_fg_materialization/executions

**Time needed to login:       2s **
**Time needed to create fg:   2s **
**Time needed to insert data: 2m 49s **


In [10]:
# Check the repeatedly the status of the metarialization_job. When it is FINISHED,
# get the current time and calculate the time required by the materialization.
while(fg.materialization_job.get_state() != 'FINISHED'):
    time.sleep(2)
after_materialization = time.time()

final_state = fg.materialization_job.get_final_state()
if (final_state != 'SUCCEEDED'):
    print("\nWARNING: The final state is: " + str(final_state))
    
print("\n**Time needed to materialize: " + time_printer(after_materialization - before_materialization) + "(+/- 2s) **\n")


**Time needed to materialize: 4m 8s (+/- 2s) **



In [13]:
%%capture captured_read_print
# Test the reading time and save these metrics for further evaluations.
before_read = time.time()
fg.read()
after_read  = time.time()

#print("\n**Time needed to read: " + time_printer(after_read - before_read) + "**")

In [14]:
read_time = extract_time(str(captured_read_print))
print(read_time)

21.97s


---
#### @FINAL Delete all the data and files created

In [None]:
fg.delete()