First of all, my apologies if this is not the appropriate channel for questions.
After researching metaflow, I believe metaflow's implementation of parallel s3 reading is significantly faster than other alternatives (looking for a replacement of dask internal s3fs reading logic).
However, I cant seem to be able to read valid data from metaflow.S3. Here is a snippet that shows my issue:
S3_PATH = "s3://s3-bucket/path/"
s3 = S3(s3root=S3_PATH)
s3 = s3.__enter__() # issue is the same with context manager
data = s3.get_all() # read all files in the root
first_file = data[0]
first_file.text
The text representation of the files seems to be encoded in a way I cant figure out how to deserialize? How can we properly deserialize these string representations?
I also tried deserializing the locally downloaded data using metaflow datastore
with gzip.GzipFile(data[0].path, mode="rb") as f:
r = f.read()
Yields Not a gzipped file (b'15')
First of all, my apologies if this is not the appropriate channel for questions.
After researching metaflow, I believe metaflow's implementation of parallel s3 reading is significantly faster than other alternatives (looking for a replacement of dask internal s3fs reading logic).
However, I cant seem to be able to read valid data from metaflow.S3. Here is a snippet that shows my issue:
The text representation of the files seems to be encoded in a way I cant figure out how to deserialize? How can we properly deserialize these string representations?
I also tried deserializing the locally downloaded data using metaflow datastore
Yields
Not a gzipped file (b'15')