Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jay/daft 0.1.17 #1

Closed
wants to merge 21 commits into from
Closed

Jay/daft 0.1.17 #1

wants to merge 21 commits into from

Conversation

jaychia
Copy link

@jaychia jaychia commented Sep 12, 2023

No description provided.

pfaraone and others added 21 commits August 16, 2023 14:55
* fixed commit_partition and list_deltas local deltacat storage bug

* providing destination table with deltas

* ds_mock_kwargs fixtures just consist of dict of db_file_path

* working test_compact_partition_success unit test

* flake8 issues

* removed print statements

* fixed  assert (
            manifest_records == len(compacted_table)

* 3 working unit test cases

* implemented test_compact_partition_incremental

* commented out broken unit test

* removed unused test files

* added explanatory comment to block_until_instance_metadata_service_returns_success

* reverting ray_utils/runtime change

* fixed test_retrying_on_statuses_in_status_force_list

* fixed test_retrying_on_statuses_in_status_force_list

* added additional block_until_instance_metadata_service_returns_success unit test

* additional fixtures

* validation_callback_func_ -> validation_callback_func_kwargs

* added use_prev_compacted key

* added additional use_prev_compacted

* fixed test_compaction_session unit tests

* copied over working unit tests from dev/test_compact_partition_first_cut

* fixed repartition unit tests

* moved test_utils to itw own module

* removed unused kwargs arg

* paramtrizing records_per_compacted_file, hash_bucket_count

* removed unused TODO

* augmented CompactPartitionParams to include additional compact_partition kwargs

* augmented CompactPartitionParams to include additional compact_partition kwargs

* refactored testcases and setup to there own modules

* added additional type hints

* defaulting to empty dict safetly wherever deltacat_storage_kwargs is a param

* revert change that no longer passed **list_deltas_kwargs to io.discover_deltas

* additional type hints

* no longer passing kwargs to dedupe + decimal-pk case

* no longer passing kwargs to materialize

* unpacking deltacat_storage_kwargs in deltacat_storage.stage_delta timed_invocation

* no longer unpacking **list_deltas_kwargs when calling _execute_compaction_round

* readded kwargs

* added incremental timestamp-pk unit test

* added docstring to offer_iso8601_timestamp_list

* added tc for duplicate w sorting keyand multiple primary keys

* removed deadcode

* added additional comments to first incremental test case

* fixture refactoring - compaction_artifacts_s3_bucket -> setup_compaction_artifacts_s3_bucket

* parameterizing table version

* # ds_mock_kwargs -> # teardown_local_deltacat_storage_db

* tearing down db within test execution is now parameterized

* added INCREMENTAL_DEPENDENT_TEST_CASES dependent test case

* reversing order of sk in 10-incremental-decimal-pk-multi-dup
Co-authored-by: Kevin Yan <kevnya@amazon.com>
* Adding support for reading table into ParquetFile (ray-project#163)

* Adding support for reading just the parquet meta by leveraging ParquetFile

[Description] Right now we read the entire table using read_parquet method of pyarrow. However, we lose the parquet metadata in the process and this will eagerly load all the data. In this commit we will use ParquetFile (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html)

[Motivation] ray-project#162

* Using pyarrow.fs.S3FileSystem instead of s3fs

* Adding s3 adaptive retries and validate encoding

* Interface for merge v2 (ray-project#182)

* Adding interface and definitions for merge step

* fix tests and merge logs

* Add hashed memcached client support (ray-project#173)

* Adding support for reading table into ParquetFile (ray-project#163)

* Adding support for reading just the parquet meta by leveraging ParquetFile

[Description] Right now we read the entire table using read_parquet method of pyarrow. However, we lose the parquet metadata in the process and this will eagerly load all the data. In this commit we will use ParquetFile (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html)

[Motivation] ray-project#162

* Using pyarrow.fs.S3FileSystem instead of s3fs

* Adding s3 adaptive retries and validate encoding

* Add hashed memcached client support

* Adding support for reading table into ParquetFile (ray-project#163)

* Adding support for reading just the parquet meta by leveraging ParquetFile

[Description] Right now we read the entire table using read_parquet method of pyarrow. However, we lose the parquet metadata in the process and this will eagerly load all the data. In this commit we will use ParquetFile (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html)

[Motivation] ray-project#162

* Using pyarrow.fs.S3FileSystem instead of s3fs

* Adding s3 adaptive retries and validate encoding

* use uid instead of ip address as key to hash algorithm

---------

Co-authored-by: Raghavendra M Dani <draghave@amazon.com>

* Implementing hash bucketing v2 (ray-project#178)

* Implementing hash bucket v2

* Fix the assertion regarding hash buckets

* Python 3.7 does not have doClassCleanups in super

* Fix the memory issue with the hb index creation

* Compaction session implementation for algo v2 (ray-project#187)

* Compaction session implementation for algo v2

* Address comments

* Added capability to measure instance minutes in a autoscaling cluster setting

* Avoid configuring logger if ray is uninitialized

* Add readme to run tests

* Refactoring and fixing the num_cpus key in options

* Resolve merge conflict and rebase from main

* Adding additional optimization from POC (ray-project#194)

* Adding additional optimization from POC

* fix typo

* Moved the compact_partition tests to top level module

* Adding unit tests for parquet downloaders

* fix typo

* fix repartition session

* Adding stack trace and passing config kwargs separate due to s3fs bug

* fix the parquet reader

* pass deltacat_storage_kwargs in repartition_session

* addressed comments and extend tests to handle v2

---------

Co-authored-by: Zyiqin-Miranda <79943187+Zyiqin-Miranda@users.noreply.github.com>
* Add pyarrow_to_daft_schema helper

* added initial daft read for parquet

* integrate io code with daft cast

* suggestions

* lint-fix EOF for requirements

* lint-fixs EOF with pre-commit

* bump version to 0.1.18b15

* Fix tuple issue with coerce_int96_timestamp_unit

* Fix kwargs and Schema

* Lint

* using include column names instead of column names

* Update to use new Daft Schema.from_pyarrow_schema method

* Remove daft requirement from benchmark-requirements.txt

* Factor out daft parquet reading code into seperate util file

* add timing to daft.read_parquet, address feedback, add unit tests

* run pre-commit

* switch on columns and add local file

* add schema override unit test

* add schema override unit test

* fix fstrings

* thread column names through

* downgrade to daft==0.1.12

* add support for partial row group downloads

* add `reader_type` to pyarrow utils

---------

Co-authored-by: Jay Chia <jaychia94@gmail.com@users.noreply.github.com>
…ct#184)

* Use list_partition_deltas to retrieve deltas within a given shard

* Update signature for list_partition_deltas

* Update list_partition_deltas impl. in unit tests

* Handle Partition and PartitionLocator types appropriately
* Implementing hash bucketing v2 (ray-project#178)

* Implementing hash bucket v2

* Fix the assertion regarding hash buckets

* Python 3.7 does not have doClassCleanups in super

* Fix the memory issue with the hb index creation

* Compaction session implementation for algo v2 (ray-project#187)

* Compaction session implementation for algo v2

* Address comments

* Added capability to measure instance minutes in a autoscaling cluster setting

* Avoid configuring logger if ray is uninitialized

* Resolve merge conflict and rebase from main

* Adding additional optimization from POC (ray-project#194)

* Adding additional optimization from POC

* fix typo

* Moved the compact_partition tests to top level module

* Adding unit tests for parquet downloaders

* fix typo

* fix repartition session

* Adding stack trace and passing config kwargs separate due to s3fs bug

* fix the parquet reader

* pass deltacat_storage_kwargs in repartition_session

* addressed comments and extend tests to handle v2

* Add merge support and unit tests (ray-project#193)

* Add merge support and unit tests

* Add merge support and unit tests

* fix drop_duplicates

* fix merge and ensure all v1 tests are passing

* fix the naming

* Refactor drop_duplicates to into module

* fix the hash group indices range

* Copy empty hash bucket support; Fix for empty hash bucket in old compacted table

* refactor and naming changes

* Add case when no primary keys

* Add capability to avoid dropping duplicates for rebase

* Support DELETE deltas including unit tests

* only create a delta type column when delete bundle

* fix all issues during actual run

* fix incremental compaction num_rows None

* remove db_test.sqlite

* optimize appending the parquet files

* address comments

* address comments

* address comments

---------

Co-authored-by: Raghavendra Dani <draghave@amazon.com>

* Merge phash_main branch into main

* Bumping up deltacat version

---------

Co-authored-by: Zyiqin-Miranda <79943187+Zyiqin-Miranda@users.noreply.github.com>
* Gracefully handle empty CSV

* addressed comments

* Add a new test case to verify cloudpickle fix

* Allow raising error on empty CSV

* ignore_reinit_error=True to avoid init twice

* Bump up the version
* Fix the boto3 config issue

[Description] Duplicate boto3 config in kwargs was failing the reads.

* Optimize uTSV reads by avoiding creating bytearray
Configure STS to move towards using regional endpoint based on the AWS_REGION environment variable, defaulting to us-east-1
---------

Co-authored-by: Kevin Yan <kevnya@amazon.com>
* Capture memory gb seconds and bug fixes

* fix delta split
* added inflation to the parquet size
* added more logging
* change storage_type to LOCAL as ParquetFile cannot be pickled
* increase the delay for blocking instance metadata service
* capture total and used memory gb seconds to understand memory efficiency
* add ray_custom_resources compaction param to allow scheduling to particular node types

* address comments
…ing with int96 options (ray-project#210)

* update daft to 0.1.16 that uses openssl and correct arrow schema parsing

* update version
* Setting connect timeout and timeout parameters

* Allow configuring timeouts at client side

* Setting task parallelism to memcached connect limit

* fix the comment
* Adding compactor version in audit info and rci

* subclassing compaction version with string

* Use builder pattern to set the audit values
…roject#215)

* Add index array size to estimation, fix dedupe and concurrency

* Explicitly setting fetch_local to False to optimize head node memory

* address comments
* Write pa write result even when referencing

* adding log statement
@jaychia jaychia closed this Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants