From 212e855adfcd9e39d864446fe84687545541e37f Mon Sep 17 00:00:00 2001 From: David Wood Date: Thu, 27 Jun 2024 10:39:44 -0400 Subject: [PATCH 1/5] fix links in advanced tutorials, and normalize and add readmes to f/ededup and doc_id transforms Signed-off-by: David Wood --- .../doc/advanced-transform-tutorial.md | 4 ++-- transforms/language/lang_id/README.md | 4 ++-- transforms/universal/doc_id/README.md | 12 ++++++++++++ .../universal/doc_id/ray/{Readme.md => README.md} | 0 transforms/universal/ededup/README.md | 10 ++++++++++ .../universal/ededup/ray/{Readme.md => README.md} | 0 transforms/universal/fdedup/README.md | 10 ++++++++++ .../universal/fdedup/ray/{Readme.md => README.md} | 0 transforms/universal/filter/README.md | 4 ++-- transforms/universal/noop/README.md | 4 ++-- transforms/universal/tokenization/README.md | 2 +- 11 files changed, 41 insertions(+), 9 deletions(-) create mode 100644 transforms/universal/doc_id/README.md rename transforms/universal/doc_id/ray/{Readme.md => README.md} (100%) create mode 100644 transforms/universal/ededup/README.md rename transforms/universal/ededup/ray/{Readme.md => README.md} (100%) create mode 100644 transforms/universal/fdedup/README.md rename transforms/universal/fdedup/ray/{Readme.md => README.md} (100%) diff --git a/data-processing-lib/doc/advanced-transform-tutorial.md b/data-processing-lib/doc/advanced-transform-tutorial.md index 6231fe206..e325bbcc0 100644 --- a/data-processing-lib/doc/advanced-transform-tutorial.md +++ b/data-processing-lib/doc/advanced-transform-tutorial.md @@ -32,13 +32,13 @@ Finally, we show to use the command line to run the transform in a local ray clu One of the basic components of exact dedup implementation is a cache of hashes. That is why we will start from implementing this support actor. The implementation is fairly straight forward and can be -found [here](../../transforms/universal/ededup/ray/src/ededup_transform.py) +found [here](../../transforms/universal/ededup/ray/src/ededup_transform_ray.py) ## EdedupTransform First, let's define the transform class. To do this we extend the base abstract/interface class -[AbstractTableTransform](../ray/src/data_processing_ray/transform/table_transform.py), +[AbstractTableTransform](../python/src/data_processing/transform/table_transform.py), which requires definition of the following: * an initializer (i.e. `init()`) that accepts a dictionary of configuration diff --git a/transforms/language/lang_id/README.md b/transforms/language/lang_id/README.md index e1705be8e..7f03a4cdb 100644 --- a/transforms/language/lang_id/README.md +++ b/transforms/language/lang_id/README.md @@ -8,5 +8,5 @@ the following runtimes are available: implementation. * [ray](ray/README.md) - enables the running of the base python transformation in a Ray runtime -* [kfp](kfp_ray/README.md) - enables running the ray docker image for -noop in a kubernetes cluster using a generated `yaml` file. +* [kfp](kfp_ray/README.md) - enables running the ray docker image +in a kubernetes cluster using a generated `yaml` file. diff --git a/transforms/universal/doc_id/README.md b/transforms/universal/doc_id/README.md new file mode 100644 index 000000000..5de66d253 --- /dev/null +++ b/transforms/universal/doc_id/README.md @@ -0,0 +1,12 @@ +# DOC Id Transform +The doc_id transforms allows the addition of document identifiers, both unique integers and content hashes, +per the set of +[transform project conventions](../../README.md#transform-project-conventions) +the following runtimes are available: + +* [ray](ray/README.md) - enables the running of the base python transformation +in a Ray runtime +* [spark](spark/README.md) - enables the running of a spark-based transformation +in a Spark runtime. +* [kfp](kfp_ray/README.md) - enables running the ray docker image +in a kubernetes cluster using a generated `yaml` file. diff --git a/transforms/universal/doc_id/ray/Readme.md b/transforms/universal/doc_id/ray/README.md similarity index 100% rename from transforms/universal/doc_id/ray/Readme.md rename to transforms/universal/doc_id/ray/README.md diff --git a/transforms/universal/ededup/README.md b/transforms/universal/ededup/README.md new file mode 100644 index 000000000..09de2f2ab --- /dev/null +++ b/transforms/universal/ededup/README.md @@ -0,0 +1,10 @@ +# Exect Deduplification Transform +The ededup transforms removes duplicate documents within a set of parquet files, +per the set of +[transform project conventions](../../README.md#transform-project-conventions) +the following runtimes are available: + +* [ray](ray/README.md) - enables the running of the base python transformation +in a Ray runtime +* [kfp](kfp_ray/README.md) - enables running the ray docker image +in a kubernetes cluster using a generated `yaml` file. diff --git a/transforms/universal/ededup/ray/Readme.md b/transforms/universal/ededup/ray/README.md similarity index 100% rename from transforms/universal/ededup/ray/Readme.md rename to transforms/universal/ededup/ray/README.md diff --git a/transforms/universal/fdedup/README.md b/transforms/universal/fdedup/README.md new file mode 100644 index 000000000..e128566d2 --- /dev/null +++ b/transforms/universal/fdedup/README.md @@ -0,0 +1,10 @@ +# Fuzzy Deduplification Transform +The fdedup transforms removes documents that are very similar to each other within a set of parquet files, +per the set of +[transform project conventions](../../README.md#transform-project-conventions) +the following runtimes are available: + +* [ray](ray/README.md) - enables the running of the base python transformation +in a Ray runtime +* [kfp](kfp_ray/README.md) - enables running the ray docker image +in a kubernetes cluster using a generated `yaml` file. diff --git a/transforms/universal/fdedup/ray/Readme.md b/transforms/universal/fdedup/ray/README.md similarity index 100% rename from transforms/universal/fdedup/ray/Readme.md rename to transforms/universal/fdedup/ray/README.md diff --git a/transforms/universal/filter/README.md b/transforms/universal/filter/README.md index fb3ffc876..73688eeff 100644 --- a/transforms/universal/filter/README.md +++ b/transforms/universal/filter/README.md @@ -10,5 +10,5 @@ implementation. in a Ray runtime * [spark](spark/README.md) - enables the running of a spark-based transformation in a Spark runtime. -* [kfp](kfp_ray/README.md) - enables running the ray docker image for -filter in a kubernetes cluster using a generated `yaml` file. \ No newline at end of file +* [kfp](kfp_ray/README.md) - enables running the ray docker image +in a kubernetes cluster using a generated `yaml` file. diff --git a/transforms/universal/noop/README.md b/transforms/universal/noop/README.md index 56ce95889..8d310041d 100644 --- a/transforms/universal/noop/README.md +++ b/transforms/universal/noop/README.md @@ -10,5 +10,5 @@ implementation. in a Ray runtime * [spark](spark/README.md) - enables the running of a spark-based transformation in a Spark runtime. -* [kfp](kfp_ray/README.md) - enables running the ray docker image for -noop in a kubernetes cluster using a generated `yaml` file. +* [kfp](kfp_ray/README.md) - enables running the ray docker image +in a kubernetes cluster using a generated `yaml` file. diff --git a/transforms/universal/tokenization/README.md b/transforms/universal/tokenization/README.md index 4067850a0..3fd4571ff 100644 --- a/transforms/universal/tokenization/README.md +++ b/transforms/universal/tokenization/README.md @@ -9,5 +9,5 @@ the following runtimes are available: implementation. * [ray](ray/README.md) - enables the running of the python-based transformation in a Ray runtime -* [kfp](kfp_ray/README.md) - enables running the ray docker image for +* [kfp](kfp_ray/README.md) - enables running the ray docker image the transform in a kubernetes cluster using a generated `yaml` file. From 83f11ffbb4ceebd099245d6fe9db09b3da767bf2 Mon Sep 17 00:00:00 2001 From: David Wood Date: Thu, 27 Jun 2024 11:23:21 -0400 Subject: [PATCH 2/5] Updates to ingest2parquet and code2parquet readmes Signed-off-by: David Wood --- tools/ingest2parquet/README.md | 5 +++++ transforms/code/code2parquet/python/README.md | 3 ++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/ingest2parquet/README.md b/tools/ingest2parquet/README.md index 842624b53..c7485994e 100644 --- a/tools/ingest2parquet/README.md +++ b/tools/ingest2parquet/README.md @@ -1,5 +1,10 @@ # INGEST2PARQUET +**Please note: This tool is deprecated and will be removed soon. +It is superseded by the transform-based implementation, +[code2parquet](../../transforms/code/code2parquet), providing identical capability, +but with support for ray-based scalability.** + ## Summary This Python script is designed to convert raw data files, particularly ZIP files, into Parquet format. It is built to handle concurrent processing of multiple files using multiprocessing for efficient execution. Each file contained within the ZIP is transformed into a distinct row within the Parquet dataset, adhering to the below schema. diff --git a/transforms/code/code2parquet/python/README.md b/transforms/code/code2parquet/python/README.md index da70d6381..14ee9f032 100644 --- a/transforms/code/code2parquet/python/README.md +++ b/transforms/code/code2parquet/python/README.md @@ -76,7 +76,8 @@ from the configuration dictionary. and specifies the path to a JSON file containing the mapping of languages to extensions. The json file is expected to contain a dictionary of languages names as keys, with values being a list of strings specifying the - associated extensions. + associated extensions. As an example, see + [lang_extensions](test-data/languages/lang_extensions.json) . * `data_access_factory` - used to create the DataAccess instance used to read the file specified in `supported_langs_file`. * `detect_programming_lang` - a flag that indicates if the language:extension mappings From 948b0bd1a7c88fcd076563a0fcfe47fd4602d2ac Mon Sep 17 00:00:00 2001 From: David Wood Date: Thu, 27 Jun 2024 12:02:18 -0400 Subject: [PATCH 3/5] Fix doc_id,e/fdedup and profiler Dockerfiles to track rename of Readme.me Signed-off-by: David Wood --- transforms/universal/doc_id/ray/Dockerfile | 2 +- transforms/universal/ededup/ray/Dockerfile | 2 +- transforms/universal/fdedup/ray/Dockerfile | 2 +- transforms/universal/profiler/ray/Dockerfile | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/transforms/universal/doc_id/ray/Dockerfile b/transforms/universal/doc_id/ray/Dockerfile index acff66d79..56cf13609 100644 --- a/transforms/universal/doc_id/ray/Dockerfile +++ b/transforms/universal/doc_id/ray/Dockerfile @@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray && pip install --no-cache-dir -e . # Install ray project source COPY --chown=ray:users src/ src/ COPY --chown=ray:users pyproject.toml pyproject.toml -COPY --chown=ray:users Readme.md Readme.md +COPY --chown=ray:users README.md README.md RUN pip install --no-cache-dir -e . # copy source data diff --git a/transforms/universal/ededup/ray/Dockerfile b/transforms/universal/ededup/ray/Dockerfile index 8b6bf8c1d..a0cf9d61d 100644 --- a/transforms/universal/ededup/ray/Dockerfile +++ b/transforms/universal/ededup/ray/Dockerfile @@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray && pip install --no-cache-dir -e . # Install ray project source COPY --chown=ray:users src/ src/ COPY --chown=ray:users pyproject.toml pyproject.toml -COPY --chown=ray:users Readme.md Readme.md +COPY --chown=ray:users README.md README.md COPY --chown=ray:users images/ images/ RUN pip install --no-cache-dir -e . diff --git a/transforms/universal/fdedup/ray/Dockerfile b/transforms/universal/fdedup/ray/Dockerfile index 72af4a747..eeffd88ed 100644 --- a/transforms/universal/fdedup/ray/Dockerfile +++ b/transforms/universal/fdedup/ray/Dockerfile @@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray && pip install --no-cache-dir -e . # Install ray project source COPY --chown=ray:users src/ src/ COPY --chown=ray:users pyproject.toml pyproject.toml -COPY --chown=ray:users Readme.md Readme.md +COPY --chown=ray:users README.md README.md COPY --chown=ray:users images/ images/ RUN pip install --no-cache-dir -e . diff --git a/transforms/universal/profiler/ray/Dockerfile b/transforms/universal/profiler/ray/Dockerfile index d45c57fdb..0f46afe45 100644 --- a/transforms/universal/profiler/ray/Dockerfile +++ b/transforms/universal/profiler/ray/Dockerfile @@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray && pip install --no-cache-dir -e . # Install ray project source COPY --chown=ray:users src/ src/ COPY --chown=ray:users pyproject.toml pyproject.toml -COPY --chown=ray:users Readme.md Readme.md +COPY --chown=ray:users README.md README.md RUN pip install --no-cache-dir -e . # copy source data From 3069814c2d006edba1c8903abce7592309a0a67d Mon Sep 17 00:00:00 2001 From: David Wood Date: Thu, 27 Jun 2024 12:34:49 -0400 Subject: [PATCH 4/5] temp delete profiler's readme Signed-off-by: David Wood --- transforms/universal/profiler/ray/Readme.md | 82 --------------------- 1 file changed, 82 deletions(-) delete mode 100644 transforms/universal/profiler/ray/Readme.md diff --git a/transforms/universal/profiler/ray/Readme.md b/transforms/universal/profiler/ray/Readme.md deleted file mode 100644 index fbaec5957..000000000 --- a/transforms/universal/profiler/ray/Readme.md +++ /dev/null @@ -1,82 +0,0 @@ -# Profiler - -Please see the set of -[transform project conventions](../../../README.md) -for details on general project conventions, transform configuration, -testing and IDE set up. - -## Summary - -Profiler implement a word count. Typical implementation of the word count is done using map reduce. -* It’s O(N2) complexity -* shuffling with lots of data movement - -Implementation here is using “streaming” aggregation, based on central cache: - -* At the heart of the implementation is a cache of partial word counts, implemented as a set of Ray actors and containing -word counts processed so far. -* Individual data processors are responsible for: - * Reading data from data plane - * tokenizing documents (we use pluggable tokenizer) - * Coordinating with distributed cache to collect overall word counts - -The complication of mapping this model to transform model is the fact that implementation requires an aggregators cache, -that transform mode knows nothing about. The solution here is to use transform runtime to create cache -and pass it as a parameter to transforms. - -## Transform runtime - -[Transform runtime](src/profiler_transform_ray.py) is responsible for creation cache actors and sending their -handles to the transforms themselves -Additionally it writes created word counts to the data storage (as .csv files) and enhances statistics information with the information about cache size and utilization - -## Configuration and command line Options - -The set of dictionary keys holding [EdedupTransform](src/profiler_transform_ray.py) -configuration for values are as follows: - -* _aggregator_cpu_ - specifies an amount of CPUs per aggregator actor -* _num_aggregators_ - specifies number of aggregator actors -* _doc_column_ - specifies name of the column containing documents - -## Running - -### Launched Command Line Options -When running the transform with the Ray launcher (i.e. TransformLauncher), -the following command line arguments are available in addition to -[the options provided by the launcher](../../../../data-processing-lib/doc/ray-launcher-options.md). - -```shell - --profiler_aggregator_cpu PROFILER_AGGREGATOR_CPU - number of CPUs per aggrigator - --profiler_num_aggregators PROFILER_NUM_AGGREGATORS - number of agregator actors to use - --profiler_doc_column PROFILER_DOC_COLUMN - key for accessing data - ``` - -These correspond to the configuration keys described above. - -### Running the samples -To run the samples, use the following `make` targets - -* `run-cli-sample` - runs src/ededup_transform_ray.py using command line args -* `run-local-sample` - runs src/ededup_local_ray.py -* `run-s3-sample` - runs src/ededup_s3_ray.py - * Requires prior installation of minio, depending on your platform (e.g., from [here](https://min.io/docs/minio/macos/index.html) - and [here](https://min.io/docs/minio/linux/index.html) - and invocation of `make minio-start` to load data into local minio for S3 access. - -These targets will activate the virtual environment and set up any configuration needed. -Use the `-n` option of `make` to see the detail of what is done to run the sample. - -For example, -```shell -make run-cli-sample -... -``` -Then -```shell -ls output -``` -To see results of the transform. From d4c35405a4783c299da643d76902443afad94fa3 Mon Sep 17 00:00:00 2001 From: David Wood Date: Thu, 27 Jun 2024 12:35:17 -0400 Subject: [PATCH 5/5] Add back profiler's readme as README.md Signed-off-by: David Wood --- transforms/universal/profiler/ray/README.md | 82 +++++++++++++++++++++ 1 file changed, 82 insertions(+) create mode 100644 transforms/universal/profiler/ray/README.md diff --git a/transforms/universal/profiler/ray/README.md b/transforms/universal/profiler/ray/README.md new file mode 100644 index 000000000..fbaec5957 --- /dev/null +++ b/transforms/universal/profiler/ray/README.md @@ -0,0 +1,82 @@ +# Profiler + +Please see the set of +[transform project conventions](../../../README.md) +for details on general project conventions, transform configuration, +testing and IDE set up. + +## Summary + +Profiler implement a word count. Typical implementation of the word count is done using map reduce. +* It’s O(N2) complexity +* shuffling with lots of data movement + +Implementation here is using “streaming” aggregation, based on central cache: + +* At the heart of the implementation is a cache of partial word counts, implemented as a set of Ray actors and containing +word counts processed so far. +* Individual data processors are responsible for: + * Reading data from data plane + * tokenizing documents (we use pluggable tokenizer) + * Coordinating with distributed cache to collect overall word counts + +The complication of mapping this model to transform model is the fact that implementation requires an aggregators cache, +that transform mode knows nothing about. The solution here is to use transform runtime to create cache +and pass it as a parameter to transforms. + +## Transform runtime + +[Transform runtime](src/profiler_transform_ray.py) is responsible for creation cache actors and sending their +handles to the transforms themselves +Additionally it writes created word counts to the data storage (as .csv files) and enhances statistics information with the information about cache size and utilization + +## Configuration and command line Options + +The set of dictionary keys holding [EdedupTransform](src/profiler_transform_ray.py) +configuration for values are as follows: + +* _aggregator_cpu_ - specifies an amount of CPUs per aggregator actor +* _num_aggregators_ - specifies number of aggregator actors +* _doc_column_ - specifies name of the column containing documents + +## Running + +### Launched Command Line Options +When running the transform with the Ray launcher (i.e. TransformLauncher), +the following command line arguments are available in addition to +[the options provided by the launcher](../../../../data-processing-lib/doc/ray-launcher-options.md). + +```shell + --profiler_aggregator_cpu PROFILER_AGGREGATOR_CPU + number of CPUs per aggrigator + --profiler_num_aggregators PROFILER_NUM_AGGREGATORS + number of agregator actors to use + --profiler_doc_column PROFILER_DOC_COLUMN + key for accessing data + ``` + +These correspond to the configuration keys described above. + +### Running the samples +To run the samples, use the following `make` targets + +* `run-cli-sample` - runs src/ededup_transform_ray.py using command line args +* `run-local-sample` - runs src/ededup_local_ray.py +* `run-s3-sample` - runs src/ededup_s3_ray.py + * Requires prior installation of minio, depending on your platform (e.g., from [here](https://min.io/docs/minio/macos/index.html) + and [here](https://min.io/docs/minio/linux/index.html) + and invocation of `make minio-start` to load data into local minio for S3 access. + +These targets will activate the virtual environment and set up any configuration needed. +Use the `-n` option of `make` to see the detail of what is done to run the sample. + +For example, +```shell +make run-cli-sample +... +``` +Then +```shell +ls output +``` +To see results of the transform.