Skip to content

Commit

Permalink
Merge pull request #352 from IBM/Readme-Changes
Browse files Browse the repository at this point in the history
Changes in code2parquet, ingest2parquet, and advance tutorial readmes.
  • Loading branch information
daw3rd committed Jun 27, 2024
2 parents e9c6d19 + d4c3540 commit 4dc8928
Show file tree
Hide file tree
Showing 18 changed files with 36 additions and 30 deletions.
4 changes: 2 additions & 2 deletions data-processing-lib/doc/advanced-transform-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,13 @@ Finally, we show to use the command line to run the transform in a local ray clu

One of the basic components of exact dedup implementation is a cache of hashes. That is why we will start
from implementing this support actor. The implementation is fairly straight forward and can be
found [here](../../transforms/universal/ededup/ray/src/ededup_transform.py)
found [here](../../transforms/universal/ededup/ray/src/ededup_transform_ray.py)

## EdedupTransform

First, let's define the transform class. To do this we extend
the base abstract/interface class
[AbstractTableTransform](../ray/src/data_processing_ray/transform/table_transform.py),
[AbstractTableTransform](../python/src/data_processing/transform/table_transform.py),
which requires definition of the following:

* an initializer (i.e. `init()`) that accepts a dictionary of configuration
Expand Down
5 changes: 5 additions & 0 deletions tools/ingest2parquet/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# INGEST2PARQUET

**Please note: This tool is deprecated and will be removed soon.
It is superseded by the transform-based implementation,
[code2parquet](../../transforms/code/code2parquet), providing identical capability,
but with support for ray-based scalability.**

## Summary
This Python script is designed to convert raw data files, particularly ZIP files, into Parquet format. It is built to handle concurrent processing of multiple files using multiprocessing for efficient execution.
Each file contained within the ZIP is transformed into a distinct row within the Parquet dataset, adhering to the below schema.
Expand Down
3 changes: 2 additions & 1 deletion transforms/code/code2parquet/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,8 @@ from the configuration dictionary.
and specifies the path to a JSON file containing the mapping of languages
to extensions. The json file is expected to contain a dictionary of
languages names as keys, with values being a list of strings specifying the
associated extensions.
associated extensions. As an example, see
[lang_extensions](test-data/languages/lang_extensions.json) .
* `data_access_factory` - used to create the DataAccess instance used to read
the file specified in `supported_langs_file`.
* `detect_programming_lang` - a flag that indicates if the language:extension mappings
Expand Down
4 changes: 2 additions & 2 deletions transforms/language/lang_id/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ the following runtimes are available:
implementation.
* [ray](ray/README.md) - enables the running of the base python transformation
in a Ray runtime
* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
noop in a kubernetes cluster using a generated `yaml` file.
* [kfp](kfp_ray/README.md) - enables running the ray docker image
in a kubernetes cluster using a generated `yaml` file.
8 changes: 4 additions & 4 deletions transforms/universal/doc_id/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Doc ID Transform
The Document ID transforms adds a document identification, which later can be used in de-duplication operations.
Per the set of
The Document ID transforms adds a document identification (unique integers and content hashes), which later can be used in de-duplication operations,
per the set of
[transform project conventions](../../README.md#transform-project-conventions)
the following runtimes are available:

* [ray](ray/README.md) - enables the running of the base python transformation
in a Ray runtime
* [spark](spark/README.md) - enables the running of a spark-based transformation
in a Spark runtime.
* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
the transformer in a kubernetes cluster using a generated `yaml` file.
* [kfp](kfp_ray/README.md) - enables running the ray docker image
in a kubernetes cluster using a generated `yaml` file.
2 changes: 1 addition & 1 deletion transforms/universal/doc_id/ray/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray && pip install --no-cache-dir -e .
# Install ray project source
COPY --chown=ray:users src/ src/
COPY --chown=ray:users pyproject.toml pyproject.toml
COPY --chown=ray:users Readme.md Readme.md
COPY --chown=ray:users README.md README.md
RUN pip install --no-cache-dir -e .

# copy source data
Expand Down
File renamed without changes.
10 changes: 5 additions & 5 deletions transforms/universal/ededup/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Exact Deduplication Transform
The exact deduplication removes text duplications
Per the set of
# Exect Deduplification Transform
The ededup transforms removes duplicate documents within a set of parquet files,
per the set of
[transform project conventions](../../README.md#transform-project-conventions)
the following runtimes are available:

* [ray](ray/README.md) - enables the running of the base python transformation
in a Ray runtime
* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
the transformer in a kubernetes cluster using a generated `yaml` file.
* [kfp](kfp_ray/README.md) - enables running the ray docker image
in a kubernetes cluster using a generated `yaml` file.
2 changes: 1 addition & 1 deletion transforms/universal/ededup/ray/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray && pip install --no-cache-dir -e .
# Install ray project source
COPY --chown=ray:users src/ src/
COPY --chown=ray:users pyproject.toml pyproject.toml
COPY --chown=ray:users Readme.md Readme.md
COPY --chown=ray:users README.md README.md
COPY --chown=ray:users images/ images/
RUN pip install --no-cache-dir -e .

Expand Down
File renamed without changes.
10 changes: 5 additions & 5 deletions transforms/universal/fdedup/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Fuzzy Deduplication Transform
The fuzzy deduplication removes text duplications
Per the set of
# Fuzzy Deduplification Transform
The fdedup transforms removes documents that are very similar to each other within a set of parquet files,
per the set of
[transform project conventions](../../README.md#transform-project-conventions)
the following runtimes are available:

* [ray](ray/README.md) - enables the running of the base python transformation
in a Ray runtime
* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
the transformer in a kubernetes cluster using a generated `yaml` file.
* [kfp](kfp_ray/README.md) - enables running the ray docker image
in a kubernetes cluster using a generated `yaml` file.
2 changes: 1 addition & 1 deletion transforms/universal/fdedup/ray/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray && pip install --no-cache-dir -e .
# Install ray project source
COPY --chown=ray:users src/ src/
COPY --chown=ray:users pyproject.toml pyproject.toml
COPY --chown=ray:users Readme.md Readme.md
COPY --chown=ray:users README.md README.md
COPY --chown=ray:users images/ images/
RUN pip install --no-cache-dir -e .

Expand Down
File renamed without changes.
8 changes: 4 additions & 4 deletions transforms/universal/filter/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Filter Transform
The NOOP transforms serves as a simple exemplar to demonstrate the development
of a simple 1:1 transform. Per the set of
The filter transforms provides SQL-based expressions for filtering rows and optionally column removal from parquet files,
per the set of
[transform project conventions](../../README.md#transform-project-conventions)
the following runtimes are available:

Expand All @@ -10,5 +10,5 @@ implementation.
in a Ray runtime
* [spark](spark/README.md) - enables the running of a spark-based transformation
in a Spark runtime.
* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
filter in a kubernetes cluster using a generated `yaml` file.
* [kfp](kfp_ray/README.md) - enables running the ray docker image
in a kubernetes cluster using a generated `yaml` file.
4 changes: 2 additions & 2 deletions transforms/universal/noop/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ implementation.
in a Ray runtime
* [spark](spark/README.md) - enables the running of a spark-based transformation
in a Spark runtime.
* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
the transformer in a kubernetes cluster using a generated `yaml` file.
* [kfp](kfp_ray/README.md) - enables running the ray docker image
in a kubernetes cluster using a generated `yaml` file.
2 changes: 1 addition & 1 deletion transforms/universal/profiler/ray/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray && pip install --no-cache-dir -e .
# Install ray project source
COPY --chown=ray:users src/ src/
COPY --chown=ray:users pyproject.toml pyproject.toml
COPY --chown=ray:users Readme.md Readme.md
COPY --chown=ray:users README.md README.md
RUN pip install --no-cache-dir -e .

# copy source data
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion transforms/universal/tokenization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@ the following runtimes are available:
implementation.
* [ray](ray/README.md) - enables the running of the python-based transformation
in a Ray runtime
* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
* [kfp](kfp_ray/README.md) - enables running the ray docker image
the transform in a kubernetes cluster using a generated `yaml` file.

0 comments on commit 4dc8928

Please sign in to comment.