Merge pull request #352 from IBM/Readme-Changes

Changes in code2parquet, ingest2parquet, and advance tutorial readmes.
IBM · Jun 27, 2024 · 4dc8928 · 4dc8928
2 parents e9c6d19 + d4c3540
commit 4dc8928
Show file tree

Hide file tree

Showing 18 changed files with 36 additions and 30 deletions.
diff --git a/data-processing-lib/doc/advanced-transform-tutorial.md b/data-processing-lib/doc/advanced-transform-tutorial.md
@@ -32,13 +32,13 @@ Finally, we show to use the command line to run the transform in a local ray clu
 
 One of the basic components of exact dedup implementation is a cache of hashes. That is why we will start
 from implementing this support actor. The implementation is fairly straight forward and can be
-found [here](../../transforms/universal/ededup/ray/src/ededup_transform.py)
+found [here](../../transforms/universal/ededup/ray/src/ededup_transform_ray.py)
 
 ## EdedupTransform
 
 First, let's define the transform class.  To do this we extend
 the base abstract/interface class
-[AbstractTableTransform](../ray/src/data_processing_ray/transform/table_transform.py),
+[AbstractTableTransform](../python/src/data_processing/transform/table_transform.py),
 which requires definition of the following:
 
 * an initializer (i.e. `init()`) that accepts a dictionary of configuration

diff --git a/tools/ingest2parquet/README.md b/tools/ingest2parquet/README.md
@@ -1,5 +1,10 @@
 # INGEST2PARQUET  
 
+**Please note: This tool is deprecated and will be removed soon. 
+It is superseded by the transform-based implementation, 
+[code2parquet](../../transforms/code/code2parquet),  providing identical capability, 
+but with support for ray-based scalability.**
+
 ## Summary 
 This Python script is designed to convert raw data files, particularly ZIP files, into Parquet format. It is built to handle concurrent processing of multiple files using multiprocessing for efficient execution.
 Each file contained within the ZIP is transformed into a distinct row within the Parquet dataset, adhering to the below schema.

diff --git a/transforms/code/code2parquet/python/README.md b/transforms/code/code2parquet/python/README.md
@@ -76,7 +76,8 @@ from the configuration dictionary.
   and specifies the path to a JSON file containing the mapping of languages
   to extensions. The json file is expected to contain a dictionary of
   languages names as keys, with values being a list of strings specifying the
-  associated extensions.
+  associated extensions. As an example, see 
+  [lang_extensions](test-data/languages/lang_extensions.json) .
 * `data_access_factory` - used to create the DataAccess instance used to read
 the file specified in `supported_langs_file`.
 * `detect_programming_lang` - a flag that indicates if the language:extension mappings

diff --git a/transforms/language/lang_id/README.md b/transforms/language/lang_id/README.md
@@ -8,5 +8,5 @@ the following runtimes are available:
 implementation.
 * [ray](ray/README.md) - enables the running of the base python transformation
 in a Ray runtime
-* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
-noop in a kubernetes cluster using a generated `yaml` file.
+* [kfp](kfp_ray/README.md) - enables running the ray docker image 
+in a kubernetes cluster using a generated `yaml` file.
diff --git a/transforms/universal/doc_id/README.md b/transforms/universal/doc_id/README.md
@@ -1,12 +1,12 @@
 # Doc ID Transform 
-The Document ID transforms adds a document identification, which later can be used in de-duplication operations. 
-Per the set of 
+The Document ID transforms adds a document identification (unique integers and content hashes), which later can be used in de-duplication operations,
+per the set of 
 [transform project conventions](../../README.md#transform-project-conventions)
 the following runtimes are available:
 
 * [ray](ray/README.md) - enables the running of the base python transformation
 in a Ray runtime
 * [spark](spark/README.md) - enables the running of a spark-based transformation
 in a Spark runtime. 
-* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
-the transformer in a kubernetes cluster using a generated `yaml` file.
+* [kfp](kfp_ray/README.md) - enables running the ray docker image 
+in a kubernetes cluster using a generated `yaml` file.
diff --git a/transforms/universal/doc_id/ray/Dockerfile b/transforms/universal/doc_id/ray/Dockerfile
@@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray    && pip install --no-cache-dir -e .
 # Install ray project source
 COPY --chown=ray:users src/ src/
 COPY --chown=ray:users pyproject.toml pyproject.toml
-COPY --chown=ray:users Readme.md Readme.md
+COPY --chown=ray:users README.md README.md
 RUN pip install --no-cache-dir -e .
 
 # copy source data

diff --git a/transforms/universal/doc_id/ray/Readme.md → transforms/universal/doc_id/ray/README.md b/transforms/universal/doc_id/ray/Readme.md → transforms/universal/doc_id/ray/README.md
diff --git a/transforms/universal/ededup/README.md b/transforms/universal/ededup/README.md
@@ -1,10 +1,10 @@
-# Exact Deduplication Transform 
-The exact deduplication removes text duplications
-Per the set of 
+# Exect Deduplification Transform 
+The ededup transforms removes duplicate documents within a set of parquet files, 
+per the set of 
 [transform project conventions](../../README.md#transform-project-conventions)
 the following runtimes are available:
 
 * [ray](ray/README.md) - enables the running of the base python transformation
 in a Ray runtime
-* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
-the transformer in a kubernetes cluster using a generated `yaml` file.
+* [kfp](kfp_ray/README.md) - enables running the ray docker image 
+in a kubernetes cluster using a generated `yaml` file.
diff --git a/transforms/universal/ededup/ray/Dockerfile b/transforms/universal/ededup/ray/Dockerfile
@@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray    && pip install --no-cache-dir -e .
 # Install ray project source
 COPY --chown=ray:users src/ src/
 COPY --chown=ray:users pyproject.toml pyproject.toml
-COPY --chown=ray:users Readme.md Readme.md
+COPY --chown=ray:users README.md README.md
 COPY --chown=ray:users images/ images/
 RUN pip install --no-cache-dir -e .
 

diff --git a/transforms/universal/ededup/ray/Readme.md → transforms/universal/ededup/ray/README.md b/transforms/universal/ededup/ray/Readme.md → transforms/universal/ededup/ray/README.md
diff --git a/transforms/universal/fdedup/README.md b/transforms/universal/fdedup/README.md
@@ -1,10 +1,10 @@
-# Fuzzy Deduplication Transform 
-The fuzzy deduplication removes text duplications
-Per the set of 
+# Fuzzy Deduplification Transform 
+The fdedup transforms removes documents that are very similar to each other within a set of parquet files, 
+per the set of 
 [transform project conventions](../../README.md#transform-project-conventions)
 the following runtimes are available:
 
 * [ray](ray/README.md) - enables the running of the base python transformation
 in a Ray runtime
-* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
-the transformer in a kubernetes cluster using a generated `yaml` file.
+* [kfp](kfp_ray/README.md) - enables running the ray docker image 
+in a kubernetes cluster using a generated `yaml` file.
diff --git a/transforms/universal/fdedup/ray/Dockerfile b/transforms/universal/fdedup/ray/Dockerfile
@@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray    && pip install --no-cache-dir -e .
 # Install ray project source
 COPY --chown=ray:users src/ src/
 COPY --chown=ray:users pyproject.toml pyproject.toml
-COPY --chown=ray:users Readme.md Readme.md
+COPY --chown=ray:users README.md README.md
 COPY --chown=ray:users images/ images/
 RUN pip install --no-cache-dir -e .
 

diff --git a/transforms/universal/fdedup/ray/Readme.md → transforms/universal/fdedup/ray/README.md b/transforms/universal/fdedup/ray/Readme.md → transforms/universal/fdedup/ray/README.md
diff --git a/transforms/universal/filter/README.md b/transforms/universal/filter/README.md
@@ -1,6 +1,6 @@
 # Filter Transform 
-The NOOP transforms serves as a simple exemplar to demonstrate the development
-of a simple 1:1 transform.  Per the set of 
+The filter transforms provides SQL-based expressions for filtering rows and optionally column removal from parquet files, 
+per the set of 
 [transform project conventions](../../README.md#transform-project-conventions)
 the following runtimes are available:
 
@@ -10,5 +10,5 @@ implementation.
 in a Ray runtime
 * [spark](spark/README.md) - enables the running of a spark-based transformation
 in a Spark runtime. 
-* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
-filter in a kubernetes cluster using a generated `yaml` file.
+* [kfp](kfp_ray/README.md) - enables running the ray docker image 
+in a kubernetes cluster using a generated `yaml` file.
diff --git a/transforms/universal/noop/README.md b/transforms/universal/noop/README.md
@@ -10,5 +10,5 @@ implementation.
 in a Ray runtime
 * [spark](spark/README.md) - enables the running of a spark-based transformation
 in a Spark runtime. 
-* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
-the transformer in a kubernetes cluster using a generated `yaml` file.
+* [kfp](kfp_ray/README.md) - enables running the ray docker image 
+in a kubernetes cluster using a generated `yaml` file.
diff --git a/transforms/universal/profiler/ray/Dockerfile b/transforms/universal/profiler/ray/Dockerfile
@@ -13,7 +13,7 @@ RUN cd data-processing-lib-ray    && pip install --no-cache-dir -e .
 # Install ray project source
 COPY --chown=ray:users src/ src/
 COPY --chown=ray:users pyproject.toml pyproject.toml
-COPY --chown=ray:users Readme.md Readme.md
+COPY --chown=ray:users README.md README.md
 RUN pip install --no-cache-dir -e .
 
 # copy source data

diff --git a/transforms/universal/profiler/ray/Readme.md → transforms/universal/profiler/ray/README.md b/transforms/universal/profiler/ray/Readme.md → transforms/universal/profiler/ray/README.md
diff --git a/transforms/universal/tokenization/README.md b/transforms/universal/tokenization/README.md
@@ -9,5 +9,5 @@ the following runtimes are available:
 implementation.
 * [ray](ray/README.md) - enables the running of the python-based transformation
 in a Ray runtime
-* [kfp_ray](kfp_ray/README.md) - enables running the ray docker image for
+* [kfp](kfp_ray/README.md) - enables running the ray docker image 
 the transform in a kubernetes cluster using a generated `yaml` file.