New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Added Curator and Spark example #220

Closed

ronjer30 wants to merge 1 commit into NVIDIA:main from ronjer30:spark

Contributor

ronjer30 commented Aug 29, 2024

Description

Adds a simple example to demonstrate how Apache Spark can interoperate with NeMo Curator modules using intermediate parquet files.

Usage

Jupyter notebook tutorial

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.


          Added Curator and Spark example

f10bcfd

Signed-off-by: RanjitR <ranjitr@nvidia.com>

arhamm1 assigned arhamm1 and unassigned arhamm1

sarahyurick requested changes

View reviewed changes

Collaborator

sarahyurick left a comment

This is a cool tutorial! I've added a bunch of grammar/punctuation nits for now.

experiments/curator_apache_spark/curator_and_pyspark_example.ipynb

+                  "NeMo Curator is a Python library that consists of a collection of scalable data processing modules for curating natural language processing (NLP) data for training large language models (LLMs). The modules within the NeMo Data Curator enable NLP researchers to mine high-quality text at scale from massive uncurated web corpora. \n",
+                  "\n",
+                  "NeMo Curator includes the following modules to perform data curation:\n",
+                  "- Data download and Extraction\n",

Collaborator

sarahyurick Sep 4, 2024

Suggested change

      
                "- Data download and Extraction\n",
          
                "- Data download and text extraction\n",

experiments/curator_apache_spark/curator_and_pyspark_example.ipynb

+                  "- Quality filtering\n",
+                  "- Document-level deduplication\n",
+                  "- Multilingual downstream-task decontamination\n",
+                  "- Distributed Data Classification\n",

Collaborator

sarahyurick Sep 4, 2024

Suggested change

      
                "- Distributed Data Classification\n",
          
                "- Distributed data classification\n",

experiments/curator_apache_spark/curator_and_pyspark_example.ipynb

+                  "## About this notebook\n",
+                  "\n",
+                  "\n",
+                  "This notebook will use the **Tiny Stories [Dataset](https://huggingface.co/datasets/roneneldan/TinyStories)** as an example to demonstrate how you can run data curation using NeMo Curator, integrate additional data processing you may already have on PySpark and build and end to end curation pipeline. \n",

Collaborator

sarahyurick Sep 4, 2024

Suggested change

      
                "This notebook will use the **Tiny Stories [Dataset](https://huggingface.co/datasets/roneneldan/TinyStories)** as an example to demonstrate how you can run data curation using NeMo Curator, integrate additional data processing you may already have on PySpark and build and end to end curation pipeline. \n",
          
                "This notebook will use the **Tiny Stories [Dataset](https://huggingface.co/datasets/roneneldan/TinyStories)** as an example to demonstrate how you can run data curation using NeMo Curator, integrate additional data processing you may already have on PySpark, and build an end-to-end curation pipeline. \n",

experiments/curator_apache_spark/curator_and_pyspark_example.ipynb

+                  "3. Perform additional processing using PySpark\n",
+                  "4. Deduplication using Nemo Curator\n",
+                  "\n",
+                  "For a full working example of Nemo Curator, please refer this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb)\n",

Collaborator

sarahyurick Sep 4, 2024

Suggested change

      
                "For a full working example of Nemo Curator, please refer this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb)\n",
          
                "For a full working example of NeMo Curator, please refer this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb).\n",

experiments/curator_apache_spark/curator_and_pyspark_example.ipynb

+                  "## Prerequisites\n",
+                  "\n",
+                  "### System Requirements\n",
+                  "Here is the hardware setting for this notebook\n",

Collaborator

sarahyurick Sep 4, 2024

Suggested change

      
                "Here is the hardware setting for this notebook\n",
          
                "Here are the hardware settings for this notebook:\n",

experiments/curator_apache_spark/curator_and_pyspark_example.ipynb

+                  "\n",
+                  "#ignores checksum and marker files created by Spark job\n",
+                  "processed_files = [filename for filename in get_all_files_paths_under(PROCESSED_DIR) \n",
+                  "                   if not filename.endswith('.crc') or filename.endswith('_SUCCESS')]\n"

Collaborator

sarahyurick Sep 4, 2024

Suggested change

      
                "                   if not filename.endswith('.crc') or filename.endswith('_SUCCESS')]\n"
          
                "                   if not filename.endswith(".crc") or filename.endswith("_SUCCESS")]\n"

experiments/curator_apache_spark/curator_and_pyspark_example.ipynb

+                 "source": [
+                  "t0 = time.time()\n",
+                  "# Read input dataset from Spark output\n",
+                  "input_dataset = DocumentDataset.read_parquet(processed_files, backend='cudf')\n",

Collaborator

sarahyurick Sep 4, 2024

Suggested change

      
                "input_dataset = DocumentDataset.read_parquet(processed_files, backend='cudf')\n",
          
                "input_dataset = DocumentDataset.read_parquet(processed_files, backend="cudf")\n",

experiments/curator_apache_spark/curator_and_pyspark_example.ipynb

+                  "!mkdir -p {LOG_DIR}\n",
+                  "!mkdir -p {EXACT_DEDUP_OUT_DIR}\n",
+                  "\n",
+                  "#ignores checksum and marker files created by Spark job\n",

Collaborator

sarahyurick Sep 4, 2024

Suggested change

      
                "#ignores checksum and marker files created by Spark job\n",
          
                "# Ignores checksum and marker files created by Spark job\n",

experiments/curator_apache_spark/curator_and_pyspark_example.ipynb

+                  "# Read input dataset from Spark output\n",
+                  "input_dataset = DocumentDataset.read_parquet(processed_files, backend='cudf')\n",
+                  "\n",
+                  "#Run exact deduplication to the input\n",

Collaborator

sarahyurick Sep 4, 2024

Suggested change

      
                "#Run exact deduplication to the input\n",
          
                "# Run exact deduplication on input_dataset\n",

experiments/curator_apache_spark/curator_and_pyspark_example.ipynb

+                  "    id_field=\"id\",\n",
+                  "    text_field=\"text\",\n",
+                  "    hash_method=\"md5\",\n",
+                  "    cache_dir=EXACT_DEDUP_OUT_DIR #Duplicated document ID list is output to the cache_dir\n",

Collaborator

sarahyurick Sep 4, 2024

Suggested change

      
                "    cache_dir=EXACT_DEDUP_OUT_DIR #Duplicated document ID list is output to the cache_dir\n",
          
                "    cache_dir=EXACT_DEDUP_OUT_DIR # Duplicated document ID list is outputted to the cache_dir\n",

ryantwolf assigned ronjer30

ronjer30 closed this

Contributor Author

ronjer30 commented Sep 26, 2024

As discussed, moving this example to docs with a new PR https://github.com/NVIDIA/NeMo-Curator/pull/261

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet