Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Curator and Spark example #220

Closed
wants to merge 1 commit into from
Closed

Conversation

ronjer30
Copy link
Contributor

Description

Adds a simple example to demonstrate how Apache Spark can interoperate with NeMo Curator modules using intermediate parquet files.

Usage

Jupyter notebook tutorial

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: RanjitR <ranjitr@nvidia.com>
@arhamm1 arhamm1 assigned arhamm1 and unassigned arhamm1 Sep 3, 2024
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool tutorial! I've added a bunch of grammar/punctuation nits for now.

"NeMo Curator is a Python library that consists of a collection of scalable data processing modules for curating natural language processing (NLP) data for training large language models (LLMs). The modules within the NeMo Data Curator enable NLP researchers to mine high-quality text at scale from massive uncurated web corpora. \n",
"\n",
"NeMo Curator includes the following modules to perform data curation:\n",
"- Data download and Extraction\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- Data download and Extraction\n",
"- Data download and text extraction\n",

"- Quality filtering\n",
"- Document-level deduplication\n",
"- Multilingual downstream-task decontamination\n",
"- Distributed Data Classification\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- Distributed Data Classification\n",
"- Distributed data classification\n",

"## About this notebook\n",
"\n",
"\n",
"This notebook will use the **Tiny Stories [Dataset](https://huggingface.co/datasets/roneneldan/TinyStories)** as an example to demonstrate how you can run data curation using NeMo Curator, integrate additional data processing you may already have on PySpark and build and end to end curation pipeline. \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"This notebook will use the **Tiny Stories [Dataset](https://huggingface.co/datasets/roneneldan/TinyStories)** as an example to demonstrate how you can run data curation using NeMo Curator, integrate additional data processing you may already have on PySpark and build and end to end curation pipeline. \n",
"This notebook will use the **Tiny Stories [Dataset](https://huggingface.co/datasets/roneneldan/TinyStories)** as an example to demonstrate how you can run data curation using NeMo Curator, integrate additional data processing you may already have on PySpark, and build an end-to-end curation pipeline. \n",

"3. Perform additional processing using PySpark\n",
"4. Deduplication using Nemo Curator\n",
"\n",
"For a full working example of Nemo Curator, please refer this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb)\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"For a full working example of Nemo Curator, please refer this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb)\n",
"For a full working example of NeMo Curator, please refer this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb).\n",

"## Prerequisites\n",
"\n",
"### System Requirements\n",
"Here is the hardware setting for this notebook\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Here is the hardware setting for this notebook\n",
"Here are the hardware settings for this notebook:\n",

"\n",
"#ignores checksum and marker files created by Spark job\n",
"processed_files = [filename for filename in get_all_files_paths_under(PROCESSED_DIR) \n",
" if not filename.endswith('.crc') or filename.endswith('_SUCCESS')]\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" if not filename.endswith('.crc') or filename.endswith('_SUCCESS')]\n"
" if not filename.endswith(".crc") or filename.endswith("_SUCCESS")]\n"

"source": [
"t0 = time.time()\n",
"# Read input dataset from Spark output\n",
"input_dataset = DocumentDataset.read_parquet(processed_files, backend='cudf')\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"input_dataset = DocumentDataset.read_parquet(processed_files, backend='cudf')\n",
"input_dataset = DocumentDataset.read_parquet(processed_files, backend="cudf")\n",

"!mkdir -p {LOG_DIR}\n",
"!mkdir -p {EXACT_DEDUP_OUT_DIR}\n",
"\n",
"#ignores checksum and marker files created by Spark job\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"#ignores checksum and marker files created by Spark job\n",
"# Ignores checksum and marker files created by Spark job\n",

"# Read input dataset from Spark output\n",
"input_dataset = DocumentDataset.read_parquet(processed_files, backend='cudf')\n",
"\n",
"#Run exact deduplication to the input\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"#Run exact deduplication to the input\n",
"# Run exact deduplication on input_dataset\n",

" id_field=\"id\",\n",
" text_field=\"text\",\n",
" hash_method=\"md5\",\n",
" cache_dir=EXACT_DEDUP_OUT_DIR #Duplicated document ID list is output to the cache_dir\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" cache_dir=EXACT_DEDUP_OUT_DIR #Duplicated document ID list is output to the cache_dir\n",
" cache_dir=EXACT_DEDUP_OUT_DIR # Duplicated document ID list is outputted to the cache_dir\n",

@ronjer30
Copy link
Contributor Author

As discussed, moving this example to docs with a new PR https://github.com/NVIDIA/NeMo-Curator/pull/261

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants