-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Curator and Spark example #220
Conversation
Signed-off-by: RanjitR <ranjitr@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a cool tutorial! I've added a bunch of grammar/punctuation nits for now.
"NeMo Curator is a Python library that consists of a collection of scalable data processing modules for curating natural language processing (NLP) data for training large language models (LLMs). The modules within the NeMo Data Curator enable NLP researchers to mine high-quality text at scale from massive uncurated web corpora. \n", | ||
"\n", | ||
"NeMo Curator includes the following modules to perform data curation:\n", | ||
"- Data download and Extraction\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"- Data download and Extraction\n", | |
"- Data download and text extraction\n", |
"- Quality filtering\n", | ||
"- Document-level deduplication\n", | ||
"- Multilingual downstream-task decontamination\n", | ||
"- Distributed Data Classification\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"- Distributed Data Classification\n", | |
"- Distributed data classification\n", |
"## About this notebook\n", | ||
"\n", | ||
"\n", | ||
"This notebook will use the **Tiny Stories [Dataset](https://huggingface.co/datasets/roneneldan/TinyStories)** as an example to demonstrate how you can run data curation using NeMo Curator, integrate additional data processing you may already have on PySpark and build and end to end curation pipeline. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This notebook will use the **Tiny Stories [Dataset](https://huggingface.co/datasets/roneneldan/TinyStories)** as an example to demonstrate how you can run data curation using NeMo Curator, integrate additional data processing you may already have on PySpark and build and end to end curation pipeline. \n", | |
"This notebook will use the **Tiny Stories [Dataset](https://huggingface.co/datasets/roneneldan/TinyStories)** as an example to demonstrate how you can run data curation using NeMo Curator, integrate additional data processing you may already have on PySpark, and build an end-to-end curation pipeline. \n", |
"3. Perform additional processing using PySpark\n", | ||
"4. Deduplication using Nemo Curator\n", | ||
"\n", | ||
"For a full working example of Nemo Curator, please refer this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"For a full working example of Nemo Curator, please refer this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb)\n", | |
"For a full working example of NeMo Curator, please refer this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb).\n", |
"## Prerequisites\n", | ||
"\n", | ||
"### System Requirements\n", | ||
"Here is the hardware setting for this notebook\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Here is the hardware setting for this notebook\n", | |
"Here are the hardware settings for this notebook:\n", |
"\n", | ||
"#ignores checksum and marker files created by Spark job\n", | ||
"processed_files = [filename for filename in get_all_files_paths_under(PROCESSED_DIR) \n", | ||
" if not filename.endswith('.crc') or filename.endswith('_SUCCESS')]\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" if not filename.endswith('.crc') or filename.endswith('_SUCCESS')]\n" | |
" if not filename.endswith(".crc") or filename.endswith("_SUCCESS")]\n" |
"source": [ | ||
"t0 = time.time()\n", | ||
"# Read input dataset from Spark output\n", | ||
"input_dataset = DocumentDataset.read_parquet(processed_files, backend='cudf')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"input_dataset = DocumentDataset.read_parquet(processed_files, backend='cudf')\n", | |
"input_dataset = DocumentDataset.read_parquet(processed_files, backend="cudf")\n", |
"!mkdir -p {LOG_DIR}\n", | ||
"!mkdir -p {EXACT_DEDUP_OUT_DIR}\n", | ||
"\n", | ||
"#ignores checksum and marker files created by Spark job\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"#ignores checksum and marker files created by Spark job\n", | |
"# Ignores checksum and marker files created by Spark job\n", |
"# Read input dataset from Spark output\n", | ||
"input_dataset = DocumentDataset.read_parquet(processed_files, backend='cudf')\n", | ||
"\n", | ||
"#Run exact deduplication to the input\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"#Run exact deduplication to the input\n", | |
"# Run exact deduplication on input_dataset\n", |
" id_field=\"id\",\n", | ||
" text_field=\"text\",\n", | ||
" hash_method=\"md5\",\n", | ||
" cache_dir=EXACT_DEDUP_OUT_DIR #Duplicated document ID list is output to the cache_dir\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" cache_dir=EXACT_DEDUP_OUT_DIR #Duplicated document ID list is output to the cache_dir\n", | |
" cache_dir=EXACT_DEDUP_OUT_DIR # Duplicated document ID list is outputted to the cache_dir\n", |
As discussed, moving this example to docs with a new PR https://github.com/NVIDIA/NeMo-Curator/pull/261 |
Description
Adds a simple example to demonstrate how Apache Spark can interoperate with NeMo Curator modules using intermediate parquet files.
Usage
Jupyter notebook tutorial
Checklist