Skip to content

cp: Improve math tutorial README documentation (1766) into r1.2.0#1859

Merged
sarahyurick merged 1 commit intor1.2.0from
cherry-pick-1766-r1.2.0
Apr 22, 2026
Merged

cp: Improve math tutorial README documentation (1766) into r1.2.0#1859
sarahyurick merged 1 commit intor1.2.0from
cherry-pick-1766-r1.2.0

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor

beep boop [🤖]: Hi @lbliii 👋,

we've cherry picked #1766 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

* Improve math tutorial README documentation

Address documentation gaps including: platform requirements (vLLM
x86_64 Linux only), GPU VRAM requirements and compatible GPUs, AWS
credential setup, HuggingFace token guidance with error notes, all 12
available prompts documented with selection guidance, quality score
interpretation table, LSH parameter documentation for deduplication,
cache clearing warning, compact horizontal pipeline diagram, and a
troubleshooting section for common errors.

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address PR review comments on math tutorial README

- Fix wrong input directory in Content Cleaning example
  (deduplicated → preprocessed) to match pipeline order
- Add --bands_per_iteration to LSH parameters table
- Correct AWS description: Common Crawl is part of AWS Open Data
  program, not requester-pays

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address reviewer feedback on math tutorial README

- Remove Platform Requirements section (Curator is Linux-only)
- Revert GPU requirements to original wording
- Remove CUDA verification from troubleshooting
- Remove redundant "Repository Not Found" troubleshooting section
- Fix LSH threshold value from 0.72 to 0.79

Signed-off-by: Lawrence Lane <llane@nvidia.com>

* Remove redundant "No GPUs Detected" troubleshooting subsection

Addresses reviewer feedback: the pynvml reinstall note already appears
in the Install section, so duplicating it under Troubleshooting is
unnecessary.

Signed-off-by: Lawrence Lane <llane@nvidia.com>

---------

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor Author

/ok to test 2889617

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This is a documentation-only cherry-pick from #1766 into r1.2.0 that substantially improves the math tutorial README. It adds an AWS credentials setup section, a quick-overview ASCII pipeline diagram, a collapsible wrapper for the detailed Mermaid flowchart, expanded HuggingFace auth guidance, score interpretation tables, LSH parameter reference, and a new Troubleshooting section.

Confidence Score: 5/5

Documentation-only change; safe to merge.

No code is modified — all changes are README documentation additions and clarifications. The technical content (LSH threshold formula, score tables, CLI flag descriptions) is accurate. No P0/P1 findings.

No files require special attention.

Important Files Changed

Filename Overview
tutorials/math/README.md Documentation improvements: AWS credentials section, pipeline overview diagram, collapsible Mermaid chart, expanded auth/score/LSH guidance, and a Troubleshooting section — all accurate and well-structured.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart LR
    HF["HuggingFace Hub"] -->|"0_download.py"| RAW["Raw Data"]
    RAW -->|"1_cc_index_lookup.py\n(if no WARC metadata)"| CC_IDX["CC Index (S3)"]
    CC_IDX --> PRE["2_text_preprocess.py"]
    RAW -->|"skip step 1\n(if WARC present)"| PRE
    PRE -->|"fetches WARC via S3/HTTPS"| LLM["3_llm_cleanup.py"]
    LLM --> QC["4_quality_classifier.py"]
    QC --> DEDUP["5_deduplication.py"]
    DEDUP --> FINAL["Final Data"]
Loading

Reviews (1): Last reviewed commit: "Improve math tutorial README documentati..." | Re-trigger Greptile

@sarahyurick sarahyurick merged commit 9e8349c into r1.2.0 Apr 22, 2026
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants