cp: Improve math tutorial README documentation (1766) into r1.2.0#1859
cp: Improve math tutorial README documentation (1766) into r1.2.0#1859sarahyurick merged 1 commit intor1.2.0from
Improve math tutorial README documentation (1766) into r1.2.0#1859Conversation
* Improve math tutorial README documentation Address documentation gaps including: platform requirements (vLLM x86_64 Linux only), GPU VRAM requirements and compatible GPUs, AWS credential setup, HuggingFace token guidance with error notes, all 12 available prompts documented with selection guidance, quality score interpretation table, LSH parameter documentation for deduplication, cache clearing warning, compact horizontal pipeline diagram, and a troubleshooting section for common errors. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address PR review comments on math tutorial README - Fix wrong input directory in Content Cleaning example (deduplicated → preprocessed) to match pipeline order - Add --bands_per_iteration to LSH parameters table - Correct AWS description: Common Crawl is part of AWS Open Data program, not requester-pays Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address reviewer feedback on math tutorial README - Remove Platform Requirements section (Curator is Linux-only) - Revert GPU requirements to original wording - Remove CUDA verification from troubleshooting - Remove redundant "Repository Not Found" troubleshooting section - Fix LSH threshold value from 0.72 to 0.79 Signed-off-by: Lawrence Lane <llane@nvidia.com> * Remove redundant "No GPUs Detected" troubleshooting subsection Addresses reviewer feedback: the pynvml reinstall note already appears in the Install section, so duplicating it under Troubleshooting is unnecessary. Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
|
/ok to test 2889617 |
Greptile SummaryThis is a documentation-only cherry-pick from Confidence Score: 5/5Documentation-only change; safe to merge. No code is modified — all changes are README documentation additions and clarifications. The technical content (LSH threshold formula, score tables, CLI flag descriptions) is accurate. No P0/P1 findings. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart LR
HF["HuggingFace Hub"] -->|"0_download.py"| RAW["Raw Data"]
RAW -->|"1_cc_index_lookup.py\n(if no WARC metadata)"| CC_IDX["CC Index (S3)"]
CC_IDX --> PRE["2_text_preprocess.py"]
RAW -->|"skip step 1\n(if WARC present)"| PRE
PRE -->|"fetches WARC via S3/HTTPS"| LLM["3_llm_cleanup.py"]
LLM --> QC["4_quality_classifier.py"]
QC --> DEDUP["5_deduplication.py"]
DEDUP --> FINAL["Final Data"]
Reviews (1): Last reviewed commit: "Improve math tutorial README documentati..." | Re-trigger Greptile |
beep boop [🤖]: Hi @lbliii 👋,