Adds example on how to chunk Hamilton documentation #721

skrawcz · 2024-02-29T01:23:22Z

This uses langchain's text splitting functionality, you could easily replace this with your own logic. The DAG though is run and defined in Hamilton -- and the processing of each URL is done in parallel. You could easily farm this out to Ray or Dask as well... See the documentation for more info.

Changes

adds example for pulling HTML, parsing, and then chunking it "into documents".

How I tested this

runs locally

Notes

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

	🚀 This PR description was created by Ellipsis for commit `08e9f28`.

Summary:

This PR adds a new example to the Hamilton library demonstrating how to chunk Hamilton documentation using a parallel processing pipeline, which can run locally, as well as with Ray, Dask, and PySpark.

Key points:

Added a new example module in examples/LLM_Workflows/scraping_and_chunking.
The module demonstrates how to chunk Hamilton documentation using a parallel processing pipeline.
The pipeline can run locally, as well as with Ray, Dask, and PySpark.
The processing of each URL is done in parallel.
The code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested.
New functions are documented (with a description, list of inputs, and expected output).
Placeholder code is flagged / future TODOs are captured in comments.
Project documentation has been updated if adding/changing functionality.

Generated with ❤️ by ellipsis.dev

examples/LLM_Workflows/scraping_and_chunking/run.py

ellipsis-dev

❌ Changes requested.

Reviewed the entire pull request up to 7c5468d
Looked at 208 lines of code in 3 files
Took 1 minute and 8 seconds to review

More info

Skipped 2 files when reviewing.
Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.

1. /examples/LLM_Workflows/scraping_and_chunking/pipeline.py:45:

Assessed confidence : 70%
Grade: 0%
Comment:
Consider adding a check to ensure that 'max_urls' is less than or equal to the length of 'urls_from_sitemap'. If 'max_urls' is greater than the length of the list, this could lead to unexpected behavior.
Reasoning:
The 'url' function is defined to take a list of URLs and a maximum number of URLs to process. However, there is no check to ensure that 'max_urls' is less than or equal to the length of the list. If 'max_urls' is greater than the length of the list, this could lead to unexpected behavior.

Workflow ID: wflow_OEO6o5caT6ylp3uF

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. We'll respond in a few minutes. Learn more here.

examples/LLM_Workflows/scraping_and_chunking/pipeline.py

This uses langchain's text splitting functionality, you could easily replace this with your own logic. The DAG though is run and defined in Hamilton -- and the processing of each URL is done in parallel. You could easily farm this out to Ray or Dask as well... See the documentation for more info.

Since that reflects what is happening, rather than creating embeddings. TODOs: - leave notes on how this connects with embeddings.

ellipsis-dev

👍 Looks good to me!

Performed an incremental review on fd82bc0
Looked at 13 lines of code in 1 files
Took 1 minute and 48 seconds to review

More info

Skipped 0 files when reviewing.
Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.

1. /examples/LLM_Workflows/scraping_and_chunking/README.md:2:

Assessed confidence : 50%
Comment:
Consider adding a brief explanation or a link to further information about 'RAG dataflow' for users who may not be familiar with this term.
Reasoning:
The README file is concise and provides a clear overview of the purpose and usage of the new module. However, it would be helpful to include a brief explanation of what 'RAG dataflow' is, as not all users may be familiar with this term.

Workflow ID: wflow_jmScUJsssDYRhwty

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

To make it clearer how things work and operate.

ellipsis-dev

👍 Looks good to me!

Performed an incremental review on e79193c
Looked at 461 lines of code in 7 files
Took 1 minute and 36 seconds to review

More info

Skipped 2 files when reviewing.
Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.

1. examples/LLM_Workflows/scraping_and_chunking/README.md:1:

Assessed confidence : 0%
Comment:
The PR adds a new example to the Hamilton library that demonstrates how to chunk Hamilton documentation using a parallel processing pipeline. The pipeline can run locally, as well as with Ray, Dask, and PySpark. The code passed the pre-commit check & code is left cleaner/nicer than when first encountered. Any change in functionality is tested. New functions are documented (with a description, list of inputs, and expected output). Placeholder code is flagged / future TODOs are captured in comments. Project documentation has been updated if adding/changing functionality. The PR adheres to the principles and rules set for the review. No violations or bugs were found in the PR.
Reasoning:
The PR adds a new example to the Hamilton library that demonstrates how to chunk Hamilton documentation using a parallel processing pipeline. The pipeline can run locally, as well as with Ray, Dask, and PySpark. The code passed the pre-commit check & code is left cleaner/nicer than when first encountered. Any change in functionality is tested. New functions are documented (with a description, list of inputs, and expected output). Placeholder code is flagged / future TODOs are captured in comments. Project documentation has been updated if adding/changing functionality. The PR adheres to the principles and rules set for the review. No violations or bugs were found in the PR.

Workflow ID: wflow_ZG9cycqL4iZeFfTk

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

elijahbenizzy

This looks great -- don't have the full time now for a review but glanced over a few times and let's ship.

ellipsis-dev

👍 Looks good to me!

Performed an incremental review on 08e9f28
Looked at 531 lines of code in 7 files
Took 2 minutes and 0 seconds to review

More info

Skipped 2 files when reviewing.
Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.

1. examples/LLM_Workflows/scraping_and_chunking/doc_pipeline.py:1:

Assessed confidence : 0%
Comment:
The code is well-written and follows best practices. The functions are well-documented and the logic is clear. Good job!
Reasoning:
The PR adds a new example to the Hamilton library that demonstrates how to chunk Hamilton documentation using a parallel processing pipeline. The pipeline can run locally, as well as with Ray, Dask, and PySpark. The code passed the pre-commit check & code is left cleaner/nicer than when first encountered. Any change in functionality is tested. New functions are documented (with a description, list of inputs, and expected output). Placeholder code is flagged / future TODOs are captured in comments. Project documentation has been updated if adding/changing functionality. The PR seems to be following the best practices and there are no obvious bugs or issues. The code is clean, well-documented, and follows the DRY principle. There are no secrets or credentials in the code, and the code does not log sensitive data. The code follows the Single Responsibility Principle and the function and method naming follows consistent patterns. The PR does not have WIP in the title or body description. The PR also suggests places to add to the sphinx documentation under docs/ where appropriate for each pull request.

Workflow ID: wflow_yIFkZM4uRy2ZXD4W

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

skrawcz force-pushed the hamilton_docs_chunking branch from 38887ba to 859e881 Compare February 29, 2024 02:31

skrawcz marked this pull request as ready for review March 4, 2024 00:58

skrawcz commented Mar 4, 2024

View reviewed changes

examples/LLM_Workflows/scraping_and_chunking/run.py Outdated Show resolved Hide resolved

ellipsis-dev bot reviewed Mar 4, 2024

View reviewed changes

examples/LLM_Workflows/scraping_and_chunking/pipeline.py Outdated Show resolved Hide resolved

examples/LLM_Workflows/scraping_and_chunking/pipeline.py Outdated Show resolved Hide resolved

skrawcz added 3 commits March 4, 2024 14:48

Adds requirements.txt

372f24a

Renames example folder to scraping and chunking

fd82bc0

Since that reflects what is happening, rather than creating embeddings. TODOs: - leave notes on how this connects with embeddings.

skrawcz force-pushed the hamilton_docs_chunking branch from 7c5468d to fd82bc0 Compare March 4, 2024 22:59

ellipsis-dev bot reviewed Mar 4, 2024

View reviewed changes

Adds documentation for scraping and chunking example

e79193c

To make it clearer how things work and operate.

ellipsis-dev bot reviewed Mar 5, 2024

View reviewed changes

elijahbenizzy approved these changes Mar 5, 2024

View reviewed changes

skrawcz merged commit 08e9f28 into main Mar 5, 2024
23 checks passed

skrawcz deleted the hamilton_docs_chunking branch March 5, 2024 06:48

ellipsis-dev bot reviewed Mar 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds example on how to chunk Hamilton documentation #721

Adds example on how to chunk Hamilton documentation #721

skrawcz commented Feb 29, 2024 •

edited by ellipsis-dev bot

Loading

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

elijahbenizzy left a comment

ellipsis-dev bot left a comment

Adds example on how to chunk Hamilton documentation #721

Adds example on how to chunk Hamilton documentation #721

Conversation

skrawcz commented Feb 29, 2024 • edited by ellipsis-dev bot Loading

Changes

How I tested this

Notes

Checklist

Summary:

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

elijahbenizzy left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

skrawcz commented Feb 29, 2024 •

edited by ellipsis-dev bot

Loading