Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds document chunking example to contrib #755

Merged
merged 2 commits into from
Mar 11, 2024
Merged

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Mar 11, 2024

Things this module does:

  1. takes in a sitemap.xml file and creates a list of all the URLs in the file.
  2. takes in a list of URLs and pulls the HTML from each URL.
  3. it then strips the HTML to the relevant body of HTML. We assume furo themed sphinx docs. html/body/div[class="page"]/div[class="main"]/div[class="content"]/div[class="article-container"]/article
  4. it then chunks the HTML into smaller pieces -- returning langchain documents
  5. what this doesn't do is create embeddings -- but that would be easy to extend.

For new dataflows:

Do you have the following?

  • Added a directory mapping to my github user name in the contrib/hamilton/contrib/user directory.
    • If my author names contains hyphens I have replaced them with underscores.
    • If my author name starts with a number, I have prefixed it with an underscore.
    • If your author name is a python reserved keyword. Reach out to the maintainers for help.
    • Added an author.md file under my username directory and is filled out.
    • Added an init.py file under my username directory.
  • Added a new folder for my dataflow under my username directory.
    • Added a README.md file under my dataflow directory that follows the standard headings and is filled out.
    • Added a init.py file under my dataflow directory that contains the Hamilton code.
    • Added a requirements.txt under my dataflow directory that contains the required packages outside of Hamilton.
    • Added tags.json under my dataflow directory to curate my dataflow.
    • Added valid_configs.jsonl under my dataflow directory to specify the valid configurations.
    • Added a dag.png that shows one possible configuration of my dataflow.
  • I hearby acknowledge that to the best of my ability, that the code I have contributed contains correct attribution
    and notices as appropriate.

How I tested this

  • ran it locally

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Dataflow documentation has been updated if adding/changing functionality.

Things this module does:

 1. takes in a sitemap.xml file and creates a list of all the URLs in the file.
 2. takes in a list of URLs and pulls the HTML from each URL.
 3. it then strips the HTML to the relevant body of HTML. We assume `furo themed sphinx docs`.
        html/body/div[class="page"]/div[class="main"]/div[class="content"]/div[class="article-container"]/article
 4. it then chunks the HTML into smaller pieces -- returning langchain documents
 5. what this doesn't do is create embeddings -- but that would be easy to extend.
@skrawcz skrawcz merged commit 820d69b into main Mar 11, 2024
23 of 25 checks passed
@skrawcz skrawcz deleted the contrib/doc_chunking branch March 11, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant