Adds document chunking example to contrib #755

skrawcz · 2024-03-11T21:40:59Z

Things this module does:

takes in a sitemap.xml file and creates a list of all the URLs in the file.
takes in a list of URLs and pulls the HTML from each URL.
it then strips the HTML to the relevant body of HTML. We assume furo themed sphinx docs. html/body/div[class="page"]/div[class="main"]/div[class="content"]/div[class="article-container"]/article
it then chunks the HTML into smaller pieces -- returning langchain documents
what this doesn't do is create embeddings -- but that would be easy to extend.

For new dataflows:

Do you have the following?

How I tested this

ran it locally

Notes

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Dataflow documentation has been updated if adding/changing functionality.

Things this module does: 1. takes in a sitemap.xml file and creates a list of all the URLs in the file. 2. takes in a list of URLs and pulls the HTML from each URL. 3. it then strips the HTML to the relevant body of HTML. We assume `furo themed sphinx docs`. html/body/div[class="page"]/div[class="main"]/div[class="content"]/div[class="article-container"]/article 4. it then chunks the HTML into smaller pieces -- returning langchain documents 5. what this doesn't do is create embeddings -- but that would be easy to extend.

skrawcz added 2 commits March 11, 2024 14:34

Adds missing tags and import catch statement

961d231

skrawcz temporarily deployed to github-pages March 11, 2024 21:51 — with GitHub Actions Inactive

skrawcz merged commit 820d69b into main Mar 11, 2024
23 of 25 checks passed

skrawcz deleted the contrib/doc_chunking branch March 11, 2024 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds document chunking example to contrib #755

Adds document chunking example to contrib #755

skrawcz commented Mar 11, 2024 •

edited

Loading

Adds document chunking example to contrib #755

Adds document chunking example to contrib #755

Conversation

skrawcz commented Mar 11, 2024 • edited Loading

For new dataflows:

How I tested this

Notes

Checklist

skrawcz commented Mar 11, 2024 •

edited

Loading