Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/fix html parse bug #195

Merged
merged 1 commit into from
Mar 24, 2024
Merged

Conversation

emrgnt-cmplxty
Copy link
Contributor

@emrgnt-cmplxty emrgnt-cmplxty commented Mar 24, 2024

Ellipsis 🚀 This PR description was created by Ellipsis for commit 81b79ae.

Summary:

This PR modifies the handling of HTML data in ingestion.py, updates the examples list in README.md, and adds a new guide for running a RAG pipeline.

Key points:

  • Modified process_data function in ingestion.py to handle HTML data as bytes object and encode it to 'utf-8'.
  • Updated README.md with reordered and updated examples list.
  • Added new guide end-to-end.mdx for running a RAG pipeline.

Generated with ❤️ by ellipsis.dev

Copy link

vercel bot commented Mar 24, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
r2r-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 24, 2024 4:27am

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Reviewed the entire pull request up to 4d044c4
  • Looked at 119 lines of code in 3 files
  • Took 26 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. r2r/pipelines/basic/ingestion.py:88:
  • Assessed confidence : 50%
  • Comment:
    The change from expecting a string to expecting a bytes object for HTML data seems to be in order. However, it's important to ensure that this change is reflected in all parts of the codebase that interact with this function. If there are other parts of the codebase that still pass a string as HTML data to this function, this could lead to errors.
  • Reasoning:
    The PR author has made changes to the README.md file, removing the 'pdf chat' example and renumbering the remaining examples. They have also added a new 'end-to-end' example. The changes seem to be in order and there are no obvious issues with the changes made to the README.md file. The PR author has also added a new file 'end-to-end.mdx' which provides a step-by-step guide on how to run a RAG pipeline locally with R2R. The instructions seem clear and there are no obvious issues with the content of this file. The PR author has also made changes to the 'ingestion.py' file, specifically to the 'process_data' function. They have changed the expected type of 'entry_data' for HTML entries from 'str' to 'bytes' and added an 'encode' function call when parsing HTML data. This change seems to be in order and there are no obvious issues with it.

Workflow ID: wflow_doHtN7YwAM4Bv7Su


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

@emrgnt-cmplxty emrgnt-cmplxty merged commit f8de21d into main Mar 24, 2024
2 checks passed
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 81b79ae
  • Looked at 17 lines of code in 1 files
  • Took 46 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. r2r/pipelines/basic/ingestion.py:88:
  • Assessed confidence : 80%
  • Grade: 0%
  • Comment:
    Consider encoding the entry_data to 'utf-8' before checking if it's a bytes object. This would ensure that the data is in the correct format before further processing.
entry_data = entry_data.encode('utf-8')
if not isinstance(entry_data, bytes):
    raise ValueError("HTML data must be a bytes object.")
return self._parse_html(entry_data)
  • Reasoning:
    The change in the PR modifies the data type for HTML from string to bytes. However, the encoding to 'utf-8' is done after the check for bytes object. This could potentially lead to issues if the data is not in 'utf-8' format. It would be better to perform the encoding before the check.

Workflow ID: wflow_FP4kpwW1aYQZaMq9


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

@emrgnt-cmplxty emrgnt-cmplxty deleted the feature/fix-html-parse-bug branch March 24, 2024 06:39
iCUE-Solutions pushed a commit to DeweyLearn/DeweyLearnR2R that referenced this pull request Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant