-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/fix html parse bug #195
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Reviewed the entire pull request up to 4d044c4
- Looked at
119
lines of code in3
files - Took 26 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. r2r/pipelines/basic/ingestion.py:88
:
- Assessed confidence :
50%
- Comment:
The change from expecting a string to expecting a bytes object for HTML data seems to be in order. However, it's important to ensure that this change is reflected in all parts of the codebase that interact with this function. If there are other parts of the codebase that still pass a string as HTML data to this function, this could lead to errors. - Reasoning:
The PR author has made changes to the README.md file, removing the 'pdf chat' example and renumbering the remaining examples. They have also added a new 'end-to-end' example. The changes seem to be in order and there are no obvious issues with the changes made to the README.md file. The PR author has also added a new file 'end-to-end.mdx' which provides a step-by-step guide on how to run a RAG pipeline locally with R2R. The instructions seem clear and there are no obvious issues with the content of this file. The PR author has also made changes to the 'ingestion.py' file, specifically to the 'process_data' function. They have changed the expected type of 'entry_data' for HTML entries from 'str' to 'bytes' and added an 'encode' function call when parsing HTML data. This change seems to be in order and there are no obvious issues with it.
Workflow ID: wflow_doHtN7YwAM4Bv7Su
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
4d044c4
to
81b79ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 81b79ae
- Looked at
17
lines of code in1
files - Took 46 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. r2r/pipelines/basic/ingestion.py:88
:
- Assessed confidence :
80%
- Grade:
0%
- Comment:
Consider encoding theentry_data
to 'utf-8' before checking if it's a bytes object. This would ensure that the data is in the correct format before further processing.
entry_data = entry_data.encode('utf-8')
if not isinstance(entry_data, bytes):
raise ValueError("HTML data must be a bytes object.")
return self._parse_html(entry_data)
- Reasoning:
The change in the PR modifies the data type for HTML from string to bytes. However, the encoding to 'utf-8' is done after the check for bytes object. This could potentially lead to issues if the data is not in 'utf-8' format. It would be better to perform the encoding before the check.
Workflow ID: wflow_FP4kpwW1aYQZaMq9
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
Summary:
This PR modifies the handling of HTML data in
ingestion.py
, updates the examples list inREADME.md
, and adds a new guide for running a RAG pipeline.Key points:
process_data
function iningestion.py
to handle HTML data as bytes object and encode it to 'utf-8'.README.md
with reordered and updated examples list.end-to-end.mdx
for running a RAG pipeline.Generated with ❤️ by ellipsis.dev