Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix html bug #196

Merged
merged 1 commit into from
Mar 24, 2024
Merged

fix html bug #196

merged 1 commit into from
Mar 24, 2024

Conversation

emrgnt-cmplxty
Copy link
Contributor

@emrgnt-cmplxty emrgnt-cmplxty commented Mar 24, 2024

Ellipsis 🚀 This PR description was created by Ellipsis for commit 81b79ae.

Summary:

This PR modifies the process_data function in the BasicIngestionPipeline class to expect HTML data as a bytes object, which is then encoded to 'utf-8' before parsing.

Key points:

  • Modified process_data function in BasicIngestionPipeline class in /r2r/pipelines/basic/ingestion.py.
  • Changed expected HTML data type from string to bytes.
  • HTML data is now encoded to 'utf-8' before being passed to _parse_html function.

Generated with ❤️ by ellipsis.dev

Copy link

vercel bot commented Mar 24, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
r2r-docs ✅ Ready (Inspect) Visit Preview Mar 24, 2024 6:16am

@emrgnt-cmplxty emrgnt-cmplxty merged commit 490719a into main Mar 24, 2024
2 checks passed
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Reviewed the entire pull request up to 81b79ae
  • Looked at 18 lines of code in 1 files
  • Took 30 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 0 additional comments because they didn't meet confidence threshold of 50%.

Workflow ID: wflow_Yz60K8JK54K7FpZ0


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. We'll respond in a few minutes. Learn more here.

return self._parse_html(entry_data)
if not isinstance(entry_data, bytes):
raise ValueError("HTML data must be a bytes object.")
return self._parse_html(entry_data.encode("utf-8"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from expecting a string to expecting bytes for HTML data seems unnecessary, as it is immediately encoded back to a string. This could potentially introduce bugs. Consider reverting this change.

@emrgnt-cmplxty emrgnt-cmplxty deleted the feature/fix-html-parse-bug branch March 24, 2024 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant