Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic_jsonl parser #1370

Merged
merged 25 commits into from
Mar 15, 2024
Merged

Add generic_jsonl parser #1370

merged 25 commits into from
Mar 15, 2024

Conversation

jimwins
Copy link
Contributor

@jimwins jimwins commented Mar 1, 2024

Adds a JSONL parser and also fixes the JSON parser to reject what it suspects is a single-line JSONL file.

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

benmuth and others added 7 commits February 22, 2024 23:04
Also fix `YOUTUBEDL_EXTRA_ARGS`.
The feedparser packages has 20 years of history and is very good at parsing
RSS and Atom, so use that instead of ad-hoc regex and XML parsing.

The medium_rss and shaarli_rss parsers weren't touched because they are
probably unnecessary. (The special parse for pinboard is just needing because
of how tags work.)

Doesn't include tests because I haven't figured out how to run them in the
docker development setup.

Fixes ArchiveBox#1171
pirate and others added 16 commits February 29, 2024 21:29
The feedparser packages has 20 years of history and is very good at parsing
RSS and Atom, so use that instead of ad-hoc regex and XML parsing.

The medium_rss and shaarli_rss parsers weren't touched because they are
probably unnecessary. (The special parse for pinboard is just needing because
of how tags work.)

Doesn't include tests because I haven't figured out how to run them in the
docker development setup.

Fixes ArchiveBox#1171
Changes ./bin/test.sh to pass command line options to pytest, and default to
only running tests in the tests/ directory instead of everywhere excluding
a few directories which is more error-prone.

Also keeps the mock_server used in testing quiet so access log entries don't
appear on stdout.
Fixes ArchiveBox#1171
Fixes ArchiveBox#870 (probably, would need to test against a Wallabag Atom file to
Fixes ArchiveBox#135
Fixes ArchiveBox#123
Fixes ArchiveBox#106
@pirate
Copy link
Member

pirate commented Mar 14, 2024

Small conflict to fix with your earlier RSS parser changes, then down to merge this!

@jimwins
Copy link
Contributor Author

jimwins commented Mar 14, 2024

Okay, I probably made the git history a little wonkier than it needed to be, but the updated PR should be merge cleanly now.

@pirate pirate merged commit 0872c84 into ArchiveBox:dev Mar 15, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants