-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parser=auto will almost always just fall back to parser=generic_txt, needs to let the first parser to find URLS win #1363
Comments
Here's how it should work:
|
Okay, that makes more sense. I think it would help to explicitly throw an error when --parser is specified when we get one or more URLs passed on the command line, so I'll look into creating a patch that does that and write some test cases. (I started down this rabbit hole when I was trying to write test cases for my patch for #1171. That should be easy now that I understand how it is supposed to work!) |
Another behavior I question that I ran across in writing tests. ArchiveBox/archivebox/parsers/__init__.py Lines 138 to 141 in f02b279
This is problematic with RSS feeds or other XML formats, for example, since they may include URLs for namespaces that the user probably doesn't really intend to archive. Looks like this behavior was introduced with the commit for the new generic_html parser. I would have thought it would be problematic there, too, since |
Ah yeah, I know about this shortcoming of Great job spelunking through the git history to find the context of that change! And I greatly appreciate your willingness to stick with debugging this even though it's clearly one of the poorest designed/highest technical debt areas of the codebase. I sent you a Github Sponsorship as a token of my appreciation since you're going above and beyond casual contributor level here. I think now that the dedicated parsers are improving, we can walk back that clobbery design decision and return it to the "first one that succeeds" approach. Assumptions to verify:
|
That makes more sense. I'll redo my RSS parser pull request to not muck around with this area, and work on adding test cases for how Still thinking about the complicated interactions of |
Current status of tests: FAILED tests/parser/test_auto.py::test_add_stdin_json FAILED tests/parser/test_auto.py::test_add_file_json FAILED tests/parser/test_auto.py::test_add_stdin_file_json FAILED tests/parser/test_specific.py::test_json_url FAILED tests/parser/test_specific.py::test_json_urls FAILED tests/parser/test_specific.py::test_json_filenames FAILED tests/parser/test_specific.py::test_json_stdin_urls FAILED tests/parser/test_specific.py::test_json_stdin_filenames FAILED tests/parser/test_specific.py::test_json_depth1_url FAILED tests/parser/test_specific.py::test_json_depth1_urls FAILED tests/parser/test_specific.py::test_json_depth1_filename FAILED tests/parser/test_specific.py::test_json_depth1_filenames FAILED tests/parser/test_specific.py::test_json_depth1_stdin_urls FAILED tests/parser/test_specific.py::test_json_depth1_stdin_filenames FAILED tests/parser/test_specific.py::test_json_depth1_stdin_contents (15 failed, 4 passed.) Working on issue ArchiveBox#1363
This is how it was actually documented to work in the code, but had been intentionally broken to just try to work with as much as possible. With more rigorous testing of the parsers and add's behavior, we shouldn't need that. History explained here: ArchiveBox#1363 (comment) With this, we pass 9 of the 19 tests in tests/parser.
(Thanks for the sponsorship, it was very unexpected!) This is a long comment, and gets longer every time I try to make it shorter. The 'tl;dr' is that the way archivebox tries to do things now leaves ambiguity and I fear building a tower of workarounds that is going to implode. You can see that in this: $ echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss
$ cat '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss Either stdin is parsed as an RSS file, or the filenames read from stdin is parsed as RSS, I'm not sure you can really make a reliable determination without doing some hacky and dangerous things.1 I pushed test cases2 that implement all of the scenarios you outlined in this comment, and with the current dev branch, only 4 of the 19 pass. After fixing Here's how I am thinking about it: We start with an empty list of URLs to archive.4 We get a list of zero or more filenames or URLs on the command line. That's If Now If If $ archivebox add https://example.com/feed.rss https://example.com/content.html The intent is "obvious" for an RSS feed, but ambiguous for an HTML file, because maybe it's an export file of bookmarks. I don't think we want to always download and parse the resource, because with That means that However, I think when the user has specified a parser ( Rough code outline without error checking or lots of other option handling: urls_to_archive = []
if not len(args):
urls_to_archive += parse_file(stdin, parser)
for arg in args:
if parser == "auto" and re.findall(URL_REGEX, arg):
urls_to_archive += [ arg ]
else:
urls_to_archive += parse_url_or_filename(arg, parser)
for url in urls_to_archive:
queue_for_archive(url) Note that Now we can turn our last loop into: for url in urls_to_archive:
if already_queued(url):
continue # skip URLs we've already queued
queue_for_archive(url)
if depth > 0:
add(args= parse_url_or_filename(url), depth= depth - 1, parser="auto") We can also use This can support any arbitrary depth, but we'll want to add some options to control two parts of that -- filtering URLs (so we can restrict the crawl to a specific site, for example) and deciding which leaves of the tree we're walking we actually archive. Back to the RSS example, when we do I haven't really captured the problematic interplay of Footnotes
|
After thinking about this a bunch I agree with all of it, especially these:
Firstly I think I'm going to banish all filesystem paths being used as input unless they start with I also think we should mandate that args can only ever be lists of URIs or snapshot_ids, you can never pass text to be parsed as an arg. These two assertions can be enforced at runtime, and should greatly simplify all our logic. I've written a new extractor called I propose creating three new commands, each with a smaller scope, designed to be chained with unix pipes:
This lets us keep For example:
archivebox snapshot 'https://example.com' > hop_0_snapshot_id.txt
archivebox extract --extractors=all < hop_0_snapshot_id.txt
# aka: archivebox snapshot 'https://example.com' | archivebox extract
archivebox crawl --parser=auto 'https://example.com' > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_id.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt
# aka: archivebox crawl 'https://example.com' | archivebox snapshot | archivebox extract
archivebox crawl --parser=rss 'https://example.com/some/feed.xml' > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_ids.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt
# aka: archivebox crawl --parser=rss 'https://example.com/some/feed.xml' | archivebox snapshot | archivebox extract
# save stdin to file://data/sources/add_stdin_rss.txt
archivebox crawl --parser=rss file://data/sources/add_stdin_rss.txt > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_ids.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt
for snapshot_id in $(cat hop_0_snapshot_ids.txt);
archivebox crawl file://data/archive/$snapshot_id/outlinks.json > hop_1_outlinks.txt
archivebox snapshot < hop_1_outlinks.txt > hop_1_snapshot_ids.txt
archivebox extract --extracotrs=all < hop_1_snapshot_ids.txt
done
# aka:
archivebox crawl --parser=rss 'file://data/sources/add_stdin_rss.txt' # outputs urls found in file
| archivebox snapshot # creates a snapshot for each first hop url
| archivebox extract # saves all the content for each first hop, outputs snapshot ids
| archivebox crawl # gets outlinks urls from first hop snapshots
| archivebox snapshot # creates snapshots for each outlink
| archivebox extract # saves the rest of the content for each outlink Does that sound reasonable? †: I'm imagining all these commands can yield/ingest bare snapshot IDs or URLs, but it would be nice if they also support a normalized JSONL format for archivebox objects like this so we can pass metadata (like tags, bookmarked_ts, etc.) with each line: {"url": "https://example.com/some/article.html", "tags": "sometagxyz,sometagabc"}
{"type": "Snapshot", "url": "https://example.com/some/article.html", "tags": "sometagxyz,sometagabc", "id": "124235353425"}
{"type": "Tag", "name": "sometagabc"} A format like this is extensible as we pass more metadata/object-types in the future, it's streamable (can process each line independently), and it's easy to debug with |
When you run:
archivebox add --parser=rss --depth=1 https://example.org/example.rss
It fails, because what happens is that the URL is written to
sources/{{id}}-import.txt
on a line by itself, and then this code tries to read that file with the specified parser:ArchiveBox/archivebox/main.py
Lines 622 to 642 in f02b279
If I am reading and understanding the code correctly, that first call to
parse_links_from_source
should actually includeparser="url_list"
because that's always what the file it is reading will be.Then the call to
parse_links_from_source
on line 640 should includeparser= parser
so the parser specified on the command line is the one used to parse the contents of the URL that was originally provided.If I am understanding things correctly, using
--parser=rss
or other non-text parsers with--depth=0
(the default) will always just fall over because it will try to read the URL list using the parser.The text was updated successfully, but these errors were encountered: