Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: JSONDecodeError when trying to load JSON with an array at the top-level #1347

Closed
philippemilink opened this issue Feb 15, 2024 · 7 comments
Labels
expected: next release python Pull requests that update Python code size: easy status: done Work is completed and released (or scheduled to be released in the next version) touches: API/CLI/user interface type: bug report why: functionality Intended to improve ArchiveBox functionality or features
Milestone

Comments

@philippemilink
Copy link

With ArchiveBox version 0.6.2, I used to import URLs stored in JSON files with content looking like the following:

[
    {"url": "https://archivebox.io/", "tags": "test,archivebox"},
    {"url": "https://en.wikipedia.org/wiki/Linux", "tags": "test,wikipedia"}
]
archivebox add --parser json < ./links.json

Everything worked well.

With version 0.7.2, however, I have a JSONDecodeError exception during the import:

  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/main.py", line 631, in add
    new_links += parse_links_from_source(write_ahead_log, root_url=None, parser=parser)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/index/__init__.py", line 278, in parse_links_from_source
    raw_links, parser_name = parse_links(source_path, root_url=root_url, parser=parser)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/__init__.py", line 103, in parse_links
    links, parser = run_parser_functions(file, timer, root_url=root_url, parser=parser)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/__init__.py", line 117, in run_parser_functions
    parsed_links = list(parser_func(to_parse, root_url=root_url))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/generic_json.py", line 23, in parse_generic_json_export
    links = json.loads(json_file_json_str)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 61 (char 60)

The error is caused by the following line, introduced by commit aaca74f:

json_file_json_str = '{' + json_file.read().split('{', 1)[-1]

@pirate
Copy link
Member

pirate commented Feb 21, 2024

ah interesting, all the files I tested with had an object at the top level instead of a list

I can add handling for lists pretty easily, there are so many different JSON formats to support haha

@pirate pirate added type: bug report why: functionality Intended to improve ArchiveBox functionality or features size: easy good first ticket help wanted status: backlog Work is planned someday but is not the highest priority at the moment touches: API/CLI/user interface expected: release after next python Pull requests that update Python code labels Feb 21, 2024
@pirate pirate added this to the v0.8.0 milestone Feb 21, 2024
@pirate pirate changed the title Bug: JSONDecodeError with JSON parser Bug: JSONDecodeError when trying to load JSON with an array at the top-level Feb 21, 2024
@jimwins
Copy link
Contributor

jimwins commented Feb 26, 2024

Instead of trying to figure out what is going on when the first line of the JSON file is garbage, it would be easier to try not skipping it first, and then try again after skipping it if that fails.

    try:
        links = json.load(json_file)
    except json.decoder.JSONDecodeError:
        # sometimes the first line is a comment or other junk, so try without
        json_file.seek(0)
        first_line = json_file.readline()
        #print('      > Trying JSON parser without first line: "', first_line.strip(), '"', sep= '')
        links = json.load(json_file)
        # we may fail again, which means we really don't know what to do

But maybe even this isn't necessary? It looks like the original "skip the first line" logic came about because ArchiveBox would add the filename to the file as the first line when putting in the sources directory, but that doesn't seem to happen any more (which seems like a much, much better way to go).

jimwins added a commit to jimwins/ArchiveBox that referenced this issue Feb 27, 2024
Rather than by assuming the JSON file we are parsing has junk at the beginning
(which maybe only used to happen?), try parsing it as-is first, and then fall
back to trying again after skipping the first line

Fixes ArchiveBox#1347
@jimwins jimwins mentioned this issue Feb 27, 2024
6 tasks
@pirate
Copy link
Member

pirate commented Feb 29, 2024

But maybe even this isn't necessary?

Yes but I want to keep the workaround logic as a fallback because users still have the old "filename as first line" style imports in their sources/ dir and they might want to re-import their sources again later on.

I do agree we should move it to a try: except: fallback though as you showed above.

@jimwins
Copy link
Contributor

jimwins commented Feb 29, 2024

Yes but I want to keep the workaround logic as a fallback because users still have the old "filename as first line" style imports in their sources/ dir and they might want to re-import their sources again later on.

I do agree we should move it to a try: except: fallback though as you showed above.

I don't see how the existing workaround ever worked for anything because it chops off everything before the first { which includes the [, and the rest of the parsing assumes that links is a list.

@pirate
Copy link
Member

pirate commented Feb 29, 2024

Ah sorry, it was long enough ago that I don't remember what it was for exactly... maybe it was to handle an extra newline at the start, or maybe I thought I was handling a JSON object at the top level instead of JSONL?

Either way I'm down to change it, this parser is broken enough that it's not useful in its current state anyway.

@jimwins
Copy link
Contributor

jimwins commented Feb 29, 2024

Handling JSONL wouldn’t be hard to add as a another fallback. We could try JSON, then JSONL, and then try them both again without the first line to handle old source lists that had that extra line added.

@pirate
Copy link
Member

pirate commented Mar 22, 2024

This is done, thanks again @jimwins for all your great work here!

Will be out in the next release, or pull :dev to get it early.

@pirate pirate closed this as completed Mar 22, 2024
@pirate pirate added status: done Work is completed and released (or scheduled to be released in the next version) and removed status: backlog Work is planned someday but is not the highest priority at the moment expected: release after next labels Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expected: next release python Pull requests that update Python code size: easy status: done Work is completed and released (or scheduled to be released in the next version) touches: API/CLI/user interface type: bug report why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

No branches or pull requests

3 participants