Bug: `JSONDecodeError` when trying to load JSON with an array at the top-level #1347

philippemilink · 2024-02-15T20:26:03Z

With ArchiveBox version 0.6.2, I used to import URLs stored in JSON files with content looking like the following:

[
    {"url": "https://archivebox.io/", "tags": "test,archivebox"},
    {"url": "https://en.wikipedia.org/wiki/Linux", "tags": "test,wikipedia"}
]

archivebox add --parser json < ./links.json

Everything worked well.

With version 0.7.2, however, I have a JSONDecodeError exception during the import:

  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/main.py", line 631, in add
    new_links += parse_links_from_source(write_ahead_log, root_url=None, parser=parser)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/index/__init__.py", line 278, in parse_links_from_source
    raw_links, parser_name = parse_links(source_path, root_url=root_url, parser=parser)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/__init__.py", line 103, in parse_links
    links, parser = run_parser_functions(file, timer, root_url=root_url, parser=parser)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/__init__.py", line 117, in run_parser_functions
    parsed_links = list(parser_func(to_parse, root_url=root_url))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/generic_json.py", line 23, in parse_generic_json_export
    links = json.loads(json_file_json_str)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 61 (char 60)

The error is caused by the following line, introduced by commit aaca74f:

ArchiveBox/archivebox/parsers/generic_json.py

Line 22 in 3ad3250

json_file_json_str = '{' + json_file.read().split('{', 1)[-1]

The text was updated successfully, but these errors were encountered:

pirate · 2024-02-21T02:16:53Z

ah interesting, all the files I tested with had an object at the top level instead of a list

I can add handling for lists pretty easily, there are so many different JSON formats to support haha

jimwins · 2024-02-26T01:54:11Z

Instead of trying to figure out what is going on when the first line of the JSON file is garbage, it would be easier to try not skipping it first, and then try again after skipping it if that fails.

    try:
        links = json.load(json_file)
    except json.decoder.JSONDecodeError:
        # sometimes the first line is a comment or other junk, so try without
        json_file.seek(0)
        first_line = json_file.readline()
        #print('      > Trying JSON parser without first line: "', first_line.strip(), '"', sep= '')
        links = json.load(json_file)
        # we may fail again, which means we really don't know what to do

But maybe even this isn't necessary? It looks like the original "skip the first line" logic came about because ArchiveBox would add the filename to the file as the first line when putting in the sources directory, but that doesn't seem to happen any more (which seems like a much, much better way to go).

Rather than by assuming the JSON file we are parsing has junk at the beginning (which maybe only used to happen?), try parsing it as-is first, and then fall back to trying again after skipping the first line Fixes ArchiveBox#1347

pirate · 2024-02-29T01:46:55Z

But maybe even this isn't necessary?

Yes but I want to keep the workaround logic as a fallback because users still have the old "filename as first line" style imports in their sources/ dir and they might want to re-import their sources again later on.

I do agree we should move it to a try: except: fallback though as you showed above.

jimwins · 2024-02-29T02:30:42Z

Yes but I want to keep the workaround logic as a fallback because users still have the old "filename as first line" style imports in their sources/ dir and they might want to re-import their sources again later on.

I do agree we should move it to a try: except: fallback though as you showed above.

I don't see how the existing workaround ever worked for anything because it chops off everything before the first { which includes the [, and the rest of the parsing assumes that links is a list.

pirate · 2024-02-29T03:23:08Z

Ah sorry, it was long enough ago that I don't remember what it was for exactly... maybe it was to handle an extra newline at the start, or maybe I thought I was handling a JSON object at the top level instead of JSONL?

Either way I'm down to change it, this parser is broken enough that it's not useful in its current state anyway.

jimwins · 2024-02-29T04:53:54Z

Handling JSONL wouldn’t be hard to add as a another fallback. We could try JSON, then JSONL, and then try them both again without the first line to handle old source lists that had that extra line added.

Fixes #1347.

pirate · 2024-03-22T05:15:44Z

This is done, thanks again @jimwins for all your great work here!

Will be out in the next release, or pull :dev to get it early.

philippemilink mentioned this issue Feb 16, 2024

Bug: archivebox init --setup ignore configuration through environment variables #1346

Closed

pirate added this to the v0.8.0 milestone Feb 21, 2024

pirate changed the title ~~Bug: JSONDecodeError with JSON parser~~ Bug: JSONDecodeError when trying to load JSON with an array at the top-level Feb 21, 2024

th mentioned this issue Feb 22, 2024

Feature Request: Raindrop.io import #1359

Open

9 tasks

jimwins mentioned this issue Feb 27, 2024

Fix generic_json parser #1365

Merged

6 tasks

pirate added a commit that referenced this issue Mar 1, 2024

Fix generic_json parser (#1365)

7b042c8

Fixes #1347.

pirate mentioned this issue Mar 1, 2024

Add generic_jsonl parser #1370

Merged

6 tasks

pirate removed good first ticket help wanted labels Mar 5, 2024

pirate closed this as completed Mar 22, 2024

pirate added status: done Work is completed and released (or scheduled to be released in the next version) and removed status: backlog Work is planned someday but is not the highest priority at the moment expected: release after next labels Mar 22, 2024

pirate added the expected: next release label Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: `JSONDecodeError` when trying to load JSON with an array at the top-level #1347

Bug: `JSONDecodeError` when trying to load JSON with an array at the top-level #1347

philippemilink commented Feb 15, 2024

pirate commented Feb 21, 2024

jimwins commented Feb 26, 2024

pirate commented Feb 29, 2024

jimwins commented Feb 29, 2024

pirate commented Feb 29, 2024

jimwins commented Feb 29, 2024

pirate commented Mar 22, 2024

Bug: JSONDecodeError when trying to load JSON with an array at the top-level #1347

Bug: JSONDecodeError when trying to load JSON with an array at the top-level #1347

Comments

philippemilink commented Feb 15, 2024

pirate commented Feb 21, 2024

jimwins commented Feb 26, 2024

pirate commented Feb 29, 2024

jimwins commented Feb 29, 2024

pirate commented Feb 29, 2024

jimwins commented Feb 29, 2024

pirate commented Mar 22, 2024

Bug: `JSONDecodeError` when trying to load JSON with an array at the top-level #1347

Bug: `JSONDecodeError` when trying to load JSON with an array at the top-level #1347