Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: NOT NULL constraint failed: core_archiveresult.output when upgrading v0.4.24 archive to v0.6 #705

Closed
pigmonkey opened this issue Apr 14, 2021 · 10 comments
Labels
size: easy status: wip Work is in-progress / has already been partially completed touches: data/schema/architecture type: bug report
Milestone

Comments

@pigmonkey
Copy link
Contributor

I upgraded from v0.4.24 to v0.6.0 and ran archivebox init. After the list of migrations, it output:

[*] Checking links from indexes and archive folders (safe to Ctrl+C)...
    √ Added 1054 orphaned links from existing archive directories.
    ! Skipped adding 2236 invalid link data directories.
    (long list of archive dirs here)

    Hint: For more information about the link data directories that were skipped, run:
        archivebox status
        archivebox list --status=invalid

[*] [2021-04-13 05:09:37] Writing 1054 links to main index...
Traceback (most recent call last):
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 589, in update_or_create
    obj = self.select_for_update().get(**kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get
    raise self.model.DoesNotExist(
core.models.DoesNotExist: ArchiveResult matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
sqlite3.IntegrityError: NOT NULL constraint failed: core_archiveresult.output

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/pigmonkey/.local/bin//archivebox", line 8, in <module>
    sys.exit(main())
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/cli/archivebox_init.py", line 43, in main
    init(
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/main.py", line 433, in init
    write_main_index(list(pending_links.values()), out_dir=out_dir)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/index/__init__.py", line 232, in write_main_index
    write_sql_main_index(links, out_dir=out_dir)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/index/sql.py", line 88, in write_sql_main_index
    write_link_to_sql_index(link)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/index/sql.py", line 66, in write_link_to_sql_index
    result, _ = ArchiveResult.objects.update_or_create(
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 594, in update_or_create
    obj, created = self._create_object_from_params(kwargs, params, lock=True)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 610, in _create_object_from_params
    obj = self.create(**params)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 447, in create
    obj.save(force_insert=True, using=self.db)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/base.py", line 753, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/base.py", line 790, in save_base
    updated = self._save_table(
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/base.py", line 895, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/base.py", line 933, in _do_insert
    return manager._insert(
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 1254, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1397, in execute_sql
    cursor.execute(sql, params)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
django.db.utils.IntegrityError: NOT NULL constraint failed: core_archiveresult.output
$  archivebox version
ArchiveBox v0.6.0
Cpython Linux Linux-5.11.11-hardened1-1-hardened-x86_64-with-glibc2.33 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.0          valid     /home/pigmonkey/.local/pipx/venvs/archivebox/bin/archivebox
 √  PYTHON_BINARY         v3.9.2          valid     /usr/bin/python3.9
 √  DJANGO_BINARY         v3.1.8          valid     /home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.76.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.21.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     ./node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     ./node_modules/readability-extractor/readability-extractor
 -  MERCURY_BINARY        -               disabled  ./node_modules/@postlight/mercury-parser/cli.js
 -  GIT_BINARY            -               disabled  /usr/bin/git
 -  YOUTUBEDL_BINARY      -               disabled  /usr/bin/youtube-dl
 √  CHROME_BINARY         v89.0.4389.114  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     /home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            9 files         valid     /home/pigmonkey/tmp/bookmarks
 √  SOURCES_DIR           0 files         valid     ./sources
 √  LOGS_DIR              0 files         valid     ./logs
 √  ARCHIVE_DIR           2236 files      valid     ./archive
 √  CONFIG_FILE           411.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             920.0 KB        valid     ./index.sqlite3
@pirate
Copy link
Member

pirate commented Apr 14, 2021

Ah sorry for the trouble, that shouldn't happen.

In the meantime while I investigate, if you have a backups from v0.4.24, can you try migrating to v0.5.6 first, then from there to v0.6?

The issue is caused by some extractor outputs being null in your old archive (which shouldn't happen, they shouldn't get saved in the first place if there is no output, but the old v0.4.x series had problems with this). I can add a case to handle this in v0.6 and create them as emptystrings instead, but it will take a bit of time to test.

Also helpful would be a sample ./archive/<timestamp>/index.json from one of your archive folders that's marked as invalid. It looks like you have lots more of them than normal and I'm wondering why so many are being considered invalid.

@pirate pirate added size: easy type: bug report status: wip Work is in-progress / has already been partially completed touches: data/schema/architecture labels Apr 14, 2021
@pirate pirate changed the title Bug: NOT NULL constraint failed: core_archiveresult.output Bug: NOT NULL constraint failed: core_archiveresult.output when upgrading v0.4.24 archive to v0.6 Apr 14, 2021
@pigmonkey
Copy link
Contributor Author

I downgraded to v0.5.6.

Rather than restoring a backup from v0.4.24, I deleted everything in the directory except the archive/ snapshot folder and ran archivebox init. I figured it might be easier for it to start fresh rather than trying to migrate an older database structure. It ended up failing with a new error:

[i] [2021-04-14 16:47:50] ArchiveBox v0.5.6: archivebox init
    > /home/pigmonkey/tmp/bookmarks

[+] Initializing a new ArchiveBox collection in this folder...
    /home/pigmonkey/tmp/bookmarks
------------------------------------------------------------------

[+] Building archive folder structure...
    √ /home/pigmonkey/tmp/bookmarks/sources
    √ /home/pigmonkey/tmp/bookmarks/archive
    √ /home/pigmonkey/tmp/bookmarks/logs
    √ /home/pigmonkey/tmp/bookmarks/ArchiveBox.conf

[+] Building main SQL index and running migrations...
    √ /home/pigmonkey/tmp/bookmarks/index.sqlite3

    Operations to perform:
      Apply all migrations: admin, auth, contenttypes, core, sessions
    Running migrations:
    Applying contenttypes.0001_initial... OK
    Applying auth.0001_initial... OK
    Applying admin.0001_initial... OK
    Applying admin.0002_logentry_remove_auto_add... OK
    Applying admin.0003_logentry_add_action_flag_choices... OK
    Applying contenttypes.0002_remove_content_type_name... OK
    Applying auth.0002_alter_permission_name_max_length... OK
    Applying auth.0003_alter_user_email_max_length... OK
    Applying auth.0004_alter_user_username_opts... OK
    Applying auth.0005_alter_user_last_login_null... OK
    Applying auth.0006_require_contenttypes_0002... OK
    Applying auth.0007_alter_validators_add_error_messages... OK
    Applying auth.0008_alter_user_username_max_length... OK
    Applying auth.0009_alter_user_last_name_max_length... OK
    Applying auth.0010_alter_group_name_max_length... OK
    Applying auth.0011_update_proxy_permissions... OK
    Applying auth.0012_alter_user_first_name_max_length... OK
    Applying core.0001_initial... OK
    Applying core.0002_auto_20200625_1521... OK
    Applying core.0003_auto_20200630_1034... OK
    Applying core.0004_auto_20200713_1552... OK
    Applying core.0005_auto_20200728_0326... OK
    Applying core.0006_auto_20201012_1520... OK
    Applying core.0007_archiveresult... OK
    Applying core.0008_auto_20210105_1421... OK
    Applying sessions.0001_initial... OK

[*] Collecting links from any existing indexes and archive folders...
    √ Added 1054 orphaned links from existing archive directories.
    ! Skipped adding 2236 invalid link data directories.
    (long list of invalid archive dirs here)

    Hint: For more information about the link data directories that were skipped, run:
        archivebox status
        archivebox list --status=invalid

[*] [2021-04-14 19:45:31] Writing 1054 links to main index...
Traceback (most recent call last):
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 589, in update_or_create
    obj = self.select_for_update().get(**kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get
    raise self.model.DoesNotExist(
core.models.DoesNotExist: Snapshot matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pigmonkey/.local/bin//archivebox", line 8, in <module>
    sys.exit(main())
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 129, in main
    run_subcommand(
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 69, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/cli/archivebox_init.py", line 33, in main
    init(
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/main.py", line 387, in init
    write_main_index(list(pending_links.values()), out_dir=out_dir)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/index/__init__.py", line 232, in write_main_index
    write_sql_main_index(links, out_dir=out_dir)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/index/sql.py", line 53, in write_sql_main_index
    write_link_to_sql_index(link)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/archivebox/index/sql.py", line 44, in write_link_to_sql_index
    snapshot, _ = Snapshot.objects.update_or_create(url=link.url, defaults=info)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 594, in update_or_create
    obj, created = self._create_object_from_params(kwargs, params, lock=True)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 610, in _create_object_from_params
    obj = self.create(**params)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 447, in create
    obj.save(force_insert=True, using=self.db)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/base.py", line 753, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/base.py", line 790, in save_base
    updated = self._save_table(
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/base.py", line 895, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/base.py", line 933, in _do_insert
    return manager._insert(
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/query.py", line 1254, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1396, in execute_sql
    for sql, params in self.as_sql():
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1339, in as_sql
    value_rows = [
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1340, in <listcomp>
    [self.prepare_value(field, self.pre_save_val(field, obj)) for field in fields]
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1340, in <listcomp>
    [self.prepare_value(field, self.pre_save_val(field, obj)) for field in fields]
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1281, in prepare_value
    value = field.get_db_prep_save(value, connection=self.connection)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/fields/__init__.py", line 823, in get_db_prep_save
    return self.get_db_prep_value(value, connection=connection, prepared=False)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/models/fields/__init__.py", line 1379, in get_db_prep_value
    return connection.ops.adapt_datetimefield_value(value)
  File "/home/pigmonkey/.local/pipx/venvs/archivebox/lib/python3.9/site-packages/django/db/backends/sqlite3/operations.py", line 245, in adapt_datetimefield_value
    raise ValueError("SQLite backend does not support timezone-aware datetimes when USE_TZ is False.")
ValueError: SQLite backend does not support timezone-aware datetimes when USE_TZ is False.

Here is a random index.json from one of the snapshot directories that appeared in the invalid list for both v0.5.6 and v0.6.0:

{
    "archive_path": "archive/1409265356",
    "base_url": "www.jabberwocky.com/software/paperkey",
    "basename": "",
    "bookmarked_date": "2014-08-28 22:35",
    "canonical": {
        "archive_org_path": "https://web.archive.org/web/www.jabberwocky.com/software/paperkey",
        "dom_path": "output.html",
        "favicon_path": "favicon.ico",
        "git_path": "git/",
        "google_favicon_path": "https://www.google.com/s2/favicons?domain=www.jabberwocky.com",
        "headers_path": "headers.json",
        "index_path": "index.html",
        "media_path": "media/",
        "mercury_path": "mercury/content.html",
        "pdf_path": "output.pdf",
        "readability_path": "readability/content.html",
        "screenshot_path": "screenshot.png",
        "singlefile_path": "singlefile.html",
        "warc_path": "warc/",
        "wget_path": "www.jabberwocky.com/software/paperkey/index.html"
    },
    "domain": "www.jabberwocky.com",
    "extension": "",
    "hash": "AM68BRMYAT8Z6HEQT8GM",
    "history": {
        "archive_org": [
            {
                "cmd": [
                    "curl",
                    "--location",
                    "--head",
                    "--user-agent",
                    "ArchiveBox/332a32f4f9b6f548d9a61495ec9008667ca1f5f6 (+https://github.com/pirate/ArchiveBox/)",
                    "--max-time",
                    "60",
                    "https://web.archive.org/save/http://www.jabberwocky.com/software/paperkey/"
                ],
                "cmd_version": null,
                "end_ts": "2019-04-30T13:51:52.542295+00:00",
                "index_texts": null,
                "output": "https://web.archive.org/web/20190430205152/http://www.jabberwocky.com/software/paperkey/",
                "pwd": "/home/pigmonkey/library/bookmarks/archive/1409265356",
                "schema": "ArchiveResult",
                "start_ts": "2019-04-30T13:51:52.087359+00:00",
                "status": "succeeded"
            }
        ],
        "dom": [
            {
                "cmd": [
                    "/usr/bin/chromium",
                    "--headless",
                    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36",
                    "--window-size=1440,2000",
                    "--timeout=60000",
                    "--user-data-dir=/run/user/1000/chromium/2019-04-30T10:50:42-07:00",
                    "--dump-dom",
                    "http://www.jabberwocky.com/software/paperkey/"
                ],
                "cmd_version": null,
                "end_ts": "2019-04-30T13:51:52.087080+00:00",
                "index_texts": null,
                "output": "output.html",
                "pwd": "/home/pigmonkey/library/bookmarks/archive/1409265356",
                "schema": "ArchiveResult",
                "start_ts": "2019-04-30T13:51:51.452372+00:00",
                "status": "succeeded"
            }
        ],
        "favicon": [
            {
                "cmd": [
                    "curl",
                    "--max-time",
                    "60",
                    "--location",
                    "--output",
                    "favicon.ico",
                    "https://www.google.com/s2/favicons?domain=www.jabberwocky.com"
                ],
                "cmd_version": null,
                "end_ts": "2019-04-30T13:51:49.427358+00:00",
                "index_texts": null,
                "output": "favicon.ico",
                "pwd": "/home/pigmonkey/library/bookmarks/archive/1409265356",
                "schema": "ArchiveResult",
                "start_ts": "2019-04-30T13:51:49.290417+00:00",
                "status": "succeeded"
            }
        ],
        "git": [],
        "headers": [
            {
                "cmd": [
                    "curl",
                    "--silent",
                    "--max-time",
                    "180",
                    "--location",
                    "--compressed",
                    "--head",
                    "--user-agent",
                    "ArchiveBox/0.4.21 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.73.0 (x86_64-pc-linux-gnu)",
                    "http://www.jabberwocky.com/software/paperkey/"
                ],
                "cmd_version": "curl 7.73.0 (x86_64-pc-linux-gnu)",
                "end_ts": "2020-11-17T21:19:24.366356+00:00",
                "index_texts": null,
                "output": "headers.json",
                "pwd": "/home/pigmonkey/tmp/bookmarks/archive/1409265356",
                "schema": "ArchiveResult",
                "start_ts": "2020-11-17T21:19:23.993579+00:00",
                "status": "succeeded"
            }
        ],
        "media": [],
        "mercury": [],
        "pdf": [
            {
                "cmd": [
                    "/usr/bin/chromium",
                    "--headless",
                    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36",
                    "--window-size=1440,2000",
                    "--timeout=60000",
                    "--user-data-dir=/run/user/1000/chromium/2019-04-30T10:50:42-07:00",
                    "--print-to-pdf",
                    "http://www.jabberwocky.com/software/paperkey/"
                ],
                "cmd_version": null,
                "end_ts": "2019-04-30T13:51:50.497825+00:00",
                "index_texts": null,
                "output": "output.pdf",
                "pwd": "/home/pigmonkey/library/bookmarks/archive/1409265356",
                "schema": "ArchiveResult",
                "start_ts": "2019-04-30T13:51:49.698822+00:00",
                "status": "succeeded"
            }
        ],
        "readability": [
            {
                "cmd": [
                    "/home/pigmonkey/tmp/bookmarks/node_modules/readability-extractor/readability-extractor",
                    "/tmp/pigmonkey/tmp078hdqif"
                ],
                "cmd_version": "0.0.2",
                "end_ts": "2021-04-11T02:58:44.262770+00:00",
                "index_texts": [
                    "\nby David Shaw\n\nA reasonable way to achieve a long term backup of OpenPGP (GnuPG, PGP,\netc) keys is to print them out on paper.  Paper and ink have amazingly\nlong retention qualities - far longer than the magnetic or optical\nmeans that are generally used to back up computer data.\n\nDownload\n\n\nFor POSIX (Linux, Unix, *BSD, etc):\npaperkey-1.6.tar.gz\npaperkey-1.6.tar.gz.sig\n(OpenPGP signature from my key 0x99242560)\nWin32 precompiled binary:\npaperkey-1.6-win32.zip\npaperkey-1.6-win32.zip.sig\n(OpenPGP signature from my key 0x99242560)\n\n\nEarlier releases as well as the usual GitHub stuff are available on GitHub.\n\nPaper?  Seriously?\n\nThe goal with paper is not secure storage.  There are countless ways\nto store something securely.  A paper backup also isn't a replacement\nfor the usual machine readable (tape, CD-R, DVD-R, etc) backups, but\nrather as an if-all-else-fails method of restoring a key.  Most of the\nstorage media in use today do not have particularly good long-term\n(measured in years to decades) retention of data.  If and when the\nCD-R and/or tape cassette and/or USB key and/or hard drive the secret\nkey is stored on becomes unusable, the paper copy can be used to\nrestore the secret key.\n\nWhat paperkey does\n\nDue to metadata and redundancy, OpenPGP secret keys are significantly\nlarger than just the \"secret bits\".  In fact, the secret key contains\na complete copy of the public key.  Since the public key generally\ndoesn't need to be escrowed (most people have many copies of it on\nvarious keyservers, web pages, or similar), only archiving the secret\nparts can be a real advantage.\n\nPaperkey extracts just those secret bytes and prints them.  To\nreconstruct, you re-enter those bytes (whether by hand, OCR, QR code,\nor the like) and paperkey can use them to transform your existing\npublic key into a secret key.\n\nFor example, the regular DSA+Elgamal secret key I just tested comes\nout to 1281 bytes.  The secret parts of that key (plus some minor\npacket structure) come to only 149 bytes.  It's a lot easier to\nre-enter 149 bytes correctly.\n\nDifferent key algorithms will benefit to a different degree from this\nsize reduction.  In general, DSA or Elgamal keys benefit the most,\nshrinking to around 10% of the original key size, and RSA keys benefit\nthe least, only shrinking to about 50% of the original key size.  ECC\nkeys are in between, shrinking to around 20-25% of the original, but\nof course, ECC keys are quite small to begin with, and 25% of a small\nnumber can compare well to 10% of a larger number.\n\nAs with any backup or archiving system, it is prudent to verify you\ncan restore the key from your paper copy before filing the paper away.\n\nAren't CD-Rs supposed to last a long time?\n\nThey're certainly advertised to (and I've seen some pretty incredible\nclaims of 100 years or more), but in practice it doesn't really work\nout that way.  The manufacturing of the media, the burn quality, the\nburner quality, the storage, etc, all have a significant impact on how\nlong an optical disc will last.  Some tests show that you're lucky to\nget 10 years.\n\nIn comparison, to claim that paper will last for 100 years is not even\nvaguely impressive.  High-quality paper with good ink regularly lasts\nmany hundreds of years even under less than optimal conditions.\n\nAnother bonus is that ink on paper is readable by humans.  Not all\nbackup methods will be readable 50 years later, so even if you have\nthe backup, you can't easily buy a drive to read it.  I doubt this\nwill happen anytime soon with CD-R as there are just so many of them\nout there, but the storage industry is littered with old, now-dead\nmethods of storing data.\n\nSecurity\n\nNote that paperkey does not change the security requirements of\nstoring a secret key.  In fact, paperkey doesn't do any crypto at all,\nbut just saves and restores the original secret key, whether it is\nencrypted or not.  If your key has a passphrase on it (i.e. is\nencrypted), the paper copy is similarly encrypted.  If your key has no\npassphrase, neither does the paper copy.  Whatever the passphrase (or\nlack thereof) was on the original secret key will be the same on the\nreconstructed key.\n\nExamples\n\nTake the secret key in key.gpg and generate a text file\nto-be-printed.txt that contains the secret data:\n\n  paperkey --secret-key my-secret-key.gpg --output to-be-printed.txt\n\nTake the secret key data in my-key-text-file.txt and combine it with\nmy-public-key.gpg to reconstruct my-secret-key.gpg:\n\n  paperkey --pubring my-public-key.gpg --secrets my-key-text-file.txt --output my-secret-key.gpg\n\nIf --output is not specified, the output goes to stdout.  If\n--secret-key is not specified, the data is read from stdin so you can\ndo things like:\n\n  gpg --export-secret-key my-key | paperkey | lpr\n\nSome other useful options are:\n\n\n\n--output-type\n\ncan be \"base16\" or \"raw\".  \"base16\" is human readable, and \"raw\"\nis useful if you want to pass the output to another program like a bar\ncode or QR code generator (although note that scannable codes have\nsome of the disadvantages discussed above).\n\n--input-type\n\nsame as --output-type, but for the restore side of things.  By\ndefault the input type is inferred automatically from the input data.\n\n--output-width\n\nsets the width of base16 output (i.e. given your font, how many\ncolumns fit on the paper you're printing on).  Defaults to 78.\n\n--ignore-crc-error\n\nallows paperkey to continue when reconstructing even if it detects\ndata corruption in the input.\n\n--verbose (or -v)\n\nbe chatty about what is happening.  Repeat this multiple times for\nmore verbosity.\n\n\n\nFull documentation for all options is in the man page.\n\nRPM\n\nPaperkey ships with a RPM spec file.  You can build the RPM with the\nusual \"rpmbuild -ta /path/to/the/paperkey/tarball.tar.gz\".\n\n\n\nPaperkey is Copyright \u00a9 2007-2018 by David Shaw\n\n\n"
                ],
                "output": "readability",
                "pwd": "/home/pigmonkey/tmp/bookmarks/archive/1409265356",
                "schema": "ArchiveResult",
                "start_ts": "2021-04-11T02:58:43.498979+00:00",
                "status": "succeeded"
            }
        ],
        "screenshot": [
            {
                "cmd": [
                    "/usr/bin/chromium",
                    "--headless",
                    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36",
                    "--window-size=1440,2000",
                    "--timeout=60000",
                    "--user-data-dir=/run/user/1000/chromium/2019-04-30T10:50:42-07:00",
                    "--screenshot",
                    "http://www.jabberwocky.com/software/paperkey/"
                ],
                "cmd_version": null,
                "end_ts": "2019-04-30T13:51:51.452135+00:00",
                "index_texts": null,
                "output": "screenshot.png",
                "pwd": "/home/pigmonkey/library/bookmarks/archive/1409265356",
                "schema": "ArchiveResult",
                "start_ts": "2019-04-30T13:51:50.498062+00:00",
                "status": "succeeded"
            }
        ],
        "singlefile": [
            {
                "cmd": [
                    "/home/pigmonkey/tmp/bookmarks/node_modules/single-file/cli/single-file",
                    "--browser-executable-path=/usr/bin/chromium",
                    "--browser-args=[\"--headless\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"]",
                    "http://www.jabberwocky.com/software/paperkey/",
                    "singlefile.html"
                ],
                "cmd_version": "0.3.16",
                "end_ts": "2021-04-11T02:58:43.480810+00:00",
                "index_texts": null,
                "output": "singlefile.html",
                "pwd": "/home/pigmonkey/tmp/bookmarks/archive/1409265356",
                "schema": "ArchiveResult",
                "start_ts": "2021-04-11T02:58:37.678078+00:00",
                "status": "succeeded"
            }
        ],
        "title": [],
        "wget": [
            {
                "cmd": [
                    "wget",
                    "--no-verbose",
                    "--adjust-extension",
                    "--convert-links",
                    "--force-directories",
                    "--backup-converted",
                    "--span-hosts",
                    "--no-parent",
                    "-e",
                    "robots=off",
                    "--restrict-file-names=windows",
                    "--timeout=60",
                    "--compression=auto",
                    "--warc-file=warc/1556657509",
                    "--page-requisites",
                    "--user-agent=Mozilla/5.0 (X11; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0",
                    "http://www.jabberwocky.com/software/paperkey/"
                ],
                "cmd_version": null,
                "end_ts": "2019-04-30T13:51:49.698028+00:00",
                "index_texts": null,
                "output": "www.jabberwocky.com/software/paperkey/index.html",
                "pwd": "/home/pigmonkey/library/bookmarks/archive/1409265356",
                "schema": "ArchiveResult",
                "start_ts": "2019-04-30T13:51:49.427885+00:00",
                "status": "succeeded"
            }
        ]
    },
    "icons": null,
    "is_archived": true,
    "is_static": false,
    "latest": {
        "archive_org": "https://web.archive.org/web/20190430205152/http://www.jabberwocky.com/software/paperkey/",
        "dom": "output.html",
        "favicon": "favicon.ico",
        "git": null,
        "media": null,
        "pdf": "output.pdf",
        "screenshot": "screenshot.png",
        "singlefile": "singlefile.html",
        "title": null,
        "warc": null,
        "wget": "www.jabberwocky.com/software/paperkey/index.html"
    },
    "link_dir": "/home/pigmonkey/tmp/bookmarks/archive/1409265356",
    "newest_archive_date": "2021-04-11T02:58:43.498979+00:00",
    "num_failures": 0,
    "num_outputs": 9,
    "oldest_archive_date": "2019-04-30T13:51:49.290417+00:00",
    "path": "/software/paperkey/",
    "schema": "Link",
    "scheme": "http",
    "snapshot_id": "a30094bc-79ae-4c6b-8f51-a4cd84b0475b",
    "sources": [
        "/home/pigmonkey/library/conf/pinboard.json"
    ],
    "tags": "backup crypto pgp",
    "tags_str": "backup crypto pgp",
    "timestamp": "1409265356",
    "title": "Paperkey - an OpenPGP key archiver",
    "updated": "2021-04-11T04:39:17.977393+00:00",
    "updated_date": "2021-04-11 04:39",
    "url": "http://www.jabberwocky.com/software/paperkey/"
}

@pigmonkey
Copy link
Contributor Author

One thing I notice looking at that JSON file is that it has a mix of absolute paths.

I originally had ArchiveBox in ~/library/bookmarks. I still have the old pre-Django version running there. When I began to experiment with the Django releases, I copied my archive from ~/library/bookmarks to ~/tmp/bookmarks. The above snapshot is one of the older URLs that was originally captured with the pre-Django versions, so I see the pwd key for some of the old archive methods, like screenshot, are pointing to my old directory, while the pwd key for some of the newer archive methods, like singlefile, are pointing to the new directory.

Maybe that is screwing it up somehow? I'm not sure why it cares about absolute paths, since I think the expectation is that archivebox is always run from the root output directory.

@pirate
Copy link
Member

pirate commented Apr 14, 2021

It doesn't actually use those paths for anything, so that wont affect it. They're just added for human readers to find files easier.

Instead of starting fresh on v0.5.6, can you try starting fresh on v0.6? Backup & delete the main index files, leaving only the archive/ then run init. If that still fails then I'll push some fixes to v0.6 to account for null outputs.

@pigmonkey
Copy link
Contributor Author

Starting fresh with v0.6.0 results in the same NOT NULL constraint failed: core_archiveresult.output error as my original post.

@pirate
Copy link
Member

pirate commented Apr 14, 2021

Ok, I'll push a fix for that one then. Hang tight, thanks for your patience.

@pirate pirate added this to the v0.6.3 milestone Apr 16, 2021
@milosz
Copy link

milosz commented Jan 7, 2022

Is there anything that can be done to import old archives? Any guidance would be helpful.

I tried to add snapshot_id to index.json inside archived website (like sed -i -e "1 a\ \ \ \ \"snapshot_id\": \"$(uuidgen --time)\"," archive/1561294668/index.json). After that executed archivebox update --status orphaned --index-only, but it does not help.

@pirate
Copy link
Member

pirate commented Jan 8, 2022

What version are you trying to import @milosz? I recommend upgrading through 0.5 then to 0.6 after.

@milosz
Copy link

milosz commented Jan 11, 2022

I have ~20G archive backup from 2019 year. Thanks, I will try this intermediate step. I am thrilled that it is possible.

@pirate
Copy link
Member

pirate commented Mar 23, 2022

New instructions here: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives

Also note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting

Contributions/suggestions welcome there.

@pirate pirate closed this as completed Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: easy status: wip Work is in-progress / has already been partially completed touches: data/schema/architecture type: bug report
Projects
None yet
Development

No branches or pull requests

3 participants