Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Help setting up full text search #1087

Closed
diego898 opened this issue Jan 20, 2023 · 11 comments
Closed

Question: Help setting up full text search #1087

diego898 opened this issue Jan 20, 2023 · 11 comments

Comments

@diego898
Copy link

I was trying the instructions outlined here:

#956 (comment)

to setup full text search on my archive of a single link. I split this out into its own issue so as to not derail the other:

@pirate - I had to download the sonic.cfg the root directory not the data folder.

Also, after down; down; up I tried docker-compose run archivebox update --index-only

and got:

~/archivebox
❯ docker-compose run archivebox update --index-only
[i] [2023-01-19 02:45:38] ArchiveBox v0.6.2: archivebox update --index-only
    > /data

[*] Indexing url: https://www.ecliptik.com/bookmarking-with-raindrop/ in the search index

[!] Sonic search backend threw an error while indexing: SonicServerError ERR invalid_meta_key(?["])

I've only indexed a single website so far

@diego898 diego898 changed the title Question: ... Question: Help setting up full text search Jan 20, 2023
@pirate
Copy link
Member

pirate commented Jan 21, 2023

Interesting, never seen this error before. Can you post your sonic config from docker-compose.yml. Also the full output of archivebox --version.

@diego898
Copy link
Author

from docker-compose.yml:

archivebox:
        # build: .                              # for developers working on archivebox
        image: ${DOCKER_IMAGE:-archivebox/archivebox:master}
        command: server --quick-init 0.0.0.0:8000
        ports:
            - 8000:8000
        environment:
            - ALLOWED_HOSTS=*                   # add any config options you want as env vars
            - MEDIA_MAX_SIZE=750m
            - SEARCH_BACKEND_ENGINE=sonic     # uncomment these if you enable sonic below
            - SEARCH_BACKEND_HOST_NAME=sonic
            - SEARCH_BACKEND_PASSWORD=SecretPassword
        volumes:
            - ./data:/data
            # - ./archivebox:/app/archivebox    # for developers working on archivebox

...

# To run the Sonic full-text search backend, first download the config file to sonic.cfg
    # curl -O https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/etc/sonic.cfg
    # after starting, backfill any existing Snapshots into the index: docker-compose run archivebox update --index-only
    sonic:
       image: valeriansaliou/sonic:v1.3.0
       expose:
           - 1491
       environment:
           - SEARCH_BACKEND_PASSWORD=SecretPassword
       volumes:
           - ./sonic.cfg:/etc/sonic.cfg:ro
           - ./data/sonic:/var/lib/sonic/store

and:

~/archivebox took 13s
❯ docker-compose run archivebox --version
ArchiveBox v0.6.2
Cpython Linux Linux-5.15.49-linuxkit-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            11 files        valid     /data
 √  SOURCES_DIR           1 files         valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           1 files         valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             212.0 KB        valid     ./index.sqlite3

@diego898
Copy link
Author

The error is different if I place it in the data folder. Also, occasionally I’ll bet a folder at root called sonic.cfg which is very strange

@pirate
Copy link
Member

pirate commented Jan 24, 2023

folder at root called sonic.cfg is caused by the file not being where you told docker to look for it, so it creates it as an empty volume and mounts it. Keep it outside the data folder and mount it like you're doing in the docker-compose.yml you posted and it should work. Can you post the contents of your sonic.cfg file, maybe it got messed up somehow?

@diego898
Copy link
Author

diego898 commented Feb 6, 2023

This is the file:

# Sonic
# Fast, lightweight and schema-less search backend
# Configuration file
# Example: https://github.com/valeriansaliou/sonic/blob/master/config.cfg


[server]

log_level = "warn"


[channel]

inet = "0.0.0.0:1491"
tcp_timeout = 300

auth_password = "${env.SEARCH_BACKEND_PASSWORD}"

[channel.search]

query_limit_default = 65535
query_limit_maximum = 65535
query_alternates_try = 10

suggest_limit_default = 5
suggest_limit_maximum = 20


[store]

[store.kv]

path = "/var/lib/sonic/store/kv/"

retain_word_objects = 100000

[store.kv.pool]

inactive_after = 1800

[store.kv.database]

flush_after = 900

compress = true
parallelism = 2
max_files = 100
max_compactions = 1
max_flushes = 1
write_buffer = 16384
write_ahead_log = true

[store.fst]

path = "/var/lib/sonic/store/fst/"

[store.fst.pool]

inactive_after = 300

[store.fst.graph]

consolidate_after = 180

max_size = 2048
max_words = 250000

@diego898
Copy link
Author

diego898 commented Feb 6, 2023

and re-runnig after re-downloading the file gives this error:

❯ docker-compose run archivebox update --index-only
[i] [2023-02-06 18:03:04] ArchiveBox v0.6.2: archivebox update --index-only
    > /data

[*] Indexing url: https://www.ecliptik.com/bookmarking-with-raindrop/ in the search index

[!] Sonic search backend threw an error while indexing: gaierror [Errno -2] Name or service not known

@pirate
Copy link
Member

pirate commented Feb 6, 2023

Very strange, your setup is completely standard but it's failing as if sonic is not running.

Lets try checking the sonic container logs, can you post the output of docker-compose logs sonic?

You can also try pinging/telnet the sonic container from the ArchiveBox one to see if it's a network issue:

docker-compose exec archivebox bash
$ telnet sonic 1491
# or
$ ping sonic

@diego898
Copy link
Author

diego898 commented Feb 6, 2023

These are the outputs:

~/archivebox
❯ docker-compose logs sonic
archivebox-sonic-1  | thread 'main' panicked at 'cannot read config file: Os { code: 21, kind: Other, message: "Is a directory" }', src/config/reader.rs:24:14
archivebox-sonic-1  | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
archivebox-sonic-1  | thread 'main' panicked at 'cannot read config file: Os { code: 21, kind: Other, message: "Is a directory" }', src/config/reader.rs:24:14
archivebox-sonic-1  | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
archivebox-sonic-1  | thread 'main' panicked at 'cannot read config file: Os { code: 21, kind: Other, message: "Is a directory" }', src/config/reader.rs:24:14
archivebox-sonic-1  | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

and

~/archivebox took 4s
❯ docker-compose exec archivebox bash
root@19e3f981c5ec:/data# telnet sonic 1491
bash: telnet: command not found
root@19e3f981c5ec:/data# ping sonic
bash: ping: command not found
root@19e3f981c5ec:/data#

@diego898
Copy link
Author

diego898 commented Feb 6, 2023

Note:

  • I deleted the sonic.cfg file from the data/ directory.
  • Deleted the directory from root, replaced it with the correct file
  • did a docker-compose down; docker-compose down; docker-compose up
  • did a docker-compose logs sonic and it came back empty (this is good right?)
  • Then I tried: to update the index and got:
~/archivebox took 6s
❯ docker-compose run archivebox update --index-only
[i] [2023-02-06 22:57:15] ArchiveBox v0.6.2: archivebox update --index-only
    > /data

[*] Indexing url: https://www.ecliptik.com/bookmarking-with-raindrop/ in the search index

[!] Sonic search backend threw an error while indexing: SonicServerError ERR invalid_meta_key(?["])

[*] Indexing url: https://news.ycombinator.com/item?id=34667067 in the search index

[*] Indexing url: https://news.ycombinator.com/item?id=34665738 in the search index

Back on the up screen I get:

~/archivebox took 1m8s
❯ docker-compose down; docker-compose down; docker-compose up
[+] Running 3/0
 ⠿ Container archivebox-archivebox-1  Removed                             0.0s
 ⠿ Container archivebox-sonic-1       R...                                0.0s
 ⠿ Network archivebox_default         Rem...                              0.0s
[+] Running 3/3
 ⠿ Network archivebox_default         Cre...                              0.1s
 ⠿ Container archivebox-sonic-1       C...                                0.1s
 ⠿ Container archivebox-archivebox-1  Created                             0.1s
Attaching to archivebox-archivebox-1, archivebox-sonic-1
archivebox-archivebox-1  | [i] [2023-02-06 22:56:32] ArchiveBox v0.6.2: archivebox server --quick-init 0.0.0.0:8000
archivebox-archivebox-1  |     > /data
archivebox-archivebox-1  |
archivebox-archivebox-1  | [^] Verifying and updating existing ArchiveBox collection to v0.6.2...
archivebox-archivebox-1  | ----------------------------------------------------------------------
archivebox-archivebox-1  |
archivebox-archivebox-1  | [*] Verifying archive folder structure...
archivebox-archivebox-1  |     + ./archive, ./sources, ./logs...
archivebox-archivebox-1  |     + ./ArchiveBox.conf...
archivebox-archivebox-1  |
archivebox-archivebox-1  | [*] Verifying main SQL index and running any migrations needed...
archivebox-archivebox-1  |     Operations to perform:
archivebox-archivebox-1  |     Apply all migrations: admin, auth, contenttypes, core, sessions
archivebox-archivebox-1  |     Running migrations:
archivebox-archivebox-1  |     No migrations to apply.
archivebox-archivebox-1  |
archivebox-archivebox-1  |     √ ./index.sqlite3
archivebox-archivebox-1  |
archivebox-archivebox-1  | [*] Checking links from indexes and archive folders (safe to Ctrl+C)...
archivebox-archivebox-1  |     √ Loaded 3 links from existing main index.
archivebox-archivebox-1  |     > Skipping full snapshot directory check (quick mode)
archivebox-archivebox-1  |
archivebox-archivebox-1  | ----------------------------------------------------------------------
archivebox-archivebox-1  | [√] Done. Verified and updated the existing ArchiveBox collection.
archivebox-archivebox-1  |
archivebox-archivebox-1  |     Hint: To view your archive index, run:
archivebox-archivebox-1  |         archivebox server  # then visit http://127.0.0.1:8000
archivebox-archivebox-1  |
archivebox-archivebox-1  |     To add new links, you can run:
archivebox-archivebox-1  |         archivebox add ~/some/path/or/url/to/list_of_links.txt
archivebox-archivebox-1  |
archivebox-archivebox-1  |     For more usage and examples, run:
archivebox-archivebox-1  |         archivebox help
archivebox-archivebox-1  |
archivebox-archivebox-1  | [+] Starting ArchiveBox webserver...
archivebox-archivebox-1  |     > Logging errors to ./logs/errors.log
archivebox-archivebox-1  | Performing system checks...
archivebox-archivebox-1  |
archivebox-archivebox-1  | System check identified no issues (0 silenced).
archivebox-archivebox-1  | February 06, 2023 - 22:56:34
archivebox-archivebox-1  | Django version 3.1.10, using settings 'core.settings'
archivebox-archivebox-1  | Starting development server at http://0.0.0.0:8000/
archivebox-archivebox-1  | Quit the server with CONTROL-C.
archivebox-archivebox-1  | "GET /admin/login/ HTTP/1.1" 200 11144
archivebox-sonic-1       | (WARN) - took a lot of time: 226ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 90ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 107ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 77ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 65ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 75ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 79ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 78ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 81ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 100ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 62ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 72ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 89ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 64ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 87ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 93ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 91ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 60ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 70ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 59ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 82ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 70ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 54ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 52ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 57ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 70ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 55ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 55ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 68ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 58ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 52ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 53ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 57ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 118ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 59ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 68ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 125ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 86ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 81ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 89ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 101ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 69ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 59ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 81ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 71ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 76ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 94ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 63ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 71ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 63ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 71ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 70ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 65ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 64ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 66ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 54ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 64ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 53ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 62ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 50ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 60ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 53ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 57ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 51ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 56ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 54ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 60ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 56ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 52ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 55ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 59ms to process channel message
archivebox-sonic-1       | (WARN) - took a lot of time: 52ms to process channel message
archivebox-archivebox-1  | "GET /admin/login/ HTTP/1.1" 200 11144
archivebox-archivebox-1  | "GET /admin/login/ HTTP/1.1" 200 11144
archivebox-archivebox-1  | "GET /admin/login/ HTTP/1.1" 200 11144

Very strange!

@pirate
Copy link
Member

pirate commented Feb 6, 2023

Ah ok that seems fine now, looks like it's working. It's possible that the Ecliptik article text extraction had an issue and so Sonic is getting empty text for that URL, but it's working on other URLs as evidenced by the logs after that point.

@diego898
Copy link
Author

diego898 commented Feb 7, 2023

wow what are the odds that the very first test url I made had a url specific error and it never occurred to me to try others! thank you! closing this for now!

@diego898 diego898 closed this as completed Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants