Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Error: Search Backend only searching default admin search fields #654

Closed
alsokpisz opened this issue Feb 15, 2021 · 16 comments
Closed
Labels
status: done Work is completed and released (or scheduled to be released in the next version) type: bug report

Comments

@alsokpisz
Copy link

Describe the bug

Bug:
The bug occurs when I attempt to search any query. An error message appears saying: "Error from the search backend, only showing results from default admin search fields -Error:[Errno -3] Temporary failure in name resolution."

If the search query is a word in the title of a website, it will return results with that word in it.
If it is only in the wget snapshot of the item, it will not return that item.

Context:
I am running ArchiveBox using on Windows 10 with docker-compose and have launched the web UI which I am successfully accessing at http://127.0.0.1:8000. As far as I can tell, all the snapshots are functional and there are no pending links. The output directory is on an external hard drive, but there have been no issues reading/writing from this drive (except for speed, though I can't tell if that's just how the Django Web UI is or not).

Relevant Info:
Bug seems similar to @jdcaballerov comment when search enabled but backend failed in his testing (see screenshot 4).

Steps to reproduce

mkdir archivebox && cd archivebox
curl -O https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml

Edit the docker-compose yml's volumes section to read:

volumes:
  - ./data:/data
  - D:\\archivebox:/mnt/d/archivebox

(unsure if external drive-specific setup needed to reproduce, so wanted to include)

Open a Windows Terminal in administrator mode, navigate to D:/archivebox/, open a Git Bash tab and run the following:

> docker-compose up -d
> docker-compose run archivebox init
> docker-compose run archivebox manage createsuperuser
> docker-compose run archivebox add 'https://www.dailydot.com/parsec/fandom/dieselpunk-steampunk-beginners-guide/'

Navigate to http://127.0.0.1:8000 and search "beginners' (see screenshot 1). Because it is in the title, it will show up. The error message will also show up.

Search "biopunk" (see screenshot 2). Even though it is in the wget file, it will not show up (see screenshot 3). The error message will show up. I have not done extensive testing on whether different filetype snapshots will get searched or not, but I don't think it picks any of them up if they are not in title.

Screenshots or log output

Screenshot 1:
image

Screenshot 2:
image

Screenshot 3:
image

Screenshot 4:
image

ArchiveBox version

ArchiveBox v0.5.6
Cpython Linux Linux-4.19.128-microsoft-standard-x86_64-with-glibc2.28 x86_64 (in Docker)

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.5.6          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.1          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.3          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.8.0         valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.1.14         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.1.0          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.02.04.1   valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v88.0.4324.146  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

>docker -version
Docker version 20.10.2, build 2291f61

>docker-compose --version
docker-compose version 1.27.4, build 40524192

@jdcaballerov
Copy link
Contributor

@alsokpisz The error message describes a mis configured dns in the docker compose setup. If the search backend can't be queried the search will only occur in the url and title, the admin fields.

@pirate
Copy link
Member

pirate commented Feb 15, 2021

@alsokpisz as @jdcaballerov mentioned this is likely a DNS resolving issue inside your docker-compose network. Docker on macOS is infamous for having container DNS issues, so I wouldn't be surprised if Docker on Windows is plagued by similar bugs.

First please make sure you have Sonic's config.cfg file present in ./etc/sonic next to your docker-compose.yml file (if not, create that dir and download the config file within):

# these linux commands may be different on Windows, sorry I don't know the equivalents for batch/powershell
mkdir -p ./etc/sonic
cd ./etc/sonic
curl -O https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/etc/sonic/config.cfg

Then confirm that Sonic is up and running and accessible from the archivebox container, can you run these python commands manually and report back what output you get:

docker-compose run archivebox /usr/local/bin/python3
>>> import socket
>>> HOST = 'sonic'
>>> PORT = 1491
>>> with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
...     s.connect((HOST, PORT))
...     s.sendall(b'Hello, world')
...     data = s.recv(1024)
...
>>> print('Received', repr(data))
Received b'CONNECTED <sonic-server v1.3.0>\r\n'   # this line means everything is working, if your output is different then something is wrong
>>>

@alsokpisz
Copy link
Author

alsokpisz commented Feb 15, 2021

--archivebox
   | -- docker-compose.yml
   | --data
        | -- archive, logs, sources, A...B.conf, A...B.conf.bak, index.sqlite3
   | --etc
        | -- sonic
              | -- config.cfg (file)

Several errors in the terminal (see screenshot 1).
Is the tree I've set up above not correct?

EDIT:
I hard-coded the volume specifier again in the Sonic section (screenshot 2), and everything starts up fine (screenshot 3). Of note, the error messages do not appear anymore on search query, but the search is not working correctly still.

> docker-compose run archivebox C:/Python37/python (which I think would be the equivalent command to launch it with Python just brings up (screenshot 4). Sorry if I'm missing something obvious, I don't see why the way you wrote that command wouldn't cause a subargument issue.

EDIT 2 (one hot cup of coffee later):
> docker ps has both services running. I make a Python file with the code you posted above.

#!/usr/bin/env python3
import socket
import time
HOST = 'sonic'
PORT = 1491

def main():
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.connect((HOST, PORT))
        s.sendall(b'Hello, world')
        data = s.recv(1024)

    print('Received', repr(data))
    time.sleep(40)

if __name__ == "__main__":
    """ This is executed when run from the command line """
    main()

> py SONICTEST.py

Result:
image

Screenshots
Screenshot 1.
image

Screenshot 2.
image

Screenshot 3.
image

Screenshot 4.
image

@pirate
Copy link
Member

pirate commented Feb 16, 2021

Inside of docker is always linux, so having a Windows path in this docker command doesn't make sense: docker-compose run archivebox C:/Python37/python

Run it verbatim as I posted above, and paste in the script line by line (don't make a file):

docker-compose run archivebox /usr/local/bin/python3

>>> ... paste in lines above here

@alsokpisz
Copy link
Author

image
Causes a subargument issue.

@pirate
Copy link
Member

pirate commented Feb 16, 2021

try docker-compose run archivebox shell

@alsokpisz
Copy link
Author

Result:
Received b'CONNECTED <sonic-server v1.3.0>\r\nENDED '
I tried this when docker ps still shows both services running.

@pirate
Copy link
Member

pirate commented Feb 17, 2021

Great! That means both the inter-container DNS and the TCP socket to the sonic container are working. Try this next in docker-compose run archivebox shell:

>>> from sonic import SearchClient
>>> from archivebox.config import SEARCH_BACKEND_HOST_NAME, SEARCH_BACKEND_PORT, SEARCH_BACKEND_PASSWORD, SONIC_BUCKET, SONIC_COLLECTION
>>> with SearchClient(SEARCH_BACKEND_HOST_NAME, SEARCH_BACKEND_PORT, SEARCH_BACKEND_PASSWORD) as querycl:
>>>    print(querycl.query(SONIC_COLLECTION, SONIC_BUCKET, 'test'))

@alsokpisz
Copy link
Author

alsokpisz commented Feb 17, 2021

Results:
['3ad870d4-82b5-4974-a6ce-ee8cc6a235fa', '5d6734a5-1b9d-418a-a215-8e1e1dbdb8e5', '74696c7d-4421-46b8-8f35-9f1c9537ee1b', 'fca5096f-13da-4d94-8afd-5d742d7b3fb4', '6f009e8d-947a-4fa7-94d7-f21a94c2b525']

EDIT: While troubleshooting why mass import links never seem to get the Chrome headless stuff to capture (pdf, scrnshot, dom) I essentially re-imported all of my links. Two new folders appeared in archivebox/data : fst and kv. The search can get wget text now only in admin mode, not in the signed out mode.

@pirate
Copy link
Member

pirate commented Feb 17, 2021

Ok, getting closer, sound like Sonic is working and connected but it's not getting text to index. Can you try running this to force a re-index:

docker-compose run archivebox update --index-only

Then you can test full-text search from the CLI like so:

archivebox list --filter-type=search example

If it works from the Admin and the CLI then we can try and track down why the public index isn't working. If it's broken on the CLI then there's still an issue with the Sonic backend we have figure out. Thanks for bearing with me here!

@alsokpisz
Copy link
Author

alsokpisz commented Feb 18, 2021

Seems like any page which is a .pdf, or .jpg causes this error during the index command:

[*] <link.pdf>
[X] An Exception ocurred reading the indexable content='utf-8' codec can't decode byte 0xb5 in position 10: invalid start byte:
[*] <link>
[*] <link>
[X] The search backend threw an exception=ERR invalid_format(PUSH <collection> <bucket> <object> "<text>" [LANG(<locale>)]?)
:
[*] <link>
[*] <link>

And then it hangs.

Sometimes I'd get just the one error. I think this happened after I got rid of the .pdf links. I jotted it down but didn't write any context with it.

[X] The search backend threw an exception=ERR invalid_format(PUSH <collection> <bucket> <object> "<text>" [LANG(<locale>)]?)
:
[*] <link>
[*] <link>
*terminal hangs*

After removing all the .pdf/.jpg links, there are no errors in the terminal when I run the re-index command, but it will still spend ages on random pages. Notably stuff with 'weirder' components like live webcam feeds or something. I removed those one by one until it managed to get through the 80ish bookmarks in less than 10 minutes.

The terminal results were the same ones as the admin search, but the public search still didn't work.

@jdcaballerov
Copy link
Contributor

@alsokpisz the public search view is not connected to the search backend for security and performance reasons.

@alsokpisz
Copy link
Author

Well that's that then I suppose.
Is there a way to set a "timeout" per link on the docker-compose run archivebox update --index-only command? So it will skip links it spends more than say, a minute trying to index?

@pirate
Copy link
Member

pirate commented Apr 6, 2021

Ok this should be somewhat improved in f67a5a2. It will be out with the next v0.6 release soon.
You can also try it early by adding this line to your docker-compose config: build: https://github.com/ArchiveBox/ArchiveBox.git#dev.

Comment back here if you're still having issues with indexing failures/hanging and I'll reopen the issue.

@pirate pirate closed this as completed Apr 6, 2021
@pirate pirate added the status: done Work is completed and released (or scheduled to be released in the next version) label Apr 6, 2021
@rmrf-sl4sh
Copy link

print('Received', repr(data))

This is what I get when I follow the troubleshooting:
>>> print('Received', repr(data)) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'data' is not defined

Anyone know what I'm doing incorrectly?

@pirate
Copy link
Member

pirate commented Nov 12, 2021

Looks like you messed up the indentation, make sure to copy paste that whole block together above, or remove the extra newline before that print to be doubly sure. @jdqw210

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: done Work is completed and released (or scheduled to be released in the next version) type: bug report
Projects
None yet
Development

No branches or pull requests

4 participants