Full-text search #543

jdcaballerov · 2020-11-19T22:03:44Z

Summary

This PR Adds the ability to do full-text search 🎉

Related issues

#22 #24

Changes these areas

jdcaballerov · 2020-11-19T22:12:32Z

Will change to use generators for the ids.

cdvv7788 · 2020-11-19T23:26:18Z

@jdcaballerov LGTM! Can you please update the Dockerfile so it is installed automatically? This only provides FTS in the admin, right? Can we add something in the list command too?

jdcaballerov · 2020-11-20T00:05:33Z

@cdvv7788 will sonic be a required dependency ? Right now only admin, I'll check tomorrow how we could integrate it with list.

cdvv7788 · 2020-11-20T00:14:21Z

Not necessarily. You can leave it as the first option, and use the current mechanism as a fallback. I just want to have it in the cli where we can test that it actually works without digging into the code.

jdcaballerov · 2020-11-20T03:17:28Z

@cdvv7788 Could you please be more specific about the needs for list. The filtering currently in place in the django admin is not comparable to list filtering.

list command with --filter-type subcommand uses the following filters (you can choose only one)

LINK_FILTERS = {
    'exact': lambda pattern: Q(url=pattern),
    'substring': lambda pattern: Q(url__icontains=pattern),
    'regex': lambda pattern: Q(url__iregex=pattern),
    'domain': lambda pattern: Q(url__istartswith=f"http://{pattern}") | Q(url__istartswith=f"https://{pattern}") | Q(url__istartswith=f"ftp://{pattern}"),
}

while the django admin uses this search fields (many icontains filters at once chained with OR )
2.

    search_fields = ['url', 'timestamp', 'title', 'tags__name']

What's needed a new subcommand for list ? let's say search ? or a new filter-type to be called search

Currently, if search is enabled, the django admin search adds two queries: the one from the search_fields + the one from the search backend
if search is enabled but the backend fails a warning message is shown but the query results just use the first.

A test will be the most helpful.

jdcaballerov · 2020-11-20T04:52:52Z

Behavior proposal (this is how it's working right now) :

ADMIN

Search enabled and all OK

Search enabled but backend fail

Search disabled just uses search_fields

CLI

New search filter backend enabled all OK

New search filter backend enabled but fail

New search filter backend disabled (fail and ok)

cdvv7788 · 2020-11-20T12:28:00Z

Yes, I had in mind a new filter. Where is it searching? Using readability?

jdcaballerov · 2020-11-20T13:51:22Z

@cdvv7788 currently I've only added readability content but any texts from the extractors could be added. i.e each one of the headers, etc.

What is needed is to add a list in index_texts at the extractor returned ArchiveResult dataclass. If index_texts is not present it will only pass.

Example from readability:

    return ArchiveResult(
        cmd=cmd,
        pwd=str(out_dir),
        cmd_version=READABILITY_VERSION,
        output=output,
        status=status,
        index_texts= [readability_content] if readability_content else [],
        **timer.stats,  
    )

…o sonic-search

jdcaballerov · 2020-11-20T15:34:03Z

@cdvv7788
Sonic is a service with interaction via a telnet like protocol, can't be added to the Dockerfile unless one includes supervisord to manage the container.

I've added sonic to the docker-compose file as a service. Please don't forget to uncomment the line build . and comment the next line so that it doesn't use the image from dockerhub.

pirate · 2020-11-20T17:36:17Z

How do you handle retroactively adding previous archive data to the index? When a user upgrades to this release will it start indexing all their previously archived text?

jdcaballerov · 2020-11-20T18:20:18Z

@pirate Retroactively is not doing anything. The indexing is taking place in archivebox/extractors/__init__.py.archive_link
if indexing is enabled (config.USE_INDEXING_BACKEND)

I thought someone wanting indexing will run archivebox update

WARNING

WIth sonic one must be careful of what to expect, let's say on the archivebox index marking something as indexed, since it evicts data from the index using a sliding window following the config parameters. I don't know how this numbers can't translate to huge archives.

Let's say I start indexing a huge archive and keep a state in archivebox to mark a text related to snapshot as indexed. Then sonic could evict data from the index and I'll end with an inconsistent state. i.e archivebox state marking it as indexed, while evicted in sonic.

Sonic defines some limits:

query_limit_maximum (type: integer, allowed: numbers, default: 100) — Maximum search results limit for a query command (if the LIMIT command modifier is being used when issuing a QUERY command)

retain_word_objects (type: integer, allowed: numbers, default: 1000) — Maximum number of objects a given word in the index can be linked to (older objects are cleared using a sliding window)

Example:
I index 2000 articles using the word skateboarding , I query for the word and sonic will only take into account the latest 1000 indexed but only return 100

sonic is not designed to be exhaustive and better used as fast suggester, or to search chat conversations favoring the latest indexed conversations while the other are accessible with pagination.

Sonic design favors having many buckets, but archivebox use case requires one (unless we find some meaningful way to further partition: tags, etc)

pirate · 2020-11-20T19:29:59Z

Hmmm we definitely need some solution that can index >50k articles with ~2000 words each, and it needs to start indexing retroactively somehow when its enabled. Archivebox update will not re-run extractors that have already run, so I don't think that's enough to index everything.

Are there any sonic alternatives you know of that can handle indexing 100m - 2bn words?

ekiel · 2020-11-20T19:34:01Z

Is there a way to integrate recoll? https://www.lesbonscomptes.com/recoll/ Eric Kiel

…

On Nov 20, 2020, at 1:30 PM, Nick Sweeting ***@***.***> wrote: Hmmm we definitely need some solution that can index >50k articles with ~2000 words each, and it needs to start indexing retroactively somehow when its enabled. Archivebox update will not re-run extractors that have already run, so I don't think that's enough to index everything. Are there any sonic alternatives you know of that can handle indexing 100m - 2bn words? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

pirate · 2020-11-20T19:39:22Z

Recoll seems pretty heavyweight, ideally I'm looking for something similar to sonic, just with a bigger index ability.
It does have a python API though, which seems ok: https://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.PROGRAM.PYTHONAPI.INTRO.html

Side note, it would be great to offer ripgrep as a search backend fallback @jdcaballerov (e.g. in Docker) if sonic or another index-searching backend is not available. Ripgrep is faster and simpler for smaller archives and can be installed as a static binary and called when needed, instead of needing a constantly running backend.

jdcaballerov · 2020-11-20T19:51:52Z

@pirate sonic might be able to handle the requirements

it's not that it can not handle 50k articles, but a query will return max 65.535 results, and a word can be associated to max retain_word_objects (in our case Snapshot ids) that can be huge.
It works as follows:

word	ids of Snapshots containing
cat	2, 4, 54, 34 (max here is `retain_word_objects`)
house	2, 8, 34 , 34

query for cat with query_limit =2 returns docs= 2,4

query_limit_maximum can go up to (u16) - 65535
retain_word_objects (usize) can go up to 18_446_744_073_709_551_615 on some systems

I think the question is about if it's ok being exhaustive or not, if it's ok with max 65535 results per query and a word related to maximum X articles we are ok.

The retroactive indexing can be figured out.

jdcaballerov · 2020-11-23T02:01:36Z

I've added ripgrep (rg) backend and set it as default.

jdcaballerov · 2020-11-23T22:30:36Z

Added the ability to do retroactive indexing using archivebox update --index-only (plus filters). Will select the content to index from the first available extractor output registered as a succeeded ArchiveResult in the order (readability, singlefile, dom, wget) and execute indexing in the current search backend (ripgrep doesn't do anything, but sonic backend does).

ps;
--index-only still needs to be corrected

pirate · 2020-12-05T01:39:28Z

~~Can rebase when you get a chance and then~~ I'll merge this PR next

pirate · 2020-12-05T23:22:21Z

Fixed the conflicts and merged this here 😁: #570

JDC added 4 commits November 17, 2020 18:42

Initial implementation

379b4b0

Implement backend architecture for search engines

5f38f14

Add config for search backend

f74f1d8

Implement flush for search backend after remove command

f5fbb67

JDC added 2 commits November 19, 2020 17:33

Use a generator for snapshot flush from index

b33db1d

Use QuerySets for search backend API instead of pks

05ace2b

pirate changed the title ~~Sonic search~~ Full-text search using Sonic Nov 20, 2020

fix: Return empty QuerySet instead of list

a3c4c72

feat: add search filter-type to list command

d02bbfa

jdcaballerov and others added 3 commits November 20, 2020 08:59

Merge branch 'master' into sonic-search

3f3f87e

Add sonic to docker-compose

4e4d3e2

Merge branch 'sonic-search' of github.com:jdcaballerov/ArchiveBox int…

7a0259a

…o sonic-search

Get searc backend password from env var SEARCH_BACKEND_PASSWORD

ac8c84a

JDC added 2 commits November 20, 2020 15:27

Max out number of queries

11a8679

Add search filter type for update

334586c

JDC added 5 commits November 21, 2020 09:37

Exception handling for indexing and searching

06a1f9a

Add search filter-type

9b04918

fix: flush_search_index must be called before removing snapshots

d0b47a0

Fix add search filter to update

4e94190

Add ripgrep rg search backend and set as default

b60cee9

jdcaballerov changed the title ~~Full-text search using Sonic~~ Full-text search Nov 23, 2020

JDC and others added 7 commits November 23, 2020 08:26

Add ignored extensions in ripgrep search

5425886

Merge branch 'archive-result' into sonic-search

e80d905

Add ArchiveResult Manager and sorted indexable filter

a6cc213

Add tag filter to update command

9a36064

Add indexing to update command and utilities

a604f10

Partition long strings in chunks for sonic

af89e4f

Add log print for url indexing

d171b84

jdcaballerov and others added 5 commits November 23, 2020 17:31

Merge branch 'master' into sonic-search

0307ae7

Change MAX_SONIC_TEXT_LENGTH

cef036b

Increase word_objects for Sonic default config

3e53482

Add exception handling for indexable content reader

e5ad30e

refactor: Remove if LENGTH and use text chunker for every input

796d5d7

pirate changed the base branch from master to v0.5.0 December 2, 2020 22:28

pirate mentioned this pull request Dec 5, 2020

Full-text search (rebased) #570

Merged

pirate closed this Dec 5, 2020

jdcaballerov deleted the sonic-search branch December 6, 2020 13:11

alsokpisz mentioned this pull request Feb 15, 2021

Bugfix: Error: Search Backend only searching default admin search fields #654

Closed

bbkane mentioned this pull request Mar 27, 2022

Documentation: Document how search works #956

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full-text search #543

Full-text search #543

jdcaballerov commented Nov 19, 2020 •

edited by pirate

Loading

jdcaballerov commented Nov 19, 2020

cdvv7788 commented Nov 19, 2020

jdcaballerov commented Nov 20, 2020

cdvv7788 commented Nov 20, 2020

jdcaballerov commented Nov 20, 2020 •

edited

Loading

jdcaballerov commented Nov 20, 2020 •

edited

Loading

cdvv7788 commented Nov 20, 2020

jdcaballerov commented Nov 20, 2020 •

edited

Loading

jdcaballerov commented Nov 20, 2020

pirate commented Nov 20, 2020

jdcaballerov commented Nov 20, 2020 •

edited

Loading

pirate commented Nov 20, 2020

ekiel commented Nov 20, 2020 via email

pirate commented Nov 20, 2020 •

edited

Loading

jdcaballerov commented Nov 20, 2020 •

edited

Loading

jdcaballerov commented Nov 23, 2020

jdcaballerov commented Nov 23, 2020 •

edited

Loading

pirate commented Dec 5, 2020 •

edited

Loading

pirate commented Dec 5, 2020 •

edited

Loading

Full-text search #543

Full-text search #543

Conversation

jdcaballerov commented Nov 19, 2020 • edited by pirate Loading

Summary

Related issues

Changes these areas

jdcaballerov commented Nov 19, 2020

cdvv7788 commented Nov 19, 2020

jdcaballerov commented Nov 20, 2020

cdvv7788 commented Nov 20, 2020

jdcaballerov commented Nov 20, 2020 • edited Loading

jdcaballerov commented Nov 20, 2020 • edited Loading

ADMIN

CLI

cdvv7788 commented Nov 20, 2020

jdcaballerov commented Nov 20, 2020 • edited Loading

jdcaballerov commented Nov 20, 2020

pirate commented Nov 20, 2020

jdcaballerov commented Nov 20, 2020 • edited Loading

WARNING

pirate commented Nov 20, 2020

ekiel commented Nov 20, 2020 via email

pirate commented Nov 20, 2020 • edited Loading

jdcaballerov commented Nov 20, 2020 • edited Loading

jdcaballerov commented Nov 23, 2020

jdcaballerov commented Nov 23, 2020 • edited Loading

pirate commented Dec 5, 2020 • edited Loading

pirate commented Dec 5, 2020 • edited Loading

jdcaballerov commented Nov 19, 2020 •

edited by pirate

Loading

jdcaballerov commented Nov 20, 2020 •

edited

Loading

jdcaballerov commented Nov 20, 2020 •

edited

Loading

jdcaballerov commented Nov 20, 2020 •

edited

Loading

jdcaballerov commented Nov 20, 2020 •

edited

Loading

pirate commented Nov 20, 2020 •

edited

Loading

jdcaballerov commented Nov 20, 2020 •

edited

Loading

jdcaballerov commented Nov 23, 2020 •

edited

Loading

pirate commented Dec 5, 2020 •

edited

Loading

pirate commented Dec 5, 2020 •

edited

Loading