Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-text search #543

Closed
wants to merge 31 commits into from
Closed

Conversation

jdcaballerov
Copy link
Contributor

@jdcaballerov jdcaballerov commented Nov 19, 2020

Summary

This PR Adds the ability to do full-text search 🎉

Related issues

#22 #24

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

@jdcaballerov
Copy link
Contributor Author

Will change to use generators for the ids.

@cdvv7788
Copy link
Contributor

@jdcaballerov LGTM! Can you please update the Dockerfile so it is installed automatically? This only provides FTS in the admin, right? Can we add something in the list command too?

@jdcaballerov
Copy link
Contributor Author

@cdvv7788 will sonic be a required dependency ? Right now only admin, I'll check tomorrow how we could integrate it with list.

@cdvv7788
Copy link
Contributor

Not necessarily. You can leave it as the first option, and use the current mechanism as a fallback. I just want to have it in the cli where we can test that it actually works without digging into the code.

@pirate pirate changed the title Sonic search Full-text search using Sonic Nov 20, 2020
@jdcaballerov
Copy link
Contributor Author

jdcaballerov commented Nov 20, 2020

@cdvv7788 Could you please be more specific about the needs for list. The filtering currently in place in the django admin is not comparable to list filtering.

  1. list command with --filter-type subcommand uses the following filters (you can choose only one)
LINK_FILTERS = {
    'exact': lambda pattern: Q(url=pattern),
    'substring': lambda pattern: Q(url__icontains=pattern),
    'regex': lambda pattern: Q(url__iregex=pattern),
    'domain': lambda pattern: Q(url__istartswith=f"http://{pattern}") | Q(url__istartswith=f"https://{pattern}") | Q(url__istartswith=f"ftp://{pattern}"),
}

while the django admin uses this search fields (many icontains filters at once chained with OR )
2.

    search_fields = ['url', 'timestamp', 'title', 'tags__name']

What's needed a new subcommand for list ? let's say search ? or a new filter-type to be called search

Currently, if search is enabled, the django admin search adds two queries: the one from the search_fields + the one from the search backend
if search is enabled but the backend fails a warning message is shown but the query results just use the first.

A test will be the most helpful.

@jdcaballerov
Copy link
Contributor Author

jdcaballerov commented Nov 20, 2020

Behavior proposal (this is how it's working right now) :

ADMIN

Search enabled and all OK

Screenshot from 2020-11-19 23-42-12

Search enabled but backend fail

Screenshot from 2020-11-19 23-43-28

Search disabled just uses search_fields

CLI

New search filter backend enabled all OK

Screenshot from 2020-11-19 23-45-14

New search filter backend enabled but fail

Screenshot from 2020-11-19 23-46-45

New search filter backend disabled (fail and ok)

Screenshot from 2020-11-19 23-52-11

@cdvv7788
Copy link
Contributor

Yes, I had in mind a new filter. Where is it searching? Using readability?

@jdcaballerov
Copy link
Contributor Author

jdcaballerov commented Nov 20, 2020

@cdvv7788 currently I've only added readability content but any texts from the extractors could be added. i.e each one of the headers, etc.

What is needed is to add a list in index_texts at the extractor returned ArchiveResult dataclass. If index_texts is not present it will only pass.

Example from readability:

    return ArchiveResult(
        cmd=cmd,
        pwd=str(out_dir),
        cmd_version=READABILITY_VERSION,
        output=output,
        status=status,
        index_texts= [readability_content] if readability_content else [],
        **timer.stats,  
    )

@jdcaballerov
Copy link
Contributor Author

@cdvv7788
Sonic is a service with interaction via a telnet like protocol, can't be added to the Dockerfile unless one includes supervisord to manage the container.

I've added sonic to the docker-compose file as a service. Please don't forget to uncomment the line build . and comment the next line so that it doesn't use the image from dockerhub.

@pirate
Copy link
Member

pirate commented Nov 20, 2020

How do you handle retroactively adding previous archive data to the index? When a user upgrades to this release will it start indexing all their previously archived text?

@jdcaballerov
Copy link
Contributor Author

jdcaballerov commented Nov 20, 2020

@pirate Retroactively is not doing anything. The indexing is taking place in archivebox/extractors/__init__.py.archive_link
if indexing is enabled (config.USE_INDEXING_BACKEND)

I thought someone wanting indexing will run archivebox update

WARNING

WIth sonic one must be careful of what to expect, let's say on the archivebox index marking something as indexed, since it evicts data from the index using a sliding window following the config parameters. I don't know how this numbers can't translate to huge archives.

Let's say I start indexing a huge archive and keep a state in archivebox to mark a text related to snapshot as indexed. Then sonic could evict data from the index and I'll end with an inconsistent state. i.e archivebox state marking it as indexed, while evicted in sonic.

Sonic defines some limits:

query_limit_maximum (type: integer, allowed: numbers, default: 100) — Maximum search results limit for a query command (if the LIMIT command modifier is being used when issuing a QUERY command)

retain_word_objects (type: integer, allowed: numbers, default: 1000) — Maximum number of objects a given word in the index can be linked to (older objects are cleared using a sliding window)

Example:
I index 2000 articles using the word skateboarding , I query for the word and sonic will only take into account the latest 1000 indexed but only return 100

sonic is not designed to be exhaustive and better used as fast suggester, or to search chat conversations favoring the latest indexed conversations while the other are accessible with pagination.

Sonic design favors having many buckets, but archivebox use case requires one (unless we find some meaningful way to further partition: tags, etc)

@pirate
Copy link
Member

pirate commented Nov 20, 2020

Hmmm we definitely need some solution that can index >50k articles with ~2000 words each, and it needs to start indexing retroactively somehow when its enabled. Archivebox update will not re-run extractors that have already run, so I don't think that's enough to index everything.

Are there any sonic alternatives you know of that can handle indexing 100m - 2bn words?

@ekiel
Copy link

ekiel commented Nov 20, 2020 via email

@pirate
Copy link
Member

pirate commented Nov 20, 2020

Recoll seems pretty heavyweight, ideally I'm looking for something similar to sonic, just with a bigger index ability.
It does have a python API though, which seems ok: https://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.PROGRAM.PYTHONAPI.INTRO.html

Side note, it would be great to offer ripgrep as a search backend fallback @jdcaballerov (e.g. in Docker) if sonic or another index-searching backend is not available. Ripgrep is faster and simpler for smaller archives and can be installed as a static binary and called when needed, instead of needing a constantly running backend.

@jdcaballerov
Copy link
Contributor Author

jdcaballerov commented Nov 20, 2020

@pirate sonic might be able to handle the requirements

it's not that it can not handle 50k articles, but a query will return max 65.535 results, and a word can be associated to max retain_word_objects (in our case Snapshot ids) that can be huge.
It works as follows:

word ids of Snapshots containing
cat 2, 4, 54, 34 (max here is retain_word_objects)
house 2, 8, 34 , 34

query for cat with query_limit =2 returns docs= 2,4

query_limit_maximum can go up to (u16) - 65535
retain_word_objects (usize) can go up to 18_446_744_073_709_551_615 on some systems

I think the question is about if it's ok being exhaustive or not, if it's ok with max 65535 results per query and a word related to maximum X articles we are ok.

The retroactive indexing can be figured out.

@jdcaballerov
Copy link
Contributor Author

I've added ripgrep (rg) backend and set it as default.

@jdcaballerov jdcaballerov changed the title Full-text search using Sonic Full-text search Nov 23, 2020
@jdcaballerov
Copy link
Contributor Author

jdcaballerov commented Nov 23, 2020

Added the ability to do retroactive indexing using archivebox update --index-only (plus filters). Will select the content to index from the first available extractor output registered as a succeeded ArchiveResult in the order (readability, singlefile, dom, wget) and execute indexing in the current search backend (ripgrep doesn't do anything, but sonic backend does).

ps;
--index-only still needs to be corrected

@pirate pirate changed the base branch from master to v0.5.0 December 2, 2020 22:28
@pirate
Copy link
Member

pirate commented Dec 5, 2020

Can rebase when you get a chance and then I'll merge this PR next

@pirate pirate mentioned this pull request Dec 5, 2020
@pirate
Copy link
Member

pirate commented Dec 5, 2020

image

Fixed the conflicts and merged this here 😁: #570

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants