-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full-text search #543
Full-text search #543
Conversation
Will change to use generators for the ids. |
@jdcaballerov LGTM! Can you please update the |
@cdvv7788 will sonic be a required dependency ? Right now only admin, I'll check tomorrow how we could integrate it with |
Not necessarily. You can leave it as the first option, and use the current mechanism as a fallback. I just want to have it in the cli where we can test that it actually works without digging into the code. |
@cdvv7788 Could you please be more specific about the needs for
LINK_FILTERS = {
'exact': lambda pattern: Q(url=pattern),
'substring': lambda pattern: Q(url__icontains=pattern),
'regex': lambda pattern: Q(url__iregex=pattern),
'domain': lambda pattern: Q(url__istartswith=f"http://{pattern}") | Q(url__istartswith=f"https://{pattern}") | Q(url__istartswith=f"ftp://{pattern}"),
} while the django admin uses this search fields (many icontains filters at once chained with OR ) search_fields = ['url', 'timestamp', 'title', 'tags__name'] What's needed a new subcommand for Currently, if search is enabled, the django admin search adds two queries: the one from the search_fields + the one from the search backend A test will be the most helpful. |
Yes, I had in mind a new filter. Where is it searching? Using readability? |
@cdvv7788 currently I've only added readability content but any texts from the extractors could be added. i.e each one of the headers, etc. What is needed is to add a Example from readability: return ArchiveResult(
cmd=cmd,
pwd=str(out_dir),
cmd_version=READABILITY_VERSION,
output=output,
status=status,
index_texts= [readability_content] if readability_content else [],
**timer.stats,
) |
@cdvv7788 I've added sonic to the docker-compose file as a service. Please don't forget to uncomment the line |
How do you handle retroactively adding previous archive data to the index? When a user upgrades to this release will it start indexing all their previously archived text? |
@pirate Retroactively is not doing anything. The indexing is taking place in I thought someone wanting indexing will run WARNINGWIth Let's say I start indexing a huge archive and keep a state in archivebox to mark a text related to snapshot as indexed. Then Sonic defines some limits:
Example:
Sonic design favors having many buckets, but archivebox use case requires one (unless we find some meaningful way to further partition: tags, etc) |
Hmmm we definitely need some solution that can index >50k articles with ~2000 words each, and it needs to start indexing retroactively somehow when its enabled. Archivebox update will not re-run extractors that have already run, so I don't think that's enough to index everything. Are there any sonic alternatives you know of that can handle indexing 100m - 2bn words? |
Is there a way to integrate recoll? https://www.lesbonscomptes.com/recoll/
Eric Kiel
… On Nov 20, 2020, at 1:30 PM, Nick Sweeting ***@***.***> wrote:
Hmmm we definitely need some solution that can index >50k articles with ~2000 words each, and it needs to start indexing retroactively somehow when its enabled. Archivebox update will not re-run extractors that have already run, so I don't think that's enough to index everything.
Are there any sonic alternatives you know of that can handle indexing 100m - 2bn words?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Recoll seems pretty heavyweight, ideally I'm looking for something similar to Side note, it would be great to offer |
@pirate it's not that it can not handle 50k articles, but a query will return max 65.535 results, and a word can be associated to max
query for cat with query_limit =2 returns docs= 2,4 query_limit_maximum can go up to (u16) - 65535 I think the question is about if it's ok being exhaustive or not, if it's ok with max 65535 results per query and a word related to maximum X articles we are ok. The retroactive indexing can be figured out. |
I've added ripgrep (rg) backend and set it as default. |
Added the ability to do retroactive indexing using ps; |
|
Fixed the conflicts and merged this here 😁: #570 |
Summary
This PR Adds the ability to do full-text search 🎉
Related issues
#22 #24
Changes these areas