Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new SQLite FTS5 full-text search backend #1241

Merged
merged 7 commits into from
Oct 31, 2023

Conversation

overhacked
Copy link
Contributor

  • WIP: add sqlite search backend boilerplate
  • Introduce SQLite FTS5-powered search backend

@pirate
Copy link
Member

pirate commented Oct 9, 2023

Great idea! I was thinking about doing this myself too but you beat me to it.

Have you tested to make sure the monkey patching at the top doesn't affect other areas of the codebase that depend on the native database connection provided by django?

@pirate
Copy link
Member

pirate commented Oct 11, 2023

Let me know when you're ready for final review and I'll test/merge it!

@overhacked
Copy link
Contributor Author

I tested this on my "prod" dataset, and the index it created was huge at 1GB, 30% of the size of the data/archive folder. This seemed odd, because it's using FTS5 "contentless" indexes. I'm fairly sure that what's happening is that SQLite is indexing very long "terms", in the form of base64 strings in data: URLs, when it is given the contents of singlefile.html as the indexable content.

I thought about and experimented with several approaches to this, and I think the best approach is to (very loosely) parse singlefile.html and pass only text and some attributes to the search backend, rather than the entire HTML contents. I've opened a pull request with this approach: #1244. This resulted in a 10x reduction in index size. I think that merging this PR without somehow addressing the ballooning index size would result in issues down the road.

I think that Sonic doesn't have this problem because it limits maximum term length, but FTS5 doesn't have a configuration option for maximum (or minimum) term length. Or, maybe it's just the overall MAX_SONIC_TEXT_TOTAL_LENGTH? Even if that's the case, the approach of parsing and including only meaningful text content, not JavaScript code, data URLs, and markup, would improve the signal-to-noise ratio of indexed content, so searching for "html" wouldn't hit on every document that had its singlefile.html indexed.

Use SQLite's FTS5 extension to power full-text search without any
additional dependencies. FTS5 was introduced in SQLite 3.9.0,
[released][1] in 2015 so should be available on most SQLite
installations at this point in time.

[1]: https://www.sqlite.org/changes.html#version_3_9_0
Retry with table creation should fail if it is attempted for a second
time.
`connection` could cause confusion with `django.db.connection` and
`get_connection` is a better callable name.
Clean up error handling, and report a better error message
on search and flush if FTS5 tables haven't yet been created.

Add some mypy comments to clean up type-checking errors.
If creating the FTS5 tables fails due to a known version
incompatiblity, report the required version to the user.
@overhacked
Copy link
Contributor Author

Let me know when you're ready for final review and I'll test/merge it!

Alright, @pirate, I think it's ready for a hard look. Thanks!

@pirate pirate changed the title fts5 search Add new SQLite FTS5 full-text search backend Oct 31, 2023
@pirate pirate merged commit 62e077a into ArchiveBox:dev Oct 31, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants