Search Improvements
-
The query parsing and evaluation model has been re-written from scratch, as the original model was very flaky and hard to maintain. The new model performs query segmentation, and introduces a graph based model for bag-of-words query evaluation. Writeup, #89
-
Full phrase matching has been implemented, allowing not just "quoted search terms" to function better, but the result ranking to consider the word order in the query. Writeup, #99
'search.marginalia.nu' Application
- A new screenshot capture service has been added, screenshots are fetched/refreshed by page views on the site:-viewer, if fetch/refresh criteria are met. This has helped the screenshot library grow quite considerably. #120
- CSS overhaul and dark mode from @samstorment #94, #98
- Add pagination support for the search results. This is in a sense fake pagination, made possible because each index node fetches the total number of requested results, but only the best results across all nodes are selected in the query service, and the pagination is done within this set. It's unlikely paging beyond page 8 or 9 is going to be helpful anyway. #119
- A new actor has been added that periodically polls certain links for outbound links (e.g. hacker news), and adds them to the list of domains to be crawled.
Base application
- Added a new domain management view that permits inspection and index node assignment on a domain level, as well as the easy addition of new domains to be crawled.
- Removed crawl spec abstraction. This was historically used to specify which domains to be crawled, but using the domains table is better and since it's now so much easier to interact with thanks to the new domain view, the crawl specs became redundant.
Architecture
-
The project has been migrated to JDK 22.
-
A new dependence on zookeeper has been introduced, to let the project self-manage routing and port mapping. This is to liberate the project from being dependent on docker to do this, #81; with additional work the project is therefore now fully possible to run bare metal outside of docker, #90 #92
-
Ephemeral data is now stored in the built-for-marginalia Slop format, with a new marginalia adjacent library for reading and writing this data. As a result of this, and optimizations surrounding it, index construction is time is now reduced by 80% in production.
Misc
- Address findings from UX and security assessments, #93 #101
- Improved error recovery and logging for "download sample crawl data" actor
Work In Progress
- Crawler now attempts to fetch favicons. The aim is to be able to present them along with the search results in the future, but for now we're just gathering them.
- Content-type probing via HEAD requests is disabled for now, evaluating the use of Accept header instead.
Notable Bugfixes
- Fix bug that caused some domains to fail to fully crawl. The exact circumstances are a bit flaky, but in some cases, the crawler would halt at the first document, and fail to load links from it.
- The crawler was not properly stripping the W/-prefix from weak E-tags, when making conditional requests, causing unnecessary traffic. This has been corrected.
- Fix bug where the summarizer would pick up the contents of <noscript> tags.
Bugs found in dependencies
- Identified the cause of transient instabilities during index construction as being caused by a JVM compiler error. This should be corrected in the latest version of GraalVM JDK 22. Writeup
New Contributors
- @patrickbreen made their first contribution in #85
- @jmholla made their first contribution in #88
- @samstorment made their first contribution in #94
- @jaseemabid made their first contribution in #102
Full Changelog: v24.01.2...v24.10.0-rc