Releases: MarginaliaSearch/MarginaliaSearch
v24.01.2
A second corrections release. Primarily addresses technical debt and stability issues.
New features:
- Cleaned up the code and technical documentation in several places
- Sideloading improvements, notably the ability to sideload reddit posts d970836
- Additional search options QueryStrategy and TemporalBias are exposed through the query-service API 66b3e71
- Domain ranking is overhauled to use JGraphT instead of the primordial home-cooked pagerank impl previously in effect: #80
- The start-up time of several services are improved in production by delaying the loading of the domain id blacklist. e61e7f4
🧯 Notable Fixes:
- A disastrous series of actions in the operations gui is mitigated, where it was possible to lose crawl data if a re-crawl is performed before loading a spec-generated crawl. This is mitigated in several ways, both to address the specific scenario, but also to reduce the harm of similar unforeseen situations: c73e43f
Marginalia App Changes:
- Correcting the sort order of the results in the API gateway 6950dff
- The 'popular' filter was temporarily removed, as it doesn't really do anything except be a bit worse than 'no filter'
- The 'vintage' and 'show recent' filters were updated to use the new TemporalBias feature 66b3e71 16526d2 a175b36
- A new 'search in title' option was added, using the new QueryStrategy parameter 66b3e71 2515993 752e677
v24.01.1
This is a small corrections release that primarily addresses bugs and issues with the previous release.
Core:
-
Ranking parameters were tweaked to improve search result accuracy. This seems to have had a pretty significant impact on the search results: eb59ac8
-
Fixed an issue where WARC files would not decompress the body of the response according to the WARC specs: 8340aa2 929caed
-
The crawler and processor now understands HTML redirects: 785d8de
-
Sideloaded wiki content, which is cut short to improve accuracy (because wikis tend to go on about unrelated things toward the middle/end), these items are given more leninent minumum-length checks: fa145f6
-
Problems cropped up loading full Wikipedia, that were due to 32 bit size constraints being breached. This is the result of some relatively flimsy heuristics for breaking up the loader's keyword data. This has been mitigated, and files will now stay safely below 2 GB decompressed size. In fixing this, an opportunity for memory optimization also cropped up, and as a result, full Wikipedia will now load correctly on 16 GB of RAM (assuming the system is configured with
system.conserveMemory
): 6dcc200 53c575d d986f90 6e7649b -
Fixed an issue where stackoverflow was not picked up properly by the sideloader: c6313a5
-
Loading speed was improved by splitting the repartition action into two separate steps: 467ba5b 8acbc6a
-
The internal API between the control-service and executor-services has been migrated off REST to GRPC, as it was a bit of a maintainabilty headache. Other internal APIs will probably follow soon. #75
Marginalia Search App:
-
Fixed a !bang handling bug, thanks @conor-f for finding the issue and helping with the ddg bang.
-
Added clustering of search results from the same domain, where additional good matches will be listed under the best search result: bcd0dab a778463 7cc8b0f 10bad63
Experimental:
v24.01.0
This is a major new release of the search engine software, corresponding to nearly four months of changes. In these months, the state of the code hasn't been stable enough for a new release, but it's now been brought to a stable point.
Release Highlights:
- The installation procedure has been cleaned up.
- It's now possible to run the search engine in a white label/bare-bones mode, without any of the Marginalia Search branding or logic.
- The Marginalia Search web interface has been overhauled. The site-info page has especially been given a large upgrade.
- The search engine can use anchor texts to supplement keywords.
- The search engine can use multiple index shards.
- The operations GUI has been overhauled.
- An operations manual has been written.
- The crawler can now resume crawls in process due to intermediate WARCs.
- The search engine can import several formats without external pre-processing.
- The Academia filter has been improved
- The Recipe filter has been improved
- The system now penalizes documents that have obvious hallmarks of being written by ChatGPT in its quality assessment.
Other technical changes:
- Several bugfixes in the ranking algorithm has improved search result precision
- Domain link graph have moved out of the database, improving processing time
- The system can be configured to automatically perform db migrations
- Ranking algorithm improvements
Known Limitations:
- Service discovery is currently a bit limited, making it only possible to run the system within docker (or similar) at this point, as host names and ports are not configurable. This is not intended to be a permanent state of affairs.
- The Marginalia Search website has lost its dark mode.
- There might be an off-heap resource leak in the crawler. It's primarily a problem with very long crawl runs.
v23.10.0
This is a mostly technical release. It takes the index from 106M to 164M documents.
Zero Downtime Upgrades and halved memory consumption
The initial focus of the release was to address the sometimes lengthy downtimes that have plagued the project when loading a new index.
There is a somewhat lengthy write-up about this here; but the short version is that this was very successful and a drastic optimization, removed not only the needed downtime, but added neat new features and slashed the RAM requirements in half!
A annoyance fueled optimization methodology also slashed the index construction time in half at later point. Pull Request #52.
Java 21 PREVIEW
There were unintended consequences of the changes above, and the system needed an upgrade to Java 21 with enabled preview features. This has to do with off-heap memory lifecycle management. Up until Java 21 (preview), Java offered no way of explicitly closing off-heap memory, including memory mapped files. This caused the filesystem to hold onto references to the mapped data even after the associated files had been deleted, which vastly increased the amount of disk required to construct the index using the new method of recursive merging.
A positive side-effect of this is that using the new foreign memory API is a lot faster than Java's old byte buffers, since the size can exceed 2 GB without userspace paging.
There are some stray vestigial remains of the old way of memory mapping files still lingering, to be rooted out in the next release.
Writeup: https://www.marginalia.nu/log/89-disk-usage-mystery/
Commits: d0aa75
Parquet files in converter and crawl specs
For a long time, compressed json files have been used to store much of the unprocessed and half-processed crawl data. This is very easy to use, but tends to be a bit awkward when you have millions of the files. It's also not the most performant format in the world, since e.g. it doesn't announce how long a string is upfront, you need to just keep reading to find out.
Parquet is a clever format popular in big data applications that largely solves these problems. Parquet in Java is not so great, however, since the only(?) implementation is deeply tied to the Hadoop ecosystem, and separating the two isn't entirely trivial.
Thankfully there's a helpful library called parquet-floor that tries to do this. It is a bit on the basic side, but its technological and biological distinctiveness was added to our own, and now it does what's necessary.
The biggest benefit of this is that it's much easier to interact with. Previously to inspect some processed data, you'd need to use some combination of unix command line tools and jq to get at it. With parquet, much more convenient tools are available. The entire dataset can be queried with SQL using for example DuckDB!
The parquetification of the project is still ongoing. The crawl data needs to be addressed too, but this is in a future release.
Improved sideloading support
There's been kinda-sorta support for sideloading encyclopedia data from Wikipedia already, but it's been pretty shaky. This release introduces the ability to sideload not only Wikipedia data, but also Stackexchange dumps and just directories with HTML for e.g. javadocs.
These will not go live in the production index until it can be figured out how to make such large popular websites not show up as the first result for every query.
I wrote a rough documentation for how to do this.
Commits:
70aa04
5b0a6d
6bbf40
98bcdf
9b385e
5e5aaf
Notable bugfixes:
- A concurrency bug was casuing some of the position data to be corrupted. This had a fairly adverse effect on the quality of the search results, causing bad matches to be promoted and good matches to be dismissed as irrelevant. a433bb
v23.08.1
Hotfix release that addresses some problems that cropped up with v23.08.0.
- Full JDK20 compatibility. The entire project should build with JDK20 now, and install instructions have been tried on a clean linux system to verify it all works.
- Increase the theoretical maximum number of keywords in the lexicon from 0.75 x 2^30 (800mn) to 2^31 (2 bn).
- Partially roll back the change in language identification, as this had unexpected side effects in increasing the number of keywords.
- Fix a resource leak in the loader due to improper use of Cleaners
- Fix a problem where the converter-process would use 100% CPU on a single core when running any process
- Fix a minor bug with the loader's logging that caused it to not be able to report how much data is in each domain
23.08.0
This release mainly aims to improve the operational side of the search engine, with an emphasis of automating tedious manual processes and optimizing crawling and data processing to use fewer resources.
Conventionally I try to link to relevant commits in these notes, but some of the changes were so sweeping and protracted it was hard to narrow it down to individual commits; in those cases I'll link to the relevant code instead.
New Features
Better Feature Detection and Blog Filter
The FeatureExtractor
which analyzes websites' HTML for things like advertisements and tracking code has been improved a fair bit. Website generator detection was also improved in this process.
Curated via a publicly available set of domains, the new filter selects for blogs and similar websites. These domains are also given slightly different processing rules on the assumption they are blogs.
Commit: cbbf60
Crawler - Smart Recrawling
The crawler has been enhanced to be able to make use of older crawl data to do optional fetching via the ETag
and Last-Modified
headers. This saves bandwidth and processing power for the server.
Code: CrawlDataReference CrawlerRetriever$recrawl
Operator's GUI
A new user interface has been built for operating Marginalia Search. It was previously operated via command line instructions, direct SQL commands, and the like. This manual operation was both tedious and error prone.
The UI allows basic administrative operations such as dealing with domain complaints, creating API keys, blocking websites; but also has abstractions for triggering crawls and managing the heavier processes in the system.
Code: control-service
Message Queue / Actor Abstraction
To enable automation of the system several new abstractions have been introduced, including a message queue and an Actor abstraction on top of that. See /log/85-mq_sm_actor_ui for a detailed break down of this functionality.
Code: message-queue
Better language identification
Instead of using a naive home-made language identification algorithm, the fasttext library (via jfasttext) was used. It is much better at language identification, and as the name implies, pretty fast albeit not quite as fast when you run it via JNI. FastText is a very pleasant classifier library that will likely find other additional uses in the project in the future.
Commit: 46d761
Optimizations
There have been a lot of optimizations of the processes, these are just some of the bigger ones.
Converter - Reduced Memory Footprint and Increased Speed
The converter was keeping more items in memory than was necessary due to loading its input data up front by domain, and then iterating over each item. Streaming processing was introduced instead, which reduced the memory footprint so much that several previous memory optimizations such as transparent string compression became unnecessary, which in turn sped up the process a fair bit.
Commits: 507f26
Converter/Loader - Side Loading (experimental)
Some websites such as for example Wikipedia or Stack Overflow are too big to exhaustively crawl in a traditional sense, but they have data dumps available. Experimental support for side-loading Wikipedia was built.
This functionality is very immature.
To permit side loading large domains, the loader was also modified to reduce the amount of data it keeps in memory while loading. This was mainly accomplished by re-arranging the order the loading instructions are written by the converter.
Commits: f11103
Other Changes
Better feature detection and a new approach to advertisement filtering
A bit of effort was spent trying to figure out the modern advertisement ecosystem, and lessons learned were incorporated into the feature detection logic of the search engine.
A major shift in operation is to instead of looking for ads, the search engine will instead look for ad-tech tracking. This is much easier to do with the sort of static analysis Marginalia does, and probably what you want anyway. It turns out you can't really run ads with no tracking without exposing yourself to click fraud, and you need to be pretty aggressive with how you do the tracking in a way that's not easy to hide.
Commits: 0f9b90 ...
Bugfix: Loader Stop Bug
There was a fairly trivial error in the loader process where it would stop loading documents from a website if any of their URLs were for some reason not loaded, typically because they were too long. This primarily affected large wordpress-style websites.
if (urlId <= 0) {
logger.warn("Failed to resolve ID for URL {}", doc.url());
return;
}
should have been
if (urlId <= 0) {
logger.warn("Failed to resolve ID for URL {}", doc.url());
continue;
}
Fixing the bug had the unanticipated side-effect of severely decreasing the average quality of the websites in the index, since large wordpress-style websites are often not very good.
To mitigate the quality problem, the ranking algorithm was modified to penalize large websites with kebab-case urls. This was a relatively invasive change that meant routing additional feature bits into the forward index. An upside of this is that the index has more information available for ranking websites, and it's possible to e.g. apply a penalty to sites
with adtech or likely affiliate links on them.
Bugfix: Crash on excluding keywords that are not known by the search engine
A rare bug was found that caused an error when excluding documents that contain a keyword where the keyword was not known to the search engine. This was due to a piece of debug logging that wouldn't even have printed, yet still managed to
trigger an index out of bounds error.
Commits: cb55c7
Upgraded dependencies -- expected JDK version increased to 18+
Dependencies with security vulnerabilities were upgraded, which introduced a strange interaction with JDK 17, the previous default version, where non-ASCII letters would become garbled when reading crawl data. The exact cause of this is unknown, but a solution that works is to use JDK 18+ instead.
Flyway Migrations
Database migrations are now managed via Flyway. This eliminates manual database upgrades.
Commits: 58556a
23.06.0
New Features
Generator keywords
To provide additional ways of selecting search results, a synthetic keyword has been added for the <meta name="generator" content="...">
tag. This is basically a vanity tag that is used by some HTML generators to advertise themselves, and it's also
common for hand-edited HTML to include this tag with a string like "vim" or "myself", as a wink to human readers of the code.
The generator keywords have the form generator:value
. For example, to search for websites made with Hugo, you can use generator:hugo
. Generator categories have also been added as searchable keywords, for example generator:wiki
, generator:forum
, generator:docs
.
These last keywords have been added as options in in the search engine's filters.
Crawler support for sitemaps
To ensure the crawler is able to find all the pages of a website, while wasting minimal time and bandwidth on dead links, the crawler now supports the sitemap protocol. Implementing this support was relatively straightforward as a site map parser was already available within Crawler Commons, a library which is already used for parsing robots.txt
files.
The crawler will look for a sitemap directive in robots.txt, and will also look for /sitemap.xml
in the root of the server, as well as parse RSS and Atom feeds for links if they are found in the root document of the website.
Crawler specialization for Lemmy, Discourse and Mediawiki
Some server software for larger websites have a lot of valid links, but also many links that are highly ephemeral (such a mastdon feed, or the index of a forum). To help the crawler only index the pages that don't change that often, has specialized logic has been introduced for Lemmy, Discourse and Mediawiki.
This also saves processing power for the server, as these applications often have relatively expensive rendering logic.
This is a bit of an experiment. Implementing these specializations is relatively easy, and if it pans out it will be extended to other software.
Improved Site Info
The site information view has been improved to show better placeholder information for unknown domains, including a link to the git repository for submitting websites to be crawled.
Bug Fixes
Pub-date validation
The published date of a page is now validated against the plausible range of the HTML standard it's written in. It's impossible that a HTML5 document was written in 1997, and unlikely that a HTML2 document was written in 2021. 7326ba74
A bug was also discovered in the JSON+LD parser, that caused rare null pointer exceptions. This code is a bit of a hack and could definitely be cleaned up further. 21125206
Optimizations
The converter process, which extracts keywords and meta data from HTML documents, has been optimized to run about 20-25% faster. The crawler has also been modified to spend less effort on domains that historically have demonstrated to not have a lot of viable pages. As a result, crawling is twice as fast, processing takes about 24 hours instead of 60+ hours.
The converter optimization was achieved by replacing expensive string operations (like toLower()) with custom logic that doesn't require allocation.
BigString
The BigString
is an object for transparent storage of compressed strings in memory that enables the processor to work load the full contents of a website into memory at once, and then unpack each document as it's being processed.
BigString
was optimized to use fixed buffers. Allocating large arrays in Java is expensive, and the garbage collector has to work hard to clean up the mess. This introduces some lock contention, but it is still significantly faster than the previous version.
Another small speed-up is from using java.lang.String
's char[]
constructors instead of byte[]
-constructors, reducing unnecessary back-and-forth charset conversion.
Commit: e4372289
RDRPosTagger
The RDRPosTagger library, which does Part Of Speech tagging, already impressively fast, already aggressively modified to be faster, has been further optimized to be faster still, and its Java object tree design was replaced with flat integer arrays.
This was always an expensive operation, but now it's much faster. The speed-up comes from replacing string comparisons with integer comparisons, as well as re-ordering the data in memory to reduce the cache thrashing that is typically associated with walking a branching tree structure. Part of this is from eliminating Java object headers.
Commit: 186a02
23.06.0-rc1
Full notes when the release becomes official, but in short:
- Crawler supports sitemaps
- Specialization for crawling Lemmy, Discoruse and Mediawiki
- Processing is heavily optimized
- Specialization for converting Lemmy
- Keywords for searching by meta generator tag (e.g. "generator:xyz")
- New filters for generator category (e.g. wiki, forum)
- Summarization logic is better
- Publication date logic is much improved, including a bugfix for JSON+LD
23.03.2
This is primarily a bugfix release that primarily addresses some issues with a metadata corruption that was introduced in the previous release.
New Features
File keywords
To provide more tools for navigating the web, the converter now generates synthetic keywords for documents that link to files on the same server based on their file ending.
If the file contains a link such as
<a href="file.zip">Download</a>
then he document will be tagged with the keyword file:zip
as well as file:archive
.
The category keywords are file:audio
, file:video
, file:image
, file:document
, file:archive
.
Since earlier, the converter has also generated keywords based on filenames, even if the filename itself doesn't appear in the visible portion of the document.
So in the example above, file.zip
would also be a relevant keyword for the document.
Commit: a9f7b4c4
Bug fixes
Metadata corruption
As a workaround for the limitations of the Java language, document metadata is encoded through explicit bit twiddling. It's basically a manual implementation of a C struct on top a 64 bit long. This is a great performance improvement and allows for very compact storage of the metadata, but the approach is also notoriously error prone and difficult to do in a safe way. It's basically the programming equivalen tof running with scissors.
A bug crept in where parts of the document metadata was garbled. This made it impossible to search by year, and also broke the 'blog' and 'vintage' filters, and may also have deteriorated the search result quality a bit.
The bug wasn't directly caused by the bit twiddling, but by mispopulating the fields in a constructor. It's a fairly trivial error, but it was hard to detect since it was not immediately obvious that the data was corrupted given the limited visibility into the "struct", and reproducing the error in a test proved difficult since the test used the constructor correctly.
Despite testing on a pre-production environment, the bug was not discovered until it was deployed to production. If anything I think it highlights a need for finding better testing strategies. This functionality is fairly smeared out over the code path, the functionality is difficult to isolate and it's often not immediately apparent when it's broken, all this makes it a continuous struggle to test in a systematic way. In general it's very hard to test this sort of logic, as it requires a large and relatively realistic corpus of data to test against which makes isolating behavior harder, and the outcome is also never clearly right or wrong, but a matter of this-feels-right or this-seems-wrong.
Commit: 2ab26f37
Publish Date Detection
The order of the heuristics in the publish date detection has been improved to reduce the number of false positives, the support for JSON+LD has also been improved to support additional cases.
Marginalia uses a long list of different heuristics to try to detect the publish date of a document. It was previously assumed that HTML5's <time[pubdate="pubdate"]>
element would generally contain a valid publish date for the current document, but this is not always the case, as some blogging platforms also include <article>
-tags, including <time>
for snippets of other articles. The heuristics have been reordered to try to detect the date from other sources first, and then fall back to the <time>
element as one of the less reliable heuristics.
Commit: 619fb8ba
Response cache for the API service to help misconfigured clients
It's been a long standing problem that some misconfigured API consumers spam the API endpoint with the same query multiple times in a row, very rapidly consuming the rate limit. A cache has been added before the rate limit that will return the same result for the same query within a short time window "for free".
This was also a good opportunity to clean up the API service a bit and improve the test coverage.
Commit: 112f43b3
Minor Fixes
- Stopgap fix for a bug in dealing with quote terms containing stop words. 6fae51a8
- Fix data loading bug where domains with some IPv6 addresses would blow up. d42ab191
- Fix bug where some synthetic keywords would fail to return results. df1850bd
Experiments That Never Made It
A wise man once said "it's not R&D if you aren't throwing away half your work". Here are some of the experiments that didn't make it into production.
A synthetic keyword for image filenames that look like they come out of a smartphone
Alongside the file keywords, an experiment was run with generating a synthetic keyword for image filenames that look like they come out of a smartphone, e.g. filenames
with the format "IMG_nnnnnn.jpg". While very easy to build, this turned out to be not very useful. The idea was scrapped.
LDA topic modeling
Latent Dirichlet Allocation (LDA) is a popular topic modeling algorithm that's often used to extract topics from a corpus of documents. The idea was to use this to offer
additional ways of navigating the web. The idea was scrapped because the results were not quite useful. The main work involved porting the LDA implementation in Mallet from
a very old style of Java to a modern one. Since this was a fairly large task, it was decided to keep the code around in a branch in case it could be useful for other purposes.
Performance wise it might be plausible to do something with LDA in the future. The branch with the patched Mallet code is available here.