BlackLab blog

Jan Niestadt edited this page Sep 22, 2016 · 11 revisions

PLEASE NOTE: this wiki has been superseded by the official BlackLab site.

---

A technical look into the BlackLab corpus retrieval engine. Or: the exciting search for very specific needles in large, annotated haystacks of text.

3 - What BlackLab brings to the Lucene table

Lucene is a great text search library: it's fast, has a lot of useful features and is well-documented. Of course, it doesn't do everything we want out of the box, or we wouldn't have developed BlackLab. Fortunately, Lucene is free software and is relatively easy to extend with your own functionality. So, this time around, let's take a look at some of the ways in which we've added functionality to Lucene for BlackLab. That means we're going to get a little more technical.

Word Properties

An important feature we wanted to provide with BlackLab was the ability to store and search for multiple properties (also called tags or annotations) associated with words. A common example of this would be to store the headword and part-of-speech of each word, but it could be any extra information about a word, like how high it scores on your "sumptuous words to casually use in a blog post" list (now I can cross 'sumptuous' off mine).

Lucene can index multiple values for a word position in a single field, but you would have to differentiate the types of values with a prefix or some such. Storing type information in string prefixes makes me feel dirty and unworthy as a programmer, so I decided instead to use a separate field per property and make sure the position information on each field is synchronized so we can combine them in queries. It takes a little bit of convincing to stop Lucene complaining that you're mixing fields in this way, but that wasn't too hard. We call the aggregate of fields containing properties about your words a "complex field", and you can query such a complex field directly in BlackLab, for example using Corpus Query Language:

[hw = 'feed' & pos = 'NOU']

Regular Expression Queries

The second key feature on our wish list was the ability to search for queries with a regular expression structure at the word level. So for example, we want to search for an article ("a", "an" or "the"), followed by two or more adjectives, followed by a form of the headword "cat". Examples of hits for that query would be "the sleepy long-haired white cats" and "an inquisitive black cat". We could denote this query in Corpus Query Language as follows:

  [pos = "ART"] [pos = "ADJ"]{2,} [hw = "cat"]

The first thing to realize is that we need SpanQueries for this purpose. Regular Queries can't really deal with positional information aside from PhraseQuery, which is too limited for us. But while many SpanQuery classes are included in Lucene, we were missing a few. For this functionality, we needed a "repetition query" (for parts of the query that may occur multiple times in direct succession) and a true "sequence query" (to find a number of subquery hits directly adjacent to one another; SpanNearQuery seems to come close, but turns out not to quite fit our purposes).

So, I implemented SpanQueryRepetition and SpanQuerySequence for this purpose. This involved some sorting and uniquifying (another one off the list!) trickery to make sure we generated all possible sequences, didn't include doubles, and didn't violate Lucene's convention that SpanQuery results are ordered by position.

Match All Operator

A third interesting challenge was how to support "match all" operators. For example:

[]{2} "apple"

This searches for apple preceded by two arbitrary words. The match all operator can be used as a prefix, infix or postfix operator, and we wanted generic support for all these situations. This was solved by creating a class SpanQueryExpansion, which expands hits from another query by a number of tokens to the left or right. Again, a bit of craftiness was required to maintain proper sort order and prevent doubles.

There's plenty of stuff we haven't covered yet, such as how BlackLab enables fast sorting and grouping on matched text or the words around it, but that's a subject for another post. I'm also working on several other improvements, like a generic way to index and search for XML tags in the text, so you can search for words at the beginning of a sentence, or search for tagged named entities (like the names of people or places). The basic functionality for this is in BlackLab now, and should mature over the next few months.

Hopefully, now you've got a better idea of how BlackLab builds on the many features Lucene already provides to enable new ways of searching for patterns in annotated texts. As always, questions, suggestions and code contributions are more than welcome. See the README for contact information.

2 - BlackLab's niche

Projects are like animals. Some are predatory, and consume all available resources. Others are more like gentle herbivores: slowly but surely getting the job done.

As the name suggests, BlackLab is a friendly, inquisitive companion who is great at search and retrieval. Unlike it's namesake, however, it does not stick its wet nose in your face at 5:30 AM. Although I guess we could add that feature if enough people request it.

Evolution

Like animals, projects evolve from what was around before, they find a niche for themselves, and they either thrive or go extinct.

BlackLab evolved out of our need for a better way to search through large bodies of text. For the IMPACT project, we needed to show the advances made in digitizing and searching in historical texts. Once we had the basics, it made sense to develop it further and use it to build interfaces on our other corpora.

Wasn't BlackLab's niche already occupied by other projects? No, we don't think so. Of course there are other corpus retrieval engines. Some of those we can't afford, some don't really fit with our chosen technologies, some aren't as full-featured as we needed. And importantly, most aren't based on Apache Lucene.

Symbiosis

BlackLab lives in symbiosis with Lucene: it relies on it and adds a lot of useful functionality to it. Building on top of an active open source project like Lucene is great, because it saves us a lot of work, and ensures that progress for Lucene equals progress for BlackLab.

For example, recently Lucene 4.0 came out, and it appears to speed up several important operations: regular expression matching and fuzzy search. So when we update BlackLab to work with Lucene 4.0, many searches will likely be faster.

Ecosystem

That, for me, defines BlackLab's niche in the ecosystem of text search. It's fully written in Java and has a mature, actively developed open source project as its basis. It's a great position to start from and allows us to add advanced features to BlackLab without ever having to worry about the low-level side of things.

We believe this niche needed filling, so we went ahead and gave it a shot. We invite you to try it out. See for yourself if this is just a short-lived evolutionary oddity, or a viable new species. If you have any questions or comments, let us know. Together we can make BlackLab thrive!

Older posts