BM25 search engine implemented in Java, with a simple CLI to build an index and search over a collection of documents.
From Wikipedia:
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document.
This variant of BM25 is the original one, invented by Stephen E. Robertson, Karen Spärck Jones, and others.
- Description
- Requirements
- Installation
- Usage
- Commands
- Limitations
- Formula
- How it works
- Repo structure
- Roadmap
- How to contribute
- Dependencies
- Authors
- Acknowledgements
- Java Development Kit (JDK) v21.0+
- Maven (optional, Maven Wrapper included)
- Git (for cloning the repository)
For now, it is available to build from sourse.
git clone https://github.com/maxmakovskiy/know-your-files.git
cd know-your-files# Using Maven Wrapper
./mvnw clean package
# or with Maven installed
mvn clean packageThe build produces a runnable JAR file in the target directory.
Installation complete You are ready to use the search engine.
Create index file from a collection of documents:
$ java -jar target/know-your-files-1.0-SNAPSHOT.jar build \
-I=[Index file name] path/to/documents
Example :
$ java -jar target/know-your-files-1.0-SNAPSHOT.jar build \
-I=index.txt path/to/documents
What happens:
- Read all text files from the directory
- Tokenize documents (splits into words, removes stop words, applies stemming)
- Build an inverted index with BM25 scores
- Save the index to the specified file, depending the giving index file name in the comment line.
Search through your documents using index file:
$ java -jar target/know-your-files-1.0-SNAPSHOT.jar search \
-K=3 [index file created by build command] [search word or phrase]
Example :
$ java -jar target/know-your-files-1.0-SNAPSHOT.jar search \
-K=3 index.txt Which animal is the human best friend?
Sample Output:
Query : "Which animal is the human best friend?"
file : file2.txt => score = 1.26
file : file3.txt => score = 0.46
file : file1.txt => score = 0.00
Currently, there are only 2 supported commands with syntax presented in usage:
Syntax:
java -jar target/know-your-files-1.0-SNAPSHOT.jar build [OPTIONS] <documents_directory>Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
-I, --index |
Option | No | index.txt |
Output file name for the index |
<documents_directory> |
Positional | Yes | - | Path to folder containing documents to index |
Syntax:
java -jar target/know-your-files-1.0-SNAPSHOT.jar search [OPTIONS] | Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
-K, --topK |
Option | No | 10 |
Number of top results to return |
<index_file> |
Positional | Yes | - | Path to the index file created by build |
<query> |
Positional | Yes | - | Search query (all words after index file path) |
- Language Support: Currently, only English is properly supported
- File Format: Only plain text files (
.txt) are indexed - Encoding: UTF-8 encoding is assumed for all documents
- Term on the left is called IDF (inverse document frequency).
- Term on the right is basically term frequency.
- In the formula above t stands for term in document d, in our case we call it token.
- N is number of documents among which we are searching for t.
-
$df_t$ is the number of documents with token t and$tf_{td}$ term frequency of token t in document d. -
$L_d$ and$L_{avg}$ correspond to document length and average document length respectively.
Although IDF component used in this project is slightly modified and makes a part of the other BM25 variant called Lucene.
More detailed overview could be found in this paper Kamphuis et al.
The task of ranking documents/books/notes/etc with the respect to certain query is very natural for humans, we do it all the time.
Supposing that user has a collection of documents, and he wants to search for some information.
For example (not my example, see), let's say each document has just one line:
"a cat is a feline and likes to eat bird", // file1.txt
"a dog is the human's best friend and likes to play", // file2.txt
"a bird is a beautiful animal that can fly" // file3.txt
Documents are processed through:
- Splitting: Text is split into individual words
- Stop word removal: Common words like "a", "is", "the" are removed
- Stemming: Words are reduced to their root form (e.g., "likes" → "like", "beautiful" → "beauti")
Result (corpus):
[
["cat", "feline", "like", "eat", "bird"], // file1.txt
["dog", "human", "best", "friend", "like", "plai"], // file2.txt
["bird", "beauti", anim", "can", "fly"] // file3.txt
]
A vocabulary is built from all unique tokens:
like best plai can fly beauti cat bird friend eat anim dog human felin
A document-term matrix is created where:
- Each row represents a document
- Each column represents a token from the vocabulary
- Each cell contains the BM25 score for that term in that document (0 if term is absent)
It is some sort of document-term matrix, but instead of the frequency of terms we store BM25 score.
Score Matrix:
| docIdx | like | best | plai | can | fly | beauti | cat | bird | friend | eat | anim | dog | human | felin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.46 | 0.22 | 0.00 | 0.46 | 0.00 | 0.00 | 0.00 | 0.46 |
| 1 | 0.20 | 0.42 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.42 | 0.00 | 0.00 | 0.42 | 0.42 | 0.00 |
| 2 | 0.00 | 0.00 | 0.00 | 0.46 | 0.46 | 0.46 | 0.00 | 0.22 | 0.00 | 0.00 | 0.46 | 0.00 | 0.00 | 0.00 |
The matrix and metadata are saved to a file for later use.
Query : "Which animal is the human best friend?"
After tokenization and stop word removal: ["anim", "human", "best", "friend"]
For each document:
- Look up BM25 scores for query tokens
- Sum the scores for all query tokens present in the document
Example calculation:
file : file2.txt => score = 1.26
file : file3.txt => score = 0.46
file : file1.txt => score = 0.00
Documents are sorted by score in descending order:
Rank 1: file2.txt => score = 1.26 -> Most relevant
Rank 2: file3.txt => score = 0.46
Rank 3: file1.txt => score = 0.00
ch.heigvd/
├── resources/ // documents for running demo
│ ├── simple/ // little collection of docs
│ ├── complex/ // big collection of docs
├── presentation/ // Presentation related stuff
├── bm25/ // BM25 search engine
│ ├── exceptions/ // custom exceptions
│ │ ├── IndexException.java // exception for Index parsing
│ ├── utils/ // different utils used along the way
│ │ ├── Index.java // index abstraction
│ │ ├── DSparseMatrixLIL.java // sparse matrix with LIL storage
│ │ ├── RankingResult.java // (document index, score) pair in ranking results
│ │ ├── Stopword.java // inessential words
│ ├── BM25.java // BM25 algorithm
├── commands/ // picocli commands
│ ├── Build.java // building index
│ ├── Search.java // searching with index
├── Main.java // entry point
- Unit tests — set up unit tests for
DSparseMatrixLIL,IndexandBM25. and run them in CI. - Recursive indexing — add
--recursiveto include subfolders. - Text options — support
--stopwords <file>and--stem-off. - Results UX — show matched terms + a short snippet per hit.
- Index format — migrate from custom text to JSON (store params/metadata).
- Performance — treat tokens as IDs (ints) and switch storage LIL → CSC.
Please read corresponding wiki page
- Picocli v4.7.7 for parsing command line arguments
- Apache OpenNLP Tools v2.5.5 for removing morphological affixes from words, leaving only the word stem
kindly note the project graphic / charts are generated with the help of ChatGPT.
This project is inspired by and heavily relies on the ideas presented in bm25s.
Given that it would be fair to said that it is some kind of adaption of the project mentioned above in Java.
Although a lot of things have not been respected for the sake of simplicity.
Another external project that is being used in this project is Kaggle dataset : Plain text Wikipedia (SimpleEnglish). This is a dataset that contains some Wikipedia articles in plain text. We use it as an example to build index and search through, since it contains a lot of information from different domains.
