BM25 Search Engine

BM25 Search Engine

A fast and lightweight BM25-based search engine implemented in JAVA

Description

BM25 search engine implemented in Java, with a simple CLI to build an index and search over a collection of documents.

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document.

This variant of BM25 is the original one, invented by Stephen E. Robertson, Karen Spärck Jones, and others.

Requirements

Java Development Kit (JDK) v21.0+
Maven (optional, Maven Wrapper included)
Git (for cloning the repository)

Installation

For now, it is available to build from sourse.

Step 1: Clone the repository

git clone https://github.com/maxmakovskiy/know-your-files.git
cd know-your-files

Step 2: Build the project

# Using Maven Wrapper
./mvnw clean package

# or with Maven installed
mvn clean package

The build produces a runnable JAR file in the target directory.

Installation complete You are ready to use the search engine.

Usage

1. Build an index

Create index file from a collection of documents:

$ java -jar target/know-your-files-1.0-SNAPSHOT.jar build \
    -I=[Index file name] path/to/documents

Example :

$ java -jar target/know-your-files-1.0-SNAPSHOT.jar build \
    -I=index.txt path/to/documents

What happens:

Read all text files from the directory
Tokenize documents (splits into words, removes stop words, applies stemming)
Build an inverted index with BM25 scores
Save the index to the specified file, depending the giving index file name in the comment line.

2. Search the index

Search through your documents using index file:

$ java -jar target/know-your-files-1.0-SNAPSHOT.jar search \
    -K=3 [index file created by build command] [search word or phrase]

Example :

$ java -jar target/know-your-files-1.0-SNAPSHOT.jar search \
    -K=3 index.txt Which animal is the human best friend?

Sample Output:

Query : "Which animal is the human best friend?"

file : file2.txt => score = 1.26
file : file3.txt => score = 0.46
file : file1.txt => score = 0.00

Commands

Currently, there are only 2 supported commands with syntax presented in usage:

`build` - Create Search Index

Syntax:

java -jar target/know-your-files-1.0-SNAPSHOT.jar build [OPTIONS] <documents_directory>

Parameters:

Parameter	Type	Required	Default	Description
`-I, --index`	Option	No	`index.txt`	Output file name for the index
`<documents_directory>`	Positional	Yes	-	Path to folder containing documents to index

`search` - Query the Index

Syntax:

java -jar target/know-your-files-1.0-SNAPSHOT.jar search [OPTIONS]

Parameter	Type	Required	Default	Description
`-K, --topK`	Option	No	`10`	Number of top results to return
`<index_file>`	Positional	Yes	-	Path to the index file created by `build`
`<query>`	Positional	Yes	-	Search query (all words after index file path)

Limitations

Language Support: Currently, only English is properly supported
File Format: Only plain text files (.txt) are indexed
Encoding: UTF-8 encoding is assumed for all documents

Formula

$$ log(\frac{N - df_t + 0.5}{df_t + 0.5} + 1) \cdot \frac{tf_{td}}{ k_1 \cdot (1 - b + b \cdot ( \frac{ L_d }{ L_{avg} } )) + tf_{td} } $$

Term on the left is called IDF (inverse document frequency).
Term on the right is basically term frequency.
In the formula above t stands for term in document d, in our case we call it token.
N is number of documents among which we are searching for t.
$df_t$ is the number of documents with token t and $tf_{td}$ term frequency of token t in document d.
$L_d$ and $L_{avg}$ correspond to document length and average document length respectively.

Although IDF component used in this project is slightly modified and makes a part of the other BM25 variant called Lucene.

More detailed overview could be found in this paper Kamphuis et al.

How it works?

The task of ranking documents/books/notes/etc with the respect to certain query is very natural for humans, we do it all the time.

Phase 1: Building the Index

1. Input Documents

Supposing that user has a collection of documents, and he wants to search for some information.
For example (not my example, see), let's say each document has just one line:

    "a cat is a feline and likes to eat bird",            // file1.txt
    "a dog is the human's best friend and likes to play", // file2.txt
    "a bird is a beautiful animal that can fly"           // file3.txt

2. Tokenization

Documents are processed through:

Splitting: Text is split into individual words
Stop word removal: Common words like "a", "is", "the" are removed
Stemming: Words are reduced to their root form (e.g., "likes" → "like", "beautiful" → "beauti")

Result (corpus):

[
    ["cat", "feline", "like", "eat", "bird"],           // file1.txt
    ["dog", "human", "best", "friend", "like", "plai"], // file2.txt
    ["bird", "beauti", anim", "can", "fly"]             // file3.txt
]

3. Vocabulary Construction

A vocabulary is built from all unique tokens:

like best plai can fly beauti cat bird friend eat anim dog human felin

4. BM25 Score Matrix Construction

A document-term matrix is created where:

Each row represents a document
Each column represents a token from the vocabulary
Each cell contains the BM25 score for that term in that document (0 if term is absent)
It is some sort of document-term matrix, but instead of the frequency of terms we store BM25 score.

Score Matrix:

docIdx	like	best	plai	can	fly	beauti	cat	bird	friend	eat	anim	dog	human	felin
0	0.22	0.00	0.00	0.00	0.00	0.00	0.46	0.22	0.00	0.46	0.00	0.00	0.00	0.46
1	0.20	0.42	0.42	0.00	0.00	0.00	0.00	0.00	0.42	0.00	0.00	0.42	0.42	0.00
2	0.00	0.00	0.00	0.46	0.46	0.46	0.00	0.22	0.00	0.00	0.46	0.00	0.00	0.00

5. Index Persistence

The matrix and metadata are saved to a file for later use.

Phase 2: Searching

1. Query Processing

Query : "Which animal is the human best friend?"

After tokenization and stop word removal: ["anim", "human", "best", "friend"]

2. Score Calculation

For each document:

Look up BM25 scores for query tokens
Sum the scores for all query tokens present in the document

Example calculation:

file : file2.txt => score = 1.26
file : file3.txt => score = 0.46
file : file1.txt => score = 0.00

3. Ranking

Documents are sorted by score in descending order:

Rank 1: file2.txt => score = 1.26  -> Most relevant
Rank 2: file3.txt => score = 0.46
Rank 3: file1.txt => score = 0.00

Repository Structure

ch.heigvd/
├── resources/                     // documents for running demo
│   ├── simple/                    // little collection of docs
│   ├── complex/                   // big collection of docs
├── presentation/                  // Presentation related stuff
├── bm25/                          // BM25 search engine
│   ├── exceptions/                // custom exceptions
│   │   ├── IndexException.java    // exception for Index parsing
│   ├── utils/                     // different utils used along the way
│   │   ├── Index.java             // index abstraction
│   │   ├── DSparseMatrixLIL.java  // sparse matrix with LIL storage
│   │   ├── RankingResult.java     // (document index, score) pair in ranking results
│   │   ├── Stopword.java          // inessential words
│   ├── BM25.java                  // BM25 algorithm
├── commands/                      // picocli commands
│   ├── Build.java                 // building index
│   ├── Search.java                // searching with index
├── Main.java                      // entry point

Roadmap

Unit tests — set up unit tests for DSparseMatrixLIL, Index and BM25. and run them in CI.
Recursive indexing — add --recursive to include subfolders.
Text options — support --stopwords <file> and --stem-off.
Results UX — show matched terms + a short snippet per hit.
Index format — migrate from custom text to JSON (store params/metadata).
Performance — treat tokens as IDs (ints) and switch storage LIL → CSC.

How to contribute

Please read corresponding wiki page

Dependencies

Picocli v4.7.7 for parsing command line arguments
Apache OpenNLP Tools v2.5.5 for removing morphological affixes from words, leaving only the word stem

Authors

kindly note the project graphic / charts are generated with the help of ChatGPT.

Acknowledgements

This project is inspired by and heavily relies on the ideas presented in bm25s.
Given that it would be fair to said that it is some kind of adaption of the project mentioned above in Java. Although a lot of things have not been respected for the sake of simplicity.

Another external project that is being used in this project is Kaggle dataset : Plain text Wikipedia (SimpleEnglish). This is a dataset that contains some Wikipedia articles in plain text. We use it as an example to build index and search through, since it contains a lot of information from different domains.

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github		.github
.idea		.idea
.mvn/wrapper		.mvn/wrapper
presentation		presentation
src/main		src/main
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
bm25_logo.jpg		bm25_logo.jpg
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Folders and files

Latest commit

History

Repository files navigation

BM25 Search Engine

Description

Table of Contents

Requirements

Installation

Step 1: Clone the repository

Step 2: Build the project

Usage

1. Build an index

2. Search the index

Commands

build - Create Search Index

search - Query the Index

Limitations

Formula

How it works?

Phase 1: Building the Index

1. Input Documents

2. Tokenization

3. Vocabulary Construction

4. BM25 Score Matrix Construction

5. Index Persistence

Phase 2: Searching

1. Query Processing

2. Score Calculation

3. Ranking

Repository Structure

Roadmap

How to contribute

Dependencies

Authors

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`build` - Create Search Index

`search` - Query the Index

Packages