Skip to content

FeliciaCoding/bm25-search-engine-java

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

173 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Best Matching 25-th iteration

BM25 Search Engine

A fast and lightweight BM25-based search engine implemented in JAVA


Description

BM25 search engine implemented in Java, with a simple CLI to build an index and search over a collection of documents.

From Wikipedia:

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document.

This variant of BM25 is the original one, invented by Stephen E. Robertson, Karen Spärck Jones, and others.


Table of Contents

  1. Description
  2. Requirements
  3. Installation
  4. Usage
  5. Commands
  6. Limitations
  7. Formula
  8. How it works
  9. Repo structure
  10. Roadmap
  11. How to contribute
  12. Dependencies
  13. Authors
  14. Acknowledgements

Requirements

  • Java Development Kit (JDK) v21.0+
  • Maven (optional, Maven Wrapper included)
  • Git (for cloning the repository)

Installation

For now, it is available to build from sourse.

Step 1: Clone the repository

git clone https://github.com/maxmakovskiy/know-your-files.git
cd know-your-files

Step 2: Build the project

# Using Maven Wrapper
./mvnw clean package

# or with Maven installed
mvn clean package

The build produces a runnable JAR file in the target directory.

Installation complete You are ready to use the search engine.


Usage

1. Build an index

Create index file from a collection of documents:

$ java -jar target/know-your-files-1.0-SNAPSHOT.jar build \
    -I=[Index file name] path/to/documents

Example :

$ java -jar target/know-your-files-1.0-SNAPSHOT.jar build \
    -I=index.txt path/to/documents

What happens:

  • Read all text files from the directory
  • Tokenize documents (splits into words, removes stop words, applies stemming)
  • Build an inverted index with BM25 scores
  • Save the index to the specified file, depending the giving index file name in the comment line.

2. Search the index

Search through your documents using index file:

$ java -jar target/know-your-files-1.0-SNAPSHOT.jar search \
    -K=3 [index file created by build command] [search word or phrase] 

Example :

$ java -jar target/know-your-files-1.0-SNAPSHOT.jar search \
    -K=3 index.txt Which animal is the human best friend? 

Sample Output:

Query : "Which animal is the human best friend?"

file : file2.txt => score = 1.26
file : file3.txt => score = 0.46
file : file1.txt => score = 0.00

Commands

Currently, there are only 2 supported commands with syntax presented in usage:

build - Create Search Index

Syntax:

java -jar target/know-your-files-1.0-SNAPSHOT.jar build [OPTIONS] <documents_directory>

Parameters:

Parameter Type Required Default Description
-I, --index Option No index.txt Output file name for the index
<documents_directory> Positional Yes - Path to folder containing documents to index

search - Query the Index

Syntax:

java -jar target/know-your-files-1.0-SNAPSHOT.jar search [OPTIONS]  
Parameter Type Required Default Description
-K, --topK Option No 10 Number of top results to return
<index_file> Positional Yes - Path to the index file created by build
<query> Positional Yes - Search query (all words after index file path)

Limitations

  • Language Support: Currently, only English is properly supported
  • File Format: Only plain text files (.txt) are indexed
  • Encoding: UTF-8 encoding is assumed for all documents

Formula


$$ log(\frac{N - df_t + 0.5}{df_t + 0.5} + 1) \cdot \frac{tf_{td}}{ k_1 \cdot (1 - b + b \cdot ( \frac{ L_d }{ L_{avg} } )) + tf_{td} } $$


  • Term on the left is called IDF (inverse document frequency).
  • Term on the right is basically term frequency.
  • In the formula above t stands for term in document d, in our case we call it token.
  • N is number of documents among which we are searching for t.
  • $df_t$ is the number of documents with token t and $tf_{td}$ term frequency of token t in document d.
  • $L_d$ and $L_{avg}$ correspond to document length and average document length respectively.

Although IDF component used in this project is slightly modified and makes a part of the other BM25 variant called Lucene.

More detailed overview could be found in this paper Kamphuis et al.


How it works?

The task of ranking documents/books/notes/etc with the respect to certain query is very natural for humans, we do it all the time.

Phase 1: Building the Index

1. Input Documents

Supposing that user has a collection of documents, and he wants to search for some information.
For example (not my example, see), let's say each document has just one line:

    "a cat is a feline and likes to eat bird",            // file1.txt
    "a dog is the human's best friend and likes to play", // file2.txt
    "a bird is a beautiful animal that can fly"           // file3.txt

2. Tokenization

Documents are processed through:

  • Splitting: Text is split into individual words
  • Stop word removal: Common words like "a", "is", "the" are removed
  • Stemming: Words are reduced to their root form (e.g., "likes" → "like", "beautiful" → "beauti")

Result (corpus):

[
    ["cat", "feline", "like", "eat", "bird"],           // file1.txt
    ["dog", "human", "best", "friend", "like", "plai"], // file2.txt
    ["bird", "beauti", anim", "can", "fly"]             // file3.txt
]

3. Vocabulary Construction

A vocabulary is built from all unique tokens:

like best plai can fly beauti cat bird friend eat anim dog human felin

4. BM25 Score Matrix Construction

A document-term matrix is created where:

  • Each row represents a document
  • Each column represents a token from the vocabulary
  • Each cell contains the BM25 score for that term in that document (0 if term is absent)
    It is some sort of document-term matrix, but instead of the frequency of terms we store BM25 score.

Score Matrix:

docIdx like best plai can fly beauti cat bird friend eat anim dog human felin
0 0.22 0.00 0.00 0.00 0.00 0.00 0.46 0.22 0.00 0.46 0.00 0.00 0.00 0.46
1 0.20 0.42 0.42 0.00 0.00 0.00 0.00 0.00 0.42 0.00 0.00 0.42 0.42 0.00
2 0.00 0.00 0.00 0.46 0.46 0.46 0.00 0.22 0.00 0.00 0.46 0.00 0.00 0.00

5. Index Persistence

The matrix and metadata are saved to a file for later use.


Phase 2: Searching

1. Query Processing

Query : "Which animal is the human best friend?"

After tokenization and stop word removal: ["anim", "human", "best", "friend"]

2. Score Calculation

For each document:

  • Look up BM25 scores for query tokens
  • Sum the scores for all query tokens present in the document

Example calculation:

file : file2.txt => score = 1.26
file : file3.txt => score = 0.46
file : file1.txt => score = 0.00

3. Ranking

Documents are sorted by score in descending order:

Rank 1: file2.txt => score = 1.26  -> Most relevant
Rank 2: file3.txt => score = 0.46
Rank 3: file1.txt => score = 0.00

Repository Structure

ch.heigvd/
├── resources/                     // documents for running demo
│   ├── simple/                    // little collection of docs
│   ├── complex/                   // big collection of docs
├── presentation/                  // Presentation related stuff
├── bm25/                          // BM25 search engine
│   ├── exceptions/                // custom exceptions
│   │   ├── IndexException.java    // exception for Index parsing
│   ├── utils/                     // different utils used along the way
│   │   ├── Index.java             // index abstraction
│   │   ├── DSparseMatrixLIL.java  // sparse matrix with LIL storage
│   │   ├── RankingResult.java     // (document index, score) pair in ranking results
│   │   ├── Stopword.java          // inessential words
│   ├── BM25.java                  // BM25 algorithm
├── commands/                      // picocli commands
│   ├── Build.java                 // building index
│   ├── Search.java                // searching with index
├── Main.java                      // entry point

Roadmap

  1. Unit tests — set up unit tests for DSparseMatrixLIL, Index and BM25. and run them in CI.
  2. Recursive indexing — add --recursive to include subfolders.
  3. Text options — support --stopwords <file> and --stem-off.
  4. Results UX — show matched terms + a short snippet per hit.
  5. Index format — migrate from custom text to JSON (store params/metadata).
  6. Performance — treat tokens as IDs (ints) and switch storage LIL → CSC.

How to contribute

Please read corresponding wiki page


Dependencies


Authors

kindly note the project graphic / charts are generated with the help of ChatGPT.


Acknowledgements

This project is inspired by and heavily relies on the ideas presented in bm25s.
Given that it would be fair to said that it is some kind of adaption of the project mentioned above in Java. Although a lot of things have not been respected for the sake of simplicity.

Another external project that is being used in this project is Kaggle dataset : Plain text Wikipedia (SimpleEnglish). This is a dataset that contains some Wikipedia articles in plain text. We use it as an example to build index and search through, since it contains a lot of information from different domains.

About

BM25-based search engine built in Java with CLI support for indexing and querying text document collections.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Java 100.0%