A combined TF-IDF and semantic search engine, specialized in Bulgarian, operating over the bg.wikipedia.org domain.
- can queue processed pages for indexer
- can store in persistent
LMDBstorage - for index construction used XML dump instead1
- using
sakelariev/bg_news_lg2 - stopword removal
- tokenization
- lemmatization
- using the multilingual
Alibaba-NLP/gte-multinational-base3 sentence-transformer - document embeddings stored in
FAISSorUSearch - vector index can be loaded from persistent storage or from RAM
- stored in
MySQLdatabase in Boyce-Codd normal form - overcomes HDD latency (yes, I still use an HDD!)
- allows distributed capabilities
- helps with autocompletion and spellchecking
- convenient base for KWIC Snippet generation
- fast and space-efficient storage using Directed Acyclic Word Graphs (DAWGs)
- two DAWGs - one for single-word completion and one for next-word completion
- using
hunspell - supports default Bulgarian dictionary
- supports custom dictionary from corpora
- semantic search ranks by cosine similarity
- keyword search uses TF-IDF and BM25 $$ \text{BM25}{d, Q} = \sum{t \in Q} {\left[\ln \left( \frac{N - \text{df}t + 0.5}{\text{df}t + 0.5}\right) \cdot \frac {\text{tf}{t,d} \cdot (k_1 + 1)} {\text{tf}{t,d} + k_1 \cdot \left(1 - b + b \cdot\frac{|d|}{\text{avgdl}}\right)}\right]} $$
WikiSearch/
├── api.py
├── config.toml
├── uv.lock
├── LICENSE
├── pyproject.toml
├── README.md
├── docs/
│ ├── presentation/
│ └── text/
├── requirements.txt
├── scripts/
│ ├── construct_next_word_dawg.py
│ ├── construct_word_dawg.py
│ ├── initial_crawling.py
│ ├── initial_index_construction.py
│ └── evaluate.py
├── sql/
│ ├── create_db.sql
│ ├── create_structure.sql
│ └── delete.sql
├── ui/
│ ├── gui.py
│ ├── static/
│ └── templates/
└── wikisearch/
├── __init__.py
├── autocomplete/
├── crawler/
├── db/
├── document/
├── eval/
├── index/
├── nlp/
├── spell/
└── summary/
- Precision 0.870 ± 0.034
- Recall 0.893 ± 0.029
- F1 0.877 ± 0.030
- Precision 0.130 ± 0.021
- Recall 0.419 ± 0.055
- F1 0.142 ± 0.019
Remark: Comparing vector index results to tf-idf index results is unfair. A TF-IDF index returns only the documents, containing the keywords. A vector index returns as many documents, as the user wants.
- crawl all of
bg.wikipedia.org - evaluate system using human evaluators
- combine the semantic and TF-IDF search results into one algorithm (a machine learning task)
- dynamic summaries (KWIC Snippets)
- more fine-grained autocompletion strategy
- support multiple languages by running language detection on a document and a language-specific tokenizer
- distributed database
- performance improvements