WikiSearch

A combined TF-IDF and semantic search engine, specialized in Bulgarian, operating over the bg.wikipedia.org domain.

Capabilities

Web crawler

can queue processed pages for indexer
can store in persistent LMDB storage
for index construction used XML dump instead¹

NLP processing

using sakelariev/bg_news_lg²
stopword removal
tokenization
lemmatization

Embedding Generation

using the multilingual Alibaba-NLP/gte-multinational-base³ sentence-transformer
document embeddings stored in FAISS or USearch
vector index can be loaded from persistent storage or from RAM

TF-IDF Index

stored in MySQL database in Boyce-Codd normal form
overcomes HDD latency (yes, I still use an HDD!)
allows distributed capabilities

Positional Index

helps with autocompletion and spellchecking
convenient base for KWIC Snippet generation

Query Autocompletion

fast and space-efficient storage using Directed Acyclic Word Graphs (DAWGs)
two DAWGs - one for single-word completion and one for next-word completion

Spellchecking

using hunspell
supports default Bulgarian dictionary
supports custom dictionary from corpora

Searching and Ranking

semantic search ranks by cosine similarity
keyword search uses TF-IDF and BM25 $$ \text{BM25}{d, Q} = \sum{t \in Q} {\left[\ln \left( \frac{N - \text{df}t + 0.5}{\text{df}t + 0.5}\right) \cdot \frac {\text{tf}{t,d} \cdot (k_1 + 1)} {\text{tf}{t,d} + k_1 \cdot \left(1 - b + b \cdot\frac{|d|}{\text{avgdl}}\right)}\right]} $$

Project Structure

WikiSearch/
├── api.py
├── config.toml
├── uv.lock
├── LICENSE
├── pyproject.toml
├── README.md
├── docs/
│   ├── presentation/
│   └── text/
├── requirements.txt
├── scripts/
│   ├── construct_next_word_dawg.py
│   ├── construct_word_dawg.py
│   ├── initial_crawling.py
│   ├── initial_index_construction.py
│   └── evaluate.py
├── sql/
│   ├── create_db.sql
│   ├── create_structure.sql
│   └── delete.sql
├── ui/
│   ├── gui.py
│   ├── static/
│   └── templates/
└── wikisearch/
    ├── __init__.py
    ├── autocomplete/
    ├── crawler/
    ├── db/
    ├── document/
    ├── eval/
    ├── index/
    ├── nlp/
    ├── spell/
    └── summary/

Benchmarks

Inverted Index

Precision 0.870 ± 0.034
Recall 0.893 ± 0.029
F1 0.877 ± 0.030

Semantic Index

Precision 0.130 ± 0.021
Recall 0.419 ± 0.055
F1 0.142 ± 0.019

Remark: Comparing vector index results to tf-idf index results is unfair. A TF-IDF index returns only the documents, containing the keywords. A vector index returns as many documents, as the user wants.

Possible Future Improvements

crawl all of bg.wikipedia.org
evaluate system using human evaluators
combine the semantic and TF-IDF search results into one algorithm (a machine learning task)
dynamic summaries (KWIC Snippets)
more fine-grained autocompletion strategy
support multiple languages by running language detection on a document and a language-specific tokenizer
distributed database
performance improvements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WikiSearch

Capabilities

Web crawler

NLP processing

Embedding Generation

TF-IDF Index

Positional Index

Query Autocompletion

Spellchecking

Searching and Ranking

Project Structure

Benchmarks

Inverted Index

Semantic Index

Possible Future Improvements

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.vscode		.vscode
docs		docs
scripts		scripts
sql		sql
ui		ui
wikisearch		wikisearch
.devdbrc		.devdbrc
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
api.py		api.py
config.toml		config.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

License

DanielHalachev/WikiSearch

Folders and files

Latest commit

History

Repository files navigation

WikiSearch

Capabilities

Web crawler

NLP processing

Embedding Generation

TF-IDF Index

Positional Index

Query Autocompletion

Spellchecking

Searching and Ranking

Project Structure

Benchmarks

Inverted Index

Semantic Index

Possible Future Improvements

References

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages