Standalone, swappable NER → candidate generation → rerank → disambiguation pipeline. Uses file-based storage (JSONL for KB and outputs) and optional caching in .ner_cache/.
Requirements: Python 3.10-3.12 (Python 3.13 is NOT supported due to vLLM), CUDA 12.x for GPU support
python3.10 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtLaunch the interactive web interface:
python app.pyOpen http://localhost:7860 and configure the pipeline through the UI. See docs/WEB_APP.md for details.
If you encounter issues, see docs/TROUBLESHOOTING.md for solutions to common problems including:
- PyTorch CUDA mismatch
- vLLM installation failures
- GPU memory issues
- Prepare a JSONL knowledge base with fields:
id,title,description(plus optional metadata). - Create a config file, e.g.
config.json:
{
"loader": {"name": "pdf", "params": {}},
"ner": {"name": "spacy", "params": {"model": "en_core_web_sm"}},
"candidate_generator": {"name": "bm25", "params": {}},
"reranker": {"name": "none", "params": {}},
"disambiguator": {"name": "popularity", "params": {}},
"knowledge_base": {"name": "jsonl", "params": {"path": "kb.jsonl"}},
"cache_dir": ".ner_cache",
"batch_size": 1
}- Run:
python -m lela.cli --config config.json --input docs/file1.pdf docs/file2.pdf --output outputs.jsonlpython -m lela.cli \
--config data/configs/simplewiki_fuzzy_simple.json \
--input data/docs/simple-english-wiki/corpus.txt \
--output outputs.jsonlThis uses the simple regex NER, fuzzy candidates, first-candidate disambiguation, and the YAGO-derived KB JSONL.
from lela import Lela
# Load from a JSON config file path
lela = Lela("config.json")
results = lela.run("docs/file1.txt")
# Or pass a dict directly
import json
config = json.load(open("config.json"))
lela = Lela(config)
results = lela.run("docs/file1.txt", "docs/file2.txt")- Loaders:
text,json,jsonl,pdf,docx,html - NER:
spacy,gliner,simple(regex) - Candidate generators:
bm25,dense,fuzzy - Rerankers:
cross_encoder,none - Disambiguators:
popularity,first,llm - Knowledge bases:
jsonl,wikipedia,wikidata
- The
data/directory is gitignored by default. Keep shareable configs indata/configs/(tracked). - Sample configs provided:
data/configs/simplewiki_fuzzy_simple.json
- YAGO labels TSV → JSONL KB:
python -m lela.scripts.convert_yago_labels data/kb/yagoLabels.tsv data/kb/yago_labels_en.jsonl
- Outputs are JSONL (one line per document with resolved entities).
- Each line:
id,text,entities(withtext,start,end,label,entity_id,entity_title,entity_description,candidates).
- Each line:
- Cache lives in
.ner_cache/keyed by file path, mtime, and size. - No dependency on LELA; integration would be optional if added later.