A lightweight, self-contained full-text search engine with a premium Web UI and REST API. Built entirely in Python — no Java, no Elasticsearch, no external search infrastructure.
Searches any text file using the standard Lucene query language (title:twain, author:dick*, content:whale, word_count:[0 TO 5000]) with instant metadata search and on-demand full-text loading.
- Full Lucene query syntax — field queries, wildcards, phrases, ranges, boolean operators
- Two-tier lazy index — metadata for all corpora always in memory (~3 MB); full text loaded only on demand
- Zero content pre-loading — startup takes ~1 second; no 966 MB RAM spike
- Pickle cache — second startup loads instantly from
index_cache.pkl - Located passage snippets — each result shows up to 5 highlighted excerpts, each with a visual mini progress-bar showing where in the document the match was found
- REST API — easily integrate with other services
- Premium Web UI — dark glassmorphism, library browser, drag-and-drop upload, document viewer
- Search-results sidebar — after any search the Library panel narrows to show only matching documents; click "✕ All books" to restore the full list
- File upload — index your own
.txt,.json, or.csvdocuments
- Python 3.12+
- pip
git clone <repo-url>
cd Lucet
pip install -r requirements.txt
# Using the management script (recommended)
./lucet_ui.sh start # start in background
./lucet_ui.sh status # show PID + memory usage
./lucet_ui.sh restart # restart after code changes
./lucet_ui.sh logs # tail the log file
./lucet_ui.sh stop # graceful shutdown
# Or run directly (foreground)
python3 main.pyOpen http://localhost:8000 in your browser.
Environment overrides:
LUCENE_PORT=9000 ./lucet_ui.sh start # different port
LUCENE_HOST=127.0.0.1 ./lucet_ui.sh start # localhost only
LUCENE_PYTHON=/usr/bin/python3.12 ./lucet_ui.sh startThe first run scans the training_data/ directory (~1–2 sec) and saves a cache. Every subsequent start loads from cache in under 0.1 sec.
Lucet implements the standard Lucene query language via luqum.
title:dickens
author:twain
filename:moby*
word_count:[0 TO 5000]
size_bytes:[100000 TO *]
content:whale
content:"call me ishmael"
Load a book first — click Load next to any book in the library, then run content queries.
Each result shows up to 5 highlighted passages from inside the text, each tagged with its approximate word position and percentage through the document.
| Pattern | Meaning |
|---|---|
title:mo* |
title starts with "mo" |
title:mo?y |
moby, moly, etc. |
Matching is substring and case-insensitive — content:whale also matches "whales", "whaleship", etc. This is intentional; recall is always 1.0.
author:twain OR author:dickens
title:moby AND content:whale
NOT author:shakespeare
title:war AND NOT title:peace
"call me ishmael"
content:"it was the best of times"
word_count:[10000 TO 50000]
word_count:[100000 TO *]
size_bytes:[* TO 100000]
| Panel | Description |
|---|---|
| Library | Paginated list of all 2,318 books with filter and Load buttons. After a search, narrows to show only matching documents with a purple banner; click ✕ All books to restore. |
| Loaded | Shows content-loaded documents; unload to free memory |
| Upload | Drag-and-drop .txt, .json, .csv — indexed with full content |
| Search Results | Card-based results. Each card shows title, author, word count, and up to 5 located passage snippets — each with a mini progress-bar dot showing where in the document the match falls, plus an estimated word position and percentage. |
| Document Viewer | Click any result or library entry to read the text |
| Footer | Live chips showing which books are content-loaded |
Base URL: http://localhost:8000
Interactive docs: http://localhost:8000/docs
GET /api/search?q=content:whale&limit=50{
"query": "content:whale",
"total_hits": 2,
"content_docs_searched": 5,
"disk_docs_searched": 0,
"total_docs": 2318,
"hits": [
{
"_id": "abc123",
"title": "Moby Dick",
"author": "Herman Melville",
"word_count": 209117,
"size_bytes": 1215834,
"content_loaded": true,
"uploaded": false,
"_snippets": [
{
"text": "…the great white whale breached the surface once more…",
"pct": 12,
"word": 25080
},
{
"text": "…whale oil filled the barrels to the brim…",
"pct": 47,
"word": 98240
}
]
}
]
}_snippets is present on every hit where content_loaded: true. Each entry:
| Field | Type | Description |
|---|---|---|
text |
string | ~280-char passage centred on the match, with … ellipsis |
pct |
int 0–100 | How far through the document this match is |
word |
int | Estimated word number of the match position |
Metadata-only queries (e.g. author:twain) return hits without _snippets.
POST /api/documents/{id}/loadReads the file from disk into memory. After this, content queries will match this document and results will include _snippets.
DELETE /api/documents/{id}/unloadEvicts the full text from memory; metadata remains.
GET /api/documents?page=1&per_page=50&q=twain&loaded_only=falsePOST /api/upload
Content-Type: multipart/form-data
file=@myfile.txtSupports .txt, .json (object or array), .csv (each row = one document).
POST /api/index
Content-Type: application/json
{"title": "My Report", "author": "Alice", "content": "Hello world..."}GET /api/stats{
"total_docs": 2318,
"content_loaded": 3,
"uploaded": 1,
"cache_exists": true
}DELETE /api/index/contentFrees all loaded book content from memory (preserves uploaded documents).
Lucet/
├── lucene_engine.py # Pure-Python Lucene query engine (embeddable standalone)
├── main.py # FastAPI application + REST API
├── requirements.txt # Python dependencies (server)
├── pyproject.toml # Package metadata for library use / PyPI
├── LICENSE # MIT
├── .gitignore
├── lucene.sh # Process management script (start/stop/logs)
├── static/
│ ├── index.html # Single-page Web UI
│ ├── style.css # Dark glassmorphism styling
│ └── app.js # UI logic (vanilla JS, no framework)
├── training_data/ # 15 bootstrap texts included; add more from Project Gutenberg
├── context.txt # Original product requirements
├── CLAUDE.md # AI agent instructions
└── README.md # This file
The repo ships with 15 bootstrap texts in training_data/ so the demo works out of the box:
| Text | Author | Genre |
|---|---|---|
| 2 B R 0 2 B | Kurt Vonnegut | Sci-fi |
| A Dog's Tale | Mark Twain | Humor |
| Extracts from Adam's Diary | Mark Twain | Humor |
| A Modest Proposal | Jonathan Swift | Satire |
| The Gift of the Magi | O. Henry | Short story |
| The Monkey's Paw | W.W. Jacobs | Horror |
| His Last Bow | Arthur Conan Doyle | Mystery |
| The Parenticide Club | Ambrose Bierce | Dark humor |
| Beyond Lies the Wub | Philip K. Dick | Sci-fi |
| Crystal Crypt | Philip K. Dick | Sci-fi |
| The Rime of the Ancient Mariner | Samuel Taylor Coleridge | Poetry |
| Songs of Innocence and Experience | William Blake | Poetry |
| Give Me Liberty | Patrick Henry | Historical speech |
| Declaration of Independence | United States of America | Historical document |
| The Constitution of the United States | United States of America | Historical document |
To add more, place additional .txt files from Project Gutenberg in training_data/ using the naming convention:
Title-Words-Hyphenated-by-Author-Name.txt
# e.g. Moby-Dick-by-Herman-Melville.txt
The full corpus used in development is 2,318 texts (~966 MB) — too large for git. The server and all API endpoints work with any number of documents including zero (upload your own via the UI or /api/index).
lucene_engine.py is a self-contained module with no web-framework dependency. Drop it into any Python project and use it as an embedded search index:
from lucene_engine import LuceneEngine
engine = LuceneEngine()
# Index documents — any dict with any fields
engine.add_document({"title": "My Doc", "author": "Alice", "content": "Hello world, this is a test."})
engine.add_document({"title": "Another Doc", "author": "Bob", "content": "Foo bar baz content here."})
# Search with full Lucene query syntax
result = engine.search("content:hello")
print(result.hits) # list of matching document dicts (content excluded)
print(result.total_docs) # 2
# Field queries, booleans, wildcards, ranges — all work
result = engine.search('author:Alice AND content:test')
result = engine.search('title:doc* AND NOT author:bob')
result = engine.search('word_count:[1000 TO *]')
# Each hit on a content-loaded doc includes located passage snippets
for hit in result.hits:
for snippet in hit.get("_snippets", []):
print(f" ~word {snippet['word']} ({snippet['pct']}%): {snippet['text']}")LuceneEngine requires only luqum (pip install luqum). No server, no Java, no network.
The index never pre-loads file content:
Startup (~1 sec)
└── Scan training_data/ filenames
└── Parse title + author from filename
└── os.stat() for size
└── Store metadata dict (no file reads)
└── Save to index_cache.pkl
When a user clicks Load:
POST /api/documents/{id}/load
└── Read file from disk (once)
└── Store text in memory
└── content_loaded = True
└── Now searchable with content: queries
└── Results include located passage snippets (_snippets)
| State | RAM |
|---|---|
| Server start (2,318 books, metadata only) | ~3 MB |
| After loading one typical novel (~400 KB) | ~3.4 MB |
| After loading 10 novels | ~7 MB |
| After loading ALL books (if you wanted to) | ~969 MB |
| Query type | Docs searched | Snippets included |
|---|---|---|
title:, author:, word_count:, etc. |
All 2,318 (metadata tier) | No |
content: or bare terms |
Only content-loaded documents | Yes — up to 5 located passages |
Mixed (title:moby AND content:whale) |
Metadata filter applied to all; content filter applied to loaded subset | Yes |
For every content-loaded document that matches a query, the engine:
- Runs
re.finditer()for all query terms across the full document text - Sorts and deduplicates match positions (merging hits within half a snippet-window of each other)
- Extracts up to 5 passages of ~280 characters each, walking back to natural word/line boundaries
- Records
pct(position ÷ document length × 100) andword(position ÷ 5.5, same ratio as word-count estimation) for each passage
Project Gutenberg boilerplate headers are detected and skipped when no match is found in the body text.
| Package | Purpose |
|---|---|
fastapi |
REST API framework |
uvicorn |
ASGI server |
luqum |
Lucene query parser (AST) |
python-multipart |
File upload support |
No Java. No Elasticsearch. No ML models. No vector databases.
Via the UI: drag and drop any .txt, .json, or .csv into the Upload panel.
Via the API:
# Upload a text file
curl -X POST http://localhost:8000/api/upload \
-F "file=@my_document.txt"
# Index a JSON object directly
curl -X POST http://localhost:8000/api/index \
-H "Content-Type: application/json" \
-d '{"title": "My Doc", "author": "Me", "content": "Full text here..."}'
# Bulk index a JSON array
curl -X POST http://localhost:8000/api/index/bulk \
-H "Content-Type: application/json" \
-d '[{"title": "Doc 1", "content": "..."}, {"title": "Doc 2", "content": "..."}]'Uploaded documents have their full content always in memory and are searchable immediately with any query type.
rm index_cache.pkl
python3 main.py # rebuilds in ~1-2 secMIT License
Copyright (c) 2026 Lucet Contributors
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Training data is from Project Gutenberg and is in the public domain.