A self-refining, offline-first knowledge base that scrapes, indexes, and ranks information using relevance feedback. No LLM required for core operation — uses BM25/FTS5 scoring, pattern-based entity extraction, and a feedback loop to improve results over time.
Optionally use Gemma 4 (via Ollama) to refine and summarize documents for higher quality results.
Search results with BM25 ranking, source tags, keyword highlights, and inline images
- Full-text search with SQLite FTS5 and BM25 ranking
- Entity extraction — pattern-based (no LLM needed), 7 entity types: person, org, law, plant, place, mineral, concept
- Knowledge graph — documents linked by shared entities, with graph traversal for discovery
- Query expansion via synonym tables
- Relevance feedback loop — mark results as relevant/irrelevant to train future rankings
- Continuous scraping from 6 free public APIs: Project Gutenberg, arXiv, openFDA, Internet Archive, PubMed Central, OpenStax
- Property scraper — BC real estate from public listings and tax sales
- Collections for grouping related documents
- Web search UI — dark-themed, responsive, works great on mobile
- Read-only consumer API for downstream projects
- Gemma 4 integration (optional) — use a local LLM to refine/summarize scraped documents
# Clone and install
git clone https://github.com/CryptoDustinJ/knowledge-engine.git
cd knowledge-engine
pip install -r requirements.txt
# Initialize the database
python3 ke.py init
# Add your first sources
python3 ke.py add-source --type rss --name "Hacker News" --url "https://news.ycombinator.com/rss"
python3 ke.py add-source --type rss --name "Python News" --url "https://realpython.com/atom.xml"
# Scrape all sources
python3 ke.py scrape
# Or run the continuous API scraper (Gutenberg, arXiv, PubMed, etc.)
python3 source_scraper.py
# Query
python3 ke.py query "machine learning transformers"
# Show stats
python3 ke.py stats +------------------+
| Search UI | <-- Flask web app (port 8585)
| (browser) |
+--------+---------+
|
+--------v---------+
| search_server.py | <-- REST API
+--------+---------+
|
+-------------------+-------------------+
| | |
+---------v------+ +---------v------+ +---------v------+
| smart_search.py| | entities.py | | db.py |
| BM25 + entity | | Pattern-based | | SQLite + FTS5 |
| overlap + | | extraction | | WAL mode |
| query expansion| | (7 types) | | migrations |
+----------------+ +----------------+ +--------+-------+
|
+--------------------------------------+
| | |
+---------v------+ +---------v------+ +--------v-------+
| source_scraper | | scraper.py | | refine.py |
| (6 public APIs)| | (RSS, web, API)| | + Gemma 4 LLM |
+----------------+ +----------------+ +----------------+
The data/ directory has pre-built datasets you can ingest immediately:
# Load BC gold mining encyclopedia (1,600+ documents)
python3 -c "from data.bc_gold_mining import DOCUMENTS; import db; db.init_db(); [db.insert_document(4, d['title'], d['content'], keywords=d.get('keywords',[])) for d in DOCUMENTS]"
# Load medicinal plants database
python3 -c "from data.medicinal_plants import PLANTS; import db; db.init_db(); [db.insert_document(3, p['name'], p['content'], keywords=p.get('keywords',[])) for p in PLANTS]"
# Load fishing, survival, off-grid building, and more
python3 expand_fishing.py
python3 expand_survival_equipment.py
python3 expand_offgrid_corp.py| Dataset | Documents | Topics |
|---|---|---|
| BC Gold Mining | ~1,600 | Placer mining, lode deposits, history, regulations, geology |
| BC Gold Analysis | ~1,600 | Detailed geological analysis, creek-by-creek data |
| Medicinal Plants | ~200 | Traditional and evidence-based plant medicine |
| Fishing (BC) | 15 | Salmon, trout, halibut, techniques, regulations |
| Survival Skills | 15 | Water, fire, shelter, navigation, first aid |
| Equipment Knowledge | 12 | Chainsaw, generator, welding, small engine repair |
| Off-Grid Building | 15 | Solar, rainwater, septic, log building |
| BC Corporation | 12 | Incorporation, taxes, compliance, directors |
The source_scraper.py runs continuously, pulling from 6 free APIs:
# Run continuous scraper (30s cooldown between fetches)
python3 source_scraper.py
# Or run it as a service
# Create a systemd service or use screen/tmux:
screen -S ke-scraper
python3 source_scraper.py
# Ctrl+A, D to detach| API | Content | Rate |
|---|---|---|
| Project Gutenberg (Gutendex) | Public domain books — science, history, philosophy, medicine | ~500 docs/cycle |
| arXiv | Research papers — AI, biology, physics, math | ~1,800 docs/cycle |
| openFDA | Drug labels and safety data | ~1,400 docs/cycle |
| Internet Archive | Military field manuals (DTIC collection) | ~10 docs/cycle |
| PubMed Central | Open access medical/scientific papers | ~1,400 docs/cycle |
| OpenStax | Free textbook metadata (biology, chemistry, physics) | ~11 docs/cycle |
You can optionally use a local LLM to refine, summarize, and improve scraped documents. This requires Ollama.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Gemma 4 (choose size for your GPU)
ollama pull gemma3:4b # 4GB VRAM minimum
ollama pull gemma3:12b # 8GB VRAM
ollama pull gemma3:27b # 16GB+ VRAM
# Verify it's running
curl http://localhost:11434/api/tagsThe refinement pipeline uses Gemma to:
- Summarize long documents into concise reference entries
- Extract better keywords than the automated TF-IDF approach
- Identify and fill knowledge gaps
- Generate research plans for new topics
# Run the automated refinement pipeline
python3 refine.py
# Or use the Claude/Gemma-powered refinement script
./refine-with-claude.shYou can also write your own refinement using the Ollama API:
import requests
import db
db.init_db()
conn = db.get_conn()
# Get documents that need refinement (low relevance, no keywords)
docs = conn.execute("""
SELECT id, title, content FROM documents
WHERE relevance_score < 0.3
ORDER BY created_at DESC LIMIT 50
""").fetchall()
for doc in docs:
# Ask Gemma to summarize and extract keywords
resp = requests.post("http://localhost:11434/api/generate", json={
"model": "gemma3:12b",
"prompt": f"Summarize this document in 2-3 sentences and list 5 keywords:\n\n{doc['content'][:3000]}",
"stream": False
})
summary = resp.json()["response"]
print(f"Refined: {doc['title']}")
print(f" {summary[:200]}...")# Start the web UI (default port 8585)
python3 search_server.py
# Open in browser
# http://localhost:8585The search UI features:
- Dark theme with gradient accents
- Real-time search with BM25 ranking
- Source filtering (by domain/type)
- Relevance feedback buttons (thumbs up/down on each result)
- Document count and database stats
- Mobile-responsive design
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Search UI (HTML) |
/api/search?q=query |
GET | Search documents |
/api/stats |
GET | Database statistics |
/api/feedback |
POST | Submit relevance feedback |
/api/entities?q=query |
GET | Search entities |
/api/graph/:id |
GET | Get document's knowledge graph |
Tailscale creates a secure mesh VPN so you can access your Knowledge Engine from anywhere — phone, tablet, laptop — without exposing it to the internet.
# Linux (Debian/Ubuntu)
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up
# It will print a URL — open it to authenticate
# Note your Tailscale IP (e.g., 100.x.y.z)
tailscale ip -4- Android: Google Play Store
- iOS: App Store
Sign in with the same account you used on the server.
# The server already binds to 0.0.0.0, so it's accessible on all interfaces
python3 search_server.pyOpen your phone's browser and go to:
http://100.x.y.z:8585
Replace 100.x.y.z with your server's Tailscale IP. Bookmark it for quick access.
# Create a systemd service so it starts on boot
sudo tee /etc/systemd/system/knowledge-engine.service << 'EOF'
[Unit]
Description=Knowledge Engine Search Server
After=network.target
[Service]
Type=simple
WorkingDirectory=/path/to/knowledge-engine
ExecStart=/usr/bin/python3 search_server.py
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now knowledge-engineOn both Android and iOS, you can add the search page as a PWA-like shortcut:
- Open
http://100.x.y.z:8585in your phone browser - Tap the share/menu button
- Select "Add to Home Screen"
- Now you have a dedicated icon that opens directly to your knowledge engine
# Extract entities from all documents (pattern-based, no LLM)
python3 ke.py extract-entities
# Search by entity
python3 ke.py entity-search "British Columbia"
# List all entities of a type
python3 ke.py entities --type person
python3 ke.py entities --type org
python3 ke.py entities --type law
# Explore document connections
python3 ke.py graph 42 # Show entity connections for doc 42
python3 ke.py related 42 # Find related documents via shared entities| Type | Examples | Count |
|---|---|---|
person |
Historical figures, authors, researchers | ~1,200 |
org |
Companies, institutions, government bodies | ~2,000 |
law |
Statutes, regulations, legal references | ~500 |
plant |
Medicinal and edible plants | ~90 |
place |
Cities, regions, geographic features | ~70 |
mineral |
Gold, silver, copper, geological terms | ~25 |
concept |
Technical concepts, methodologies | ~40 |
# Create a collection
python3 ke.py collection-create "Gold Prospecting" "Everything about finding gold in BC"
# Add documents to it
python3 ke.py collection-add 1 42
python3 ke.py collection-add 1 43
# List collections
python3 ke.py collections
# Give feedback to improve rankings
python3 ke.py feedback --id 42 --relevant true # Boost this result
python3 ke.py feedback --id 55 --relevant false # Demote this resultOther projects can consume your knowledge base without risking data integrity. See CONSUMERS.md for the full contract.
from ke_client import KEClient
with KEClient() as ke:
# Full-text search
docs = ke.search_documents("solar panel sizing", limit=10)
for doc in docs:
print(f"{doc['title']} (relevance: {doc['relevance_score']})")
# Property search
props = ke.search_properties("cabin", max_price=20000)
# Check schema version
versions = ke.schema_version()# Nightly backup (add to cron: 0 3 * * *)
./run_nightly_backup.sh
# Decay old, low-relevance documents
python3 ke.py decay
# Run refinement pipeline
python3 refine.pyknowledge-engine/
├── ke.py # CLI interface (26+ commands)
├── db.py # SQLite + FTS5 database layer
├── entities.py # Pattern-based entity extraction
├── smart_search.py # BM25 + entity overlap + query expansion
├── scraper.py # Base scraper (RSS, web, API)
├── source_scraper.py # Continuous scraper for 6 public APIs
├── search_server.py # Flask web UI + REST API
├── refine.py # Document refinement pipeline
├── ke_client.py # Read-only consumer library
├── properties_scraper.py # BC real estate scraper
├── search_ui/
│ └── index.html # Dark-themed search interface
├── migrations/ # SQL schema evolution (7 migrations)
├── data/ # Pre-built datasets
│ ├── bc_gold_mining.py
│ ├── bc_gold_analysis.py
│ ├── medicinal_plants.py
│ └── medicinal_plants_expanded.py
├── topics/ # Topic seed data
│ ├── food_seeds.py
│ └── medicinal_plants.py
├── expand_fishing.py # BC fishing knowledge (15 docs)
├── expand_survival_equipment.py # Survival + equipment (27 docs)
└── expand_offgrid_corp.py # Off-grid + BC corp (27 docs)
- Python 3.10+
- SQLite 3.35+ (with FTS5 support — included in most distributions)
- Flask, flask-cors, requests, beautifulsoup4, feedparser
- Optional: Ollama + Gemma 4 for document refinement
MIT — see LICENSE