Knowledge Engine

A self-refining, offline-first knowledge base that scrapes, indexes, and ranks information using relevance feedback. No LLM required for core operation — uses BM25/FTS5 scoring, pattern-based entity extraction, and a feedback loop to improve results over time.

Optionally use Gemma 4 (via Ollama) to refine and summarize documents for higher quality results.

Search results with BM25 ranking, source tags, keyword highlights, and inline images

Features

Full-text search with SQLite FTS5 and BM25 ranking
Entity extraction — pattern-based (no LLM needed), 7 entity types: person, org, law, plant, place, mineral, concept
Knowledge graph — documents linked by shared entities, with graph traversal for discovery
Query expansion via synonym tables
Relevance feedback loop — mark results as relevant/irrelevant to train future rankings
Continuous scraping from 6 free public APIs: Project Gutenberg, arXiv, openFDA, Internet Archive, PubMed Central, OpenStax
Property scraper — BC real estate from public listings and tax sales
Collections for grouping related documents
Web search UI — dark-themed, responsive, works great on mobile
Read-only consumer API for downstream projects
Gemma 4 integration (optional) — use a local LLM to refine/summarize scraped documents

Quick Start

# Clone and install
git clone https://github.com/CryptoDustinJ/knowledge-engine.git
cd knowledge-engine
pip install -r requirements.txt

# Initialize the database
python3 ke.py init

# Add your first sources
python3 ke.py add-source --type rss --name "Hacker News" --url "https://news.ycombinator.com/rss"
python3 ke.py add-source --type rss --name "Python News" --url "https://realpython.com/atom.xml"

# Scrape all sources
python3 ke.py scrape

# Or run the continuous API scraper (Gutenberg, arXiv, PubMed, etc.)
python3 source_scraper.py

# Query
python3 ke.py query "machine learning transformers"

# Show stats
python3 ke.py stats

Architecture

                         +------------------+
                         |   Search UI      |  <-- Flask web app (port 8585)
                         |   (browser)      |
                         +--------+---------+
                                  |
                         +--------v---------+
                         |  search_server.py |  <-- REST API
                         +--------+---------+
                                  |
              +-------------------+-------------------+
              |                   |                   |
    +---------v------+  +---------v------+  +---------v------+
    | smart_search.py|  |  entities.py   |  |    db.py       |
    | BM25 + entity  |  | Pattern-based  |  | SQLite + FTS5  |
    | overlap +      |  | extraction     |  | WAL mode       |
    | query expansion|  | (7 types)      |  | migrations     |
    +----------------+  +----------------+  +--------+-------+
                                                     |
              +--------------------------------------+
              |                    |                  |
    +---------v------+  +---------v------+  +--------v-------+
    | source_scraper |  |   scraper.py   |  |  refine.py     |
    | (6 public APIs)|  | (RSS, web, API)|  | + Gemma 4 LLM  |
    +----------------+  +----------------+  +----------------+

Seeding with Data

The data/ directory has pre-built datasets you can ingest immediately:

# Load BC gold mining encyclopedia (1,600+ documents)
python3 -c "from data.bc_gold_mining import DOCUMENTS; import db; db.init_db(); [db.insert_document(4, d['title'], d['content'], keywords=d.get('keywords',[])) for d in DOCUMENTS]"

# Load medicinal plants database
python3 -c "from data.medicinal_plants import PLANTS; import db; db.init_db(); [db.insert_document(3, p['name'], p['content'], keywords=p.get('keywords',[])) for p in PLANTS]"

# Load fishing, survival, off-grid building, and more
python3 expand_fishing.py
python3 expand_survival_equipment.py
python3 expand_offgrid_corp.py

Available Datasets

Dataset	Documents	Topics
BC Gold Mining	~1,600	Placer mining, lode deposits, history, regulations, geology
BC Gold Analysis	~1,600	Detailed geological analysis, creek-by-creek data
Medicinal Plants	~200	Traditional and evidence-based plant medicine
Fishing (BC)	15	Salmon, trout, halibut, techniques, regulations
Survival Skills	15	Water, fire, shelter, navigation, first aid
Equipment Knowledge	12	Chainsaw, generator, welding, small engine repair
Off-Grid Building	15	Solar, rainwater, septic, log building
BC Corporation	12	Incorporation, taxes, compliance, directors

Scraping from Public APIs

The source_scraper.py runs continuously, pulling from 6 free APIs:

# Run continuous scraper (30s cooldown between fetches)
python3 source_scraper.py

# Or run it as a service
# Create a systemd service or use screen/tmux:
screen -S ke-scraper
python3 source_scraper.py
# Ctrl+A, D to detach

Sources

API	Content	Rate
Project Gutenberg (Gutendex)	Public domain books — science, history, philosophy, medicine	~500 docs/cycle
arXiv	Research papers — AI, biology, physics, math	~1,800 docs/cycle
openFDA	Drug labels and safety data	~1,400 docs/cycle
Internet Archive	Military field manuals (DTIC collection)	~10 docs/cycle
PubMed Central	Open access medical/scientific papers	~1,400 docs/cycle
OpenStax	Free textbook metadata (biology, chemistry, physics)	~11 docs/cycle

Using Gemma 4 for Document Refinement

You can optionally use a local LLM to refine, summarize, and improve scraped documents. This requires Ollama.

Setup Ollama + Gemma 4

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 (choose size for your GPU)
ollama pull gemma3:4b     # 4GB VRAM minimum
ollama pull gemma3:12b    # 8GB VRAM
ollama pull gemma3:27b    # 16GB+ VRAM

# Verify it's running
curl http://localhost:11434/api/tags

Refine Documents with Gemma

The refinement pipeline uses Gemma to:

Summarize long documents into concise reference entries
Extract better keywords than the automated TF-IDF approach
Identify and fill knowledge gaps
Generate research plans for new topics

# Run the automated refinement pipeline
python3 refine.py

# Or use the Claude/Gemma-powered refinement script
./refine-with-claude.sh

Custom Refinement Script

You can also write your own refinement using the Ollama API:

import requests
import db

db.init_db()
conn = db.get_conn()

# Get documents that need refinement (low relevance, no keywords)
docs = conn.execute("""
    SELECT id, title, content FROM documents
    WHERE relevance_score < 0.3
    ORDER BY created_at DESC LIMIT 50
""").fetchall()

for doc in docs:
    # Ask Gemma to summarize and extract keywords
    resp = requests.post("http://localhost:11434/api/generate", json={
        "model": "gemma3:12b",
        "prompt": f"Summarize this document in 2-3 sentences and list 5 keywords:\n\n{doc['content'][:3000]}",
        "stream": False
    })
    summary = resp.json()["response"]
    print(f"Refined: {doc['title']}")
    print(f"  {summary[:200]}...")

Launch the Search Server

# Start the web UI (default port 8585)
python3 search_server.py

# Open in browser
# http://localhost:8585

The search UI features:

Dark theme with gradient accents
Real-time search with BM25 ranking
Source filtering (by domain/type)
Relevance feedback buttons (thumbs up/down on each result)
Document count and database stats
Mobile-responsive design

API Endpoints

Endpoint	Method	Description
`/`	GET	Search UI (HTML)
`/api/search?q=query`	GET	Search documents
`/api/stats`	GET	Database statistics
`/api/feedback`	POST	Submit relevance feedback
`/api/entities?q=query`	GET	Search entities
`/api/graph/:id`	GET	Get document's knowledge graph

Access from Your Phone with Tailscale

Tailscale creates a secure mesh VPN so you can access your Knowledge Engine from anywhere — phone, tablet, laptop — without exposing it to the internet.

1. Install Tailscale on your server

# Linux (Debian/Ubuntu)
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

# It will print a URL — open it to authenticate
# Note your Tailscale IP (e.g., 100.x.y.z)
tailscale ip -4

2. Install Tailscale on your phone

Android: Google Play Store
iOS: App Store

Sign in with the same account you used on the server.

3. Start the search server (bind to all interfaces)

# The server already binds to 0.0.0.0, so it's accessible on all interfaces
python3 search_server.py

4. Access from your phone

Open your phone's browser and go to:

http://100.x.y.z:8585

Replace 100.x.y.z with your server's Tailscale IP. Bookmark it for quick access.

Optional: Run as a system service

# Create a systemd service so it starts on boot
sudo tee /etc/systemd/system/knowledge-engine.service << 'EOF'
[Unit]
Description=Knowledge Engine Search Server
After=network.target

[Service]
Type=simple
WorkingDirectory=/path/to/knowledge-engine
ExecStart=/usr/bin/python3 search_server.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now knowledge-engine

Optional: Add to phone home screen

On both Android and iOS, you can add the search page as a PWA-like shortcut:

Open http://100.x.y.z:8585 in your phone browser
Tap the share/menu button
Select "Add to Home Screen"
Now you have a dedicated icon that opens directly to your knowledge engine

Entity Extraction & Knowledge Graph

# Extract entities from all documents (pattern-based, no LLM)
python3 ke.py extract-entities

# Search by entity
python3 ke.py entity-search "British Columbia"

# List all entities of a type
python3 ke.py entities --type person
python3 ke.py entities --type org
python3 ke.py entities --type law

# Explore document connections
python3 ke.py graph 42        # Show entity connections for doc 42
python3 ke.py related 42      # Find related documents via shared entities

Entity Types

Type	Examples	Count
`person`	Historical figures, authors, researchers	~1,200
`org`	Companies, institutions, government bodies	~2,000
`law`	Statutes, regulations, legal references	~500
`plant`	Medicinal and edible plants	~90
`place`	Cities, regions, geographic features	~70
`mineral`	Gold, silver, copper, geological terms	~25
`concept`	Technical concepts, methodologies	~40

Collections & Feedback

# Create a collection
python3 ke.py collection-create "Gold Prospecting" "Everything about finding gold in BC"

# Add documents to it
python3 ke.py collection-add 1 42
python3 ke.py collection-add 1 43

# List collections
python3 ke.py collections

# Give feedback to improve rankings
python3 ke.py feedback --id 42 --relevant true    # Boost this result
python3 ke.py feedback --id 55 --relevant false   # Demote this result

Read-Only Consumer API

Other projects can consume your knowledge base without risking data integrity. See CONSUMERS.md for the full contract.

from ke_client import KEClient

with KEClient() as ke:
    # Full-text search
    docs = ke.search_documents("solar panel sizing", limit=10)
    for doc in docs:
        print(f"{doc['title']} (relevance: {doc['relevance_score']})")

    # Property search
    props = ke.search_properties("cabin", max_price=20000)

    # Check schema version
    versions = ke.schema_version()

Backup & Maintenance

# Nightly backup (add to cron: 0 3 * * *)
./run_nightly_backup.sh

# Decay old, low-relevance documents
python3 ke.py decay

# Run refinement pipeline
python3 refine.py

Project Structure

knowledge-engine/
├── ke.py                 # CLI interface (26+ commands)
├── db.py                 # SQLite + FTS5 database layer
├── entities.py           # Pattern-based entity extraction
├── smart_search.py       # BM25 + entity overlap + query expansion
├── scraper.py            # Base scraper (RSS, web, API)
├── source_scraper.py     # Continuous scraper for 6 public APIs
├── search_server.py      # Flask web UI + REST API
├── refine.py             # Document refinement pipeline
├── ke_client.py          # Read-only consumer library
├── properties_scraper.py # BC real estate scraper
├── search_ui/
│   └── index.html        # Dark-themed search interface
├── migrations/           # SQL schema evolution (7 migrations)
├── data/                 # Pre-built datasets
│   ├── bc_gold_mining.py
│   ├── bc_gold_analysis.py
│   ├── medicinal_plants.py
│   └── medicinal_plants_expanded.py
├── topics/               # Topic seed data
│   ├── food_seeds.py
│   └── medicinal_plants.py
├── expand_fishing.py     # BC fishing knowledge (15 docs)
├── expand_survival_equipment.py  # Survival + equipment (27 docs)
└── expand_offgrid_corp.py        # Off-grid + BC corp (27 docs)

Requirements

Python 3.10+
SQLite 3.35+ (with FTS5 support — included in most distributions)
Flask, flask-cors, requests, beautifulsoup4, feedparser
Optional: Ollama + Gemma 4 for document refinement

License

MIT — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
docs		docs
migrations		migrations
search_ui		search_ui
topics		topics
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONSUMERS.md		CONSUMERS.md
LICENSE		LICENSE
README.md		README.md
db.py		db.py
entities.py		entities.py
expand_fishing.py		expand_fishing.py
expand_offgrid_corp.py		expand_offgrid_corp.py
expand_survival_equipment.py		expand_survival_equipment.py
ingest_knowledge_base.py		ingest_knowledge_base.py
ke.py		ke.py
ke_client.py		ke_client.py
properties_scraper.py		properties_scraper.py
refine-with-claude.sh		refine-with-claude.sh
refine.py		refine.py
requirements.txt		requirements.txt
run_nightly_backup.sh		run_nightly_backup.sh
run_tax_sale_scrape.sh		run_tax_sale_scrape.sh
scraper.py		scraper.py
search_server.py		search_server.py
smart_search.py		smart_search.py
source_scraper.py		source_scraper.py

Folders and files

Latest commit

History

Repository files navigation

Knowledge Engine

Features

Quick Start

Architecture

Seeding with Data

Available Datasets

Scraping from Public APIs

Sources

Using Gemma 4 for Document Refinement

Setup Ollama + Gemma 4

Refine Documents with Gemma

Custom Refinement Script

Launch the Search Server

API Endpoints

Access from Your Phone with Tailscale

1. Install Tailscale on your server

2. Install Tailscale on your phone

3. Start the search server (bind to all interfaces)

4. Access from your phone

Optional: Run as a system service

Optional: Add to phone home screen

Entity Extraction & Knowledge Graph

Entity Types

Collections & Feedback

Read-Only Consumer API

Backup & Maintenance

Project Structure

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages