Skip to content

CryptoDustinJ/knowledge-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Engine

A self-refining, offline-first knowledge base that scrapes, indexes, and ranks information using relevance feedback. No LLM required for core operation — uses BM25/FTS5 scoring, pattern-based entity extraction, and a feedback loop to improve results over time.

Optionally use Gemma 4 (via Ollama) to refine and summarize documents for higher quality results.

Knowledge Engine Search UI Search results with BM25 ranking, source tags, keyword highlights, and inline images

Features

  • Full-text search with SQLite FTS5 and BM25 ranking
  • Entity extraction — pattern-based (no LLM needed), 7 entity types: person, org, law, plant, place, mineral, concept
  • Knowledge graph — documents linked by shared entities, with graph traversal for discovery
  • Query expansion via synonym tables
  • Relevance feedback loop — mark results as relevant/irrelevant to train future rankings
  • Continuous scraping from 6 free public APIs: Project Gutenberg, arXiv, openFDA, Internet Archive, PubMed Central, OpenStax
  • Property scraper — BC real estate from public listings and tax sales
  • Collections for grouping related documents
  • Web search UI — dark-themed, responsive, works great on mobile
  • Read-only consumer API for downstream projects
  • Gemma 4 integration (optional) — use a local LLM to refine/summarize scraped documents

Quick Start

# Clone and install
git clone https://github.com/CryptoDustinJ/knowledge-engine.git
cd knowledge-engine
pip install -r requirements.txt

# Initialize the database
python3 ke.py init

# Add your first sources
python3 ke.py add-source --type rss --name "Hacker News" --url "https://news.ycombinator.com/rss"
python3 ke.py add-source --type rss --name "Python News" --url "https://realpython.com/atom.xml"

# Scrape all sources
python3 ke.py scrape

# Or run the continuous API scraper (Gutenberg, arXiv, PubMed, etc.)
python3 source_scraper.py

# Query
python3 ke.py query "machine learning transformers"

# Show stats
python3 ke.py stats

Architecture

                         +------------------+
                         |   Search UI      |  <-- Flask web app (port 8585)
                         |   (browser)      |
                         +--------+---------+
                                  |
                         +--------v---------+
                         |  search_server.py |  <-- REST API
                         +--------+---------+
                                  |
              +-------------------+-------------------+
              |                   |                   |
    +---------v------+  +---------v------+  +---------v------+
    | smart_search.py|  |  entities.py   |  |    db.py       |
    | BM25 + entity  |  | Pattern-based  |  | SQLite + FTS5  |
    | overlap +      |  | extraction     |  | WAL mode       |
    | query expansion|  | (7 types)      |  | migrations     |
    +----------------+  +----------------+  +--------+-------+
                                                     |
              +--------------------------------------+
              |                    |                  |
    +---------v------+  +---------v------+  +--------v-------+
    | source_scraper |  |   scraper.py   |  |  refine.py     |
    | (6 public APIs)|  | (RSS, web, API)|  | + Gemma 4 LLM  |
    +----------------+  +----------------+  +----------------+

Seeding with Data

The data/ directory has pre-built datasets you can ingest immediately:

# Load BC gold mining encyclopedia (1,600+ documents)
python3 -c "from data.bc_gold_mining import DOCUMENTS; import db; db.init_db(); [db.insert_document(4, d['title'], d['content'], keywords=d.get('keywords',[])) for d in DOCUMENTS]"

# Load medicinal plants database
python3 -c "from data.medicinal_plants import PLANTS; import db; db.init_db(); [db.insert_document(3, p['name'], p['content'], keywords=p.get('keywords',[])) for p in PLANTS]"

# Load fishing, survival, off-grid building, and more
python3 expand_fishing.py
python3 expand_survival_equipment.py
python3 expand_offgrid_corp.py

Available Datasets

Dataset Documents Topics
BC Gold Mining ~1,600 Placer mining, lode deposits, history, regulations, geology
BC Gold Analysis ~1,600 Detailed geological analysis, creek-by-creek data
Medicinal Plants ~200 Traditional and evidence-based plant medicine
Fishing (BC) 15 Salmon, trout, halibut, techniques, regulations
Survival Skills 15 Water, fire, shelter, navigation, first aid
Equipment Knowledge 12 Chainsaw, generator, welding, small engine repair
Off-Grid Building 15 Solar, rainwater, septic, log building
BC Corporation 12 Incorporation, taxes, compliance, directors

Scraping from Public APIs

The source_scraper.py runs continuously, pulling from 6 free APIs:

# Run continuous scraper (30s cooldown between fetches)
python3 source_scraper.py

# Or run it as a service
# Create a systemd service or use screen/tmux:
screen -S ke-scraper
python3 source_scraper.py
# Ctrl+A, D to detach

Sources

API Content Rate
Project Gutenberg (Gutendex) Public domain books — science, history, philosophy, medicine ~500 docs/cycle
arXiv Research papers — AI, biology, physics, math ~1,800 docs/cycle
openFDA Drug labels and safety data ~1,400 docs/cycle
Internet Archive Military field manuals (DTIC collection) ~10 docs/cycle
PubMed Central Open access medical/scientific papers ~1,400 docs/cycle
OpenStax Free textbook metadata (biology, chemistry, physics) ~11 docs/cycle

Using Gemma 4 for Document Refinement

You can optionally use a local LLM to refine, summarize, and improve scraped documents. This requires Ollama.

Setup Ollama + Gemma 4

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 (choose size for your GPU)
ollama pull gemma3:4b     # 4GB VRAM minimum
ollama pull gemma3:12b    # 8GB VRAM
ollama pull gemma3:27b    # 16GB+ VRAM

# Verify it's running
curl http://localhost:11434/api/tags

Refine Documents with Gemma

The refinement pipeline uses Gemma to:

  1. Summarize long documents into concise reference entries
  2. Extract better keywords than the automated TF-IDF approach
  3. Identify and fill knowledge gaps
  4. Generate research plans for new topics
# Run the automated refinement pipeline
python3 refine.py

# Or use the Claude/Gemma-powered refinement script
./refine-with-claude.sh

Custom Refinement Script

You can also write your own refinement using the Ollama API:

import requests
import db

db.init_db()
conn = db.get_conn()

# Get documents that need refinement (low relevance, no keywords)
docs = conn.execute("""
    SELECT id, title, content FROM documents
    WHERE relevance_score < 0.3
    ORDER BY created_at DESC LIMIT 50
""").fetchall()

for doc in docs:
    # Ask Gemma to summarize and extract keywords
    resp = requests.post("http://localhost:11434/api/generate", json={
        "model": "gemma3:12b",
        "prompt": f"Summarize this document in 2-3 sentences and list 5 keywords:\n\n{doc['content'][:3000]}",
        "stream": False
    })
    summary = resp.json()["response"]
    print(f"Refined: {doc['title']}")
    print(f"  {summary[:200]}...")

Launch the Search Server

# Start the web UI (default port 8585)
python3 search_server.py

# Open in browser
# http://localhost:8585

The search UI features:

  • Dark theme with gradient accents
  • Real-time search with BM25 ranking
  • Source filtering (by domain/type)
  • Relevance feedback buttons (thumbs up/down on each result)
  • Document count and database stats
  • Mobile-responsive design

API Endpoints

Endpoint Method Description
/ GET Search UI (HTML)
/api/search?q=query GET Search documents
/api/stats GET Database statistics
/api/feedback POST Submit relevance feedback
/api/entities?q=query GET Search entities
/api/graph/:id GET Get document's knowledge graph

Access from Your Phone with Tailscale

Tailscale creates a secure mesh VPN so you can access your Knowledge Engine from anywhere — phone, tablet, laptop — without exposing it to the internet.

1. Install Tailscale on your server

# Linux (Debian/Ubuntu)
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

# It will print a URL — open it to authenticate
# Note your Tailscale IP (e.g., 100.x.y.z)
tailscale ip -4

2. Install Tailscale on your phone

Sign in with the same account you used on the server.

3. Start the search server (bind to all interfaces)

# The server already binds to 0.0.0.0, so it's accessible on all interfaces
python3 search_server.py

4. Access from your phone

Open your phone's browser and go to:

http://100.x.y.z:8585

Replace 100.x.y.z with your server's Tailscale IP. Bookmark it for quick access.

Optional: Run as a system service

# Create a systemd service so it starts on boot
sudo tee /etc/systemd/system/knowledge-engine.service << 'EOF'
[Unit]
Description=Knowledge Engine Search Server
After=network.target

[Service]
Type=simple
WorkingDirectory=/path/to/knowledge-engine
ExecStart=/usr/bin/python3 search_server.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now knowledge-engine

Optional: Add to phone home screen

On both Android and iOS, you can add the search page as a PWA-like shortcut:

  1. Open http://100.x.y.z:8585 in your phone browser
  2. Tap the share/menu button
  3. Select "Add to Home Screen"
  4. Now you have a dedicated icon that opens directly to your knowledge engine

Entity Extraction & Knowledge Graph

# Extract entities from all documents (pattern-based, no LLM)
python3 ke.py extract-entities

# Search by entity
python3 ke.py entity-search "British Columbia"

# List all entities of a type
python3 ke.py entities --type person
python3 ke.py entities --type org
python3 ke.py entities --type law

# Explore document connections
python3 ke.py graph 42        # Show entity connections for doc 42
python3 ke.py related 42      # Find related documents via shared entities

Entity Types

Type Examples Count
person Historical figures, authors, researchers ~1,200
org Companies, institutions, government bodies ~2,000
law Statutes, regulations, legal references ~500
plant Medicinal and edible plants ~90
place Cities, regions, geographic features ~70
mineral Gold, silver, copper, geological terms ~25
concept Technical concepts, methodologies ~40

Collections & Feedback

# Create a collection
python3 ke.py collection-create "Gold Prospecting" "Everything about finding gold in BC"

# Add documents to it
python3 ke.py collection-add 1 42
python3 ke.py collection-add 1 43

# List collections
python3 ke.py collections

# Give feedback to improve rankings
python3 ke.py feedback --id 42 --relevant true    # Boost this result
python3 ke.py feedback --id 55 --relevant false   # Demote this result

Read-Only Consumer API

Other projects can consume your knowledge base without risking data integrity. See CONSUMERS.md for the full contract.

from ke_client import KEClient

with KEClient() as ke:
    # Full-text search
    docs = ke.search_documents("solar panel sizing", limit=10)
    for doc in docs:
        print(f"{doc['title']} (relevance: {doc['relevance_score']})")

    # Property search
    props = ke.search_properties("cabin", max_price=20000)

    # Check schema version
    versions = ke.schema_version()

Backup & Maintenance

# Nightly backup (add to cron: 0 3 * * *)
./run_nightly_backup.sh

# Decay old, low-relevance documents
python3 ke.py decay

# Run refinement pipeline
python3 refine.py

Project Structure

knowledge-engine/
├── ke.py                 # CLI interface (26+ commands)
├── db.py                 # SQLite + FTS5 database layer
├── entities.py           # Pattern-based entity extraction
├── smart_search.py       # BM25 + entity overlap + query expansion
├── scraper.py            # Base scraper (RSS, web, API)
├── source_scraper.py     # Continuous scraper for 6 public APIs
├── search_server.py      # Flask web UI + REST API
├── refine.py             # Document refinement pipeline
├── ke_client.py          # Read-only consumer library
├── properties_scraper.py # BC real estate scraper
├── search_ui/
│   └── index.html        # Dark-themed search interface
├── migrations/           # SQL schema evolution (7 migrations)
├── data/                 # Pre-built datasets
│   ├── bc_gold_mining.py
│   ├── bc_gold_analysis.py
│   ├── medicinal_plants.py
│   └── medicinal_plants_expanded.py
├── topics/               # Topic seed data
│   ├── food_seeds.py
│   └── medicinal_plants.py
├── expand_fishing.py     # BC fishing knowledge (15 docs)
├── expand_survival_equipment.py  # Survival + equipment (27 docs)
└── expand_offgrid_corp.py        # Off-grid + BC corp (27 docs)

Requirements

  • Python 3.10+
  • SQLite 3.35+ (with FTS5 support — included in most distributions)
  • Flask, flask-cors, requests, beautifulsoup4, feedparser
  • Optional: Ollama + Gemma 4 for document refinement

License

MIT — see LICENSE

About

Self-refining knowledge base with SQLite FTS5, BM25 ranking, entity extraction, and relevance feedback. No LLM required.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors