![Placeholder for project image]
A minimal search engine using vector embeddings for semantic search over crawled web pages. Currently indexes over 6000 crawled pages.
-
Install dependencies:
pip install -r requirements.txt # For Python scripts -
Environment setup:
- Copy
.env.exampleto.envand fill in your Cloudflare credentials (AI, Vectorize bindings).
- Copy
-
Run the crawler:
Manually add some domain to the seed.json
python playwright_crawler.py # Crawls one domain per runAfter every run of the crawler, it get all external domain found and add them to the seed, thus making it grow bigger with each scrape
-
Embed and vectorize pages:
python vector_embedder.py # Generates embeddings -
Serve locally:
- Run the FastAPI server:
python web_search_api.py(serves at http://localhost:8000, with frontend at /web)
- Run the FastAPI server:
playwright_crawler.py: Web crawler using Httpx && Playwright to scrape pages from a single domain per run.vector_embedder.py: Generates embeddings for crawled pages using a local model.cf_vectorize.py: Uploads embeddings to Cloudflare Vectorize index.web_search_api.py: FastAPI server for local search API and serving the frontend.index.html: Frontend search interface.tiny-search-worker/: Hono-based Cloudflare Worker for the search API.data/crawled_pages/: JSON files of crawled page data (ignored in .gitignore).data/milvus_vector_db.db: Local vector database (Milvus).state/: JSON state files tracking crawl progress, seed URLs, and vectorization status.crawled.json: Aggregated crawled data.
- Domain selection: Each crawler run picks one domain from the seed and scrapes it fully.
- Seed population: The
state/seed.jsonis manually populated with starting URLs/domains. - State management: State files in
state/track what's been crawled and vectorized. Crawled pages are ignored in .gitignore to keep repo size manageable, but state is preserved for incremental updates.