Skip to content

Abdulmumin1/tiny.search

Repository files navigation

tiny.search

![Placeholder for project image]

A minimal search engine using vector embeddings for semantic search over crawled web pages. Currently indexes over 6000 crawled pages.

Local Setup

  1. Install dependencies:

    pip install -r requirements.txt  # For Python scripts
  2. Environment setup:

    • Copy .env.example to .env and fill in your Cloudflare credentials (AI, Vectorize bindings).
  3. Run the crawler:

    Manually add some domain to the seed.json

    python playwright_crawler.py  # Crawls one domain per run

    After every run of the crawler, it get all external domain found and add them to the seed, thus making it grow bigger with each scrape

  4. Embed and vectorize pages:

    python vector_embedder.py  # Generates embeddings
  5. Serve locally:

    • Run the FastAPI server: python web_search_api.py (serves at http://localhost:8000, with frontend at /web)

Files Overview

  • playwright_crawler.py: Web crawler using Httpx && Playwright to scrape pages from a single domain per run.
  • vector_embedder.py: Generates embeddings for crawled pages using a local model.
  • cf_vectorize.py: Uploads embeddings to Cloudflare Vectorize index.
  • web_search_api.py: FastAPI server for local search API and serving the frontend.
  • index.html: Frontend search interface.
  • tiny-search-worker/: Hono-based Cloudflare Worker for the search API.
  • data/crawled_pages/: JSON files of crawled page data (ignored in .gitignore).
  • data/milvus_vector_db.db: Local vector database (Milvus).
  • state/: JSON state files tracking crawl progress, seed URLs, and vectorization status.
  • crawled.json: Aggregated crawled data.

Crawling Details

  • Domain selection: Each crawler run picks one domain from the seed and scrapes it fully.
  • Seed population: The state/seed.json is manually populated with starting URLs/domains.
  • State management: State files in state/ track what's been crawled and vectorized. Crawled pages are ignored in .gitignore to keep repo size manageable, but state is preserved for incremental updates.

About

A minimal search engine using vector embeddings for semantic search over crawled web pages. Currently indexes over 6000 crawled pages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors