tiny.search

![Placeholder for project image]

A minimal search engine using vector embeddings for semantic search over crawled web pages. Currently indexes over 6000 crawled pages.

Local Setup

Install dependencies:

pip install -r requirements.txt  # For Python scripts

Environment setup:
- Copy .env.example to .env and fill in your Cloudflare credentials (AI, Vectorize bindings).
Run the crawler:

Manually add some domain to the seed.json
```
python playwright_crawler.py  # Crawls one domain per run
```
After every run of the crawler, it get all external domain found and add them to the seed, thus making it grow bigger with each scrape

Embed and vectorize pages:

python vector_embedder.py  # Generates embeddings

Serve locally:
- Run the FastAPI server: python web_search_api.py (serves at http://localhost:8000, with frontend at /web)

Files Overview

playwright_crawler.py: Web crawler using Httpx && Playwright to scrape pages from a single domain per run.
vector_embedder.py: Generates embeddings for crawled pages using a local model.
cf_vectorize.py: Uploads embeddings to Cloudflare Vectorize index.
web_search_api.py: FastAPI server for local search API and serving the frontend.
index.html: Frontend search interface.
tiny-search-worker/: Hono-based Cloudflare Worker for the search API.
data/crawled_pages/: JSON files of crawled page data (ignored in .gitignore).
data/milvus_vector_db.db: Local vector database (Milvus).
state/: JSON state files tracking crawl progress, seed URLs, and vectorization status.
crawled.json: Aggregated crawled data.

Crawling Details

Domain selection: Each crawler run picks one domain from the seed and scrapes it fully.
Seed population: The state/seed.json is manually populated with starting URLs/domains.
State management: State files in state/ track what's been crawled and vectorized. Crawled pages are ignored in .gitignore to keep repo size manageable, but state is preserved for incremental updates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny.search

Local Setup

Files Overview

Crawling Details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
state		state
tiny-search-worker		tiny-search-worker
.gitignore		.gitignore
README.md		README.md
cf_vectorize.py		cf_vectorize.py
crawled.json		crawled.json
index.html		index.html
playwright_crawler.py		playwright_crawler.py
requirements.txt		requirements.txt
vector_embedder.py		vector_embedder.py
web_search_api.py		web_search_api.py

Folders and files

Latest commit

History

Repository files navigation

tiny.search

Local Setup

Files Overview

Crawling Details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages