Build knowledge base that scrapes developer documentation and makes it searchable using AI embeddings.
Turns documentation websites into a searchable database:
- Scrapes Node.js, TypeScript, Python, and JavaScript docs
- Converts HTML to clean Markdown
- Creates vector embeddings for semantic understanding
- Stores everything in a local SQLite database
- Searches using natural language queries
npm install
npm run buildnpx tsx src/index.ts
# or
node dist/index.js # after buildingThis will:
- Initialize the embedding encoder
- Scrape configured documentation sources
- Convert HTML to Markdown
- Generate vector embeddings
- Store in SQLite database
Edit src/Processor.ts to modify scraping sources:
export const scraperSchema: ScrapeSchema[] = [
{
url: ['https://nodejs.org/docs/latest-v24.x/api/'],
parse: (content: string): string | null => {
// Custom parsing logic
}
}
// Add more sources...
]src/
├── core/ # Core functionality
│ ├── Database.ts # SQLite + sqlite-vec vector storage
│ └── embedding/ # Vector operations
│ ├── Encoder.ts # Text to vector conversion
│ └── Decoder.ts # Vector similarity search
├── interfaces/ # TypeScript type definitions
├── utils/ # Utility classes
│ ├── Scraper.ts # Web scraping with memory management
│ ├── Logger.ts # Logging utilities
│ └── Generator.ts # ID generation
├── index.ts # Main entry point
└── Processor.ts # HTML to Markdown + embedding pipeline
@neabyte/fetch- HTTP client with retry logic@xenova/transformers- Vector embeddingsbetter-sqlite3- SQLite databasesqlite-vec- Vector similarity searchturndown- HTML to Markdown conversionjsdom- DOM parsing
CREATE TABLE embedding (
id TEXT PRIMARY KEY,
source TEXT NOT NULL,
content TEXT NOT NULL,
vector BLOB NOT NULL,
timestamp INTEGER NOT NULL
)This project is licensed under the MIT license. See the LICENSE file for more info.