Dev Knowledge

Build knowledge base that scrapes developer documentation and makes it searchable using AI embeddings.

🚀 What it does

Turns documentation websites into a searchable database:

Scrapes Node.js, TypeScript, Python, and JavaScript docs
Converts HTML to clean Markdown
Creates vector embeddings for semantic understanding
Stores everything in a local SQLite database
Searches using natural language queries

📦 Installation

npm install
npm run build

🎯 Usage

Scraping and Indexing

npx tsx src/index.ts
# or
node dist/index.js      # after building

This will:

Initialize the embedding encoder
Scrape configured documentation sources
Convert HTML to Markdown
Generate vector embeddings
Store in SQLite database

⚙️ Configuration

Edit src/Processor.ts to modify scraping sources:

export const scraperSchema: ScrapeSchema[] = [
  {
    url: ['https://nodejs.org/docs/latest-v24.x/api/'],
    parse: (content: string): string | null => {
      // Custom parsing logic
    }
  }
  // Add more sources...
]

🏗️ Project Structure

src/
├── core/                    # Core functionality
│   ├── Database.ts          # SQLite + sqlite-vec vector storage
│   └── embedding/           # Vector operations
│       ├── Encoder.ts       # Text to vector conversion
│       └── Decoder.ts       # Vector similarity search
├── interfaces/              # TypeScript type definitions
├── utils/                   # Utility classes
│   ├── Scraper.ts           # Web scraping with memory management
│   ├── Logger.ts            # Logging utilities
│   └── Generator.ts         # ID generation
├── index.ts                 # Main entry point
└── Processor.ts             # HTML to Markdown + embedding pipeline

📚 Dependencies

@neabyte/fetch - HTTP client with retry logic
@xenova/transformers - Vector embeddings
better-sqlite3 - SQLite database
sqlite-vec - Vector similarity search
turndown - HTML to Markdown conversion
jsdom - DOM parsing

🗄️ Database Schema

CREATE TABLE embedding (
  id TEXT PRIMARY KEY,
  source TEXT NOT NULL,
  content TEXT NOT NULL,
  vector BLOB NOT NULL,
  timestamp INTEGER NOT NULL
)

📄 License

This project is licensed under the MIT license. See the LICENSE file for more info.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dev Knowledge

🚀 What it does

📦 Installation

🎯 Usage

Scraping and Indexing

⚙️ Configuration

🏗️ Project Structure

📚 Dependencies

🗄️ Database Schema

📄 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dev Knowledge

🚀 What it does

📦 Installation

🎯 Usage

Scraping and Indexing

⚙️ Configuration

🏗️ Project Structure

📚 Dependencies

🗄️ Database Schema

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages