This project is a local document-based RAG (Retrieval-Augmented Generation) system built using:
- Node.js (Express server)
- Supabase (Postgres + pgvector)
- OpenAI Embeddings + Chat Models
- pdf-parse (PDF text extraction)
Instead of uploading documents, this app automatically reads all PDF files in a local folder (MyDocs/), extracts text, chunks it, generates embeddings, and stores them in Supabase for vector search.
You can then ask questions about the document using semantic search + LLM reasoning.
β Automatic PDF ingestion from MyDocs/
β Text chunking with overlap (RAG-friendly)
β Embeddings stored in Supabase using pgvector
β Fast semantic search using an RPC function
β Question-answering using retrieved document context
β Clean and modular Node.js code
β Local-only ingestion (no uploads)
Document-RAG-App/
β
βββ MyDocs/
β βββ Policies.pdf # Your local documents (auto-read)
β
βββ index.js # Node server + RAG pipeline
βββ package.json
βββ package-lock.json
βββ .env # Your API keys
βββ README.md # This file
| Component | Technology |
|---|---|
| Server | Node.js + Express |
| LLM Provider | OpenAI API |
| Embeddings | text-embedding-3-small |
| Database | Supabase Postgres |
| Vector Index | pgvector + ivfflat |
| Document Loader | pdf-parse |
| Query Transport | REST API |
npm install
Required packages:
- express
- cors
- dotenv
- openai
- @supabase/supabase-js
- pdf-parse
Create .env in the root:
OPENAI_API_KEY=your_openai_key
SUPABASE_URL=https://your-project-url.supabase.co
SUPABASE_ANON_KEY=your_supabase_anon_key
PORT=3000
Place your PDFs into:
MyDocs/
Policies.pdf
You can add more PDFs anytime.
Log in to Supabase β SQL Editor β Run these commands.
create extension if not exists vector;
create table if not exists "MyPDFDocuments" (
id bigserial primary key,
content text,
embedding vector(1536),
title text,
source text,
path text,
created_at timestamptz default now()
);
create index if not exists mypdfdocuments_embedding_idx
on "MyPDFDocuments"
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
create or replace function match_documents(
query_embedding vector(1536),
match_threshold float,
match_count int
)
returns table (
id bigint,
content text,
similarity float
)
language plpgsql
as $$
begin
return query
select
d.id,
d.content,
1 - (d.embedding <=> query_embedding) as similarity
from "MyPDFDocuments" d
where 1 - (d.embedding <=> query_embedding) > match_threshold
order by similarity desc
limit match_count;
end;
$$;
POST /index-docs
This endpoint:
- Scans the
MyDocs/folder - Extracts text from each PDF
- Splits into chunks
- Generates embeddings
- Writes to Supabase
curl -X POST http://localhost:3000/index-docs
{
"message": "Indexed all PDFs from MyDocs folder into Supabase."
}
POST /query
{
"query": "What is the leave policy?"
}
curl -X POST http://localhost:3000/query -H "Content-Type: application/json" -d '{"query": "What is the leave policy?"}'
{
"answer": "According to the policy..."
}
pdf-parse reads PDF contents.
Chunking with overlap:
chunkSize = 1000 chars
chunkOverlap = 200 chars
Each chunk is converted to a 1536-dimension vector using:
text-embedding-3-small
Stored in MyPDFDocuments table.
match_documents performs semantic vector search.
OpenAI model produces the final answer using retrieved context.
npm run dev
POST http://localhost:3000/index-docs
POST http://localhost:3000/query
| Issue | Cause | Fix |
|---|---|---|
| No text extracted | PDF is a scanned (image) PDF | Use OCR like tesseract |
| Embedding dimension error | Table vector size mismatch | Ensure vector(1536) |
| Empty results | Similarity threshold too high | Use threshold 0.2 |
| Slow search | Missing index | Add IVFFLAT index |
| OpenAI errors | Wrong API key | Check .env |