A sophisticated RAG-optimized metadata extraction system for pitch deck PDFs. Processes pitch decks to generate structured metadata at both deck-level (global context) and slide-level (detailed content) for high-performance retrieval systems.
- Multi-Level Chunking: Extracts both deck-level and slide-level metadata
- Industry Classification: 30+ industry categories with ENUM enforcement
- Slide Layout Detection: Automatic classification of slide types (Title, Problem, Solution, etc.)
- Gemini 2.0 Flash: Powered by Google's latest AI model for accurate extraction
- Batch Processing: Process entire folders of PDFs automatically
- Validation & Statistics: Built-in validation and detailed analytics
- RAG-Optimized: Structured for vector database storage and semantic search
{
deck_industry: DeckIndustry, // ENUM: Fintech, SaaS, etc.
company_name: string, // Extracted company name
executive_summary: string, // 2-3 sentence deck summary
total_pages: number // Total slide count
}{
filename: string, // Source PDF filename
slide_number: number, // Page index
slide_content: string, // Full text extraction
slide_summary: string, // 1-2 sentence summary
keywords: string[], // 3-5 core keywords
slide_layout: SlideLayout // ENUM: Title, Problem, etc.
}- Node.js 18+ and npm
- Gemini API key (Get one here)
poppler-utilsfor PDF processing:# macOS brew install poppler # Ubuntu/Debian sudo apt-get install poppler-utils # Windows # Download from: https://github.com/oschwartz10612/poppler-windows/releases
-
Clone and install dependencies:
npm install
-
Configure environment:
cp .env.example .env # Edit .env and add your GEMINI_API_KEY -
Add PDFs to process:
# Place your pitch deck PDFs in the pdf/ directory
Process all PDFs:
npm run devBuild and run:
npm run build
npm startDeckBot/
βββ src/
β βββ types.ts # TypeScript type definitions & ENUMs
β βββ pdf-processor.ts # PDF text extraction & image conversion
β βββ gemini-client.ts # Gemini API integration
β βββ metadata-extractor.ts # Main extraction orchestration
β βββ index.ts # CLI entry point
βββ pdf/ # Input PDFs (place files here)
βββ output/ # Generated JSON metadata files
βββ images/ # Converted slide images (temporary)
βββ package.json
βββ tsconfig.json
βββ README.md
Edit src/index.ts to customize:
const config: ProcessingConfig = {
pdfDirectory: path.join(projectRoot, 'pdf'),
outputDirectory: path.join(projectRoot, 'output'),
imageDirectory: path.join(projectRoot, 'images'),
geminiApiKey: apiKey,
maxConcurrentRequests: 3 // Adjust for rate limiting
};The system supports 30+ industry classifications:
- Fintech, E-commerce, SaaS
- HealthTech, EdTech, PropTech
- Marketplace, Enterprise Software
- Consumer App, Biotech, CleanTech
- AI/ML, Cybersecurity, Web3/Crypto
- Social Media, Gaming, Hardware
- Marketing/AdTech, HR/Recruiting
- ... and more (see
src/types.ts)
Automatic detection of common pitch deck slide types:
- Title, Problem, Solution
- Product, Traction, Market
- Business Model, Competition
- Team, Financials, Roadmap
- Ask, Contact, Appendix
Each PDF generates a JSON file with complete metadata:
{
"deck_metadata": {
"deck_industry": "Fintech",
"company_name": "Example Corp",
"executive_summary": "A neobank focused on...",
"total_pages": 25
},
"slides": [
{
"filename": "pitch_deck.pdf",
"slide_number": 1,
"slide_content": "Complete extracted text...",
"slide_summary": "Title slide introducing...",
"keywords": ["fintech", "banking", "mobile"],
"slide_layout": "Title"
}
// ... more slides
]
}- Vector DB: Store each slide as a separate document
- Metadata: Attach both slide-level AND deck-level metadata
- Embeddings: Generate from
slide_summary+slide_content
- Semantic Search: Query against slide embeddings
- Metadata Filtering: Filter by
deck_industry,slide_layout - Context Assembly: Include deck-level context with retrieved slides
// 1. Semantic search
const results = await vectorDB.search(query, {
filter: { deck_industry: 'Fintech' },
limit: 5
});
// 2. Assemble context
const context = results.map(slide => ({
global: slide.metadata.deck_metadata,
local: slide.metadata.slide_content
}));- PDF Text Extraction: Extract complete text using
pdf-parse - Deck Summarization: Generate global context with Gemini
- Image Conversion: Convert each page to PNG images
- Slide Analysis: Process each image with Gemini vision
- Metadata Generation: Extract structured data for each slide
- Validation: Verify completeness and consistency
- JSON Export: Save structured output files
The system includes built-in delays between API calls:
- 1 second delay between slide processing
- Configurable
maxConcurrentRequests - Adjust in
src/metadata-extractor.tsas needed
Install poppler-utils (see Prerequisites)
Ensure .env file exists with valid API key
Increase delay in metadata-extractor.ts:
await this.geminiClient.delay(2000); // 2 secondsCheck PDF permissions and poppler installation
- Processing Speed: ~5-10 seconds per slide (depends on API)
- Batch Processing: Supports unlimited PDFs sequentially
- Memory Usage: ~100-500MB per PDF
- API Costs: ~$0.01-0.05 per deck (Gemini 2.0 Flash pricing)
- API keys stored in
.env(never commit!) - All processing happens locally
- Images stored temporarily (auto-cleanup optional)
- No external data transmission except to Gemini API
MIT
Contributions welcome! Please open an issue or PR.
Built with:
- Google Generative AI - Gemini 2.0 Flash
- pdf-parse - PDF text extraction
- pdf-poppler - PDF to image conversion
- sharp - Image processing
Made with β€οΈ for RAG systems and pitch deck analysis