A comprehensive demonstration of GraphRAG vs Naive RAG using realistic PDF CVs and LLM-powered knowledge graph extraction. This project showcases how knowledge graphs enable structured queries that are impossible with traditional vector-based RAG systems.
- Python 3.11+ with
uvpackage manager - Docker Desktop (for Neo4j database)
- OpenAI API Key (set in
.envfile)
# Complete end-to-end comparison
uv run python 5_compare_systems.py# 1. Initial setup and validation
uv run python 0_setup.py
# 2. Start Neo4j database
./start_session.sh
# 3. Generate 30 realistic CV PDFs
uv run python 1_generate_data.py
# 4. Extract knowledge graph from CVs using LLMGraphTransformer
uv run python 2_data_to_knowledge_graph.py
# 5. Run complete comparison
uv run python 5_compare_systems.pydocker cp mydump.dump neo4j-graphrag:/var/lib/neo4j/import/
docker stop neo4j-graphrag
docker run --rm
-v 06_graphrag_neo4j_data:/data
-v 06_graphrag_neo4j_import:/var/lib/neo4j/import
neo4j:latest
neo4j-admin load --from=/var/lib/neo4j/import/mydump.dump --database=neo4j --force
docker start neo4j-graphrag
Traditional RAG systems struggle with structured queries requiring:
| Query Type | Example | Traditional RAG Issue |
|---|---|---|
| Counting | "How many Python developers?" | ❌ Estimates from text chunks |
| Filtering | "Find people with Docker AND Kubernetes" | ❌ Limited to semantic similarity |
| Aggregation | "Average years of experience?" | ❌ Cannot calculate across entities |
| Sorting | "Top 3 most experienced developers" | ❌ No structured ranking |
| Multi-hop | "People who attended same university" | ❌ Cannot traverse relationships |
Auto-extracted from PDF CVs using LLMGraphTransformer:
Nodes:
├── Person (id, name, location, bio)
├── Skill (id, category)
├── Company (id, industry, location)
├── University (id, location, type)
└── Certification (id, provider, field)
Relationships:
├── (Person)-[HAS_SKILL]->(Skill)
├── (Person)-[WORKED_AT]->(Company)
├── (Person)-[STUDIED_AT]->(University)
├── (Person)-[EARNED]->(Certification)
└── (Person)-[MENTIONS]->(Person)
- PDF Processing: Realistic CV generation with reportlab
- Knowledge Extraction: LangChain LLMGraphTransformer
- Graph Database: Neo4j with Docker
- GraphRAG: LangChain GraphCypherQAChain with custom prompts
- Naive RAG: ChromaDB vector search baseline
- Evaluation: GPT-5 ground truth generation
GraphRAG (✅ Accurate):
MATCH (p:Person)-[:HAS_SKILL]->(s:Skill)
WHERE toLower(s.id) = toLower("Python")
RETURN count(p) AS pythonProgrammersResult: 7 people (exact count)
Naive RAG (❌ Incomplete): Result: "Based on context, only Amanda Smith is mentioned" (missed 6 people)
GraphRAG (✅ Complete): Result: 4 people - Christine Rodriguez, Joseph Fuller, Krystal Castillo, William Bonilla
Naive RAG (❌ Limited): Result: 1 person - Christine Rodriguez (missed 3 people)
06_GraphRAG/
├── 0_setup.py # Environment validation
├── 1_generate_data.py # Synthetic PDF CV generation
├── 2_data_to_knowledge_graph.py # LLM graph extraction
├── 3_query_knowledge_graph.py # GraphRAG implementation
├── 4_naive_rag_cv.py # Naive RAG baseline
├── 5_compare_systems.py # System comparison
├── docker-compose.yml # Neo4j setup
├── start_session.sh # Neo4j management
├── utils/ # Utility files
│ ├── generate_ground_truth.py # GPT-5 ground truth
│ ├── test_questions.json # Evaluation questions
│ └── config.toml # Configuration
├── data/programmers/ # Generated CV PDFs
└── results/ # Comparison results
├── ground_truth_answers.json
└── comparison_report.md
- Language: Python 3.11+
- Package Manager: uv
- LLM: OpenAI GPT-4o (queries), GPT-5 (ground truth)
- Graph Database: Neo4j 5.x with Docker
- Vector Store: ChromaDB (baseline comparison)
- Frameworks: LangChain, LangChain Experimental
- Document Processing: Unstructured, ReportLab
- GraphRAG excels at structured queries requiring precise relationships
- LLMGraphTransformer enables real-world PDF-to-knowledge-graph workflows
- Custom Cypher prompts solve case sensitivity and result interpretation issues
- GPT-5 ground truth provides unbiased evaluation
- Hybrid approaches can combine both strengths for optimal results
Neo4j Browser: http://localhost:7474 (neo4j/password123)
# Test GraphRAG only
uv run python 3_query_knowledge_graph.py
# Test Naive RAG only
uv run python 4_naive_rag_cv.py
# Generate ground truth only
uv run python utils/generate_ground_truth.pyThis approach applies to any domain with:
- Structured relationships between entities
- Precise counting/filtering requirements
- Multi-hop reasoning needs
- Complex business queries
Examples: Staffing, inventory management, medical records, financial risk analysis.