Skip to content

Jash2606/graph-rag-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

18 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง  Hybrid Vector-Graph Retrieval System

A production-grade semantic search engine that combines vector similarity (FAISS) with graph-based knowledge traversal (Neo4j) to deliver contextually rich and highly relevant search results.

Live Demo: Frontend (Streamlit Cloud) ยท Backend API: Hosted on AWS EC2


๐Ÿ“‹ Table of Contents


๐ŸŽฏ Overview

Ever searched for something and got results that were technically relevant but missed the bigger picture? That's the limitation of traditional search โ€” it matches words or meaning, but ignores how information connects.

This system solves that by combining two powerful approaches:

  1. Vector Search โ€” Understands the meaning behind your query using AI embeddings. Search for "Who changed modern physics?" and it finds Einstein, even if those exact words never appear.

  2. Graph Search โ€” Explores connections between documents and entities. Found a doc about Einstein? It automatically knows he's connected to "relativity", "Nobel Prize", "Princeton", and surfaces those related documents too.

  3. Hybrid Search โ€” The magic sauce ๐Ÿช„ โ€” blends both approaches with configurable weights, so you get results that are both semantically relevant AND contextually connected.

The result? A search engine that doesn't just find documents โ€” it understands your knowledge base as a connected web of ideas.

Built For: Knowledge bases ยท Research paper discovery ยท Document Q&A ยท Content recommendation ยท Intelligent FAQ systems


โœจ Key Features

Category Features
Search ๐Ÿ” Vector search (semantic) ยท ๐Ÿ•ธ๏ธ Graph search (structural) ยท ๐ŸŽฏ Hybrid search (combined) ยท ๐Ÿ“Š Configurable weighting
Data ๐Ÿ“„ Auto chunking, embedding, entity extraction ยท ๐Ÿ”— Auto relationship mapping ยท ๐Ÿ—ƒ๏ธ Dual storage (FAISS + Neo4j) ยท ๐Ÿ”ง Full CRUD
DevEx ๐Ÿ›ก๏ธ MVC architecture ยท ๐Ÿšจ Custom error handling ยท ๐Ÿ”’ Cypher injection prevention ยท ๐Ÿ“Š Database inspector ยท ๐ŸŽจ Interactive graph visualization
DevOps โš™๏ธ GitHub Actions CI/CD ยท ๐Ÿš€ AWS EC2 auto-deploy ยท โ˜๏ธ Streamlit Cloud frontend ยท ๐Ÿณ Docker Compose

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 Streamlit Frontend (Cloud)                    โ”‚
โ”‚  (Search Interface + Graph Visualization + DB Inspector)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚ HTTP/REST
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  FastAPI Backend (AWS EC2)                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚   Routes   โ”‚  Controllers  โ”‚ Repositories โ”‚  Services  โ”‚ โ”‚
โ”‚  โ”‚  (API)     โ”‚ (Business     โ”‚ (Data Access)โ”‚ (NLP/ML)   โ”‚ โ”‚
โ”‚  โ”‚            โ”‚   Logic)      โ”‚              โ”‚            โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚                    โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚  Neo4j AuraDB       โ”‚  โ”‚   FAISS Vector     โ”‚
         โ”‚  (Cloud Graph DB)   โ”‚  โ”‚   Index (In-Mem)   โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

For detailed architecture with Mermaid flow diagrams, see ARCHITECTURE_OVERVIEW.md


๐Ÿ› ๏ธ Tech Stack

Layer Technology Purpose
Frontend Streamlit, streamlit-agraph Interactive UI with graph visualization
Backend FastAPI, Pydantic REST API with validation
Vector DB FAISS (IndexFlatIP) Semantic similarity search
Graph DB Neo4j AuraDB Entity relationships & graph traversal
NLP/ML Sentence Transformers (all-MiniLM-L6-v2), spaCy Embeddings & entity extraction
CI/CD GitHub Actions Lint โ†’ Test โ†’ Deploy pipeline
Infra AWS EC2, Streamlit Cloud, Docker Hosting & containerisation

๐Ÿ“ Project Structure

vector-graph-retrieval-app/
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ”œโ”€โ”€ ci.yml                # CI: Lint (flake8) + Test (pytest)
โ”‚       โ””โ”€โ”€ cd.yml                # CD: Auto-deploy to EC2 via SSH
โ”‚
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ main.py                   # FastAPI app entry point
โ”‚   โ”œโ”€โ”€ config.py                 # Environment-based configuration
โ”‚   โ”œโ”€โ”€ database.py               # Neo4j + FAISS connection management
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ api/                      # API Layer
โ”‚   โ”‚   โ”œโ”€โ”€ dependencies.py       # Dependency injection
โ”‚   โ”‚   โ””โ”€โ”€ routes/
โ”‚   โ”‚       โ”œโ”€โ”€ health.py         # GET  /v1/health
โ”‚   โ”‚       โ”œโ”€โ”€ documents.py      # CRUD /v1/nodes
โ”‚   โ”‚       โ”œโ”€โ”€ edges.py          # CRUD /v1/edges
โ”‚   โ”‚       โ”œโ”€โ”€ search.py         # POST /v1/search/*
โ”‚   โ”‚       โ””โ”€โ”€ debug.py          # GET  /v1/debug/*
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ controllers/              # Business Logic Layer
โ”‚   โ”‚   โ”œโ”€โ”€ document_controller.py
โ”‚   โ”‚   โ”œโ”€โ”€ edge_controller.py
โ”‚   โ”‚   โ””โ”€โ”€ search_controller.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ repositories/             # Data Access Layer
โ”‚   โ”‚   โ”œโ”€โ”€ base.py               # Base repository interface
โ”‚   โ”‚   โ”œโ”€โ”€ neo4j_repository.py   # Neo4j graph operations
โ”‚   โ”‚   โ””โ”€โ”€ vector_repository.py  # FAISS vector operations
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ services/                 # Utility Services
โ”‚   โ”‚   โ”œโ”€โ”€ embedding.py          # Text โ†’ 384-dim vector
โ”‚   โ”‚   โ”œโ”€โ”€ ingestion.py          # Document processing pipeline
โ”‚   โ”‚   โ””โ”€โ”€ search.py             # Search algorithms
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”‚   โ””โ”€โ”€ schemas.py            # Pydantic request/response models
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ core/
โ”‚       โ”œโ”€โ”€ constants.py          # App-wide constants
โ”‚       โ””โ”€โ”€ exceptions.py         # Custom exception hierarchy
โ”‚
โ”œโ”€โ”€ frontend/
โ”‚   โ”œโ”€โ”€ streamlit_app.py          # Streamlit UI (deployed on Streamlit Cloud)
โ”‚   โ”œโ”€โ”€ requirements.txt          # Frontend-specific dependencies
โ”‚   โ””โ”€โ”€ index.html                # Static landing page
โ”‚
โ”œโ”€โ”€ tests/
โ”‚   โ””โ”€โ”€ test_api.py               # Mocked API tests (no DB required)
โ”‚
โ”œโ”€โ”€ .env.example                  # Environment variable template
โ”œโ”€โ”€ .gitignore                    # Excludes venv/, data/, .env, __pycache__/
โ”œโ”€โ”€ ARCHITECTURE_OVERVIEW.md      # Detailed architecture with Mermaid diagrams
โ”œโ”€โ”€ docker-compose.yml            # Local Neo4j setup
โ”œโ”€โ”€ pytest.ini                    # Pytest configuration
โ””โ”€โ”€ requirements.txt              # Backend Python dependencies

๐Ÿš€ Setup & Installation

Prerequisites

  • Python 3.10+
  • Docker & Docker Compose (for local Neo4j)
  • Git

Quick Start

# 1. Clone
git clone https://github.com/Jash2606/vector-graph-retrieval-app.git
cd vector-graph-retrieval-app

# 2. Virtual environment
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # Linux/Mac

# 3. Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm

# 4. Configure environment
cp .env.example .env
# Edit .env with your credentials

Environment Variables

# For local development (Docker Neo4j)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password

# For cloud (Neo4j AuraDB)
NEO4J_URI=neo4j+s://xxxxxxxx.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=<your-aura-password>

# Frontend API target
API_URL=http://localhost:8000/v1

Run Locally

# Start Neo4j (local only)
docker-compose up -d

# Start backend
uvicorn app.main:app --reload
# โ†’ http://localhost:8000/docs (Swagger UI)

# Start frontend (separate terminal)
streamlit run frontend/streamlit_app.py
# โ†’ http://localhost:8501

๐ŸŒ Deployment

Backend โ€” AWS EC2

The backend auto-deploys via GitHub Actions CD pipeline:

  1. Push to main โ†’ CI runs (lint + tests)
  2. CI passes โ†’ CD triggers SSH deploy to EC2
  3. EC2 pulls latest code, installs deps, restarts systemd service
  4. Health check: GET /v1/health

Frontend โ€” Streamlit Cloud

The Streamlit frontend is deployed on Streamlit Community Cloud:

  • Connected to this GitHub repo's frontend/streamlit_app.py
  • API_URL env var points to the EC2 backend
  • Auto-redeploys on push to main

Database โ€” Neo4j AuraDB

  • Free-tier cloud instance at neo4j+s://<instance>.databases.neo4j.io
  • Credentials stored in .env (gitignored) and EC2 environment

๐Ÿ“š API Documentation

Base URL: http://localhost:8000/v1 (local) or your EC2 public IP

Interactive Docs: Visit /docs (Swagger UI) when running

Endpoint Method Description
/v1/health GET Health check (Neo4j + FAISS status)
/v1/nodes POST Create document (auto: embed + extract entities + graph connect)
/v1/nodes/{id} GET Get document by ID
/v1/nodes/{id} PUT Update document
/v1/nodes/{id} DELETE Delete document
/v1/edges POST Create relationship (RELATED_TO, MENTIONS, CITES, REQUIRES)
/v1/edges/{id} GET Get edge by ID
/v1/search/vector POST Semantic vector search
/v1/search/graph GET Graph traversal from start node
/v1/search/hybrid POST Combined vector + graph search
/v1/debug/documents GET List all documents (debug)
/v1/debug/entities GET List all entities (debug)
/v1/debug/faiss/info GET FAISS index stats (debug)

Example Requests

# Ingest a document
curl -X POST "http://localhost:8000/v1/nodes" \
  -H "Content-Type: application/json" \
  -d '{"text": "Albert Einstein was a German-born theoretical physicist...", "title": "Einstein Bio"}'

# Hybrid search
curl -X POST "http://localhost:8000/v1/search/hybrid" \
  -H "Content-Type: application/json" \
  -d '{"query_text": "Einstein relativity", "vector_weight": 0.7, "graph_weight": 0.3, "top_k": 5}'

๐Ÿ” Search Algorithms

1. Vector Search

  • Uses cosine similarity on normalized embeddings
  • Model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
  • Fast retrieval via FAISS IndexFlatIP

2. Graph Search

BFS traversal from a start node with configurable depth (1โ€“3 recommended). Returns full subgraph with scored edges.

3. Hybrid Search

final_score = ฮฑ ร— vector_score + ฮฒ ร— graph_score

where:
  vector_score = normalized cosine similarity
  graph_score = f(connectivity, hops, entity_matches)
  ฮฑ + ฮฒ = 1.0

Graph Score Components:

  • Connectivity: Number of relationships
  • Hops: Distance from query entities
  • Expansion Bonus: Bonus for multi-hop discovery

See ARCHITECTURE_OVERVIEW.md for detailed diagrams and scoring formulae.


๐Ÿงช Testing & CI/CD

Run Tests Locally

pytest tests/test_api.py -v

Tests use mocked dependencies โ€” no Neo4j or FAISS required.

Test Coverage

Endpoint Method Tested
/v1/ GET โœ…
/v1/health GET โœ…
/v1/nodes POST โœ…
/v1/nodes/{id} GET/PUT/DELETE โœ…
/v1/edges POST โœ…
/v1/edges/{id} GET โœ…
/v1/search/vector POST โœ…
/v1/search/graph GET โœ…
/v1/search/hybrid POST โœ…

CI/CD Pipeline

Push to main โ”€โ”€โ†’ CI (GitHub Actions)
                  โ”œโ”€โ”€ Lint (flake8)
                  โ””โ”€โ”€ Test (pytest)
                        โ”‚
                    โœ… Pass
                        โ”‚
                  CD (GitHub Actions)
                  โ””โ”€โ”€ SSH deploy to EC2
                        โ”œโ”€โ”€ git pull
                        โ”œโ”€โ”€ pip install
                        โ”œโ”€โ”€ systemctl restart
                        โ””โ”€โ”€ Health check โœ…

๐Ÿ”ฎ Future Enhancements

  • Cross-encoder reranking
  • Query expansion (synonyms/paraphrases)
  • Multi-modal embeddings (image + text)
  • Redis caching for hot queries
  • Prometheus + Grafana monitoring
  • Batch ingestion API

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Built with โค๏ธ using FastAPI, Neo4j, and FAISS

About

Hybrid search engine combining FAISS vector similarity with Neo4j graph traversal for semantically rich, context-aware document retrieval.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors