🧠 Hybrid Vector-Graph Retrieval System

A production-grade semantic search engine that combines vector similarity (FAISS) with graph-based knowledge traversal (Neo4j) to deliver contextually rich and highly relevant search results.

Live Demo: Frontend (Streamlit Cloud) · Backend API: Hosted on AWS EC2

📋 Table of Contents

🧠 Hybrid Vector-Graph Retrieval System

🎯 Overview

Ever searched for something and got results that were technically relevant but missed the bigger picture? That's the limitation of traditional search — it matches words or meaning, but ignores how information connects.

This system solves that by combining two powerful approaches:

Vector Search — Understands the meaning behind your query using AI embeddings. Search for "Who changed modern physics?" and it finds Einstein, even if those exact words never appear.
Graph Search — Explores connections between documents and entities. Found a doc about Einstein? It automatically knows he's connected to "relativity", "Nobel Prize", "Princeton", and surfaces those related documents too.
Hybrid Search — The magic sauce 🪄 — blends both approaches with configurable weights, so you get results that are both semantically relevant AND contextually connected.

The result? A search engine that doesn't just find documents — it understands your knowledge base as a connected web of ideas.

Built For: Knowledge bases · Research paper discovery · Document Q&A · Content recommendation · Intelligent FAQ systems

✨ Key Features

Category	Features
Search	🔍 Vector search (semantic) · 🕸️ Graph search (structural) · 🎯 Hybrid search (combined) · 📊 Configurable weighting
Data	📄 Auto chunking, embedding, entity extraction · 🔗 Auto relationship mapping · 🗃️ Dual storage (FAISS + Neo4j) · 🔧 Full CRUD
DevEx	🛡️ MVC architecture · 🚨 Custom error handling · 🔒 Cypher injection prevention · 📊 Database inspector · 🎨 Interactive graph visualization
DevOps	⚙️ GitHub Actions CI/CD · 🚀 AWS EC2 auto-deploy · ☁️ Streamlit Cloud frontend · 🐳 Docker Compose

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                 Streamlit Frontend (Cloud)                    │
│  (Search Interface + Graph Visualization + DB Inspector)     │
└──────────────────────┬──────────────────────────────────────┘
                       │ HTTP/REST
┌──────────────────────▼──────────────────────────────────────┐
│                  FastAPI Backend (AWS EC2)                    │
│  ┌────────────┬───────────────┬──────────────┬────────────┐ │
│  │   Routes   │  Controllers  │ Repositories │  Services  │ │
│  │  (API)     │ (Business     │ (Data Access)│ (NLP/ML)   │ │
│  │            │   Logic)      │              │            │ │
│  └────────────┴───────────────┴──────────────┴────────────┘ │
└──────────────────────┬────────────────────┬─────────────────┘
                       │                    │
         ┌─────────────▼──────┐  ┌─────────▼──────────┐
         │  Neo4j AuraDB       │  │   FAISS Vector     │
         │  (Cloud Graph DB)   │  │   Index (In-Mem)   │
         └─────────────────────┘  └────────────────────┘

For detailed architecture with Mermaid flow diagrams, see ARCHITECTURE_OVERVIEW.md

🛠️ Tech Stack

Layer	Technology	Purpose
Frontend	Streamlit, streamlit-agraph	Interactive UI with graph visualization
Backend	FastAPI, Pydantic	REST API with validation
Vector DB	FAISS (`IndexFlatIP`)	Semantic similarity search
Graph DB	Neo4j AuraDB	Entity relationships & graph traversal
NLP/ML	Sentence Transformers (`all-MiniLM-L6-v2`), spaCy	Embeddings & entity extraction
CI/CD	GitHub Actions	Lint → Test → Deploy pipeline
Infra	AWS EC2, Streamlit Cloud, Docker	Hosting & containerisation

📁 Project Structure

vector-graph-retrieval-app/
├── .github/
│   └── workflows/
│       ├── ci.yml                # CI: Lint (flake8) + Test (pytest)
│       └── cd.yml                # CD: Auto-deploy to EC2 via SSH
│
├── app/
│   ├── main.py                   # FastAPI app entry point
│   ├── config.py                 # Environment-based configuration
│   ├── database.py               # Neo4j + FAISS connection management
│   │
│   ├── api/                      # API Layer
│   │   ├── dependencies.py       # Dependency injection
│   │   └── routes/
│   │       ├── health.py         # GET  /v1/health
│   │       ├── documents.py      # CRUD /v1/nodes
│   │       ├── edges.py          # CRUD /v1/edges
│   │       ├── search.py         # POST /v1/search/*
│   │       └── debug.py          # GET  /v1/debug/*
│   │
│   ├── controllers/              # Business Logic Layer
│   │   ├── document_controller.py
│   │   ├── edge_controller.py
│   │   └── search_controller.py
│   │
│   ├── repositories/             # Data Access Layer
│   │   ├── base.py               # Base repository interface
│   │   ├── neo4j_repository.py   # Neo4j graph operations
│   │   └── vector_repository.py  # FAISS vector operations
│   │
│   ├── services/                 # Utility Services
│   │   ├── embedding.py          # Text → 384-dim vector
│   │   ├── ingestion.py          # Document processing pipeline
│   │   └── search.py             # Search algorithms
│   │
│   ├── models/
│   │   └── schemas.py            # Pydantic request/response models
│   │
│   └── core/
│       ├── constants.py          # App-wide constants
│       └── exceptions.py         # Custom exception hierarchy
│
├── frontend/
│   ├── streamlit_app.py          # Streamlit UI (deployed on Streamlit Cloud)
│   ├── requirements.txt          # Frontend-specific dependencies
│   └── index.html                # Static landing page
│
├── tests/
│   └── test_api.py               # Mocked API tests (no DB required)
│
├── .env.example                  # Environment variable template
├── .gitignore                    # Excludes venv/, data/, .env, __pycache__/
├── ARCHITECTURE_OVERVIEW.md      # Detailed architecture with Mermaid diagrams
├── docker-compose.yml            # Local Neo4j setup
├── pytest.ini                    # Pytest configuration
└── requirements.txt              # Backend Python dependencies

🚀 Setup & Installation

Prerequisites

Python 3.10+
Docker & Docker Compose (for local Neo4j)
Git

Quick Start

# 1. Clone
git clone https://github.com/Jash2606/vector-graph-retrieval-app.git
cd vector-graph-retrieval-app

# 2. Virtual environment
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # Linux/Mac

# 3. Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm

# 4. Configure environment
cp .env.example .env
# Edit .env with your credentials

Environment Variables

# For local development (Docker Neo4j)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password

# For cloud (Neo4j AuraDB)
NEO4J_URI=neo4j+s://xxxxxxxx.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=<your-aura-password>

# Frontend API target
API_URL=http://localhost:8000/v1

Run Locally

# Start Neo4j (local only)
docker-compose up -d

# Start backend
uvicorn app.main:app --reload
# → http://localhost:8000/docs (Swagger UI)

# Start frontend (separate terminal)
streamlit run frontend/streamlit_app.py
# → http://localhost:8501

🌐 Deployment

Backend — AWS EC2

The backend auto-deploys via GitHub Actions CD pipeline:

Push to main → CI runs (lint + tests)
CI passes → CD triggers SSH deploy to EC2
EC2 pulls latest code, installs deps, restarts systemd service
Health check: GET /v1/health

Frontend — Streamlit Cloud

The Streamlit frontend is deployed on Streamlit Community Cloud:

Connected to this GitHub repo's frontend/streamlit_app.py
API_URL env var points to the EC2 backend
Auto-redeploys on push to main

Database — Neo4j AuraDB

Free-tier cloud instance at neo4j+s://<instance>.databases.neo4j.io
Credentials stored in .env (gitignored) and EC2 environment

📚 API Documentation

Base URL: http://localhost:8000/v1 (local) or your EC2 public IP

Interactive Docs: Visit /docs (Swagger UI) when running

Endpoint	Method	Description
`/v1/health`	GET	Health check (Neo4j + FAISS status)
`/v1/nodes`	POST	Create document (auto: embed + extract entities + graph connect)
`/v1/nodes/{id}`	GET	Get document by ID
`/v1/nodes/{id}`	PUT	Update document
`/v1/nodes/{id}`	DELETE	Delete document
`/v1/edges`	POST	Create relationship (RELATED_TO, MENTIONS, CITES, REQUIRES)
`/v1/edges/{id}`	GET	Get edge by ID
`/v1/search/vector`	POST	Semantic vector search
`/v1/search/graph`	GET	Graph traversal from start node
`/v1/search/hybrid`	POST	Combined vector + graph search
`/v1/debug/documents`	GET	List all documents (debug)
`/v1/debug/entities`	GET	List all entities (debug)
`/v1/debug/faiss/info`	GET	FAISS index stats (debug)

Example Requests

# Ingest a document
curl -X POST "http://localhost:8000/v1/nodes" \
  -H "Content-Type: application/json" \
  -d '{"text": "Albert Einstein was a German-born theoretical physicist...", "title": "Einstein Bio"}'

# Hybrid search
curl -X POST "http://localhost:8000/v1/search/hybrid" \
  -H "Content-Type: application/json" \
  -d '{"query_text": "Einstein relativity", "vector_weight": 0.7, "graph_weight": 0.3, "top_k": 5}'

🔍 Search Algorithms

1. Vector Search

Uses cosine similarity on normalized embeddings
Model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
Fast retrieval via FAISS IndexFlatIP

2. Graph Search

BFS traversal from a start node with configurable depth (1–3 recommended). Returns full subgraph with scored edges.

3. Hybrid Search

final_score = α × vector_score + β × graph_score

where:
  vector_score = normalized cosine similarity
  graph_score = f(connectivity, hops, entity_matches)
  α + β = 1.0

Graph Score Components:

Connectivity: Number of relationships
Hops: Distance from query entities
Expansion Bonus: Bonus for multi-hop discovery

See ARCHITECTURE_OVERVIEW.md for detailed diagrams and scoring formulae.

🧪 Testing & CI/CD

Run Tests Locally

pytest tests/test_api.py -v

Tests use mocked dependencies — no Neo4j or FAISS required.

Test Coverage

Endpoint	Method	Tested
`/v1/`	GET	✅
`/v1/health`	GET	✅
`/v1/nodes`	POST	✅
`/v1/nodes/{id}`	GET/PUT/DELETE	✅
`/v1/edges`	POST	✅
`/v1/edges/{id}`	GET	✅
`/v1/search/vector`	POST	✅
`/v1/search/graph`	GET	✅
`/v1/search/hybrid`	POST	✅

CI/CD Pipeline

Push to main ──→ CI (GitHub Actions)
                  ├── Lint (flake8)
                  └── Test (pytest)
                        │
                    ✅ Pass
                        │
                  CD (GitHub Actions)
                  └── SSH deploy to EC2
                        ├── git pull
                        ├── pip install
                        ├── systemctl restart
                        └── Health check ✅

🔮 Future Enhancements

Cross-encoder reranking
Query expansion (synonyms/paraphrases)
Multi-modal embeddings (image + text)
Redis caching for hot queries
Prometheus + Grafana monitoring
Batch ingestion API

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Built with ❤️ using FastAPI, Neo4j, and FAISS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Hybrid Vector-Graph Retrieval System

📋 Table of Contents

🎯 Overview

✨ Key Features

🏗️ Architecture

🛠️ Tech Stack

📁 Project Structure

🚀 Setup & Installation

Prerequisites

Quick Start

Environment Variables

Run Locally

🌐 Deployment

Backend — AWS EC2

Frontend — Streamlit Cloud

Database — Neo4j AuraDB

📚 API Documentation

Example Requests

🔍 Search Algorithms

1. Vector Search

2. Graph Search

3. Hybrid Search

🧪 Testing & CI/CD

Run Tests Locally

Test Coverage

CI/CD Pipeline

🔮 Future Enhancements

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
app		app
frontend		frontend
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE_OVERVIEW.md		ARCHITECTURE_OVERVIEW.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧠 Hybrid Vector-Graph Retrieval System

📋 Table of Contents

🎯 Overview

✨ Key Features

🏗️ Architecture

🛠️ Tech Stack

📁 Project Structure

🚀 Setup & Installation

Prerequisites

Quick Start

Environment Variables

Run Locally

🌐 Deployment

Backend — AWS EC2

Frontend — Streamlit Cloud

Database — Neo4j AuraDB

📚 API Documentation

Example Requests

🔍 Search Algorithms

1. Vector Search

2. Graph Search

3. Hybrid Search

🧪 Testing & CI/CD

Run Tests Locally

Test Coverage

CI/CD Pipeline

🔮 Future Enhancements

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages