A production-grade semantic search engine that combines vector similarity (FAISS) with graph-based knowledge traversal (Neo4j) to deliver contextually rich and highly relevant search results.
Live Demo: Frontend (Streamlit Cloud) ยท Backend API: Hosted on AWS EC2
- ๐ง Hybrid Vector-Graph Retrieval System
Ever searched for something and got results that were technically relevant but missed the bigger picture? That's the limitation of traditional search โ it matches words or meaning, but ignores how information connects.
This system solves that by combining two powerful approaches:
-
Vector Search โ Understands the meaning behind your query using AI embeddings. Search for "Who changed modern physics?" and it finds Einstein, even if those exact words never appear.
-
Graph Search โ Explores connections between documents and entities. Found a doc about Einstein? It automatically knows he's connected to "relativity", "Nobel Prize", "Princeton", and surfaces those related documents too.
-
Hybrid Search โ The magic sauce ๐ช โ blends both approaches with configurable weights, so you get results that are both semantically relevant AND contextually connected.
The result? A search engine that doesn't just find documents โ it understands your knowledge base as a connected web of ideas.
Built For: Knowledge bases ยท Research paper discovery ยท Document Q&A ยท Content recommendation ยท Intelligent FAQ systems
| Category | Features |
|---|---|
| Search | ๐ Vector search (semantic) ยท ๐ธ๏ธ Graph search (structural) ยท ๐ฏ Hybrid search (combined) ยท ๐ Configurable weighting |
| Data | ๐ Auto chunking, embedding, entity extraction ยท ๐ Auto relationship mapping ยท ๐๏ธ Dual storage (FAISS + Neo4j) ยท ๐ง Full CRUD |
| DevEx | ๐ก๏ธ MVC architecture ยท ๐จ Custom error handling ยท ๐ Cypher injection prevention ยท ๐ Database inspector ยท ๐จ Interactive graph visualization |
| DevOps | โ๏ธ GitHub Actions CI/CD ยท ๐ AWS EC2 auto-deploy ยท โ๏ธ Streamlit Cloud frontend ยท ๐ณ Docker Compose |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Streamlit Frontend (Cloud) โ
โ (Search Interface + Graph Visualization + DB Inspector) โ
โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HTTP/REST
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FastAPI Backend (AWS EC2) โ
โ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโ โ
โ โ Routes โ Controllers โ Repositories โ Services โ โ
โ โ (API) โ (Business โ (Data Access)โ (NLP/ML) โ โ
โ โ โ Logic) โ โ โ โ
โ โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโโผโโโโโโโ โโโโโโโโโโโผโโโโโโโโโโโ
โ Neo4j AuraDB โ โ FAISS Vector โ
โ (Cloud Graph DB) โ โ Index (In-Mem) โ
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโ
For detailed architecture with Mermaid flow diagrams, see ARCHITECTURE_OVERVIEW.md
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Streamlit, streamlit-agraph | Interactive UI with graph visualization |
| Backend | FastAPI, Pydantic | REST API with validation |
| Vector DB | FAISS (IndexFlatIP) |
Semantic similarity search |
| Graph DB | Neo4j AuraDB | Entity relationships & graph traversal |
| NLP/ML | Sentence Transformers (all-MiniLM-L6-v2), spaCy |
Embeddings & entity extraction |
| CI/CD | GitHub Actions | Lint โ Test โ Deploy pipeline |
| Infra | AWS EC2, Streamlit Cloud, Docker | Hosting & containerisation |
vector-graph-retrieval-app/
โโโ .github/
โ โโโ workflows/
โ โโโ ci.yml # CI: Lint (flake8) + Test (pytest)
โ โโโ cd.yml # CD: Auto-deploy to EC2 via SSH
โ
โโโ app/
โ โโโ main.py # FastAPI app entry point
โ โโโ config.py # Environment-based configuration
โ โโโ database.py # Neo4j + FAISS connection management
โ โ
โ โโโ api/ # API Layer
โ โ โโโ dependencies.py # Dependency injection
โ โ โโโ routes/
โ โ โโโ health.py # GET /v1/health
โ โ โโโ documents.py # CRUD /v1/nodes
โ โ โโโ edges.py # CRUD /v1/edges
โ โ โโโ search.py # POST /v1/search/*
โ โ โโโ debug.py # GET /v1/debug/*
โ โ
โ โโโ controllers/ # Business Logic Layer
โ โ โโโ document_controller.py
โ โ โโโ edge_controller.py
โ โ โโโ search_controller.py
โ โ
โ โโโ repositories/ # Data Access Layer
โ โ โโโ base.py # Base repository interface
โ โ โโโ neo4j_repository.py # Neo4j graph operations
โ โ โโโ vector_repository.py # FAISS vector operations
โ โ
โ โโโ services/ # Utility Services
โ โ โโโ embedding.py # Text โ 384-dim vector
โ โ โโโ ingestion.py # Document processing pipeline
โ โ โโโ search.py # Search algorithms
โ โ
โ โโโ models/
โ โ โโโ schemas.py # Pydantic request/response models
โ โ
โ โโโ core/
โ โโโ constants.py # App-wide constants
โ โโโ exceptions.py # Custom exception hierarchy
โ
โโโ frontend/
โ โโโ streamlit_app.py # Streamlit UI (deployed on Streamlit Cloud)
โ โโโ requirements.txt # Frontend-specific dependencies
โ โโโ index.html # Static landing page
โ
โโโ tests/
โ โโโ test_api.py # Mocked API tests (no DB required)
โ
โโโ .env.example # Environment variable template
โโโ .gitignore # Excludes venv/, data/, .env, __pycache__/
โโโ ARCHITECTURE_OVERVIEW.md # Detailed architecture with Mermaid diagrams
โโโ docker-compose.yml # Local Neo4j setup
โโโ pytest.ini # Pytest configuration
โโโ requirements.txt # Backend Python dependencies
- Python 3.10+
- Docker & Docker Compose (for local Neo4j)
- Git
# 1. Clone
git clone https://github.com/Jash2606/vector-graph-retrieval-app.git
cd vector-graph-retrieval-app
# 2. Virtual environment
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/Mac
# 3. Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm
# 4. Configure environment
cp .env.example .env
# Edit .env with your credentials# For local development (Docker Neo4j)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password
# For cloud (Neo4j AuraDB)
NEO4J_URI=neo4j+s://xxxxxxxx.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=<your-aura-password>
# Frontend API target
API_URL=http://localhost:8000/v1# Start Neo4j (local only)
docker-compose up -d
# Start backend
uvicorn app.main:app --reload
# โ http://localhost:8000/docs (Swagger UI)
# Start frontend (separate terminal)
streamlit run frontend/streamlit_app.py
# โ http://localhost:8501The backend auto-deploys via GitHub Actions CD pipeline:
- Push to
mainโ CI runs (lint + tests) - CI passes โ CD triggers SSH deploy to EC2
- EC2 pulls latest code, installs deps, restarts
systemdservice - Health check:
GET /v1/health
The Streamlit frontend is deployed on Streamlit Community Cloud:
- Connected to this GitHub repo's
frontend/streamlit_app.py API_URLenv var points to the EC2 backend- Auto-redeploys on push to
main
- Free-tier cloud instance at
neo4j+s://<instance>.databases.neo4j.io - Credentials stored in
.env(gitignored) and EC2 environment
Base URL: http://localhost:8000/v1 (local) or your EC2 public IP
Interactive Docs: Visit /docs (Swagger UI) when running
| Endpoint | Method | Description |
|---|---|---|
/v1/health |
GET | Health check (Neo4j + FAISS status) |
/v1/nodes |
POST | Create document (auto: embed + extract entities + graph connect) |
/v1/nodes/{id} |
GET | Get document by ID |
/v1/nodes/{id} |
PUT | Update document |
/v1/nodes/{id} |
DELETE | Delete document |
/v1/edges |
POST | Create relationship (RELATED_TO, MENTIONS, CITES, REQUIRES) |
/v1/edges/{id} |
GET | Get edge by ID |
/v1/search/vector |
POST | Semantic vector search |
/v1/search/graph |
GET | Graph traversal from start node |
/v1/search/hybrid |
POST | Combined vector + graph search |
/v1/debug/documents |
GET | List all documents (debug) |
/v1/debug/entities |
GET | List all entities (debug) |
/v1/debug/faiss/info |
GET | FAISS index stats (debug) |
# Ingest a document
curl -X POST "http://localhost:8000/v1/nodes" \
-H "Content-Type: application/json" \
-d '{"text": "Albert Einstein was a German-born theoretical physicist...", "title": "Einstein Bio"}'
# Hybrid search
curl -X POST "http://localhost:8000/v1/search/hybrid" \
-H "Content-Type: application/json" \
-d '{"query_text": "Einstein relativity", "vector_weight": 0.7, "graph_weight": 0.3, "top_k": 5}'- Uses cosine similarity on normalized embeddings
- Model:
sentence-transformers/all-MiniLM-L6-v2(384 dimensions) - Fast retrieval via FAISS
IndexFlatIP
BFS traversal from a start node with configurable depth (1โ3 recommended). Returns full subgraph with scored edges.
final_score = ฮฑ ร vector_score + ฮฒ ร graph_score
where:
vector_score = normalized cosine similarity
graph_score = f(connectivity, hops, entity_matches)
ฮฑ + ฮฒ = 1.0
Graph Score Components:
- Connectivity: Number of relationships
- Hops: Distance from query entities
- Expansion Bonus: Bonus for multi-hop discovery
See ARCHITECTURE_OVERVIEW.md for detailed diagrams and scoring formulae.
pytest tests/test_api.py -vTests use mocked dependencies โ no Neo4j or FAISS required.
| Endpoint | Method | Tested |
|---|---|---|
/v1/ |
GET | โ |
/v1/health |
GET | โ |
/v1/nodes |
POST | โ |
/v1/nodes/{id} |
GET/PUT/DELETE | โ |
/v1/edges |
POST | โ |
/v1/edges/{id} |
GET | โ |
/v1/search/vector |
POST | โ |
/v1/search/graph |
GET | โ |
/v1/search/hybrid |
POST | โ |
Push to main โโโ CI (GitHub Actions)
โโโ Lint (flake8)
โโโ Test (pytest)
โ
โ
Pass
โ
CD (GitHub Actions)
โโโ SSH deploy to EC2
โโโ git pull
โโโ pip install
โโโ systemctl restart
โโโ Health check โ
- Cross-encoder reranking
- Query expansion (synonyms/paraphrases)
- Multi-modal embeddings (image + text)
- Redis caching for hot queries
- Prometheus + Grafana monitoring
- Batch ingestion API
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Built with โค๏ธ using FastAPI, Neo4j, and FAISS