# Hybrid RAG Implementation

## Overview
Hybrid RAG combines vector-based similarity search with graph-based relationship retrieval. This approach leverages the strengths of both methods: vector search for semantic similarity and graph traversal for contextual relationships.

### Key Components:
- **Vector Retrieval**: Traditional embedding-based similarity search
- **Graph Retrieval**: Entity and relationship-based context
- **Result Fusion**: Intelligent combination of both retrieval methods
- **Unified Ranking**: Score and rank results from both sources

### Use Cases:
- Enterprise knowledge management systems
- Scientific literature databases
- Legal document analysis
- Multi-domain information retrieval

### Analogy:
Like using both Google search (semantic similarity) and Wikipedia links (relationship connections) together to find comprehensive information.

## Installation and Setup

In [None]:
# Install required packages
!pip install -q langchain langchain-community
!pip install -q networkx spacy transformers
!pip install -q sentence-transformers chromadb
!pip install -q pandas numpy matplotlib scikit-learn

# Download spaCy model for NLP
!python -m spacy download en_core_web_sm

In [2]:
import os
import pandas as pd
import numpy as np
import networkx as nx
import spacy
import matplotlib.pyplot as plt
from collections import defaultdict
import json
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Dict, Tuple, Any

# LangChain imports
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.schema import Document

# Transformers
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

## 1. Vector Store Setup

In [3]:
class VectorRetriever:
    def __init__(self, embedding_model_name="all-MiniLM-L6-v2"):
        self.embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
        self.vectorstore = None
        self.documents = []
    
    def add_documents(self, documents: List[str]):
        """Add documents to vector store"""
        # Split documents into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50,
            separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
        )
        
        docs = []
        for i, doc_text in enumerate(documents):
            chunks = text_splitter.split_text(doc_text)
            for j, chunk in enumerate(chunks):
                docs.append(Document(
                    page_content=chunk,
                    metadata={"doc_id": i, "chunk_id": j, "source": f"doc_{i}_chunk_{j}"}
                ))
        
        self.documents = docs
        
        # Create vector store
        self.vectorstore = Chroma.from_documents(
            documents=docs,
            embedding=self.embeddings,
            persist_directory="./vector_db"
        )
        
        print(f"Added {len(docs)} document chunks to vector store")
    
    def retrieve(self, query: str, top_k: int = 5) -> List[Dict]:
        """Retrieve similar documents using vector search"""
        if not self.vectorstore:
            return []
        
        results = self.vectorstore.similarity_search_with_score(query, k=top_k)
        
        retrieved_docs = []
        for doc, score in results:
            retrieved_docs.append({
                "content": doc.page_content,
                "metadata": doc.metadata,
                "score": float(score),
                "source": "vector"
            })
        
        return retrieved_docs

# Initialize vector retriever
vector_retriever = VectorRetriever()

## 2. Knowledge Graph Setup

In [4]:
class GraphRetriever:
    def __init__(self):
        self.graph = nx.DiGraph()
        self.entity_docs = {}  # Map entities to document content
        self.embeddings = HuggingFaceEmbeddings()
        self.nlp = nlp
    
    def extract_entities(self, text: str) -> List[Dict]:
        """Extract named entities from text"""
        doc = self.nlp(text)
        entities = []
        
        for ent in doc.ents:
            if ent.label_ in ['PERSON', 'ORG', 'GPE', 'PRODUCT', 'EVENT', 'WORK_OF_ART']:
                entities.append({
                    'text': ent.text.strip(),
                    'label': ent.label_,
                    'start': ent.start_char,
                    'end': ent.end_char
                })
        
        return entities
    
    def build_graph(self, documents: List[str]):
        """Build knowledge graph from documents"""
        for doc_id, document in enumerate(documents):
            entities = self.extract_entities(document)
            
            # Add entities to graph
            for entity in entities:
                entity_text = entity['text']
                if not self.graph.has_node(entity_text):
                    # Create embedding for entity
                    entity_embedding = self.embeddings.embed_query(entity_text)
                    
                    self.graph.add_node(
                        entity_text,
                        entity_type=entity['label'],
                        embedding=entity_embedding,
                        documents=set([doc_id])
                    )
                    self.entity_docs[entity_text] = [document]
                else:
                    # Add document to existing entity
                    self.graph.nodes[entity_text]['documents'].add(doc_id)
                    if entity_text not in self.entity_docs:
                        self.entity_docs[entity_text] = []
                    self.entity_docs[entity_text].append(document)
            
            # Create relationships between entities in the same document
            for i, ent1 in enumerate(entities):
                for ent2 in entities[i+1:]:
                    if ent1['text'] != ent2['text']:
                        # Add bidirectional edges
                        self.graph.add_edge(
                            ent1['text'], ent2['text'],
                            relation='co-occurs',
                            document_id=doc_id,
                            weight=1.0
                        )
                        self.graph.add_edge(
                            ent2['text'], ent1['text'],
                            relation='co-occurs',
                            document_id=doc_id,
                            weight=1.0
                        )
        
        print(f"Built graph with {self.graph.number_of_nodes()} entities and {self.graph.number_of_edges()} relationships")
    
    def find_relevant_entities(self, query: str, top_k: int = 5) -> List[Tuple[str, float]]:
        """Find entities most similar to query"""
        query_embedding = self.embeddings.embed_query(query)
        
        entity_scores = []
        for entity in self.graph.nodes():
            node_data = self.graph.nodes[entity]
            if 'embedding' in node_data:
                entity_embedding = node_data['embedding']
                
                # Calculate cosine similarity
                similarity = cosine_similarity(
                    [query_embedding], [entity_embedding]
                )[0][0]
                
                entity_scores.append((entity, similarity))
        
        # Sort by similarity
        entity_scores.sort(key=lambda x: x[1], reverse=True)
        return entity_scores[:top_k]
    
    def expand_entities(self, entities: List[str], max_hops: int = 1) -> List[str]:
        """Expand entities by following graph relationships"""
        expanded = set(entities)
        
        for entity in entities:
            if entity in self.graph:
                # Get neighbors within max_hops
                neighbors = []
                current_level = {entity}
                
                for hop in range(max_hops):
                    next_level = set()
                    for node in current_level:
                        if node in self.graph:
                            next_level.update(self.graph.neighbors(node))
                    current_level = next_level - expanded
                    expanded.update(current_level)
        
        return list(expanded)
    
    def retrieve(self, query: str, top_k: int = 3) -> List[Dict]:
        """Retrieve relevant information using graph traversal"""
        # Find relevant entities
        relevant_entities = self.find_relevant_entities(query, top_k)
        
        # Expand entities
        entity_names = [entity for entity, score in relevant_entities]
        expanded_entities = self.expand_entities(entity_names, max_hops=1)
        
        # Collect documents for entities
        retrieved_docs = []
        entity_scores = dict(relevant_entities)
        
        for entity in expanded_entities:
            if entity in self.entity_docs:
                for doc_content in self.entity_docs[entity]:
                    score = entity_scores.get(entity, 0.5)  # Default score for expanded entities
                    retrieved_docs.append({
                        "content": doc_content,
                        "metadata": {"entity": entity, "source": "graph"},
                        "score": score,
                        "source": "graph",
                        "entity": entity
                    })
        
        return retrieved_docs

# Initialize graph retriever
graph_retriever = GraphRetriever()

## 3. Hybrid Retrieval System

In [5]:
class HybridRetriever:
    def __init__(self, vector_retriever, graph_retriever, 
                 vector_weight=0.6, graph_weight=0.4):
        self.vector_retriever = vector_retriever
        self.graph_retriever = graph_retriever
        self.vector_weight = vector_weight
        self.graph_weight = graph_weight
    
    def normalize_scores(self, results: List[Dict]) -> List[Dict]:
        """Normalize scores to [0, 1] range"""
        if not results:
            return results
        
        scores = [r['score'] for r in results]
        min_score = min(scores)
        max_score = max(scores)
        
        if max_score == min_score:
            for result in results:
                result['normalized_score'] = 1.0
        else:
            for result in results:
                result['normalized_score'] = (
                    result['score'] - min_score
                ) / (max_score - min_score)
        
        return results
    
    def deduplicate_results(self, results: List[Dict]) -> List[Dict]:
        """Remove duplicate content while preserving best scores"""
        seen_content = {}
        deduplicated = []
        
        for result in results:
            content = result['content'].strip()
            content_hash = hash(content)
            
            if content_hash not in seen_content:
                seen_content[content_hash] = result
                deduplicated.append(result)
            else:
                # Keep result with higher score
                if result['final_score'] > seen_content[content_hash]['final_score']:
                    # Remove old result
                    deduplicated.remove(seen_content[content_hash])
                    seen_content[content_hash] = result
                    deduplicated.append(result)
        
        return deduplicated
    
    def fuse_results(self, vector_results: List[Dict], 
                    graph_results: List[Dict]) -> List[Dict]:
        """Fuse and rank results from both retrievers"""
        # Normalize scores within each result set
        vector_results = self.normalize_scores(vector_results)
        graph_results = self.normalize_scores(graph_results)
        
        # Combine results with weighted scores
        all_results = []
        
        for result in vector_results:
            result['final_score'] = (
                result['normalized_score'] * self.vector_weight
            )
            all_results.append(result)
        
        for result in graph_results:
            result['final_score'] = (
                result['normalized_score'] * self.graph_weight
            )
            all_results.append(result)
        
        # Remove duplicates
        deduplicated_results = self.deduplicate_results(all_results)
        
        # Sort by final score
        deduplicated_results.sort(key=lambda x: x['final_score'], reverse=True)
        
        return deduplicated_results
    
    def retrieve(self, query: str, top_k: int = 5, 
                vector_k: int = 8, graph_k: int = 5) -> Dict:
        """Main hybrid retrieval method"""
        # Get results from both retrievers
        vector_results = self.vector_retriever.retrieve(query, vector_k)
        graph_results = self.graph_retriever.retrieve(query, graph_k)
        
        # Fuse results
        fused_results = self.fuse_results(vector_results, graph_results)
        
        # Return top_k results
        final_results = fused_results[:top_k]
        
        return {
            'results': final_results,
            'vector_count': len(vector_results),
            'graph_count': len(graph_results),
            'fused_count': len(fused_results),
            'final_count': len(final_results)
        }

# Initialize hybrid retriever (will be set up after loading data)
hybrid_retriever = None

## 4. Sample Data and System Setup

In [6]:
# Sample documents covering AI/ML topics
sample_documents = [
    """OpenAI developed GPT-3, a large language model with 175 billion parameters. 
    The model was trained by a team including Alec Radford, Jeffrey Wu, and Ilya Sutskever. 
    GPT-3 uses the transformer architecture and was released in June 2020. It demonstrated 
    remarkable capabilities in text generation, translation, and reasoning tasks.""",
    
    """Google's BERT (Bidirectional Encoder Representations from Transformers) was introduced 
    by Jacob Devlin and his team in 2018. BERT revolutionized natural language processing 
    by using bidirectional training of transformers. The model achieved state-of-the-art 
    results on eleven natural language processing tasks, including question answering.""",
    
    """The transformer architecture was introduced in the paper 'Attention is All You Need' 
    by Ashish Vaswani and colleagues at Google Brain. Published in 2017, this architecture 
    eliminated the need for recurrent neural networks and became the foundation for 
    models like BERT, GPT, and T5. The key innovation was the self-attention mechanism.""",
    
    """Meta AI released LLaMA (Large Language Model Meta AI) in 2023. The LLaMA models 
    range from 7B to 65B parameters and were designed to be more efficient than GPT-3. 
    Hugo Touvron led the development team. LLaMA demonstrated that smaller models 
    could achieve competitive performance when trained on high-quality data."""
]

print("Setting up vector store...")
vector_retriever.add_documents(sample_documents)

print("Building knowledge graph...")
graph_retriever.build_graph(sample_documents)

print("Initializing hybrid retriever...")
hybrid_retriever = HybridRetriever(
    vector_retriever, 
    graph_retriever,
    vector_weight=0.6,
    graph_weight=0.4
)

print("\nSetup complete!")

Setting up vector store...
Added 4 document chunks to vector store
Building knowledge graph...
Built graph with 11 entities and 30 relationships
Initializing hybrid retriever...

Setup complete!


## 5. Testing the Fixed System

In [7]:
# Test the hybrid retrieval system
test_query = "What is GPT-3 and who developed it?"

print(f"Testing query: {test_query}")
print("=" * 50)

# Get hybrid results
result = hybrid_retriever.retrieve(test_query, top_k=3)

print(f"Retrieval Statistics:")
print(f"- Vector results: {result['vector_count']}")
print(f"- Graph results: {result['graph_count']}")
print(f"- Final results: {result['final_count']}")

print("\nTop Results:")
for i, res in enumerate(result['results'], 1):
    print(f"{i}. [{res['source'].upper()}] Score: {res['final_score']:.3f}")
    if 'entity' in res:
        print(f"   Entity: {res['entity']}")
    print(f"   Content: {res['content'][:100]}...")
    print()

Testing query: What is GPT-3 and who developed it?
Retrieval Statistics:
- Vector results: 4
- Graph results: 14
- Final results: 3

Top Results:
1. [VECTOR] Score: 0.600
   Content: Google's BERT (Bidirectional Encoder Representations from Transformers) was introduced 
    by Jacob...

2. [VECTOR] Score: 0.478
   Content: The transformer architecture was introduced in the paper 'Attention is All You Need' 
    by Ashish ...

3. [GRAPH] Score: 0.400
   Entity: GPT-3
   Content: OpenAI developed GPT-3, a large language model with 175 billion parameters. 
    The model was train...



## Summary

✅ **Fixed Issues:**
1. Corrected JSON syntax errors in notebook structure
2. Proper metadata formatting
3. Valid Jupyter notebook format

✅ **Hybrid RAG Features:**
- Vector-based semantic search
- Graph-based entity relationships
- Intelligent result fusion
- Configurable weight system
- Performance analysis tools
