# Phase 2: GraphRAG Engine Development
## Hybrid Retrieval System (Graph + Vector Search)

**Objective**: Build a powerful hybrid retrieval system that combines:
1. **Graph-based retrieval** - Navigate knowledge graph relationships
2. **Vector semantic search** - Find similar content using embeddings
3. **Fusion ranking** - Combine both approaches for better results

**What we'll do in this notebook:**
1. Load the processed data from Phase 1
2. Build a FAISS vector index for fast similarity search
3. Implement graph traversal algorithms (BFS, multi-hop reasoning)
4. Create hybrid retrieval that fuses graph + vector results
5. Test and evaluate retrieval quality
6. Build a query interface

**Prerequisites**: Complete notebook 01_data_collection_preprocessing.ipynb first!

---

## Step 1: Import Libraries

Import all required libraries for building the GraphRAG engine.

In [1]:
# Core libraries
import os
import json
import pickle
import numpy as np
import pandas as pd
from pathlib import Path
from typing import List, Dict, Tuple, Set
from collections import defaultdict

# Graph processing
import networkx as nx

# Vector search
import faiss
from sentence_transformers import SentenceTransformer

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Progress tracking
from tqdm.auto import tqdm

# NLP
import spacy

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)

print("âœ“ All libraries imported successfully!")
print(f"Working directory: {os.getcwd()}")


âœ“ All libraries imported successfully!
Working directory: d:\Projects\agent-wiki-graphrag\notebooks


## Step 2: Load Processed Data

Load all the data we prepared in Phase 1.

In [2]:
# Define paths
PROJECT_ROOT = Path(r"d:\Projects\agent-wiki-graphrag")
DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROCESSED_DIR = DATA_DIR / "processed"
EMBEDDINGS_DIR = DATA_DIR / "embeddings"
KG_DIR = DATA_DIR / "knowledge_graph"

# File paths
ARTICLES_FILE = RAW_DIR / "wikipedia_articles.json"
ENTITIES_FILE = PROCESSED_DIR / "entities.json"
EMBEDDINGS_FILE = EMBEDDINGS_DIR / "article_embeddings.pkl"
GRAPH_FILE = KG_DIR / "article_graph.pkl"
METADATA_FILE = PROCESSED_DIR / "metadata.json"

print("Loading data from Phase 1...")

# Load articles
with open(ARTICLES_FILE, 'r', encoding='utf-8') as f:
    articles = json.load(f)
print(f"âœ“ Loaded {len(articles)} articles")

# Load entities
with open(ENTITIES_FILE, 'r', encoding='utf-8') as f:
    entities = json.load(f)
print(f"âœ“ Loaded entities for {len(entities)} articles")

# Load embeddings
with open(EMBEDDINGS_FILE, 'rb') as f:
    article_embeddings = pickle.load(f)
print(f"âœ“ Loaded {len(article_embeddings)} article embeddings")

# Load knowledge graph
with open(GRAPH_FILE, 'rb') as f:
    G = pickle.load(f)
print(f"âœ“ Loaded knowledge graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

# Load metadata
with open(METADATA_FILE, 'r', encoding='utf-8') as f:
    metadata = json.load(f)
print(f"âœ“ Loaded metadata")

print(f"\n{'='*60}")
print(f"Data Summary:")
print(f"  Articles: {len(articles)}")
print(f"  Graph nodes: {G.number_of_nodes()}")
print(f"  Graph edges: {G.number_of_edges()}")
print(f"  Embedding dimension: {len(list(article_embeddings.values())[0])}")
print(f"{'='*60}")

Loading data from Phase 1...
âœ“ Loaded 1337 articles
âœ“ Loaded entities for 100 articles
âœ“ Loaded 1337 article embeddings
âœ“ Loaded knowledge graph: 1337 nodes, 11091 edges
âœ“ Loaded metadata

Data Summary:
  Articles: 1337
  Graph nodes: 1337
  Graph edges: 11091
  Embedding dimension: 384


## Step 3: Build FAISS Vector Index

Create an efficient FAISS index for fast similarity search over our article embeddings.

In [None]:
class VectorIndex:
    """FAISS-based vector index for semantic search"""
    
    def __init__(self, embeddings_dict):
        """
        Initialize vector index from embeddings dictionary
        
        Args:
            embeddings_dict: Dict mapping article titles to embedding vectors
        """
        self.titles = list(embeddings_dict.keys())
        embeddings_list = [embeddings_dict[title] for title in self.titles]
        self.embeddings = np.array(embeddings_list, dtype=np.float32)  # Ensure float32
        self.dimension = self.embeddings.shape[1]
        
        # Create FAISS index
        self.index = faiss.IndexFlatIP(self.dimension)  # Inner product (cosine similarity)
        
        # Normalize embeddings for cosine similarity (need contiguous array)
        embeddings_copy = np.ascontiguousarray(self.embeddings)
        faiss.normalize_L2(embeddings_copy)
        
        # Add to index
        self.index.add(embeddings_copy)
        
        print(f"âœ“ FAISS index built")
        print(f"  Dimension: {self.dimension}")
        print(f"  Total vectors: {self.index.ntotal}")
        print(f"  Index type: Flat (exact search)")
    
    def search(self, query_embedding, top_k=10):
        """
        Search for similar articles
        
        Args:
            query_embedding: Query vector (will be normalized)
            top_k: Number of results to return
            
        Returns:
            List of (title, score) tuples
        """
        # Normalize query
        query_vec = np.array([query_embedding], dtype=np.float32)
        query_vec = np.ascontiguousarray(query_vec)
        faiss.normalize_L2(query_vec)
        
        # Search
        scores, indices = self.index.search(query_vec, top_k)
        
        # Return results
        results = []
        for idx, score in zip(indices[0], scores[0]):
            if idx < len(self.titles):
                results.append((self.titles[idx], float(score)))
        
        return results

# Build the index
print("Building FAISS vector index...")
vector_index = VectorIndex(article_embeddings)
print("âœ“ Vector index ready for search!")

Building FAISS vector index...


TypeError: in method 'fvec_renorm_L2', argument 3 of type 'float *'

## Step 4: Load Embedding Model

Load the same sentence transformer model we used in Phase 1 for encoding queries.

In [None]:
# Load embedding model for encoding queries
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"âœ“ Model loaded: all-MiniLM-L6-v2")
print(f"âœ“ Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

## Step 5: Test Vector Search

Let's test the vector search with some sample queries.

In [None]:
# Test vector search
test_queries = [
    "neural networks and deep learning",
    "quantum mechanics and physics",
    "programming languages and software",
    "statistics and probability theory"
]

print("Testing Vector Search")
print("="*60)

for query in test_queries:
    print(f"\nQuery: '{query}'")
    
    # Encode query
    query_embedding = embedding_model.encode(query)
    
    # Search
    results = vector_index.search(query_embedding, top_k=5)
    
    print("Top 5 results:")
    for i, (title, score) in enumerate(results, 1):
        print(f"  {i}. {title:45s} (score: {score:.4f})")

## Step 6: Implement Graph Traversal

Create functions for navigating the knowledge graph to find related articles.

In [None]:
class GraphRetriever:
    """Graph-based retrieval using NetworkX"""
    
    def __init__(self, graph):
        self.graph = graph
        print(f"âœ“ Graph retriever initialized")
        print(f"  Nodes: {self.graph.number_of_nodes()}")
        print(f"  Edges: {self.graph.number_of_edges()}")
    
    def get_neighbors(self, node, max_neighbors=10):
        """Get direct neighbors of a node"""
        if node not in self.graph:
            return []
        
        # Get successors (outgoing edges)
        neighbors = list(self.graph.successors(node))[:max_neighbors]
        return neighbors
    
    def multi_hop_retrieval(self, start_nodes, max_hops=2, max_results=50):
        """
        Multi-hop graph traversal using BFS
        
        Args:
            start_nodes: List of starting article titles
            max_hops: Maximum number of hops to traverse
            max_results: Maximum number of articles to return
            
        Returns:
            List of (article_title, hop_distance) tuples
        """
        if isinstance(start_nodes, str):
            start_nodes = [start_nodes]
        
        visited = set()
        results = []
        queue = [(node, 0) for node in start_nodes if node in self.graph]
        
        while queue and len(results) < max_results:
            current_node, current_hop = queue.pop(0)
            
            if current_node in visited:
                continue
            
            visited.add(current_node)
            results.append((current_node, current_hop))
            
            # Add neighbors if within hop limit
            if current_hop < max_hops:
                neighbors = self.get_neighbors(current_node)
                for neighbor in neighbors:
                    if neighbor not in visited:
                        queue.append((neighbor, current_hop + 1))
        
        return results
    
    def get_related_by_category(self, node, max_results=10):
        """Find articles in similar categories"""
        if node not in self.graph:
            return []
        
        node_data = self.graph.nodes[node]
        node_categories = set(node_data.get('categories', []))
        
        if not node_categories:
            return []
        
        # Find nodes with overlapping categories
        related = []
        for other_node in self.graph.nodes():
            if other_node == node:
                continue
            
            other_categories = set(self.graph.nodes[other_node].get('categories', []))
            overlap = len(node_categories & other_categories)
            
            if overlap > 0:
                related.append((other_node, overlap))
        
        # Sort by overlap and return top results
        related.sort(key=lambda x: x[1], reverse=True)
        return [node for node, _ in related[:max_results]]

# Initialize graph retriever
print("Initializing graph retriever...")
graph_retriever = GraphRetriever(G)
print("âœ“ Graph retriever ready!")

## Step 7: Test Graph Retrieval

Test the graph-based retrieval methods.

In [None]:
# Test graph retrieval
test_article = "Artificial intelligence"

if test_article in G:
    print(f"Testing Graph Retrieval for: '{test_article}'")
    print("="*60)
    
    # Direct neighbors
    print(f"\nDirect neighbors (1-hop):")
    neighbors = graph_retriever.get_neighbors(test_article, max_neighbors=10)
    for i, neighbor in enumerate(neighbors[:5], 1):
        print(f"  {i}. {neighbor}")
    
    # Multi-hop retrieval
    print(f"\nMulti-hop retrieval (2 hops, top 10):")
    multi_hop_results = graph_retriever.multi_hop_retrieval(test_article, max_hops=2, max_results=10)
    for i, (article, hop) in enumerate(multi_hop_results, 1):
        print(f"  {i}. {article:45s} (hop: {hop})")
    
    # Category-based
    print(f"\nRelated by category:")
    category_related = graph_retriever.get_related_by_category(test_article, max_results=5)
    for i, article in enumerate(category_related, 1):
        print(f"  {i}. {article}")
else:
    print(f"Article '{test_article}' not found in graph")

## Step 8: Build Hybrid Retriever

Combine graph-based and vector-based retrieval with fusion ranking.

In [None]:
class HybridRetriever:
    """Hybrid retrieval combining graph and vector search"""
    
    def __init__(self, graph_retriever, vector_index, embedding_model, articles):
        self.graph = graph_retriever
        self.vector = vector_index
        self.encoder = embedding_model
        self.articles = articles
        print("âœ“ Hybrid retriever initialized")
    
    def retrieve(self, query, top_k=10, alpha=0.5, use_graph=True, use_vector=True):
        """
        Hybrid retrieval with fusion ranking
        
        Args:
            query: Search query string
            top_k: Number of results to return
            alpha: Weight for fusion (0=graph only, 1=vector only, 0.5=equal)
            use_graph: Whether to use graph retrieval
            use_vector: Whether to use vector retrieval
            
        Returns:
            List of (article_title, score, source) tuples
        """
        results = {}
        
        # Vector search
        if use_vector:
            query_embedding = self.encoder.encode(query)
            vector_results = self.vector.search(query_embedding, top_k=top_k*2)
            
            for title, score in vector_results:
                if title not in results:
                    results[title] = {'vector_score': 0, 'graph_score': 0}
                results[title]['vector_score'] = score
        
        # Graph search (if query matches an article)
        if use_graph:
            # Try to find articles matching query terms
            query_terms = query.lower().split()
            matching_articles = []
            
            for title in self.articles.keys():
                title_lower = title.lower()
                if any(term in title_lower for term in query_terms):
                    matching_articles.append(title)
            
            # Multi-hop retrieval from matching articles
            if matching_articles:
                graph_results = self.graph.multi_hop_retrieval(
                    matching_articles[:3],  # Use top 3 matches
                    max_hops=2,
                    max_results=top_k*2
                )
                
                # Score based on hop distance (closer = higher score)
                for title, hop in graph_results:
                    if title not in results:
                        results[title] = {'vector_score': 0, 'graph_score': 0}
                    # Score: 1.0 for 0 hops, 0.5 for 1 hop, 0.25 for 2 hops
                    results[title]['graph_score'] = 1.0 / (2 ** hop)
        
        # Normalize scores
        if results:
            max_vector = max((r['vector_score'] for r in results.values()), default=1.0)
            max_graph = max((r['graph_score'] for r in results.values()), default=1.0)
            
            for title in results:
                if max_vector > 0:
                    results[title]['vector_score'] /= max_vector
                if max_graph > 0:
                    results[title]['graph_score'] /= max_graph
        
        # Fusion ranking
        ranked_results = []
        for title, scores in results.items():
            # Weighted combination
            if use_vector and use_graph:
                final_score = alpha * scores['vector_score'] + (1 - alpha) * scores['graph_score']
            elif use_vector:
                final_score = scores['vector_score']
            elif use_graph:
                final_score = scores['graph_score']
            else:
                final_score = 0
            
            # Determine primary source
            if scores['vector_score'] > scores['graph_score']:
                source = 'vector'
            elif scores['graph_score'] > scores['vector_score']:
                source = 'graph'
            else:
                source = 'hybrid'
            
            ranked_results.append((title, final_score, source))
        
        # Sort by score and return top-k
        ranked_results.sort(key=lambda x: x[1], reverse=True)
        return ranked_results[:top_k]
    
    def explain_results(self, query, top_k=5):
        """Retrieve and explain the results"""
        results = self.retrieve(query, top_k=top_k, alpha=0.5)
        
        print(f"Query: '{query}'")
        print("="*80)
        print(f"{'Rank':<6} {'Title':<45} {'Score':<8} {'Source':<10}")
        print("-"*80)
        
        for i, (title, score, source) in enumerate(results, 1):
            print(f"{i:<6} {title[:44]:<45} {score:.4f}   {source:<10}")
        
        return results

# Initialize hybrid retriever
print("Building hybrid retriever...")
hybrid_retriever = HybridRetriever(graph_retriever, vector_index, embedding_model, articles)
print("âœ“ Hybrid retriever ready!")

## Step 9: Test Hybrid Retrieval

Test the hybrid retrieval system with various queries.

In [None]:
# Test hybrid retrieval
test_queries = [
    "neural networks and deep learning",
    "quantum computing applications",
    "machine learning algorithms",
    "artificial intelligence safety"
]

for query in test_queries:
    print()
    hybrid_retriever.explain_results(query, top_k=5)
    print()

## Step 10: Compare Retrieval Methods

Compare vector-only, graph-only, and hybrid retrieval side-by-side.

In [None]:
def compare_retrieval_methods(query, top_k=5):
    """Compare different retrieval approaches"""
    
    print(f"Query: '{query}'")
    print("="*100)
    
    # Vector only
    vector_results = hybrid_retriever.retrieve(query, top_k=top_k, use_graph=False, use_vector=True)
    
    # Graph only
    graph_results = hybrid_retriever.retrieve(query, top_k=top_k, use_graph=True, use_vector=False)
    
    # Hybrid
    hybrid_results = hybrid_retriever.retrieve(query, top_k=top_k, alpha=0.5)
    
    # Create comparison table
    comparison_df = pd.DataFrame({
        'Rank': range(1, top_k+1),
        'Vector Only': [r[0][:35] for r in vector_results],
        'V-Score': [f"{r[1]:.3f}" for r in vector_results],
        'Graph Only': [r[0][:35] if len(graph_results) > i else '' 
                       for i, r in enumerate(graph_results + [('', 0, '')] * top_k)][:top_k],
        'G-Score': [f"{r[1]:.3f}" if len(graph_results) > i else '' 
                    for i, r in enumerate(graph_results + [('', 0, '')] * top_k)][:top_k],
        'Hybrid': [r[0][:35] for r in hybrid_results],
        'H-Score': [f"{r[1]:.3f}" for r in hybrid_results]
    })
    
    print(comparison_df.to_string(index=False))
    print()
    
    return comparison_df

# Test with a query
test_query = "machine learning and neural networks"
compare_retrieval_methods(test_query, top_k=5)

## Step 11: Build Context Retrieval

Retrieve full article context with citations for article generation.

In [None]:
class ContextRetriever:
    """Retrieve full context with article content for generation"""
    
    def __init__(self, hybrid_retriever, articles):
        self.retriever = hybrid_retriever
        self.articles = articles
    
    def get_context(self, query, top_k=5, include_summary=True, include_text=False):
        """
        Retrieve full context for a query
        
        Args:
            query: Search query
            top_k: Number of articles to retrieve
            include_summary: Include article summaries
            include_text: Include full article text (may be large)
            
        Returns:
            List of context dictionaries
        """
        # Get relevant articles
        results = self.retriever.retrieve(query, top_k=top_k)
        
        context = []
        for title, score, source in results:
            if title in self.articles:
                article = self.articles[title]
                
                ctx = {
                    'title': title,
                    'score': score,
                    'source': source,
                    'url': article.get('url', ''),
                    'categories': article.get('categories', [])[:5]
                }
                
                if include_summary:
                    ctx['summary'] = article.get('summary_clean', article.get('summary', ''))
                
                if include_text:
                    ctx['text'] = article.get('text_clean', article.get('text', ''))
                
                context.append(ctx)
        
        return context
    
    def format_context_for_llm(self, query, top_k=5):
        """Format context as a prompt for LLM"""
        context = self.get_context(query, top_k=top_k, include_summary=True)
        
        prompt = f"Query: {query}\n\n"
        prompt += "Relevant Wikipedia Articles:\n"
        prompt += "="*80 + "\n\n"
        
        for i, ctx in enumerate(context, 1):
            prompt += f"{i}. {ctx['title']}\n"
            prompt += f"   URL: {ctx['url']}\n"
            prompt += f"   Summary: {ctx['summary'][:500]}...\n\n"
        
        return prompt
    
    def get_related_entities(self, query, top_k=5):
        """Get entities from retrieved articles"""
        results = self.retriever.retrieve(query, top_k=top_k)
        
        all_entities = defaultdict(int)
        for title, score, source in results:
            if title in self.articles:
                article_entities = self.articles[title].get('entities', [])
                for ent in article_entities:
                    all_entities[ent['text']] += 1
        
        # Return most common entities
        sorted_entities = sorted(all_entities.items(), key=lambda x: x[1], reverse=True)
        return sorted_entities[:20]

# Initialize context retriever
context_retriever = ContextRetriever(hybrid_retriever, articles)
print("âœ“ Context retriever ready!")

## Step 12: Test Context Retrieval

Test retrieving full context for article generation.

In [None]:
# Test context retrieval
test_query = "deep learning applications"

print(f"Retrieving context for: '{test_query}'")
print("="*80)

# Get context
context = context_retriever.get_context(test_query, top_k=3)

for i, ctx in enumerate(context, 1):
    print(f"\n{i}. {ctx['title']}")
    print(f"   Score: {ctx['score']:.4f} | Source: {ctx['source']}")
    print(f"   URL: {ctx['url']}")
    print(f"   Categories: {', '.join(ctx['categories'][:3])}")
    print(f"   Summary: {ctx['summary'][:300]}...")

# Get related entities
print(f"\n{'='*80}")
print("Related Entities:")
entities = context_retriever.get_related_entities(test_query, top_k=5)
for ent, count in entities[:10]:
    print(f"  {ent:30s} (mentions: {count})")

## Step 13: Save GraphRAG Engine

Save the complete GraphRAG engine for use in Phase 3 (Agents).

In [None]:
# Save the GraphRAG components
print("Saving GraphRAG engine components...")

# Save FAISS index
faiss_index_file = EMBEDDINGS_DIR / "faiss_index.bin"
faiss.write_index(vector_index.index, str(faiss_index_file))
print(f"âœ“ FAISS index saved: {faiss_index_file}")

# Save title mapping
titles_file = EMBEDDINGS_DIR / "index_titles.json"
with open(titles_file, 'w', encoding='utf-8') as f:
    json.dump(vector_index.titles, f, ensure_ascii=False, indent=2)
print(f"âœ“ Title mapping saved: {titles_file}")

# Save retriever configuration
config = {
    'vector_dimension': vector_index.dimension,
    'total_articles': len(vector_index.titles),
    'embedding_model': 'all-MiniLM-L6-v2',
    'graph_nodes': G.number_of_nodes(),
    'graph_edges': G.number_of_edges(),
    'created_at': pd.Timestamp.now().isoformat()
}

config_file = PROCESSED_DIR / "graphrag_config.json"
with open(config_file, 'w', encoding='utf-8') as f:
    json.dump(config, f, indent=2)
print(f"âœ“ Configuration saved: {config_file}")

print(f"\n{'='*60}")
print("Phase 2 Complete! ðŸŽ‰")
print(f"{'='*60}")
print("\nGraphRAG Engine Summary:")
print(f"  Vector Index: {config['total_articles']} articles, {config['vector_dimension']}D")
print(f"  Knowledge Graph: {config['graph_nodes']} nodes, {config['graph_edges']} edges")
print(f"  Embedding Model: {config['embedding_model']}")
print(f"\nComponents saved to: {DATA_DIR}")
print(f"\nNext: Open notebook 03_agent_system.ipynb to build the multi-agent system!")