# Graph RAG with Real Data
This notebook demonstrates an advanced Retrieval-Augmented Generation pipeline that combines a knowledge graph with vector search.
We use a small subset of the **DBLP** academic graph to show how graph traversal and semantic retrieval can work together.

## Dataset
The sample below represents a handful of well known machine learning researchers and some of their publications.
In a real system you would load a much larger portion of DBLP or another source.

In [None]:
import networkx as nx
import chromadb
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings

# Sample data: authors and papers
authors = [
    {"id": "a1", "name": "Andrew Ng", "description": "Co-founder of Coursera and leading ML researcher."},
    {"id": "a2", "name": "Geoffrey Hinton", "description": "Pioneer of neural networks and deep learning."},
    {"id": "a3", "name": "Yann LeCun", "description": "Known for convolutional neural networks."}
]

papers = [
    {"id": "p1", "title": "Deep Learning for AI", "abstract": "An overview of deep learning techniques in AI.", "authors": ["a1", "a2"]},
    {"id": "p2", "title": "Convolutional Networks for Image Recognition", "abstract": "Using CNNs to recognize images with high accuracy.", "authors": ["a3"]},
    {"id": "p3", "title": "Neural Networks for Speech Recognition", "abstract": "Exploration of neural networks for speech tasks.", "authors": ["a2", "a3"]},
    {"id": "p4", "title": "Machine Learning Best Practices", "abstract": "Guidelines and tips for machine learning practitioners.", "authors": ["a1"]}
]

citations = [("p2", "p1"), ("p3", "p2"), ("p4", "p1"), ("p4", "p3")]

# Build graph
G = nx.MultiDiGraph()
for a in authors:
    G.add_node(a['id'], label=a['name'], type='author', description=a['description'])
for p in papers:
    G.add_node(p['id'], label=p['title'], type='paper', abstract=p['abstract'])
    for aid in p['authors']:
        G.add_edge(aid, p['id'], relation='wrote')
for src, dst in citations:
    G.add_edge(src, dst, relation='cites')

## Visualize the Graph
The subgraph around a query author can help explain graph traversal in Graph RAG.

In [None]:
import matplotlib.pyplot as plt
pos = nx.spring_layout(G, seed=42)
nx.draw(G, pos, with_labels=True, labels=nx.get_node_attributes(G, 'label'), node_color='lightblue')
plt.show()

## Create the Vector Store
We embed author descriptions and paper abstracts, storing them in ChromaDB with metadata for later retrieval.

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')
vectordb = chromadb.PersistentClient(path='rag_graph_db').get_or_create_collection('papers')

texts = []
metadatas = []
ids = []
for node, data in G.nodes(data=True):
    text = data.get('abstract') or data.get('description')
    if text:
        texts.append(text)
        metadatas.append({'id': node, 'label': data['label'], 'type': data['type']})
        ids.append(node)
emb = model.encode(texts, convert_to_numpy=True)
vectordb.upsert(ids=ids, embeddings=emb.tolist(), metadatas=metadatas)
vectordb.persist()

## Query Functions
The helper functions below traverse the graph to gather relevant papers, perform a similarity search in the vector store, and finally generate an answer with an LLM.

In [None]:
def papers_by_author(name):
    for node, data in G.nodes(data=True):
        if data['type'] == 'author' and data['label'].lower() == name.lower():
            return list(G.successors(node))
    return []

def vector_search(query, top_k=3):
    q_emb = model.encode([query], convert_to_numpy=True)[0]
    result = vectordb.query(query_embeddings=[q_emb.tolist()], n_results=top_k)
    return result

import openai
def ask_llm(question, graph_result, vector_result):
    prompt = f'You are a helpful research assistant.\nGraph result: {graph_result}\nVector search result: {vector_result}\nAnswer: {question}'
    response = openai.ChatCompletion.create(model='gpt-4', messages=[{'role': 'user', 'content': prompt}])
    return response.choices[0].message.content

## Example Query
Combine graph traversal with vector search and ask the LLM.

In [None]:
author = 'Andrew Ng'
papers_ids = papers_by_author(author)
graph_info = [G.nodes[p]['label'] for p in papers_ids]

vec_result = vector_search(f'papers by {author}')
answer = ask_llm(f'What has {author} published and how are those papers connected?', graph_info, vec_result)
print(answer)

This workflow first gathers an author's publications from the knowledge graph, then fetches semantically related documents via the vector store. The combined context guides the language model to generate a richer answer.