<a href="https://colab.research.google.com/github/hamzafarooq/multi-agent-course/blob/main/Module_4/knowledge_graph_neo4j_with_evals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### **Welcome to the RAG vs KG Evaluation Notebook**

In this notebook, we implement **Retrieval-Augmented Generation (RAG)** and **Knowledge Graph (KG)** pipelines **from scratch — with no LangChain, LlamaIndex, or external orchestration frameworks**.
Our goal is to show exactly what's possible when you understand and control every component end-to-end.

You'll see how to build retrieval, graph reasoning, and evaluation logic using **pure Python**, giving you maximum transparency and customizability.

This notebook serves as a foundation for benchmarking, extending, and adapting both approaches to real enterprise use cases.


---

## 🎓 **Want to Master Multi-Agent Systems?**

This notebook is part of the **Advanced LLM Multi-Agent Architecture Course** where you'll learn:

- ✅ How to build production-ready RAG systems
- ✅ Knowledge Graph integration with LLMs
- ✅ Multi-agent orchestration patterns
- ✅ Evaluation frameworks and best practices
- ✅ Real-world case studies and implementations

**[👉 Join the Course Now](https://maven.com/boring-bot/advanced-llm?promoCode=200OFF)** - Use code **`200OFF`** for $200 off!

---

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)


In [None]:
!pip install neo4j openai python-dotenv

In [None]:
import dotenv
dotenv.load_dotenv('/content/Neo4j-324820c6-Created-2025-12-05.txt', override=True)

In [None]:
import os
# If using Colab, you can store your API key in the secrets manager and access it like this:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

# # Otherwise, load from environment variables:
# OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY not found. Please set it as an environment variable or in Colab secrets.")


# **RAG vs Knowledge Graph: A Comprehensive Comparison Framework**

## **Overview**

This implementation provides a production-ready framework for evaluating two powerful question-answering approaches: **Retrieval-Augmented Generation (RAG)** and **Knowledge Graph (KG) queries via Text-to-Cypher**.

Built on **Neo4j Aura** and **OpenAI GPT-4o-minio-mini**, this system allows objective measurement of which approach works best for different types of questions.

---

## **What This Does**

This codebase implements **three distinct query methods** and compares them head-to-head:

### **1. RAG (Retrieval-Augmented Generation)**

* Uses semantic search (embeddings) or keyword matching to find relevant documents
* Passes retrieved context to an LLM for answer generation
* **Best for:** natural language understanding, semantic queries, summarization

---

### **2. Knowledge Graph with Text-to-Cypher**

* Converts natural language questions into Cypher using GPT-4o-minio-mini
* Executes structured queries directly on Neo4j
* Returns exact, verifiable data
* **Best for:** precise counts, relationship queries, aggregations, filtering

---

### **3. LLM Judge Evaluation**

* Uses GPT-4o-minio-mini as an impartial evaluator
* Scores each method on accuracy, completeness, precision, and verifiability
* Produces detailed reasoning and recommendations

---

## **Key Features**

* ✅ **No Hardcoded Queries** — Cypher is generated dynamically
* ✅ **Objective Evaluation** — unbiased LLM-based scoring
* ✅ **Production Ready** — graceful error handling, retry logic, logging
* ✅ **Flexible Data Loading** — CSV support (URL or local)
* ✅ **Vector Search Optional** — embedding-based semantic RAG
* ✅ **Batch Evaluation** — evaluate many questions together
* ✅ **Interactive Mode** — ask questions in real time

---

## **Architecture**

```
CopyQuestion
    ↓
    ├── [RAG Path] ─────────────→ Answer A (Interpretive)
    │
    ├── [Knowledge Graph Path] → Answer B (Exact)
    │
    └── [LLM Judge] ───────────→ Winner + Analysis
```

### **RAG Path**

1. Convert question to embedding or keywords
2. Retrieve relevant articles from Neo4j
3. Pass context to GPT-4o-minio-mini for answer generation

### **Knowledge Graph Path**

1. Convert question to Cypher using GPT-4o-minio-mini
2. Execute structured query on Neo4j
3. Format results + natural-language explanation

### **Judge Path**

* Compares both answers
* Scores accuracy, precision, completeness
* Determines winner with detailed reasoning

---

## **Use Cases**

### **When Knowledge Graph Wins**

* *“Who are the collaborators of Emily Chen?”*
* *“How many articles has each researcher published?”*
* *“Which researchers work on AI Ethics?”*
* *“Find all papers published in 2024.”*

### **When RAG Wins**

* *“What are the main challenges in AI safety?”*
* *“Explain innovations in transformer architectures.”*
* *“Summarize ethical concerns in AI research.”*

### **When Both Are Useful**

* *“What topics does Emily Chen research?”*
* *“Compare research focus of two researchers.”*

---

## **Data Schema**

The system works with research paper data including:

* **Articles:** title, abstract, publication date
* **Researchers:** names + co-authorship
* **Topics:** research areas
* **Relationships:**

  * `PUBLISHED` (Researcher → Article)
  * `IN_TOPIC` (Article → Topic)

---

## **Quick Start**

```python
# Single question comparison
quick_ask_with_judge("Who are the collaborators of Emily Chen?")

# Batch evaluation
questions = [
    "How many articles has each researcher published?",
    "What are the ethical concerns in AI?",
    "Which researchers work on Model Optimization?"
]
batch_judge_questions(questions)
```

---

## **Expected Results**

You will get:

* Winner declaration for each question
* Confidence level (high/medium/low)
* Metrics: accuracy, completeness, precision
* LLM reasoning explaining the decision
* Recommendations for the best approach per question type

---

## **Why This Matters**

Most implementations choose *either* RAG or Knowledge Graphs.

This framework shows **when to use which**, backed by objective LLM evaluations—ideal for:

* Building hybrid QA systems leveraging both methods
* Understanding semantic vs. structured query trade-offs
* Making informed architectural decisions
* Demonstrating the value of Knowledge Graphs vs pure LLM approaches


In [None]:
import os
from neo4j import GraphDatabase
import openai
from typing import List, Dict, Any
import time
import json

class Neo4jGraphRAG:
    def __init__(self):
        self.uri = os.getenv("NEO4J_URI")
        self.username = os.getenv("NEO4J_USERNAME")
        self.password = os.getenv("NEO4J_PASSWORD")
        self.database = os.getenv("NEO4J_DATABASE", "neo4j")

        self.driver = GraphDatabase.driver(
            self.uri,
            auth=(self.username, self.password)
        )

        openai.api_key = userdata.get('OPENAI_API_KEY')

    def close(self):
        self.driver.close()

    def execute_query(self, query: str, parameters: Dict = None) -> List[Dict]:
        """Execute a Cypher query and return results"""
        with self.driver.session(database=self.database) as session:
            result = session.run(query, parameters or {})
            return [record.data() for record in result]

    def load_data(self, csv_url: str):
        """Load data from CSV into Neo4j"""
        q_load = f"""
        LOAD CSV WITH HEADERS
        FROM '{csv_url}'
        AS row
        FIELDTERMINATOR ';'
        MERGE (a:Article {{title:row.Title}})
        SET a.abstract = row.Abstract,
            a.publication_date = date(row.Publication_Date)
        WITH a, row
        FOREACH (researcher in split(row.Authors, ',') |
            MERGE (p:Researcher {{name:trim(researcher)}})
            MERGE (p)-[:PUBLISHED]->(a))
        WITH a, row
        FOREACH (topic in [row.Topic] |
            MERGE (t:Topic {{name:trim(topic)}})
            MERGE (a)-[:IN_TOPIC]->(t))
        """
        return self.execute_query(q_load)

    def get_embedding(self, text: str) -> List[float]:
        """Generate embeddings using OpenAI"""
        response = openai.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

    def create_embeddings_for_articles(self):
        """Create and store embeddings for all articles"""
        articles = self.execute_query("""
            MATCH (a:Article)
            RETURN id(a) as id, a.title as title, a.abstract as abstract
        """)

        print(f"Creating embeddings for {len(articles)} articles...")
        for i, article in enumerate(articles, 1):
            text = f"{article['title']} {article['abstract']}"
            embedding = self.get_embedding(text)

            self.execute_query("""
                MATCH (a:Article)
                WHERE id(a) = $id
                SET a.embedding = $embedding
            """, {
                "id": article['id'],
                "embedding": embedding
            })

            if i % 10 == 0:
                print(f"  Progress: {i}/{len(articles)}")

        print(f"✅ Created embeddings for all {len(articles)} articles")

    def retrieve_context(self, question: str, limit: int = 5) -> str:
        """Retrieve relevant context from the graph based on the question."""
        keywords = question.lower().split()

        cypher_query = """
        MATCH (a:Article)
        WHERE ANY(keyword IN $keywords WHERE
            toLower(a.title) CONTAINS keyword OR
            toLower(a.abstract) CONTAINS keyword)
        OPTIONAL MATCH (a)-[:IN_TOPIC]->(t:Topic)
        OPTIONAL MATCH (r:Researcher)-[:PUBLISHED]->(a)
        WITH a,
             collect(DISTINCT t.name) as topics,
             collect(DISTINCT r.name) as authors
        RETURN a.title as title,
               a.abstract as abstract,
               a.publication_date as date,
               topics,
               authors
        ORDER BY size(authors) DESC
        LIMIT $limit
        """

        results = self.execute_query(cypher_query, {
            "keywords": keywords,
            "limit": limit
        })

        context_parts = []
        for i, record in enumerate(results, 1):
            context = f"""
Article {i}: {record['title']}
Authors: {', '.join(record['authors']) if record['authors'] else 'N/A'}
Topics: {', '.join(record['topics']) if record['topics'] else 'N/A'}
Abstract: {record['abstract']}
Date: {record['date']}
---"""
            context_parts.append(context)

        return "\n\n".join(context_parts)

    def retrieve_with_vector_search(self, question: str, limit: int = 5) -> str:
        """Retrieve using vector similarity"""
        embedding = self.get_embedding(question)

        cypher_query = """
        MATCH (a:Article)
        WHERE a.embedding IS NOT NULL
        WITH a,
             gds.similarity.cosine(a.embedding, $query_embedding) AS similarity
        ORDER BY similarity DESC
        LIMIT $limit
        OPTIONAL MATCH (a)-[:IN_TOPIC]->(t:Topic)
        OPTIONAL MATCH (r:Researcher)-[:PUBLISHED]->(a)
        WITH a, similarity,
             collect(DISTINCT t.name) as topics,
             collect(DISTINCT r.name) as authors
        RETURN a.title as title,
               a.abstract as abstract,
               topics,
               authors,
               similarity
        """

        results = self.execute_query(cypher_query, {
            "query_embedding": embedding,
            "limit": limit
        })

        context_parts = []
        for i, record in enumerate(results, 1):
            context = f"""
Article {i} (Similarity: {record['similarity']:.3f}):
Title: {record['title']}
Authors: {', '.join(record['authors'])}
Topics: {', '.join(record['topics'])}
Abstract: {record['abstract']}
---"""
            context_parts.append(context)

        return "\n\n".join(context_parts)

    def generate_answer(self, question: str, context: str) -> str:
        """Generate answer using LLM with retrieved context"""
        prompt = f"""You are a helpful assistant that answers questions based on the provided context from a knowledge graph.

Context from Knowledge Graph:
{context}

Question: {question}

Please provide a comprehensive answer based on the context above. If the context doesn't contain enough information to answer the question, say so.

Answer:"""

        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a helpful research assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=500
        )

        return response.choices[0].message.content

    def query(self, question: str, use_vector_search: bool = False) -> Dict[str, Any]:
        """Main RAG query method."""
        start_time = time.time()

        if use_vector_search:
            context = self.retrieve_with_vector_search(question)
        else:
            context = self.retrieve_context(question)

        if not context:
            return {
                "answer": "I couldn't find any relevant information in the knowledge graph.",
                "context": "",
                "sources": [],
                "time": time.time() - start_time
            }

        answer = self.generate_answer(question, context)

        return {
            "answer": answer,
            "context": context,
            "sources": self.extract_sources(context),
            "time": time.time() - start_time
        }

    def extract_sources(self, context: str) -> List[str]:
        """Extract article titles from context as sources"""
        sources = []
        for line in context.split('\n'):
            if line.startswith('Article') and ':' in line:
                title = line.split(':', 1)[1].strip()
                sources.append(title)
        return sources

    # ============================================
    # TEXT-TO-CYPHER FUNCTIONALITY
    # ============================================

    def get_graph_schema(self) -> str:
        """Get the current graph schema"""
        sample_query = """
        MATCH (r:Researcher)-[:PUBLISHED]->(a:Article)-[:IN_TOPIC]->(t:Topic)
        RETURN r.name as researcher, a.title as article, t.name as topic
        LIMIT 3
        """

        samples = self.execute_query(sample_query)

        schema = f"""
Graph Database Schema:
=====================

Node Types:
-----------
1. Researcher
   Properties: name (string)
   Example: "Emily Chen", "Dr. Sarah Williams"

2. Article
   Properties: title (string), abstract (string), publication_date (date)
   Example: "AI in Healthcare", "Machine Learning Applications"

3. Topic
   Properties: name (string)
   Example: "Artificial Intelligence", "Climate Change"

Relationships:
--------------
1. (Researcher)-[:PUBLISHED]->(Article)
   - A researcher published an article

2. (Article)-[:IN_TOPIC]->(Topic)
   - An article belongs to a topic

Important Notes:
----------------
- Multiple researchers can publish the SAME article (co-authorship)
- An article can have multiple topics
- Use MATCH patterns to find relationships
- Use WHERE clauses for filtering
- Use toLower() for case-insensitive matching
- Property access: node.property (e.g., r.name, a.title)

Sample Data:
------------
"""
        for sample in samples:
            schema += f"\n• {sample['researcher']} -> {sample['article'][:50]}... -> {sample['topic']}"

        return schema

    def text_to_cypher(self, question: str) -> Dict[str, Any]:
        """Convert natural language question to Cypher query using LLM"""
        schema = self.get_graph_schema()

        prompt = f"""{schema}

Task: Convert the following natural language question into a valid Neo4j Cypher query.

Rules:
1. Return ONLY the Cypher query, no explanations
2. Use proper Neo4j syntax
3. Use MATCH for finding patterns
4. Use WHERE for filtering
5. Use RETURN to specify what to return
6. Use toLower() for case-insensitive text matching
7. Limit results to 20 unless asked otherwise
8. For "collaborators", find researchers who published the SAME article
9. For counting, use count() function
10. For finding by name, use WHERE node.name = "exact name" or CONTAINS for partial match

Common Query Patterns:
- Find collaborators: MATCH (r1:Researcher)-[:PUBLISHED]->(a:Article)<-[:PUBLISHED]-(r2:Researcher)
- Count articles: MATCH (r:Researcher)-[:PUBLISHED]->(a) RETURN r.name, count(a)
- Find by topic: MATCH (a:Article)-[:IN_TOPIC]->(t:Topic) WHERE toLower(t.name) CONTAINS 'keyword'
- Find researcher's work: MATCH (r:Researcher {{name: "Name"}})-[:PUBLISHED]->(a) RETURN a.title

Question: "{question}"

Cypher Query:"""

        try:
            response = openai.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a Neo4j Cypher query expert. Generate only valid, executable Cypher queries. Be precise with syntax."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.1,
                max_tokens=300
            )

            cypher = response.choices[0].message.content.strip()
            cypher = cypher.replace("```cypher", "").replace("```", "").strip()

            return {
                "success": True,
                "cypher": cypher,
                "error": None
            }

        except Exception as e:
            return {
                "success": False,
                "cypher": None,
                "error": str(e)
            }

    def execute_text_to_cypher(self, question: str) -> Dict[str, Any]:
        """Generate Cypher from text and execute it"""
        start_time = time.time()

        cypher_result = self.text_to_cypher(question)

        if not cypher_result['success']:
            return {
                "success": False,
                "error": f"Failed to generate Cypher: {cypher_result['error']}",
                "time": time.time() - start_time
            }

        cypher = cypher_result['cypher']

        try:
            results = self.execute_query(cypher)

            return {
                "success": True,
                "cypher": cypher,
                "results": results,
                "result_count": len(results),
                "time": time.time() - start_time,
                "error": None
            }

        except Exception as e:
            return {
                "success": False,
                "cypher": cypher,
                "results": [],
                "result_count": 0,
                "time": time.time() - start_time,
                "error": f"Cypher execution error: {str(e)}"
            }

    def format_kg_results(self, results: List[Dict]) -> str:
        """Format KG results into readable text"""
        if not results:
            return "No results found."

        formatted = []
        for i, row in enumerate(results[:20], 1):
            row_text = f"Result {i}:"
            for key, value in row.items():
                if isinstance(value, list):
                    value = ", ".join(str(v) for v in value[:5])
                row_text += f"\n  • {key}: {value}"
            formatted.append(row_text)

        return "\n\n".join(formatted)

    def kg_query_with_explanation(self, question: str) -> Dict[str, Any]:
        """Execute KG query and generate natural language explanation"""
        start_time = time.time()

        kg_result = self.execute_text_to_cypher(question)

        if not kg_result['success']:
            return {
                "method": "Knowledge Graph (Text-to-Cypher)",
                "success": False,
                "error": kg_result['error'],
                "answer": f"Failed to query knowledge graph: {kg_result['error']}",
                "time": time.time() - start_time
            }

        formatted_results = self.format_kg_results(kg_result['results'])

        explanation_prompt = f"""You are explaining database query results to a user.

Question: {question}

Cypher Query Used:
{kg_result['cypher']}

Query Results:
{formatted_results}

Provide a clear, natural language answer based on these EXACT results. Be specific with numbers and names from the data. If there are no results, say so clearly.

Answer:"""

        try:
            response = openai.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant that explains database query results clearly and accurately."},
                    {"role": "user", "content": explanation_prompt}
                ],
                temperature=0.7,
                max_tokens=500
            )

            answer = response.choices[0].message.content

        except Exception as e:
            answer = f"Found {kg_result['result_count']} results, but failed to generate explanation: {str(e)}"

        return {
            "method": "Knowledge Graph (Text-to-Cypher)",
            "success": True,
            "cypher": kg_result['cypher'],
            "results": kg_result['results'],
            "result_count": kg_result['result_count'],
            "formatted_results": formatted_results,
            "answer": answer,
            "time": time.time() - start_time
        }

    # ============================================
    # LLM JUDGE COMPARISON
    # ============================================

    def compare_with_judge(self, question: str) -> Dict[str, Any]:
        """Use LLM to judge which method (RAG vs KG) gave a better answer"""
        print("\n" + "⚖️ " * 40)
        print("LLM JUDGE: Comparing RAG vs Knowledge Graph")
        print("⚖️ " * 40)
        print(f"\nQuestion: {question}\n")

        # Get both results
        print("🔄 Getting RAG answer...")
        rag_result = self.query(question, use_vector_search=False)

        print("🔄 Getting Knowledge Graph answer...")
        kg_result = self.kg_query_with_explanation(question)

        # Display both answers
        print("\n" + "=" * 80)
        print("📚 RAG ANSWER:")
        print("-" * 80)
        print(rag_result['answer'])
        print(f"⏱️  Time: {rag_result['time']:.2f}s")
        print(f"📄 Sources: {len(rag_result['sources'])} documents")

        print("\n" + "=" * 80)
        print("🔍 KNOWLEDGE GRAPH ANSWER:")
        print("-" * 80)
        if kg_result['success']:
            print(f"Cypher: {kg_result['cypher']}")
            print(f"\n{kg_result['answer']}")
            print(f"⏱️  Time: {kg_result['time']:.2f}s")
            print(f"📊 Results: {kg_result['result_count']} exact matches")
        else:
            print(f"❌ Failed: {kg_result['error']}")

        # If KG failed, RAG wins by default
        if not kg_result['success']:
            print("\n" + "🏆" * 40)
            print("WINNER: RAG (Knowledge Graph query failed)")
            print("🏆" * 40)
            return {
                "question": question,
                "winner": "RAG",
                "reason": "Knowledge Graph query failed",
                "rag_result": rag_result,
                "kg_result": kg_result,
                "judgment": None
            }

        # Ask LLM to judge
        print("\n🤔 Asking LLM judge to evaluate...")

        judgment_prompt = f"""You are an expert judge evaluating two AI systems answering the same question.

Question: "{question}"

SYSTEM A (RAG - Retrieval-Augmented Generation):
Answer: {rag_result['answer']}
Method: Retrieved {len(rag_result['sources'])} relevant documents and generated answer using LLM
Time: {rag_result['time']:.2f}s

SYSTEM B (Knowledge Graph with Text-to-Cypher):
Cypher Query: {kg_result['cypher']}
Answer: {kg_result['answer']}
Method: Generated structured database query and retrieved {kg_result['result_count']} exact results
Time: {kg_result['time']:.2f}s
Raw Results: {kg_result['formatted_results'][:500]}...

Evaluation Criteria:
1. **Accuracy**: Which answer is more factually correct?
2. **Completeness**: Which answer provides more complete information?
3. **Precision**: Which answer is more specific and exact?
4. **Verifiability**: Which answer can be verified/proven?
5. **Usefulness**: Which answer better serves the user's intent?

Provide your evaluation in the following JSON format:
{{
    "winner": "A" or "B" or "TIE",
    "confidence": "high" or "medium" or "low",
    "accuracy_score_a": 1-10,
    "accuracy_score_b": 1-10,
    "completeness_score_a": 1-10,
    "completeness_score_b": 1-10,
    "precision_score_a": 1-10,
    "precision_score_b": 1-10,
    "reasoning": "Detailed explanation of your judgment",
    "strengths_a": ["strength 1", "strength 2"],
    "strengths_b": ["strength 1", "strength 2"],
    "weaknesses_a": ["weakness 1", "weakness 2"],
    "weaknesses_b": ["weakness 1", "weakness 2"],
    "recommendation": "When to use each method for this type of question"
}}

Be objective and thorough in your analysis."""

        try:
            response = openai.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are an expert AI judge evaluating different question-answering systems. Be objective, thorough, and fair in your evaluations."},
                    {"role": "user", "content": judgment_prompt}
                ],
                temperature=0.3,
                max_tokens=1000
            )

            judgment_text = response.choices[0].message.content.strip()

            # Try to parse JSON
            try:
                if "```json" in judgment_text:
                    judgment_text = judgment_text.split("```json")[1].split("```")[0].strip()
                elif "```" in judgment_text:
                    judgment_text = judgment_text.split("```")[1].split("```")[0].strip()

                judgment = json.loads(judgment_text)
            except json.JSONDecodeError:
                judgment = {"raw_text": judgment_text}

        except Exception as e:
            print(f"❌ Error getting LLM judgment: {e}")
            judgment = {"error": str(e)}

        # Display judgment
        print("\n" + "🏆" * 40)
        print("LLM JUDGE VERDICT")
        print("🏆" * 40)

        if "error" in judgment:
            print(f"❌ Error: {judgment['error']}")
        elif "raw_text" in judgment:
            print(judgment['raw_text'])
        else:
            winner_map = {"A": "RAG", "B": "Knowledge Graph", "TIE": "TIE"}
            winner = winner_map.get(judgment.get('winner', 'UNKNOWN'), 'UNKNOWN')

            print(f"\n🎯 WINNER: {winner}")
            print(f"📊 Confidence: {judgment.get('confidence', 'unknown').upper()}")

            print(f"\n📈 Scores:")
            print(f"  RAG:")
            print(f"    • Accuracy: {judgment.get('accuracy_score_a', 'N/A')}/10")
            print(f"    • Completeness: {judgment.get('completeness_score_a', 'N/A')}/10")
            print(f"    • Precision: {judgment.get('precision_score_a', 'N/A')}/10")

            print(f"  Knowledge Graph:")
            print(f"    • Accuracy: {judgment.get('accuracy_score_b', 'N/A')}/10")
            print(f"    • Completeness: {judgment.get('completeness_score_b', 'N/A')}/10")
            print(f"    • Precision: {judgment.get('precision_score_b', 'N/A')}/10")

            print(f"\n💭 Reasoning:")
            print(f"  {judgment.get('reasoning', 'No reasoning provided')}")

            if judgment.get('strengths_a'):
                print(f"\n✅ RAG Strengths:")
                for strength in judgment['strengths_a']:
                    print(f"  • {strength}")

            if judgment.get('strengths_b'):
                print(f"\n✅ Knowledge Graph Strengths:")
                for strength in judgment['strengths_b']:
                    print(f"  • {strength}")

            if judgment.get('weaknesses_a'):
                print(f"\n⚠️  RAG Weaknesses:")
                for weakness in judgment['weaknesses_a']:
                    print(f"  • {weakness}")

            if judgment.get('weaknesses_b'):
                print(f"\n⚠️  Knowledge Graph Weaknesses:")
                for weakness in judgment['weaknesses_b']:
                    print(f"  • {weakness}")

            if judgment.get('recommendation'):
                print(f"\n💡 Recommendation:")
                print(f"  {judgment['recommendation']}")

        print("\n" + "=" * 80)

        return {
            "question": question,
            "winner": judgment.get('winner'),
            "confidence": judgment.get('confidence'),
            "judgment": judgment,
            "rag_result": rag_result,
            "kg_result": kg_result
        }


# ============================================
# STANDALONE HELPER FUNCTIONS
# ============================================

def quick_ask_with_judge(question: str):
    """Quick function to ask a question and get LLM judgment"""
    rag = Neo4jGraphRAG()

    # Check if data exists
    count = rag.execute_query("MATCH (n) RETURN count(n) as count")
    if count[0]['count'] == 0:
        print("📥 Loading data first...")
        rag.load_data('https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/dataset/synthetic_articles.csv')

        # Check if embeddings exist
        emb_count = rag.execute_query("MATCH (a:Article) WHERE a.embedding IS NOT NULL RETURN count(a) as count")
        if emb_count[0]['count'] == 0:
            print("🔢 Creating embeddings...")
            rag.create_embeddings_for_articles()

    result = rag.compare_with_judge(question)
    rag.close()
    return result


def batch_judge_questions(questions: List[str]):
    """Judge multiple questions and show aggregate statistics"""
    print("\n" + "🎯" * 40)
    print("BATCH LLM JUDGMENT - Multiple Questions")
    print("🎯" * 40)

    rag = Neo4jGraphRAG()

    # Check/load data
    count = rag.execute_query("MATCH (n) RETURN count(n) as count")
    if count[0]['count'] == 0:
        print("📥 Loading data...")
        rag.load_data('https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/dataset/synthetic_articles.csv')
        print("🔢 Creating embeddings...")
        rag.create_embeddings_for_articles()

    results = []
    for i, question in enumerate(questions, 1):
        print(f"\n{'='*80}")
        print(f"Question {i}/{len(questions)}")
        print(f"{'='*80}")

        result = rag.compare_with_judge(question)
        results.append(result)

        if i < len(questions):
            time.sleep(1)

    # Aggregate statistics
    print("\n" + "📊" * 40)
    print("AGGREGATE STATISTICS")
    print("📊" * 40)

    rag_wins = sum(1 for r in results if r.get('winner') == 'A')
    kg_wins = sum(1 for r in results if r.get('winner') == 'B')
    ties = sum(1 for r in results if r.get('winner') == 'TIE')

    print(f"\n🏆 Overall Results:")
    print(f"  • RAG Wins: {rag_wins}/{len(questions)} ({rag_wins/len(questions)*100:.1f}%)")
    print(f"  • Knowledge Graph Wins: {kg_wins}/{len(questions)} ({kg_wins/len(questions)*100:.1f}%)")
    print(f"  • Ties: {ties}/{len(questions)} ({ties/len(questions)*100:.1f}%)")

    # Average scores
    rag_accuracy = [r['judgment'].get('accuracy_score_a', 0) for r in results if 'judgment' in r and r['judgment'] and isinstance(r['judgment'].get('accuracy_score_a'), (int, float))]
    kg_accuracy = [r['judgment'].get('accuracy_score_b', 0) for r in results if 'judgment' in r and r['judgment'] and isinstance(r['judgment'].get('accuracy_score_b'), (int, float))]

    if rag_accuracy and kg_accuracy:
        print(f"\n📈 Average Accuracy Scores:")
        print(f"  • RAG: {sum(rag_accuracy)/len(rag_accuracy):.1f}/10")
        print(f"  • Knowledge Graph: {sum(kg_accuracy)/len(kg_accuracy):.1f}/10")

    # Question type analysis
    print(f"\n🔍 Question Type Analysis:")
    for i, (question, result) in enumerate(zip(questions, results), 1):
        winner_name = {"A": "RAG", "B": "KG", "TIE": "TIE"}.get(result.get('winner'), '?')
        print(f"  {i}. {question[:60]}...")
        print(f"     Winner: {winner_name}")

    print("\n" + "=" * 80)

    rag.close()
    return results


# ============================================
# USAGE EXAMPLES
# ============================================



In [None]:

# Example 1: Single question with judgment
print("Example 1: Single Question")
quick_ask_with_judge("Who are the collaborators of Emily Chen?")

# Example 2: Batch judgment of multiple questions
# print("\n\nExample 2: Batch Questions")
# test_questions = [
#     "Who are the collaborators of Emily Chen?",
#     "How many articles has each researcher published?",
#     "What are the most popular research topics?",
#     "Which researchers work on AI?",
# ]
# batch_judge_questions(test_questions)

Batch mode with multiple questions


# **Evaluation Question Sets (Overview Only)**

This evaluation framework organizes questions into eight categories based on which method—**RAG** or **Knowledge Graph (KG)**—is expected to perform better. Each category targets different reasoning patterns, allowing you to benchmark strengths, weaknesses, and ideal use-cases.

---

## **1. Relationship Queries**

**Expected Winner: Knowledge Graph**
These questions test the system’s ability to traverse structured relationships between entities.

Example questions:

* Who are the collaborators of Emily Chen?
* Which researchers have co-authored papers with David Johnson?
* Find all researchers who have worked with Sarah Lee.
* Which authors have published together on AI Ethics?

---

## **2. Counting & Aggregation**

**Expected Winner: Knowledge Graph**
These require exact counts, grouping, and structured aggregations.

Example questions:

* How many articles has each researcher published?
* Which researcher has published the most papers?
* How many papers are there on each topic?
* What is the total number of publications in 2023?

---

## **3. Filtering & Specific Queries**

**Expected Winner: Knowledge Graph**
KG excels when filtering based on explicit attributes or relationships.

Example questions:

* Show me all articles published by Emily Chen.
* What papers did Lisa Wang publish in 2024?
* Which papers were published after January 2024?
* Show me articles on AI Ethics published in 2023.

---

## **4. Topic-Based Queries**

**Expected Winner: Mixed (RAG + KG)**
These evaluate both semantic understanding (RAG) and structured topic associations (KG).

Example questions:

* What topics does Emily Chen research?
* Which researchers work on AI Ethics?
* What are the main research areas in the dataset?
* What subtopics are covered under Foundations of Language Models?

---

## **5. Semantic / Content Queries**

**Expected Winner: RAG**
RAG is stronger for interpretive, narrative, and content-based reasoning.

Example questions:

* What are the main challenges in AI safety?
* Explain the innovations in transformer architectures.
* What approaches are proposed for privacy in AI?
* Summarize the research on language model optimization.

---

## **6. Complex Multi-Hop Queries**

**Expected Winner: Knowledge Graph**
These require multi-step reasoning across connected entities.

Example questions:

* Which researchers work on the same topics as Emily Chen?
* Find researchers who collaborate with colleagues of David Johnson.
* What topics connect Michael Brown and Sarah Lee?
* Which researchers published in both 2023 and 2024?

---

## **7. Temporal Queries**

**Expected Winner: Knowledge Graph**
KG handles exact dates, ranges, and chronological patterns.

Example questions:

* What research was published in the last quarter of 2023?
* Which topics were most popular in 2024?
* What was the first paper published on AI Ethics?
* Compare publication activity between 2023 and 2024.

---

## **8. Comparison Queries**

**Expected Winner: Mixed**
These require both factual comparison (KG) and interpretive analysis (RAG).

Example questions:

* Compare the research focus of Emily Chen vs Michael Brown.
* Which topic has more researchers: AI Ethics or Language Models?
* Who is more prolific: David Johnson or Sarah Lee?
* Compare collaboration patterns across research areas.

---

# **Additional Evaluation Sets**

## **Quick Evaluation Set (10 Questions)**

A fast benchmark across all categories:

* Relationship, counting, filtering, topic, semantic, complex, temporal, and comparison questions.

## **Medium Evaluation Set (20 Questions)**

Balanced sampling across:

* Relationship
* Counting
* Topic
* Semantic
* Complex
* Temporal

## **Strength/Weakness Diagnostic Set**

Highlights scenarios where each method performs best—or struggles:

* KG strengths: collaboration networks, exact counts, date-range queries
* RAG strengths: summaries, innovations, ethical concerns
* Edge cases: citation impact, subjective comparisons, predictive reasoning





# **📈 Decision Flowchart: Should You Use KG or RAG?**

```
                         ┌──────────────────────────┐
                         │     Start with Query     │
                         └────────────┬─────────────┘
                                      │
                                      ▼
                     ┌────────────────────────────────────┐
                     │ Does the question require EXACT     │
                     │ data, numbers, or factual recall?   │
                     └─────────────────┬────────────────────┘
                                       │ Yes
                                       ▼
                          ┌─────────────────────────────┐
                          │     Use KNOWLEDGE GRAPH     │
                          │  (Structured, deterministic) │
                          └──────────────┬──────────────┘
                                         │
                                         ▼
                   ┌──────────────────────────────────────────────┐
                   │ Examples:                                     │
                   │  • Who collaborated with X?                   │
                   │  • How many papers in 2024?                   │
                   │  • List all articles by Emily Chen           │
                   └──────────────────────────────────────────────┘


                                      ▲
                                      │ No
                                      │
                                      ▼
                 ┌────────────────────────────────────────────────┐
                 │ Does the question require SEMANTIC reasoning,  │
                 │ interpretation, summarization, or explanation? │
                 └────────────────────────┬───────────────────────┘
                                           │ Yes
                                           ▼
                             ┌───────────────────────────┐
                             │         Use RAG           │
                             │ (Unstructured reasoning)  │
                             └────────────┬─────────────┘
                                          │
                                          ▼
               ┌─────────────────────────────────────────────────┐
               │ Examples:                                       │
               │  • What are the challenges in AI Safety?        │
               │  • Summarize research on transformer models     │
               │  • What are ethical concerns in AI research?    │
               └─────────────────────────────────────────────────┘


                                      ▲
                                      │ No
                                      │
                                      ▼
          ┌─────────────────────────────────────────────────────────┐
          │ Is the question multi-hop, relational, or graph-based? │
          └─────────────────────────┬──────────────────────────────┘
                                    │ Yes
                                    ▼
                       ┌───────────────────────────┐
                       │     Use KNOWLEDGE GRAPH   │
                       │   (Multi-hop traversal)   │
                       └───────────────────────────┘


                                      ▲
                                      │ No
                                      │
                                      ▼
       ┌─────────────────────────────────────────────────────────┐
       │ Is the question subjective, comparative, or mixed-mode?│
       └─────────────────────────┬──────────────────────────────┘
                                 │ Yes
                                 ▼
        ┌────────────────────────────────────────────────────────┐
        │ Use BOTH → Hybrid KG + RAG (Best for comparisons)      │
        │ Example: “Compare Emily Chen vs Michael Brown”         │
        └────────────────────────────────────────────────────────┘


                                     ▲
                                     │ No
                                     │
                                     ▼
                       ┌────────────────────────────────┐
                       │  Default: Try both KG and RAG   │
                       │  (Let evaluator choose winner)  │
                       └────────────────────────────────┘
```



In [None]:
# ============================================
# EVALUATION QUESTION SETS
# ============================================

# SET 1: RELATIONSHIP QUERIES (KG should excel)
relationship_questions = [
    "Who are the collaborators of Emily Chen?",
    "Which researchers have co-authored papers with David Johnson?",
    "Find all researchers who have worked with Sarah Lee",
    "Who has Michael Brown collaborated with?",
    "Which authors have published together on AI Ethics?",
]

# SET 2: COUNTING & AGGREGATION (KG should dominate)
counting_questions = [
    "How many articles has each researcher published?",
    "Which researcher has published the most papers?",
    "How many papers are there on each topic?",
    "Count the number of articles in AI Ethics",
    "Which topic has the most publications?",
    "How many researchers work on Foundations of Language Models?",
    "What is the total number of publications in 2023?",
]

# SET 3: FILTERING & SPECIFIC QUERIES (KG should win)
filtering_questions = [
    "Show me all articles published by Emily Chen",
    "What papers did Lisa Wang publish in 2024?",
    "Find articles about Model Optimization",
    "Which papers were published after January 2024?",
    "List all researchers working on Safety subtopic",
    "Show me articles on AI Ethics published in 2023",
]

# SET 4: TOPIC-BASED QUERIES (Mixed - both methods useful)
topic_questions = [
    "What topics does Emily Chen research?",
    "Which researchers work on AI Ethics?",
    "What are the main research areas in the dataset?",
    "Who works on Model Architectures?",
    "Find all researchers interested in Social Impact",
    "What subtopics are covered under Foundations of Language Models?",
]

# SET 5: SEMANTIC/CONTENT QUERIES (RAG should excel)
semantic_questions = [
    "What are the main challenges in AI safety according to the research?",
    "Explain the innovations in transformer architectures",
    "What are the ethical concerns about AI development?",
    "Summarize the research on language model optimization",
    "What approaches are proposed for privacy in AI?",
    "How is AI being used to address climate change?",
    "What are the key insights about scaling laws in language models?",
]

# SET 6: COMPLEX MULTI-HOP QUERIES (KG should excel)
complex_questions = [
    "Which researchers work on the same topics as Emily Chen?",
    "Find researchers who collaborate with colleagues of David Johnson",
    "What topics connect Michael Brown and Sarah Lee?",
    "Which researchers published in both 2023 and 2024?",
    "Find articles that bridge multiple subtopics",
    "Who are the most connected researchers in the collaboration network?",
]

# SET 7: TEMPORAL QUERIES (KG should win)
temporal_questions = [
    "What research was published in the last quarter of 2023?",
    "Which topics were most popular in 2024?",
    "Show the research timeline for Emily Chen",
    "What was the first paper published on AI Ethics?",
    "Compare publication activity between 2023 and 2024",
]

# SET 8: COMPARISON QUERIES (Mixed)
comparison_questions = [
    "Compare the research focus of Emily Chen vs Michael Brown",
    "Which topic has more researchers: AI Ethics or Language Models?",
    "Who is more prolific: David Johnson or Sarah Lee?",
    "Compare collaboration patterns between different research areas",
]

# ============================================
# COMPREHENSIVE EVALUATION SUITE
# ============================================

def run_comprehensive_evaluation():
    """Run evaluation across all question types"""

    evaluation_sets = {
        "Relationship Queries (KG Expected to Win)": relationship_questions,
        "Counting & Aggregation (KG Expected to Win)": counting_questions,
        "Filtering Queries (KG Expected to Win)": filtering_questions,
        "Topic Queries (Mixed Results)": topic_questions,
        "Semantic/Content Queries (RAG Expected to Win)": semantic_questions,
        "Complex Multi-hop (KG Expected to Win)": complex_questions,
        "Temporal Queries (KG Expected to Win)": temporal_questions,
        "Comparison Queries (Mixed)": comparison_questions,
    }

    print("\n" + "🎯" * 40)
    print("COMPREHENSIVE EVALUATION SUITE")
    print("🎯" * 40)
    print(f"\nTotal Question Sets: {len(evaluation_sets)}")
    print(f"Total Questions: {sum(len(qs) for qs in evaluation_sets.values())}")

    all_results = {}

    for set_name, questions in evaluation_sets.items():
        print(f"\n{'='*80}")
        print(f"📋 {set_name}")
        print(f"{'='*80}")
        print(f"Questions: {len(questions)}")

        results = batch_judge_questions(questions)
        all_results[set_name] = results

        # Quick summary for this set
        rag_wins = sum(1 for r in results if r.get('winner') == 'A')
        kg_wins = sum(1 for r in results if r.get('winner') == 'B')
        ties = sum(1 for r in results if r.get('winner') == 'TIE')

        print(f"\n📊 Set Results:")
        print(f"  RAG: {rag_wins}, KG: {kg_wins}, Ties: {ties}")

    # Overall statistics
    print("\n" + "🏆" * 40)
    print("OVERALL EVALUATION RESULTS")
    print("🏆" * 40)

    total_rag_wins = sum(
        sum(1 for r in results if r.get('winner') == 'A')
        for results in all_results.values()
    )
    total_kg_wins = sum(
        sum(1 for r in results if r.get('winner') == 'B')
        for results in all_results.values()
    )
    total_ties = sum(
        sum(1 for r in results if r.get('winner') == 'TIE')
        for results in all_results.values()
    )
    total_questions = total_rag_wins + total_kg_wins + total_ties

    print(f"\n📈 Aggregate Statistics:")
    print(f"  Total Questions: {total_questions}")
    print(f"  RAG Wins: {total_rag_wins} ({total_rag_wins/total_questions*100:.1f}%)")
    print(f"  KG Wins: {total_kg_wins} ({total_kg_wins/total_questions*100:.1f}%)")
    print(f"  Ties: {total_ties} ({total_ties/total_questions*100:.1f}%)")

    print(f"\n🎯 Performance by Question Type:")
    for set_name, results in all_results.items():
        rag = sum(1 for r in results if r.get('winner') == 'A')
        kg = sum(1 for r in results if r.get('winner') == 'B')
        tie = sum(1 for r in results if r.get('winner') == 'TIE')
        total = len(results)

        winner = "RAG" if rag > kg else "KG" if kg > rag else "TIE"
        print(f"\n  {set_name}")
        print(f"    Winner: {winner}")
        print(f"    RAG: {rag}/{total} ({rag/total*100:.0f}%), KG: {kg}/{total} ({kg/total*100:.0f}%), Ties: {tie}/{total}")

    return all_results


# ============================================
# QUICK EVALUATION SETS
# ============================================

# Small set for quick testing (10 questions)
quick_eval_questions = [
    # Relationship (KG should win)
    "Who are the collaborators of Emily Chen?",

    # Counting (KG should win)
    "How many articles has each researcher published?",

    # Filtering (KG should win)
    "Show me all articles published by David Johnson",

    # Topic (Mixed)
    "Which researchers work on AI Ethics?",

    # Semantic (RAG should win)
    "What are the main challenges in AI safety?",

    # Complex (KG should win)
    "Which researchers work on the same topics as Emily Chen?",

    # Temporal (KG should win)
    "What research was published in 2024?",

    # Comparison (Mixed)
    "Compare the research focus of Emily Chen vs Michael Brown",

    # Counting (KG should win)
    "Which topic has the most publications?",

    # Semantic (RAG should win)
    "Explain the innovations in transformer architectures",
]

# Medium set for balanced testing (20 questions)
medium_eval_questions = relationship_questions + counting_questions[:3] + topic_questions[:3] + semantic_questions[:3] + complex_questions[:3] + temporal_questions[:3]

# Curated set highlighting strengths/weaknesses
strength_weakness_questions = [
    # KG Strengths
    "Who collaborated with whom on Model Optimization papers?",
    "How many co-authors does each researcher have?",
    "Find all papers published between June and December 2023",
    "Which researchers published exactly 2 papers?",

    # RAG Strengths
    "What are the key innovations proposed for transformer architectures?",
    "Summarize the ethical concerns discussed in AI research",
    "Explain the privacy-preserving techniques mentioned in the papers",
    "What solutions are proposed for AI bias mitigation?",

    # Edge Cases (where both might struggle)
    "What is the citation impact of each paper?",  # Data not in graph
    "Compare the technical depth of papers on Safety",  # Subjective
    "Predict future research directions based on current trends",  # Speculative
    "Which paper had the most innovative methodology?",  # Requires judgment
]


# ============================================
# USAGE EXAMPLES
# ============================================

if __name__ == "__main__":

    # Option 1: Quick evaluation (10 questions)
    print("Running quick evaluation...")
    batch_judge_questions(quick_eval_questions)

    # Option 2: Medium evaluation (20 questions)
    # print("Running medium evaluation...")
    # batch_judge_questions(medium_eval_questions)

    # Option 3: Comprehensive evaluation (all sets)
    # print("Running comprehensive evaluation...")
    # run_comprehensive_evaluation()

    # Option 4: Strength/Weakness analysis
    # print("Running strength/weakness analysis...")
    # batch_judge_questions(strength_weakness_questions)

    # Option 5: Focus on specific question type
    # print("Evaluating relationship queries...")
    # batch_judge_questions(relationship_questions)

---

## 🚀 **Ready to Build Production-Ready AI Systems?**

You've just seen a comprehensive comparison of RAG and Knowledge Graph approaches. But this is just the beginning!

### **Take Your Skills to the Next Level**

Join the **Advanced LLM Multi-Agent Architecture Course** and learn:

- 🎯 **Multi-Agent Orchestration** - Design and implement complex agent systems
- 🔄 **Hybrid Architectures** - Combine RAG, Knowledge Graphs, and more
- 📊 **Evaluation & Monitoring** - Build robust evaluation frameworks
- 🏗️ **Production Deployment** - Scale your AI systems effectively
- 💡 **Real-World Projects** - Work on enterprise-grade case studies

### **Special Offer for Learners**

**[🎓 Enroll Now](https://maven.com/boring-bot/advanced-llm?promoCode=200OFF)** and use code **`200OFF`** to save $200!

---

### **What You'll Gain:**

✅ Deep understanding of multi-agent architectures  
✅ Hands-on experience with production patterns  
✅ Expert guidance from industry practitioners  
✅ A portfolio of real-world AI projects  
✅ Community access for ongoing support  

**[Start Building Advanced AI Systems Today →](https://maven.com/boring-bot/advanced-llm?promoCode=200OFF)**

---

*Built with ❤️ for the AI community. Questions? Reach out to the course instructors!*



# **📊 Summary Table: Expected Winners by Question Type**

| **Question Category**             | **Description**                                              | **Expected Winner**  |
| --------------------------------- | ------------------------------------------------------------ | -------------------- |
| **1. Relationship Queries**       | Entity–entity links, co-authorship, collaboration paths      | **KG**               |
| **2. Counting & Aggregation**     | Counts, totals, group-by operations                          | **KG**               |
| **3. Filtering Queries**          | Direct filters: by researcher, year, topic                   | **KG**               |
| **4. Topic-Based Queries**        | Topic membership, topic hierarchy, thematic grouping         | **Mixed (KG + RAG)** |
| **5. Semantic / Content Queries** | Summaries, insights, conceptual explanations                 | **RAG**              |
| **6. Complex Multi-Hop Queries**  | Multi-step graph reasoning, indirect relationships           | **KG**               |
| **7. Temporal Queries**           | Timelines, date filtering, year-based comparisons            | **KG**               |
| **8. Comparison Queries**         | Entity-to-entity comparisons (focus, productivity, networks) | **Mixed**            |

---

# **🏆 Winner Overview**

| **Method**               | **Types of Questions It Excels At**                                          |
| ------------------------ | ---------------------------------------------------------------------------- |
| **Knowledge Graph (KG)** | Relationship, counting, strict filtering, multi-hop, temporal logic          |
| **RAG**                  | Semantic interpretation, explanations, summaries, contextual reasoning       |
| **Mixed**                | Topic-based questions, comparisons requiring both structure + interpretation |

