# Beyond RAG: Building Explainable AI with Graphs


## 📚 What is GraphRAG?

**GraphRAG** (Graph-based Retrieval-Augmented Generation) is an advanced framework that enhances traditional **RAG (Retrieval-Augmented Generation)** by incorporating **knowledge graphs** to improve context retrieval and reasoning. It leverages structured relationships between entities (e.g., papers, authors, concepts) to provide **more accurate, connected, and explainable answers**.

---

### What is knowledge graph?

A **knowledge graph** is a structured representation of information that captures **entities** (things) and the **relationships** between them in the form of a **graph**.

---

📌 **Key Components**:

| Element                 | Description                              | Example                                            |
| ----------------------- | ---------------------------------------- | -------------------------------------------------- |
| **Entity (Node)**       | A concept, object, or person             | `"Gucci"`, `"Luxury Branding"`, `"Sustainability"` |
| **Relationship (Edge)** | A link between two entities              | `"Gucci" → supports → Sustainability`              |
| **Property**            | Additional info on an entity or relation | `"Founded: 1921"`, `"Revenue: $10B"`               |

----

### ⚙️ How GraphRAG Works

🔹 **Knowledge Graph Construction**

* Extracts entities (e.g., *"luxury brands," "scarcity"*) and relationships (e.g., *"influences," "conflicts with"*) from documents.
* Builds a graph where:

  * **Nodes** = Entities (papers, authors, concepts)
  * **Edges** = Relationships (citations, authorship, thematic links)

🔹 **Retrieval Phase**

* For a query (e.g., *"How does sustainability impact luxury branding?"*), it:

  * Traverses the graph to find multi-hop connections
    (e.g., *sustainability → conflicts → exclusivity → resolved by → Gucci*)
  * Retrieves richer context than chunk-based RAG.

🔹 **Generation Phase**

* Augments the LLM (e.g., Gemini, GPT) with the **subgraph of relevant relationships**.
* Generates answers grounded in **structured evidence**, not just raw text snippets.

---

### GraphRAG vs. Traditional RAG

* **RAG** retrieves chunks based on **textual similarity only** (no understanding of connections).
* **GraphRAG** navigates **structured relationships** between entities, citations, topics, or time points.
* Ideal for **researchers**, **analysts**, and **enterprise workflows** that demand **traceability and reasoning paths**.
---

Below are some example questions that clearly demonstrate the difference between **Traditional RAG** and **GraphRAG**

| **Question**                                                                     | **RAG Output**                                                                                       | **GraphRAG Output**                                                                                                                                                              | **What GraphRAG Adds**                                                                                |
| -------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| **How do Brand A and Brand B differ in sustainability messaging?**               | *“Brand A emphasizes carbon offsetting while Brand B focuses on circular fashion.”*                  | *“Brand A’s strategy centers around carbon offsetting (Report X), while Brand B aligns with the circularity model (Paper Y). Graph links show divergent sustainability logics.”* | GraphRAG traces **conceptual differences through cited frameworks**, not just surface-level mentions. |
| **How is influencer marketing impacting luxury pricing?**                        | *“Influencer marketing increases demand, influencing perceived value and enabling premium pricing.”* | *“Influencer campaigns → perceived exclusivity → scarcity logic → price premiums (Brand X case study, 2023).”*                                                                   | GraphRAG connects **causal paths**, not just retrieves related paragraphs.                            |
| **What are the opposing views between Author A and Author B on digital luxury?** | *“Author A supports NFTs in luxury; Author B is skeptical about blockchain integration.”*            | *“Author A supports NFTs as exclusivity tools (Paper Z); Author B warns of heritage dilution (Paper Q). Citation paths: A → supports → NFTs; B → critiques → tokenization.”*     | GraphRAG maps **contradictory stances via citation networks**, enabling structured comparison.        |
| **How has luxury brand strategy evolved since COVID-19?**                        | *“Luxury brands shifted to e-commerce and digital storytelling.”*                                    | *“Path: lockdown → digital push → livestreaming (China) → Hermes on WeChat → 2023 personalization surge.”*                                                                       | GraphRAG shows **chronological, interconnected shifts** in strategy using a timeline-style graph.     |
| **Which authors influenced the rise of scarcity narratives in luxury branding?** | *“Scarcity is discussed by Author X and Author Y in multiple studies.”*                              | *“Author Z → cited by Author X → defined modern scarcity logic → influenced current exclusivity tactics.”*                                                                       | GraphRAG **traces influence across authors and papers**, revealing deeper intellectual lineage.       |

---

## 💡 Why Use GraphRAG?

✅ **Multi-Hop Reasoning**
→ Answers complex queries like *“How does X influence Y through Z?”* via graph paths.

✅ **Explainability**
→ Shows evidence trails (e.g., *Paper 1 → cites → Paper 2 → contradicts → Theory A*).

✅ **Dynamic Knowledge**
→ Updates graphs with new papers/data **without retraining LLMs**.

✅ **Domain Adaptability**
→ Works for **literature reviews**, **legal analysis**, **fraud detection**, etc.

---

## 🧩 Key Components

### 📊 Graph Database

* Tools: **Neo4j**, **NebulaGraph**, or Python libraries (**NetworkX**, **PyVis**)

### 🧠 Embedding Models

* Examples: `text-embedding-3` (OpenAI), `text-embedding-gecko` (Google)


## 🌍 Real-World Applications

* **Academic Research**: Map conflicting theories in literature reviews
* **Enterprise Knowledge**: Link internal docs (*e.g., “How does policy X relate to contract Y?”*)
* **Healthcare**: Trace drug interactions via medical paper graphs

---

## ⚠️ Limitations

| **Challenge**   | **Details**                                             |
| --------------- | ------------------------------------------------------- |
| **Complexity**  | Requires graph-building overhead (vs. simple RAG)       |
| **Scalability** | Large graphs need optimized databases (e.g., **Neo4j**) |

---


In [6]:

!pip install google-generativeai networkx pyvis spacy python-dotenv
!python -m spacy download en_core_web_sm  # For entity extraction

Collecting python-dotenv
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.1
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m95.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [12]:
documents = [
    {
        "title": "The Paradox of Scarcity in Luxury Markets",
        "authors": ["Dion", "Borraz"],
        "journal": "Journal of Brand Management",
        "year": 2017,
        "abstract": "Luxury brands leverage scarcity to enhance perceived value...",
        "fields": ["luxury marketing", "consumer behavior"]
    },
    {
        "title": "Social Media Influencers and Brand Authenticity",
        "authors": ["Kapferer", "Valette-Florence"],
        "journal": "International Journal of Research in Marketing",
        "year": 2023,
        "abstract": "Influencers impact authenticity perceptions in luxury...",
        "fields": ["digital marketing", "brand authenticity"]
    }
]

## 2. Build a Multi-Relational Knowledge Graph

This code constructs a **multi-relational, directed knowledge graph** using `networkx.MultiDiGraph()` to represent complex relationships within a set of academic papers. Each entity in the data—**papers**, **authors**, **journals**, and **research fields**—is added as a **node** with a corresponding `type` attribute (e.g., `"paper"`, `"author"`). For every paper in the `documents` collection, the graph encodes structured relationships by creating **typed edges**:

* From each **author** to their paper (`"authored_by"`) and back from paper to author (`"author_of"`)
* From the **paper** to its **journal** (`"published_in"`) and vice versa (`"publishes"`)
* From the **paper** to each **field** of research (`"research_field"`) and from field to paper (`"paper_in_field"`)

This **bidirectional edge construction** supports multi-hop traversals, allowing for detailed graph reasoning (e.g., tracing a paper to its field, then finding all other papers in that field). The use of `MultiDiGraph` allows for **multiple relationships** between the same nodes, which is crucial for building rich semantic graphs needed in **GraphRAG** applications. This structure enables downstream tasks like context-aware retrieval, author influence analysis, or identifying interdisciplinary connections, and can be visualized with tools like `pyvis` for interactive exploration.


In [13]:
import networkx as nx
from pyvis.network import Network

G = nx.MultiDiGraph()  # Supports multiple edge types

for paper in documents:
    # Add nodes
    G.add_node(paper["title"], type="paper", year=paper["year"])
    for author in paper["authors"]:
        G.add_node(author, type="author")
    G.add_node(paper["journal"], type="journal")
    for field in paper["fields"]:
        G.add_node(field, type="field")

    # Add edges
    for author in paper["authors"]:
        G.add_edge(author, paper["title"], relationship="authored_by")
        G.add_edge(paper["title"], author, relationship="author_of")

    G.add_edge(paper["title"], paper["journal"], relationship="published_in")
    G.add_edge(paper["journal"], paper["title"], relationship="publishes")

    for field in paper["fields"]:
        G.add_edge(paper["title"], field, relationship="research_field")
        G.add_edge(field, paper["title"], relationship="paper_in_field")

## Visualize with Color-Coded Nodes

In [17]:
def show_graph(graph):
    net = Network(notebook=True, cdn_resources="remote", height="750px")

    # Color mapping
    colors = {
        "paper": "#004E98", #blue
        "author": "#B11226", #red
        "journal": "#D1D5DB", #grey
        "field": "#93B4E5" #light blue
    }

    for node in graph.nodes:
        node_type = graph.nodes[node].get("type", "")
        net.add_node(
            node,
            label=node,
            color=colors.get(node_type, "#888888"),
            title=f"{node_type.upper()}\n{graph.nodes[node].get('year', '')}"
        )

    for u, v, data in graph.edges(data=True):
        net.add_edge(u, v, label=data["relationship"], arrows="to")

    net.show("academic_graph.html")

show_graph(G)

academic_graph.html


In [18]:
from IPython.display import HTML
HTML(filename="academic_graph.html")

## Query with Gemini for Literature Synthesis: Symbolic Knowledge Graph option

Symbolic Knowledge Graphs is a structured network where nodes represent entities (e.g., people, concepts) and edges represent explicit relationships (e.g., "authored by," "influences").

Built using rules, NLP (e.g., dependency parsing), or manual curation.

**Key Features**:

**Explicit Relationships**: Edges have clear semantics (e.g., [Dion] --authored--> [Paper X]).

**Discrete Symbols**: Nodes are human-interpretable (e.g., "sustainability", "Gucci").

**Logic-Based**: Supports queries like "Find all papers that cite both X and Y."

<img src="https://raw.githubusercontent.com/MariaAise/llm_guide/refs/heads/main/shared_assets/visuals/images/graph.png" alt="Symbolic Knowledge Graph"/>

**Pros**:

✅ Precision: Exact matches for known relationships.

✅ Explainability: Paths are interpretable (e.g., A → B → C).

✅ Multi-Hop Reasoning: Traverses chains of relationships.

**Cons**:

❌ Rigid: Struggles with fuzzy/vague queries (e.g., "things like luxury").

❌ Manual Effort: Requires rules or NLP pipelines to extract relationships.

**Use Cases**:

Academic literature reviews (citations, authorship).

Enterprise knowledge graphs (org charts, product hierarchies).


In [19]:
query = "What are the key findings about luxury brand authenticity from Kapferer's work?"

# Retrieve subgraph
kapferer_papers = [n for n in G.nodes if "Kapferer" in G.nodes[n].get("authors", [])]
context = []
for paper in kapferer_papers:
    for _, field, data in G.out_edges(paper, data=True):
        if data["relationship"] == "research_field":
            context.append(f"{paper} (field: {field})")

# Generate answer
response = model.generate_content(f"""
    Academic Context:
    {'; '.join(context)}

    Question: {query}
    Analyze using the papers' research fields:
""")


from IPython.display import Markdown, display

Markdown(response.text)



Okay, let's break down Kapferer's key findings about luxury brand authenticity, analyzed through the lens of relevant research fields.  Keep in mind that Kapferer's work spans marketing, branding, consumer behavior, and even touches upon sociology and semiotics.  This multi-disciplinary perspective is crucial to understanding his views.

**Key Findings about Luxury Brand Authenticity from Kapferer's Work:**

Kapferer, particularly in his extensive work on luxury brand management, emphasizes that authenticity is *paramount* for the enduring success and desirability of luxury brands. It's not just a nice-to-have; it's the core of their value proposition. Here's a breakdown of key aspects:

1.  **Authenticity as Heritage and Origin (History and Origin Field):**
    *   **Emphasis on Historical Roots:** Luxury brands *must* have a deep and well-documented history. This isn't just marketing fluff; it's the bedrock of their authenticity. This history provides a narrative of excellence, craftsmanship, and evolution over time. Kapferer argues that luxury brands are often associated with the story of their founders, their unique craftsmanship, and the traditions passed down through generations.
    *   **Territorial Origin and "Made In":**  The geographical origin of the brand and its products is critical. "Made in Italy," "Swiss Made," "French Couture," etc., are not just labels but signify a specific set of skills, quality standards, and cultural associations.  Kapferer stresses that luxury brands must protect and leverage their territorial origin as a core component of their authenticity. Losing this connection weakens the brand significantly.  If a brand tries to "fake" this origin, consumers are quick to recognize the lack of authenticity.
    *   **Research Field Connection:** **History, Economic Geography, Cultural Studies**.  These fields are essential for understanding how the historical context, geographical location, and cultural values shape the brand's identity and are communicated to consumers.  Analyzing archival records, historical advertising, and the brand's physical presence in its origin region are crucial.

2.  **Authenticity as Commitment to Excellence (Operations and Production Field):**
    *   **Uncompromising Quality and Craftsmanship:** Luxury brands *cannot* compromise on quality.  The materials, the production process, and the attention to detail must be impeccable. This commitment to excellence is a tangible expression of the brand's authenticity. Kapferer highlights the importance of handcraftsmanship, even in the face of technological advancements.  He points to the continued relevance of artisan skills in maintaining the perceived value and authenticity of luxury goods.
    *   **Transparency and Traceability:**  Consumers increasingly want to know where the materials come from, how the products are made, and who made them. Transparency in the supply chain and ethical sourcing are becoming increasingly important dimensions of luxury brand authenticity. Brands that demonstrate a commitment to sustainability and fair labor practices enhance their perceived authenticity.
    *   **Research Field Connection:** **Operations Management, Supply Chain Management, Materials Science, Engineering, Ethics, and Corporate Social Responsibility (CSR)**. These disciplines help understand the practicalities of maintaining high-quality standards, ensuring ethical sourcing, and implementing transparent production processes.

3.  **Authenticity as Clear Brand Identity and Values (Marketing and Communication Field):**
    *   **Consistency and Coherence:** Luxury brands must have a clear and consistent brand identity that permeates all aspects of the business, from product design to marketing communications.  This includes consistent messaging, visual identity, and brand values.  Inconsistencies damage the perception of authenticity.
    *   **Brand Storytelling:**  Luxury brands are masters of storytelling. They weave compelling narratives around their history, their founders, their products, and their values.  These stories help consumers connect with the brand on an emotional level and reinforce the perception of authenticity.
    *   **Avoiding "Massification" and Over-Accessibility:** Kapferer warns against diluting the brand's exclusivity through excessive licensing, mass-market distribution, or overly aggressive marketing. Maintaining scarcity and exclusivity is critical for preserving the brand's aura and authenticity.  "Democratizing" luxury too much can backfire and erode the brand's perceived value.
    *   **Research Field Connection:** **Marketing, Branding, Advertising, Public Relations, Semiotics, Consumer Behavior**. These fields provide the tools and frameworks for understanding how brands communicate their identity, build relationships with consumers, and manage their reputation. Semiotics, in particular, helps to decode the signs and symbols that contribute to the perception of luxury.

4.  **Authenticity as Consumer Perception (Consumer Psychology Field):**
    *   **Perceived Authenticity is Key:** Ultimately, authenticity is in the *eye of the beholder*.  It's not enough for the brand to *be* authentic; consumers must *perceive* it as authentic. This perception is influenced by a variety of factors, including brand history, product quality, marketing communications, and word-of-mouth.
    *   **Building Trust and Relationships:**  Authenticity builds trust and fosters stronger relationships with consumers. Consumers are more likely to be loyal to brands that they perceive as authentic and transparent.
    *   **Vulnerability and Imperfection (Paradoxically):** While aiming for perfection in quality, brands that are *too* perfect can sometimes appear inauthentic. Acknowledging imperfections or vulnerabilities, while rare in luxury, can paradoxically enhance authenticity, as it shows a human side. This is a very delicate balance.
    *   **Research Field Connection:** **Consumer Psychology, Sociology, Anthropology**. These disciplines offer insights into how consumers perceive brands, form attitudes, and make purchase decisions.  Understanding the cultural and social context in which luxury brands operate is crucial for managing the perception of authenticity.

**In Summary:**

Kapferer's work emphasizes that luxury brand authenticity is a complex and multi-faceted concept. It's not just about historical heritage, though that's a crucial element.  It's about a holistic commitment to excellence, a clear brand identity, consistent communication, and ultimately, the consumer's perception of genuineness. Ignoring any of these aspects can significantly damage a luxury brand's reputation and long-term success. He shows it's a balance of creating and managing the brand in a way that respects its history, origins, and values, while delivering exceptional quality and service. This requires a deep understanding of various fields of study to create and maintain a truly authentic brand experience.


## GraphRAG + Vector-Based Semantic Search

In a **GraphRAG** (Graph-based Retrieval-Augmented Generation) system, traditional symbolic retrieval (i.e., knowledge graphs with nodes and edges) is enhanced by **vector-based semantic search**. This hybrid approach leverages the strengths of both worlds:

 🔹 **Vector-Based Semantic Search Recap**

* Transforms text (e.g., document titles, concepts, queries) into **dense numerical vectors** using models like `text-embedding-3`, `Sentence-BERT`, or `text-embedding-gecko`.
* These vectors represent **semantic meaning** — so similar ideas have vectors that are **closer** in high-dimensional space.
* Enables **flexible**, **fuzzy**, and **language-aware** matching between user queries and documents/entities.

---

🧠 **Where Embeddings Fit in GraphRAG**

In GraphRAG, embeddings are not used alone — they complement the graph. Here's how:

✅ Use Case 1: **Semantic Node Matching**

When a user submits a query like `"How does scarcity influence luxury branding?"`, GraphRAG:

1. Embeds the query using a language model.
2. Searches for semantically **closest nodes** in the graph (e.g., papers, authors, or concepts that may not use the word "scarcity" directly but relate to it).
3. These matches guide which parts of the **knowledge graph** to traverse next.

---

✅ Use Case 2: **Node Expansion via Embeddings**

When ingesting new content:

* You embed its key elements (title, abstract, keywords).
* You then **link it** to existing nodes in the graph **based on semantic similarity**, even if no explicit citation or author relationship exists.
* This enables **dynamic augmentation** of the graph — no rule-writing or manual ontology needed.

---

🔁 Benefits of Combining Graphs + Embeddings

| Feature                | Vector-Based | Symbolic Graph | Together (GraphRAG)         |
| ---------------------- | ------------ | -------------- | --------------------------- |
| Fuzzy matching         | ✅            | ❌              | ✅                           |
| Logical path traversal | ❌            | ✅              | ✅                           |
| Multi-hop reasoning    | ❌            | ✅              | ✅                           |
| Semantic clustering    | ✅            | ❌              | ✅                           |
| Explainability         | ❌            | ✅              | Partial (via graph context) |

---

🔧 **Implementation Flow**

1. **Graph Construction**

   * Build a symbolic graph from documents (authors, concepts, journals, etc.).
   * Define relationships (e.g., *authored by*, *cites*, *mentions*).

2. **Embedding Layer**

   * Generate embeddings for all nodes or content chunks.
   * Store them in a **vector index** (e.g., FAISS, Chroma, Pinecone).

3. **Hybrid Retrieval at Query Time**

   * Embed user query.
   * Retrieve top-matching nodes/documents via vector search.
   * Traverse symbolic graph starting from these nodes to gather related entities.
   * Feed this **augmented context** into the LLM for grounded generation.

<img src="https://raw.githubusercontent.com/MariaAise/llm_guide/refs/heads/main/shared_assets/visuals/images/graphrag2.png" alt="GraphRag" width="400"/>

In [20]:
!pip install sentence-transformers networkx

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [22]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")  # 384-dim embeddings

# Sample nodes (papers/concepts)
nodes = [
    "luxury branding",
    "scarcity in marketing",
    "sustainability in luxury"
]

# Generate embeddings
node_embeddings = embedding_model.encode(nodes)

print(node_embeddings.shape)  # Output: (3, 384)

(3, 384)


In [23]:
import networkx as nx

G = nx.Graph()

# Add nodes with embeddings
for node, embedding in zip(nodes, node_embeddings):
    G.add_node(node, embedding=embedding)

# Add relationships (optional)
G.add_edge("luxury branding", "scarcity in marketing", label="enhances")
G.add_edge("sustainability in luxury", "luxury branding", label="challenges")

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

def find_similar_nodes(query, graph, top_k=2):
    query_embedding = embedding_model.encode([query])
    similarities = {}

    for node in graph.nodes:
        node_embedding = graph.nodes[node].get("embedding")
        if node_embedding is not None:
            sim = cosine_similarity(query_embedding, [node_embedding])[0][0]
            similarities[node] = sim

    return sorted(similarities.items(), key=lambda x: -x[1])[:top_k]

# Example query
query = "How to market high-end products?"
similar_nodes = find_similar_nodes(query, G)
print(similar_nodes)
# Output: [("luxury branding", 0.82), ("scarcity in marketing", 0.76)]

[('scarcity in marketing', np.float32(0.40735376)), ('luxury branding', np.float32(0.3383499))]
