In [28]:
from dotenv import load_dotenv
import os

# Common data processing
import json
import textwrap

# Langchain
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import ChatOpenAI

# Warning control
import warnings
warnings.filterwarnings("ignore")

# Building a Knowledge Graph from SEC Filings for Graph RAG

This tutorial demonstrates how to transform unstructured text documents (SEC 10-K filings) into a queryable knowledge graph with vector search capabilities. We'll build a complete Graph RAG pipeline that combines semantic search with graph structure.

## What is Graph RAG?

**Graph RAG** = Traditional RAG (Retrieval Augmented Generation) + Knowledge Graphs

### Traditional RAG Limitations:
- Treats documents as isolated chunks
- No understanding of relationships between entities
- Can't answer multi-hop questions like "Which firms connected to X have invested over $Y?"

### Graph RAG Solution:
1. **Chunk documents** into manageable pieces
2. **Create graph nodes** for each chunk with metadata
3. **Generate embeddings** for semantic search
4. **Extract entities and relationships** (future: add company, person, investment nodes)
5. **Query using both** vector similarity AND graph traversal

## Real-World Use Case: Financial Analysis

Imagine asking: *"Which companies connected to Palo Alto Networks have disclosed cybersecurity risks and invested over $50M in R&D?"*

This requires:
- Semantic search for "cybersecurity risks" (vector similarity)
- Entity extraction for companies and financial figures
- Graph traversal for "connected to" relationships
- Filtering by investment amounts

## Dataset Overview

We're using **NetApp's 10-K Form** (SEC filing 0000950170-23-027948):
- **Item 1**: Business Overview
- **Item 1A**: Risk Factors  
- **Item 7**: Management's Discussion and Analysis
- **Item 7A**: Market Risk Disclosures

After processing, we'll have:
- **257 text chunks** (nodes in our graph)
- **Vector embeddings** for each chunk (1536 dimensions)
- **Metadata**: company name, CIK, CUSIP, form section, source URL

## Learning Objectives

By the end of this tutorial, you'll understand:

1. **Document Chunking**: How to split long documents while preserving context
2. **Graph Construction**: Creating nodes with Cypher MERGE statements
3. **Vector Indexing**: Setting up Neo4j vector indexes for similarity search
4. **Embedding Generation**: Using OpenAI embeddings within Neo4j
5. **Retrieval Augmented Generation**: Building a Q&A system with LangChain
6. **Hallucination Control**: Testing how the system handles out-of-scope questions

Let's get started!

In [29]:
# Load from environment
load_dotenv()
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE') or 'neo4j'
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

OPENAI_ENDPOINT = os.getenv('OPENAI_BASE_URL') + '/embeddings'

# Global constants
VECTOR_INDEX_NAME = 'form_10k_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'

## Setup and Configuration

First, we'll load environment variables and set up our connections to Neo4j and OpenAI.

In [30]:
print(f"API Key loaded: {OPENAI_API_KEY[:20]}..." if OPENAI_API_KEY else "API Key is None!")
print(f"Endpoint: {OPENAI_ENDPOINT}")

API Key loaded: sk-proj-cZujNxtOyhQE...
Endpoint: https://api.openai.com/v1/embeddings


## Step 1: Load and Explore the SEC Filing Data

SEC **Form 10-K** is an annual report required by the U.S. Securities and Exchange Commission (SEC). It provides a comprehensive summary of a company's financial performance.

### Key Sections We're Processing:
- **Item 1**: Description of business operations
- **Item 1A**: Risk factors
- **Item 7**: Management's discussion and analysis (MD&A)
- **Item 7A**: Quantitative and qualitative disclosures about market risk

You can search and download these filings from the SEC's [EDGAR database](https://www.sec.gov/edgar/search/).

Our data file is a JSON containing the NetApp 10-K filing with pre-extracted sections.

In [31]:
file_name = "0000950170-23-027948.json"
file = json.load(open(file_name))
print(f"Type of the file: {type(file)}")

for k, v in file.items():
    print(k, type(v))

Type of the file: <class 'dict'>
item1 <class 'str'>
item1a <class 'str'>
item7 <class 'str'>
item7a <class 'str'>
cik <class 'str'>
cusip6 <class 'str'>
cusip <class 'list'>
names <class 'list'>
source <class 'str'>


In [32]:
item1_text = file['item1']
item1_text[0:1000]

'>Item 1.  \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.\n\n\nOur opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the complexities created by rapid data and cloud growth, multi-cloud management, and the adoption of next-generation technologies, such as AI, Kubernetes, and modern databases. Our modern approach to hybrid, multicloud infrastructure and data management, which we term ‘evolved cloud’, provi

## Step 2: Document Chunking Strategy

### Why Chunking Matters

**The Challenge**: LLMs have context window limits, and embeddings work best on focused text segments.

**The Goal**: Split documents into chunks that:
1. Fit within embedding model limits (we use OpenAI's ada-002: max 8,191 tokens)
2. Preserve semantic coherence (don't split mid-sentence or mid-paragraph)
3. Maintain some context overlap (help with boundary cases)

### Our Chunking Parameters:
- **chunk_size = 2000 characters**: ~400-500 tokens (safe for embeddings)
- **chunk_overlap = 200 characters**: Overlapping context prevents information loss at boundaries
- **RecursiveCharacterTextSplitter**: LangChain's splitter that tries to split on paragraph breaks first, then sentences

### Visual Example:
```
Document: "AAAAA BBBBB CCCCC DDDDD EEEEE"

Chunk 1: "AAAAA BBBBB CCCCC"
                        ↓ (200 char overlap)
Chunk 2:         "CCCCC DDDDD EEEEE"
```

This overlap ensures queries about "CCCCC" match both chunks!

In [33]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap = 200,
    length_function = len,
    is_separator_regex = False
)

In [34]:
file_name[:file_name.rindex('.')]

'0000950170-23-027948'

In [35]:
def split_form10k_data_into_chunks(file_name):
    chunks_with_metadata = []
    file = json.load(open(file_name))
    for item in ['item1', 'item1a', 'item7', 'item7a']:
        print(f'Processing {item} from {file_name}')
        item_text = file[item]
        item_text_chunks = text_splitter.split_text(item_text)
        chunk_seq_id = 0

        for chunk in item_text_chunks:
            form_id = file_name[:file_name.rindex('.')]
            chunks_with_metadata.append({
                'text': chunk,
                'form10kItem': item,
                'chunkSeqId': chunk_seq_id,
                'formId': f'{form_id}',
                'chunkId': f'{form_id}-{item}-chunk{chunk_seq_id:04d}',
                # metadata
                'names': file['names'],
                'cik': file['cik'],
                'cusip6': file['cusip6'],
                'source': file['source']
            })
            chunk_seq_id += 1
        print(f'Split into {chunk_seq_id} chunks')
    return chunks_with_metadata


In [36]:
file_chunks = split_form10k_data_into_chunks(file_name)

Processing item1 from 0000950170-23-027948.json
Split into 254 chunks
Processing item1a from 0000950170-23-027948.json
Split into 1 chunks
Processing item7 from 0000950170-23-027948.json
Split into 1 chunks
Processing item7a from 0000950170-23-027948.json
Split into 1 chunks


In [37]:
file_chunks[0]

{'text': '>Item 1.  \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.\n\n\nOur opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the complexities created by rapid data and cloud growth, multi-cloud management, and the adoption of next-generation technologies, such as AI, Kubernetes, and modern databases. Our modern approach to hybrid, multicloud infrastructure and data management, which we term ‘evolved clou

### Understanding Our Chunk Metadata

Each chunk contains:
- **text**: The actual text content
- **chunkId**: Unique identifier (e.g., `0000950170-23-027948-item1-chunk0000`)
- **formId**: SEC filing ID
- **form10kItem**: Which section (item1, item1a, item7, item7a)
- **chunkSeqId**: Sequential order within that section (0, 1, 2, ...)
- **names**: Company names (e.g., ['Netapp Inc', 'NETAPP INC'])
- **cik**: Central Index Key (SEC's unique company identifier)
- **cusip6**: First 6 characters of CUSIP (financial security identifier)
- **source**: URL to the original SEC filing

This rich metadata enables:
- **Filtering**: "Only show me risk factors" (filter by form10kItem = 'item1a')
- **Provenance**: Track answers back to source documents
- **Entity resolution**: Link chunks to the same company via CIK
- **Sequential reading**: Use chunkSeqId to read in order

## Step 3: Create Graph Nodes from Text Chunks

Now we'll transform our Python dictionaries into graph nodes using Cypher, Neo4j's query language.

### Understanding the MERGE Pattern

**MERGE** is like "INSERT or UPDATE" in SQL:
- If a node with `chunkId` exists → do nothing
- If it doesn't exist → create it and set all properties

### Why MERGE instead of CREATE?
- **Idempotent**: Safe to run multiple times without creating duplicates
- **Upsert logic**: Can update existing nodes (though we just create here)

### The Cypher Query Breakdown:
```cypher
MERGE (mergedChunk:Chunk {chunkId: $chunkParam.chunkId})
  -- Find or create a Chunk node with this chunkId
  
ON CREATE SET
  -- Only runs if the node was newly created
  mergedChunk.names = $chunkParam.names,
  mergedChunk.formId = $chunkParam.formId,
  -- ... set all other properties
  
RETURN mergedChunk
  -- Return the node (for verification)
```

### Parameters with `$chunkParam`:
We pass the entire chunk dictionary as a parameter. Neo4j automatically maps nested properties:
- `$chunkParam.chunkId` → `file_chunks[0]['chunkId']`
- `$chunkParam.text` → `file_chunks[0]['text']`

In [38]:
chunk_node_query = """
merge(mergedChunk:Chunk {chunkId: $chunkParam.chunkId})
on create set
    mergedChunk.names = $chunkParam.names,
    mergedChunk.formId = $chunkParam.formId, 
    mergedChunk.cik = $chunkParam.cik, 
    mergedChunk.cusip6 = $chunkParam.cusip6, 
    mergedChunk.source = $chunkParam.source, 
    mergedChunk.f10kItem = $chunkParam.form10kItem, 
    mergedChunk.chunkSeqId = $chunkParam.chunkSeqId, 
    mergedChunk.text = $chunkParam.text
return mergedChunk
"""

Connection to graph instance using LangChain

In [39]:
kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)

Remove the movies dataset, if exists

In [40]:
kg.query("""
      MATCH (n:Movie) DETACH DELETE n
  """)
kg.query("""
    MATCH (n:Person) DETACH DELETE n
""")
print("Movie dataset removed")

Movie dataset removed


Create a single chunk node

In [41]:
kg.query(chunk_node_query, params={'chunkParam': file_chunks[0]})

[{'mergedChunk': {'formId': '0000950170-23-027948',
   'f10kItem': 'item1',
   'names': ['Netapp Inc', 'NETAPP INC'],
   'cik': '1002047',
   'cusip6': '64110D',
   'source': 'https://www.sec.gov/Archives/edgar/data/1002047/000095017023027948/0000950170-23-027948-index.htm',
   'text': '>Item 1.  \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.\n\n\nOur opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the 

## Step 4: Create Constraints for Data Integrity

### What are Graph Constraints?

Like SQL constraints, Neo4j constraints enforce data rules:

1. **Uniqueness**: No two nodes can have the same property value
2. **Existence**: A property must exist on all nodes of a label
3. **Property type**: Enforce data types

### Why `UNIQUE` on chunkId?

```cypher
CREATE CONSTRAINT unique_chunk IF NOT EXISTS
  FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE
```

**Benefits**:
- **Prevents duplicates**: Can't accidentally create two chunks with same ID
- **Creates an index**: Automatic index on `chunkId` for fast lookups
- **MERGE performance**: Makes `MERGE` operations much faster (uses index)

**Real-world analogy**: Like a primary key in SQL databases.

### Checking Existing Indexes

The `SHOW INDEXES` command reveals:
- **VECTOR indexes**: For similarity search (we'll create one next)
- **RANGE indexes**: For property lookups (created by constraints)
- **LOOKUP indexes**: Internal Neo4j indexes for labels/relationships

In [42]:
kg.query("""
    create constraint unique_chunk if not exists
         for (c:Chunk) require c.chunkId is unique
""")

[]

In [43]:
kg.query('show indexes')

[{'id': 4,
  'name': 'form_10k_chunks',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['textEmbedding'],
  'indexProvider': 'vector-3.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2025, 12, 10, 1, 48, 35, 932000000, tzinfo=<UTC>),
  'readCount': 9},
 {'id': 1,
  'name': 'index_1b9dcc97',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'RELATIONSHIP',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2025, 12, 5, 7, 13, 47, 53000000, tzinfo=<UTC>),
  'readCount': 20},
 {'id': 0,
  'name': 'index_460996c0',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'NODE',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.Dat

- Loop through and create nodes for all chunks
- Should create 257 chunks

In [44]:
node_count = 0
for chunk in file_chunks:
    print(f"Creating `:Chunk` node for chunk ID {chunk['chunkId']}")

    kg.query(chunk_node_query, params={'chunkParam': chunk})
    node_count += 1
print(f'Created {node_count} nodes')

Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0000
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0001
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0002
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0003
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0004
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0005
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0006
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0007
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0008
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0009
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0010
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0011
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0012
Creating `:Chunk` node for chunk ID 0000950170-23-0

In [45]:
kg.query("""match (n) 
         return count(n) as nodeCount""")

[{'nodeCount': 257}]

## Step 5: Create a Vector Index for Semantic Search

### What is a Vector Index?

Traditional indexes work with exact matches or ranges:
- "Find chunks where formId = 'X'" → Range index
- "Find chunks where name contains 'NetApp'" → Text index

**Vector indexes** enable **similarity search**:
- "Find chunks semantically similar to this query"
- Uses cosine similarity, Euclidean distance, or other metrics
- Essential for RAG systems

### Understanding the Configuration

```cypher
CREATE VECTOR INDEX `form_10k_chunks` IF NOT EXISTS
  FOR (c:Chunk) ON (c.textEmbedding)
  OPTIONS {
    indexConfig: {
      `vector.dimensions`: 1536,  -- OpenAI ada-002 embedding size
      `vector.similarity_function`: 'cosine'  -- Cosine similarity
    }
  }
```

### Key Parameters:

| Parameter | Value | Why? |
|-----------|-------|------|
| **dimensions** | 1536 | OpenAI's text-embedding-ada-002 produces 1536-dim vectors |
| **similarity_function** | cosine | Best for text embeddings (normalizes for length) |

### Similarity Functions Compared:

- **Cosine**: Measures angle between vectors (0 to 1, higher = more similar)
  - Best for: Text, where magnitude doesn't matter
- **Euclidean**: Measures straight-line distance
  - Best for: Spatial data, images
- **Dot Product**: Considers both angle and magnitude
  - Best for: When vector magnitude has meaning

**For text embeddings, always use cosine similarity!**

In [46]:
kg.query("""
    create vector index `form_10k_chunks` if not exists
         for (c: Chunk) on (c.textEmbedding)
         options { indexConfig: {
            `vector.dimensions`: 1536,
            `vector.similarity_function`: 'cosine'
         }}
""")

[]

In [47]:
kg.query('show indexes')

[{'id': 4,
  'name': 'form_10k_chunks',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['textEmbedding'],
  'indexProvider': 'vector-3.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2025, 12, 10, 1, 48, 35, 932000000, tzinfo=<UTC>),
  'readCount': 9},
 {'id': 1,
  'name': 'index_1b9dcc97',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'RELATIONSHIP',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2025, 12, 5, 7, 13, 47, 53000000, tzinfo=<UTC>),
  'readCount': 20},
 {'id': 0,
  'name': 'index_460996c0',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'NODE',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.Dat

## Step 6: Generate and Store Embeddings

This is where the magic happens! We'll calculate embedding vectors for all chunks and store them directly in Neo4j.

### Neo4j GenAI Plugin

Neo4j 5.x includes built-in GenAI functions that call embedding APIs from within Cypher queries:

```cypher
genai.vector.encode(
  text,              -- The text to embed
  "OpenAI",          -- Provider
  {
    token: $openAiApiKey,
    endpoint: $openAiEndpoint
  }
)
```

### The Complete Query Breakdown:

```cypher
MATCH (chunk:Chunk) WHERE chunk.textEmbedding IS NULL
  -- Only process chunks without embeddings (idempotent)

WITH chunk, genai.vector.encode(...) AS vector
  -- Generate embedding for chunk.text
  
CALL db.create.setNodeVectorProperty(chunk, "textEmbedding", vector)
  -- Store the vector as a property on the node
```

### Why Store Embeddings in the Graph?

**Option 1**: External vector DB (Pinecone, Weaviate, etc.)
- Pros: Specialized for vectors, fast
- Cons: Two databases to manage, sync issues

**Option 2**: Store in Neo4j (our approach)
- Pros: Single source of truth, use graph structure + vectors together
- Cons: Slightly slower for pure vector search

**For Graph RAG, option 2 is better!** We need both graph traversal and vector search in the same query.

### Cost Consideration

Embedding 257 chunks with OpenAI ada-002:
- ~500K characters total
- Cost: ~$0.0001 per 1K tokens → ~$0.05 total
- One-time cost (embeddings are stored)

In [48]:
kg.query("""
    match (chunk: Chunk) where chunk.textEmbedding is null
    with chunk, genai.vector.encode(
        chunk.text,
        "OpenAI",
        {
            token: $openAiApiKey,
            endpoint: $openAiEndpoint
        }
    ) as vector
    call db.create.setNodeVectorProperty(chunk, "textEmbedding", vector)
""",
params={"openAiApiKey":OPENAI_API_KEY, "openAiEndpoint": OPENAI_ENDPOINT})

[]

In [49]:
kg.query("""
    MATCH (n)
    WHERE n.chunkId IS NOT NULL AND NOT n:Chunk
    SET n:Chunk
    RETURN count(n) as labeled
""")

[#CC7E]  _: <CONNECTION> error: Failed to read from defunct connection IPv4Address(('p-c43bf5af-d06d-0001.production-orch-1048.neo4j.io', 7687)) (ResolvedIPv4Address(('54.85.127.23', 7687))): OSError('No data')
Transaction failed and will be retried in 0.9544950658835788s (Failed to read from defunct connection IPv4Address(('p-c43bf5af-d06d-0001.production-orch-1048.neo4j.io', 7687)) (ResolvedIPv4Address(('54.85.127.23', 7687))))
[#CC8A]  _: <CONNECTION> error: Failed to read from defunct connection ResolvedIPv4Address(('44.210.31.186', 7687)) (ResolvedIPv4Address(('44.210.31.186', 7687))): OSError('No data')


[{'labeled': 0}]

In [50]:
kg.refresh_schema()
print(kg.schema)

Node properties:
Chunk {chunkId: STRING, names: LIST, formId: STRING, cik: STRING, cusip6: STRING, source: STRING, chunkSeqId: INTEGER, text: STRING, textEmbedding: LIST, f10kItem: STRING}
Relationship properties:

The relationships:



## Step 7: Implement Vector Similarity Search

Now we can query our knowledge graph using natural language questions!

### How Vector Search Works

1. **Encode the question**: Convert user's query into a 1536-dim embedding vector
2. **Calculate similarity**: Compare question embedding to all chunk embeddings
3. **Rank by score**: Return top-k most similar chunks (we use k=10)
4. **Return results**: Get the text and similarity scores

### The Vector Search Query

```cypher
WITH genai.vector.encode($question, "OpenAI", {...}) AS question_embedding
  -- Convert question to embedding vector

CALL db.index.vector.queryNodes($index_name, $top_k, question_embedding)
  -- Search the vector index
  YIELD node, score
  
RETURN score, node.text AS text
  -- Return similarity score and text
```

### Understanding Similarity Scores

Cosine similarity ranges from 0 to 1:
- **0.95 - 1.0**: Extremely similar (likely exact match or paraphrase)
- **0.85 - 0.95**: Very relevant
- **0.75 - 0.85**: Moderately relevant
- **< 0.75**: Weak match (may not be useful)

### Example Query Walkthrough

**Question**: "In a single sentence, tell me about NetApp"

**What happens**:
1. Question encoded: `[0.002, -0.015, 0.031, ..., 0.008]` (1536 dimensions)
2. Index searched: Compares to all 257 chunk embeddings
3. Top result: Item 1 Business Overview (score: ~0.92)
4. Why? Contains "NetApp, Inc.... is a global cloud-led, data-centric software company"

**Key insight**: We never searched for exact text matches! The system understood the semantic meaning of "tell me about NetApp" and found the business overview section.

In [51]:
def neo4j_vector_search(question):
    """Search for similar nodes using the Neo4j ector index"""
    vector_search_query = """
        with genai.vector.encode(
            $question,
            "OpenAI",
            {
                token: $openAiApiKey,
                endpoint: $openAiEndpoint
            }) as question_embedding
        call db.index.vector.queryNodes($index_name, $top_k, question_embedding) yield node, score
        return score, node.text as text
    """
    similar = kg.query(vector_search_query, 
                       params={
                            'question': question, 
                            'openAiApiKey':OPENAI_API_KEY,
                            'openAiEndpoint': OPENAI_ENDPOINT,
                            'index_name':VECTOR_INDEX_NAME, 
                            'top_k': 10
                       })
    return similar

In [52]:
search_results = neo4j_vector_search('In a single sentence, tell me about NetApp')
search_results[0]['text']

'>Item 1.  \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.\n\n\nOur opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the complexities created by rapid data and cloud growth, multi-cloud management, and the adoption of next-generation technologies, such as AI, Kubernetes, and modern databases. Our modern approach to hybrid, multicloud infrastructure and data management, which we term ‘evolved cloud’, provi

## Step 8: Build a Complete RAG System with LangChain

Now we'll integrate everything into an end-to-end question-answering system!

### The RAG Pipeline Architecture

```
User Question
    ↓
[1] Encode question (OpenAI Embeddings)
    ↓
[2] Vector search in Neo4j (retrieve top-k chunks)
    ↓
[3] Pass chunks as context to LLM
    ↓
[4] Generate answer (ChatGPT)
    ↓
Return answer to user
```

### LangChain Components

**Neo4jVector**: Wrapper for Neo4j vector store
- Handles embedding generation
- Performs vector searches
- Returns results as Documents

**Retriever**: Converts vector store into retriever interface
- Standard LangChain interface
- Can be chained with other components

**RetrievalQAWithSourcesChain**: Complete RAG chain
- Takes a question
- Retrieves relevant chunks (via retriever)
- Constructs prompt with context
- Calls LLM (ChatGPT)
- Returns answer + sources

### Why "WithSources"?

The `RetrievalQAWithSourcesChain` provides:
1. **Answer**: The generated response
2. **Sources**: Which chunks were used (provenance)

**Critical for compliance and trust** in financial/legal domains!

In [53]:
neo4j_vector_store = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)

In [54]:
retriever = neo4j_vector_store.as_retriever()

## Step 9: Test the RAG System

Let's ask questions and evaluate the quality of responses!

### Testing Strategy

We'll test three scenarios:
1. **In-scope questions**: About NetApp (should answer correctly)
2. **Out-of-scope questions**: About other companies (tests hallucination control)
3. **Prompt engineering**: How instructions affect behavior

In [55]:
chain = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0), 
    chain_type="stuff", 
    retriever=retriever
)

In [56]:
def prettychain(question: str) -> str:
    """Pretty print the chain's response to a question"""
    response = chain({"question": question},
        return_only_outputs=True,)
    print(textwrap.fill(response['answer'], 60))

In [57]:
question = "What is Netapp's primary business?"
prettychain(question)

NetApp's primary business is enterprise storage and data
management, cloud storage, and cloud operations.


### Test 1: In-Scope Question

**Question**: "What is Netapp's primary business?"

**Expected**: Should retrieve business overview chunk and answer correctly.

In [58]:
prettychain("Where is Netapp headquartered?")

Netapp is headquartered in San Jose, California.


**Result**: Correct! Answered "San Jose, California" from the retrieved context.

In [59]:
prettychain("""
    Tell me about Netapp. 
    Limit your answer to a single sentence.
""")

NetApp is a global cloud-led, data-centric software company
that empowers customers with hybrid multicloud solutions
built for a better future.


In [60]:
prettychain("""
    Tell me about Apple. 
    Limit your answer to a single sentence.
""")

NetApp is a global cloud-led, data-centric software company
that provides organizations the ability to manage and share
their data across on-premises, private, and public clouds.


### Test 2: Out-of-Scope Question (Hallucination Test)

**Question**: "Tell me about Apple."

**Challenge**: Our database only contains NetApp documents. A good RAG system should either:
1. Say "I don't know"
2. Refuse to answer out-of-scope questions

**What happened?** The system hallucinated! It returned information about NetApp, substituting it for Apple. This is a common RAG failure mode.

In [62]:
prettychain("""
    Tell me about Apple. 
    Limit your answer to a single sentence.
    If you are unsure about the answer, say you don't know.
""")

I don't know.


### Test 3: Prompt Engineering to Control Hallucinations

**Updated question**: "Tell me about Apple. If you are unsure about the answer, say you don't know."

**Result**: Success! The system now responds "I don't know."

### Key Lesson: Prompt Engineering Matters!

**The fix**: Adding explicit instructions to the prompt:
- "If you are unsure, say you don't know"
- "Only answer based on the provided context"
- "Do not use external knowledge"

**Best practices for production RAG systems**:

1. **System prompt engineering**:
```python
system_prompt = """
You are a financial document assistant. 
Only answer questions based on the provided SEC filing context.
If the context doesn't contain the answer, respond: "I don't have information about that in this document."
Never use your general knowledge about companies.
"""
```

2. **Similarity threshold**: Only use chunks above a certain score (e.g., > 0.80)

3. **Confidence scoring**: Return confidence levels with answers

4. **Source attribution**: Always show which chunks were used

**Why this matters in production**:
- Legal liability (giving wrong financial advice)
- Regulatory compliance (SEC, FINRA requirements)
- User trust (transparency about limitations)

---

# Summary: Building Production-Ready Graph RAG Systems

## What We Built

A complete **Graph RAG pipeline** that:
1. Ingests SEC 10-K filings (unstructured text)
2. Chunks documents with overlap for context preservation
3. Creates graph nodes with rich metadata
4. Generates embeddings using OpenAI
5. Indexes vectors for similarity search
6. Builds a Q&A system with LangChain
7. Controls hallucinations through prompt engineering

## The Graph RAG Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    User Question                        │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ↓
┌─────────────────────────────────────────────────────────┐
│         1. Embed Question (OpenAI ada-002)              │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ↓
┌─────────────────────────────────────────────────────────┐
│    2. Vector Search in Neo4j (Cosine Similarity)        │
│       - Search 257 chunk embeddings                     │
│       - Return top-10 most similar                      │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ↓
┌─────────────────────────────────────────────────────────┐
│    3. Graph Traversal (Future Enhancement)              │
│       - Filter by metadata (form section, company)      │
│       - Follow relationships (NEXT_CHUNK, etc.)         │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ↓
┌─────────────────────────────────────────────────────────┐
│    4. LLM Generation (ChatGPT with Context)             │
│       - Pass chunks as context                          │
│       - Generate answer                                 │
│       - Return sources                                  │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ↓
                   Answer
```

## Key Concepts Covered

### 1. Document Chunking
- **Challenge**: Balance context vs. embedding limits
- **Solution**: 2000 chars with 200 char overlap
- **Trade-off**: Smaller chunks = more precise, larger = more context

### 2. Graph Node Design
- **Pattern**: Use MERGE for idempotent operations
- **Constraints**: Enforce uniqueness on chunkId
- **Metadata**: Store CIK, CUSIP, section, source for filtering

### 3. Vector Embeddings
- **Model**: OpenAI text-embedding-ada-002 (1536 dimensions)
- **Storage**: In-graph (single source of truth)
- **Index**: Cosine similarity for text

### 4. Hallucination Control
- **Problem**: LLMs answer questions outside their context
- **Solution**: Explicit prompt instructions ("say you don't know")
- **Production**: Add similarity thresholds, confidence scores

## Current Limitations & Future Enhancements

### Limitations:
1. **No entity extraction**: We only have text chunks, not structured entities
2. **No relationships**: Chunks are isolated (no NEXT_CHUNK, REFERENCES, etc.)
3. **Single document**: Only one company's 10-K form
4. **Basic retrieval**: Pure vector search without graph traversal

### Next Steps (Graph RAG Part 2):

#### 1. Entity Extraction
Extract entities from text and create dedicated nodes:
```cypher
(Company {name: "NetApp", cik: "1002047"})
(Person {name: "CEO Name"})
(Product {name: "Cloud Storage"})
(Risk {category: "Cybersecurity"})
```

#### 2. Relationship Modeling
Connect entities with meaningful relationships:
```cypher
(Company)-[:DISCLOSED_RISK]->(Risk)
(Company)-[:OFFERS_PRODUCT]->(Product)
(Person)-[:LEADS]->(Company)
(Chunk)-[:MENTIONS]->(Company)
(Chunk)-[:NEXT_CHUNK]->(Chunk)
```

#### 3. Multi-Document Knowledge Graph
Load multiple companies' 10-K forms:
```cypher
(NetApp)-[:COMPETITOR_OF]->(Dell)
(NetApp)-[:PARTNER_WITH]->(AWS)
(NetApp)-[:INVESTED_IN]->(Startup)
```

#### 4. Advanced Queries
Combine vector search + graph traversal:
```cypher
// Find competitors of NetApp who disclosed AI risks
MATCH (netapp:Company {name: "NetApp"})
MATCH (netapp)-[:COMPETITOR_OF]->(competitor)
MATCH (competitor)-[:DISCLOSED_RISK]->(risk:Risk)
WHERE risk.category CONTAINS "AI"
MATCH (chunk:Chunk)-[:MENTIONS]->(risk)
// Then do vector search on those chunks
CALL db.index.vector.queryNodes(...)
```

#### 5. Temporal Analysis
Track changes over time:
```cypher
(Company)-[:FILED {year: 2023}]->(Form10K)
(Company)-[:FILED {year: 2024}]->(Form10K)

// Compare risk disclosures year-over-year
MATCH (c:Company)-[f1:FILED {year: 2023}]->(form1)
MATCH (c)-[f2:FILED {year: 2024}]->(form2)
MATCH (form1)<-[:PART_OF]-(chunk1:Chunk {section: "risks"})
MATCH (form2)<-[:PART_OF]-(chunk2:Chunk {section: "risks"})
// Vector search to find new risks in 2024
```

## Real-World Applications

### 1. Investment Research
**Query**: "Which companies in the cloud storage sector have disclosed supply chain risks?"
- Vector search for "supply chain risks"
- Graph traversal to find companies in "cloud storage" sector
- Filter by disclosure year

### 2. Compliance Monitoring
**Query**: "Has NetApp's cybersecurity risk disclosure changed since last year?"
- Fetch 2023 and 2024 risk sections
- Vector similarity between years
- Flag significant changes

### 3. Competitive Intelligence
**Query**: "What new products has NetApp mentioned that Dell hasn't?"
- Extract product entities from both companies
- Compare using set difference
- Return chunks mentioning unique products

### 4. Due Diligence
**Query**: "Show all companies connected to NetApp with investments over $50M and disclosed risks"
- Graph traversal: (NetApp)-[:INVESTED_IN]->(Portfolio Company)
- Filter: investment amount > $50M
- Vector search in portfolio company filings for risks

## Production Checklist

Before deploying a Graph RAG system:

- [ ] **Data quality**: Clean, validated SEC filings
- [ ] **Chunking strategy**: Tested for your document types
- [ ] **Embedding model**: Chosen based on cost/performance
- [ ] **Vector index tuning**: Right similarity function and dimensions
- [ ] **Prompt engineering**: System prompts prevent hallucinations
- [ ] **Similarity thresholds**: Filter low-relevance results
- [ ] **Source attribution**: Always return provenance
- [ ] **Error handling**: Graceful failures for API errors
- [ ] **Monitoring**: Track query latency, relevance scores
- [ ] **Compliance**: Legal review for financial advice disclaimers
- [ ] **Security**: API key management, access controls
- [ ] **Scalability**: Test with full dataset (1000+ forms)

## Cost Analysis

For 1000 SEC 10-K forms:

| Component | Volume | Cost |
|-----------|--------|------|
| **Embedding generation** | ~100M chars | ~$10 (one-time) |
| **Vector storage** | 1536-dim × 300K chunks | Neo4j license |
| **Query embeddings** | 10K queries/month | ~$1/month |
| **LLM generation** | 10K queries × 1K tokens | ~$200/month |

**Total**: ~$10 setup + ~$200/month for moderate usage

## Resources & Further Reading

### Neo4j & Graph Databases
- [Neo4j Cypher Manual](https://neo4j.com/docs/cypher-manual/)
- [Vector Search in Neo4j](https://neo4j.com/docs/cypher-manual/current/indexes-for-vector-search/)
- [GenAI Plugin Documentation](https://neo4j.com/docs/genai-plugin/)

### LangChain & RAG
- [LangChain Neo4j Integration](https://python.langchain.com/docs/integrations/vectorstores/neo4jvector/)
- [RAG Best Practices](https://www.anthropic.com/index/retrieval-augmented-generation)

### SEC Filings & Financial Data
- [SEC EDGAR Database](https://www.sec.gov/edgar/search/)
- [Understanding 10-K Forms](https://www.sec.gov/files/reada10k.pdf)
- [XBRL for Structured Data](https://www.sec.gov/structureddata/osd-inline-xbrl.html)

### Graph RAG Research
- [Microsoft GraphRAG](https://github.com/microsoft/graphrag)
- [Knowledge Graphs for LLMs (Stanford)](https://arxiv.org/abs/2306.04136)

---

## Next Tutorial: Entity Extraction & Multi-Hop Queries

In the next part of this series, we'll:
1. Use LLMs to extract entities (companies, people, products, risks)
2. Build a multi-document knowledge graph
3. Implement hybrid queries (vector search + graph traversal)
4. Answer complex questions like: "Which firms connected to Palo Alto Networks have invested over $50M and disclosed cybersecurity risks?"

**Stay tuned!**

---

**Congratulations!** You've built a complete Graph RAG system from scratch. You now understand:
- Document chunking strategies
- Graph node modeling with Cypher
- Vector embeddings and similarity search
- Retrieval augmented generation with LangChain
- Hallucination control through prompt engineering

This foundation prepares you for building advanced Graph RAG applications in finance, legal, healthcare, and beyond!