# Lesson 8 - Knowledge Graph Construction - Part II

In this lesson, you'll continue with knowledge graph construction. The previous lesson created the domain
graph from CSV files according to the construction plan. Now, you will process the markdown files,
chunking them up into the lexical graph and the subject graph which will connect to the domain graph
for a complete knowledge graph. 

You will learn:
- how to use Neo4j's graphrag library to perform the chunking and entity extraction
- techniques for entity resolution
  

<img src="images/last.png" width="600">

**Note**: This notebook uses Cypher queries to build the domain graph from CSV files. Don't worry if you're unfamiliar with Cypher — focus on understanding the big picture of how the unstructured data is transformed into a graph structure based on the extraction plan.

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">
<p> 💻 &nbsp; <b>To access the helper.py, neo4j_for_adk.py and tools.py files :</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>.

</div>

## 8.1 Tools

Two tools, with helper functions:
  1. `make_kg_builder` - to chunk markdown and produce the lexical and subject graphs
  2. `correlate_subject_and_domain_nodes` - to connect the subject graph to the domain graph
- Input: `approved_files`, `approved_construction_plan`, `approved_entities`, `approved_fact_types`
- Output: a completed knowledge graph with domain, lexical and subject graphs
  
**Workflow**

1. The context is initialized with an `approved_construction_plan` and `approved_files`
2. For each markdown file, `make_kg_builder` is called to create a construction pipeline
3. For each resulting entity label, `correlate_subject_and_domain_nodes` will connect the subject and domain graphs


## 8.2 Setup

The usual import of needed libraries, loading of environment variables, and connection to Neo4j.

### 8.2.1 Common Setup

### 8.2.2 Load Part of the Domain Graph with a Helper Function

You're only loading the product nodes from the domain graph, because they're the only nodes that you'll use to connect the domain graph to the lexical graph.

### 8.2.3 Initialize State from Previous Workflow

## 8.3 Tool Definitions for loading, chunking and entity extraction

The Neo4j GraphRAG library has a convenient `SimpleKGPipeline` which you can use to process chunks and extract entities with relationships.

For the markdown files you will be processing, you'll need to create some helper functions.

### 8.3.1 SimpleKGPipeline Interface

<img src="images/KG_pipeline.png" width="600">

### 8.3.2 Text-Splitter for Chunking up the Markdown

Define a custom text splitter that uses regex patterns to chunk markdown text. This splitter breaks documents at specified delimiters (like "---") to create meaningful text segments for processing.

### 8.3.3 Custom Markdown Data Loader

This custom loader adapts the Neo4j GraphRAG PDF loader to work with markdown files. It reads markdown content, extracts the document title from the first H1 header, and wraps it in the expected document format for the pipeline.

### 8.3.4 Set up LLM, Embedder and Neo4j Driver

Initialize the core components needed for the Neo4j GraphRAG pipeline: an OpenAI LLM for entity extraction, an embeddings model for vectorizing text chunks, and the Neo4j database driver for graph storage.

### 8.3.5 Entity Schema

Use the approved entity types from the previous workflow as the allowed node types for entity extraction. This constrains the LLM to only extract entities of these specific types.

Transform the approved fact types into relationship types by extracting the predicate labels and converting them to uppercase format for the schema.

Create relationship patterns by converting fact types into tuples that specify allowed relationships between specific node types (subject-predicate-object patterns).

Assemble the complete entity schema dictionary that will guide the LLM's entity extraction, combining node types, relationship types, and patterns into a single configuration.

### 8.3.6 Contexualized Entity Extraction Prompt

This helper function extracts the first few lines from a file to provide context for entity extraction. This context helps the LLM better understand the document structure and content when processing individual chunks.

This function creates a contextualized prompt template for entity and relationship extraction. It combines general extraction instructions with file-specific context to improve the accuracy of the LLM's entity recognition on each text chunk.

## 8.4 Make and Use the Knowledge Graph (KG) builder

### 8.4.1 Make the Neo4j KG Builder Pipeline

This function creates a customized KG builder pipeline for a specific file by extracting file context and creating a contextualized extraction prompt. It combines all the previously defined components (loader, splitter, schema, LLM) into a complete pipeline.

Process each approved markdown file by creating a KG builder pipeline and running it asynchronously. This extracts entities and relationships from the text chunks and stores them in the Neo4j database as the subject graph.

<p style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px"> 🚨
&nbsp; <b>Different Run Results:</b> The output generated by LLMs can vary with each execution due to their stochastic nature. Your results might differ from those shown in the video.</p>

## 8.5 Tool Definition for Entity Resolution


Connect entities in the subject graph to entities in the domain graph.

For each type of entity in the subject graph, you will devise a strategy for correlating
with the right node in the domain graph. 

For example, you should expect that Products with product names exist in the subject graph,
and that these should correlate with products in the domain graph.

To do this, you will:
1. find the unique entity labels in the subject graph
2. find the unique node labels in the domain graph
3. attempt to correlate property keys
4. perform entity resolution by analyzing the similarity of property values

### 8.5.1 Unique Entity Labels in the Subject Graph

The unique triples of (subject, predicate, object) will give you an idea about what the subject graph looks like.

Query the Neo4j database to find all nodes that have the `__Entity__` label (entities created by the knowledge graph builder) and return their distinct label combinations.

Flatten the label arrays into individual label strings using UNWIND, which transforms the array of labels into separate rows for each label.

Filter out internal Neo4j labels that start with double underscores ("__") to focus only on the meaningful entity type labels extracted from the text.

Combine the previous query steps into a reusable function that returns all unique entity labels from the subject graph, excluding internal Neo4j system labels.

Test the function to see what entity labels were actually extracted from the processed markdown files into the subject graph.

### 8.5.2 Unique Entity Keys for a Given Label

Create a function to find all unique property keys for entities of a specific label in the subject graph. This helps identify what properties are available for matching with domain graph nodes.

### 8.5.3 Unique Domain keys for a Given Label

Create a function to find unique property keys for nodes of a specific label in the domain graph (nodes without the `__Entity__` label). This enables comparison with subject graph properties for entity resolution.

### 8.5.4 Normalize keys

This is a simple version of "stemming" as done in NLP.

Define a function to normalize property key names by removing label prefixes, converting to lowercase, and standardizing spacing. This helps match similar property keys that may have different naming conventions between subject and domain graphs.

### 8.5.5 Correlate Keys for a Given Label

Use fuzzy string matching to find correlations between entity graph property keys and domain graph property keys. This function compares normalized key names and returns matches above a similarity threshold, helping identify which properties can be used for entity resolution.

### 8.5.6 Value Similarity using Jaro–Winkler Distance

The Jaro–Winkler distance is a string comparison method, emphasizing common prefixes to favor strings that match from the start. 

- measures "edit distance" between two strings
- produces values from 0.0 (exact match) to 1.0 (no similarity)
- use `similarity = 1.0 - distance` to get a similarity score

See [Jaro-WinklerDistance](https://en.wikipedia.org/wiki/Jaro–Winkler_distance) for details.

Ideally, you would sample a few values that you expect to correlate well, trying different similarity metrics
to find one that works well for that particular value pair. 

Neo4j provides many [text similarity functions](https://neo4j-contrib.github.io/neo4j-apoc-procedures/3.4/utilities/text-functions/). 
Other options include:
- [`apoc.text.hammingDistance`]()
- [`apoc.text.levenshteinSimilarity`]()
- [`apoc.text.sorensenDiceSimilarity`]()
- [`apoc.text.fuzzyMatch`]()

And for vector similarity:
- `vector.similarity.cosine` to directly calculate cosine similarity
- `db.index.vector.queryNodes` to perform vector similarity search (after first creating a vector index on the domain nodes)

Wrap the entity resolution logic into a reusable function that correlates subject and domain nodes based on property value similarity using Jaro-Winkler distance. This creates the bridge between extracted entities and the existing domain graph.

Create `CORRESPONDS_TO` relationships between subject graph entities and domain graph nodes with similar property values. Uses MERGE to avoid duplicate relationships and adds timestamps to track when correlations were established.

Test the Jaro-Winkler distance function by finding all pairs of subject and domain Product nodes where the name properties have similarity scores below a threshold, showing potential matches for entity resolution.

### 8.5.7 Correlate and connect the subject nodes to the domain nodes

Execute the complete entity resolution workflow by iterating through all extracted entity labels, finding the best property key correlations, and automatically creating connections between the subject graph and domain graph to complete the knowledge graph integration.