# Building PrimeKG knowledge graph in Neo4j

This notebook demonstrates how connect to PrimeKG knowledge graph in Neo4j via LangChain

### Step 1: Environment Setup

```
!pip install langchain-neo4j langchain-openai neo4j azure-identity
```

In [1]:
# Import required packages
import os
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_neo4j import Neo4jGraph, GraphCypherQAChain

In [2]:
# Set up environment variables for Azure OpenAI
AOAI_API_BASE = os.getenv("AZURE_OPENAI_API_BASE")
AOAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION")
AOAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_API_DEPLOY")

### Step 2: LLM Initialisation

In [3]:
# Setup Entra ID Auth
credential = DefaultAzureCredential()

token_provider = get_bearer_token_provider(
    credential, 
    "https://cognitiveservices.azure.com/.default"
)

In [4]:
# Initialise the Azure OpenAI model
llm = AzureChatOpenAI(
    azure_endpoint = AOAI_API_BASE,
    azure_deployment = AOAI_DEPLOYMENT,
    openai_api_version = AOAI_API_VERSION,
    azure_ad_token_provider = token_provider,
    temperature = 0,
    verbose = True
)

### Step 3: Connection to Neo4j Knowledge Graph

In [5]:
# Connect to Knowledge Graph
graph = Neo4jGraph(
    url = "bolt://localhost:7687",
    username = "neo4j",
    password = "PrimeKG123!"
)

In [6]:
# Refresh graph schema
graph.refresh_schema()

print(graph.schema[:300])

Node properties:
gene/protein {original_id: STRING, name: STRING, source: STRING, node_index: STRING}
drug {original_id: STRING, name: STRING, source: STRING, description: STRING, mechanism_of_action: STRING, pharmacodynamics: STRING, state: STRING, atc_1: STRING, atc_2: STRING, atc_3: STRING, atc_4


### Step 4: AI Chain Setup

In [7]:
# Define system prompt
CYPHER_GENERATION_TEMPLATE = """
Task: Generate Cypher statement to query a graph database.
Instructions:
- Use only the provided relationship types and properties in the schema.
- Do not use any other relationship types or properties that are not provided.
- CRITICAL: Escape node labels that contain special characters using backticks (e.g., use `gene/protein` instead of gene/protein).
- ALWAYS return the 'source' property (e.g. DrugBank, NCBI) alongside names.
- ALWAYS return the 'original_id' property.
- If no data is found, return nothing.

Schema:
{schema}

Question: {question}
"""

CYPHER_PROMPT = PromptTemplate(
    input_variables = ["schema", "question"], 
    template = CYPHER_GENERATION_TEMPLATE
)

QA_TEMPLATE = """
You are a responsible AI agent for Precision Medicine.
Your task is to answer the question based ONLY on the following context provided by the Knowledge Graph.

Context:
{context}

Structure your response into two distinct sections:
1. **Analysis**: Synthesize the answer naturally.
2. **References**: List the specific data sources (e.g., DrugBank, DisGeNET) and Original IDs used in the analysis.

Rules:
- Do NOT use internal training knowledge.
- If the context is empty, state that no information was found in PrimeKG.

Question: {question}
Helpful Answer:
"""

QA_PROMPT = PromptTemplate(
    input_variables = ["context", "question"], 
    template = QA_TEMPLATE
)

In [8]:
# Build the Cypher QA Chain
chain = GraphCypherQAChain.from_llm(
    llm = llm,
    graph = graph,
    verbose = True,
    cypher_prompt = CYPHER_PROMPT,
    qa_prompt = QA_PROMPT,
    return_intermediate_steps = True, 
    allow_dangerous_requests = True
)

### Step 5: Solution Test

In [9]:
# Run the Tool
question = "Which drugs are indicated for Alzheimer disease, and which specific genes do these drugs target? Include the sources."

try:
    print(f"QUESTION: {question}\n")

    print(f"ANSWER:")
    response = chain.invoke(question)
    print(response['result'])
    
except Exception as e:
    print(f"Error executing chain: {e}")

QUESTION: Which drugs are indicated for Alzheimer disease, and which specific genes do these drugs target? Include the sources.

ANSWER:


[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (disease:disease {name: "Alzheimer disease"})<-[:indication]-(drug:drug)-[:drug_protein]->(gene:`gene/protein`)
RETURN 
  drug.name AS drug_name, 
  drug.source AS drug_source, 
  drug.original_id AS drug_original_id,
  gene.name AS gene_name, 
  gene.source AS gene_source, 
  gene.original_id AS gene_original_id
[0m
Full Context:
[32;1m[1;3m[{'drug_name': 'Galantamine', 'drug_source': 'DrugBank', 'drug_original_id': 'DB00674', 'gene_name': 'ABCB1', 'gene_source': 'NCBI', 'gene_original_id': '5243'}, {'drug_name': 'Galantamine', 'drug_source': 'DrugBank', 'drug_original_id': 'DB00674', 'gene_name': 'CHRNA7', 'gene_source': 'NCBI', 'gene_original_id': '1139'}, {'drug_name': 'Galantamine', 'drug_source': 'DrugBank', 'drug_original_id': 'DB00674', 'gene_na