# GraphRAG Retrieval Evaluation

## Purpose
The code implements an enhanced document retrieval system that combines vector similarity search with graph-based traversal to find relevant document chunks. Here's a detailed breakdown:

1. **Vector Search & Similarity Scoring**
 - Converts the input query into a vector embedding
 - Performs similarity search against entity nodes in Neo4j
 - Creates a sorted dictionary of entity IDs and their similarity scores
 - Filters results based on a similarity threshold (default 0.8)

2. **Graph Traversal Strategy**
```sql
    MATCH path = (n:Chunk)-[*1..{max_hops}]->(m:`__Entity__`)
    WHERE m.id IN $ids
```
 - Finds paths from document chunks to relevant entities
 - Limits path length to control traversal depth
 - Only considers entities that met the similarity threshold

3. **Relevance Calculation**
```sql
    WITH n, min(length(path)) as distance, m
    WITH n, distance, m.id as entity_id
    WITH n, distance, entity_id, 
            CASE 
            WHEN entity_id IN $ids 
            THEN $similarity_scores[entity_id]
            END as similarity
```
 - Calculates shortest path length to each entity
 - Preserves original similarity scores from vector search
 - Combines structural proximity (distance) with semantic similarity

4. **Result Ordering**
```sql
    ORDER BY similarity DESC, distance
```
 - Prioritizes chunks with higher semantic similarity
 - Uses path distance as a secondary sorting criterion
 - Ensures most relevant chunks appear first

5. **Output Format**
```sql
    RETURN n.text, n.fileName, n.page_number, n.position, entity_id, similarity
```
 - Returns comprehensive chunk metadata:
    - Text content
    - Source file name
    - Page number
    - Position in document
    - Associated entity ID
    - Similarity score

## Key Features
- Hybrid retrieval approach combining:
    - Vector-based semantic search
    - Graph-based structural relationships
- Configurable parameters:
    - Similarity threshold
    - Maximum path length
    - Result limit
- Deduplication of chunks
- Ordered results by relevance
- Rich metadata for each chunk

## Step 0: environment set up

In [1]:
from dotenv import load_dotenv
import os
from langchain_neo4j import Neo4jGraph
from libs import create_vector_index
import pandas as pd
from conn import connect2Googlesheet,retrieval_rel_docs, get_avg_similarity_df,get_concatenate_df
from libs import context_builder, chunk_finder, enhanced_chunk_finder
# Force reload of the .env file
load_dotenv()

True

In [2]:
# Connect to Neo4j database
try:
    graph = Neo4jGraph(
        url=os.getenv("NEO4J_URL"),
        username=os.getenv("NEO4J_USERNAME"),
        password=os.getenv("NEO4J_PASSWORD")
    )
    print("Connected to Neo4j database successfully.")
except ValueError as e:
    print(f"Could not connect to Neo4j database: {e}")

Connected to Neo4j database successfully.


## Step 1: Create vector index

In [3]:
create_vector_index(graph, "entities")

✅ Index 'entities' already exists with correct dimensions: 384


## Step 2: Load questions from google sheet

In [4]:
spreadsheet = connect2Googlesheet()

# Select the worksheet: relevance
worksheet = spreadsheet.get_worksheet(2)  

# Get all records as a list of dictionaries
data = worksheet.get_all_records()

# Convert to Pandas DataFrame
df_MedQ = pd.DataFrame(data)
df_MedQ.head()

Unnamed: 0,condition,number,docs,Question,Mahmud's Note,status,comments,Unnamed: 8
0,ARDS,1,ACURASYS,Does early administration of neuromuscular blo...,Like,,,
1,ARDS,2,ACURASYS,Do patients with severe ARDS being treated wit...,Replace,fixed,,
2,ARDS,3,ROSE,"In patients with moderate to severe ARDS, does...",Maybe this question: In patients with moderate...,fixed,,
3,ARDS,4,ROSE,Do patients with moderate-to-severe ARDS have ...,Local question (not sure if this is the aim of...,fixed,Wrong concept since PEEP by itself is mandator...,Does the use of neuromuscular blockers in pati...
4,ARDS,5,FACTT,"Among patients with ALI/ARDS, does a conservat...",Local question (not sure if this is the aim of...,fixed,Check if studies defined conservative by CVP <...,


## Step 3: Relevance check for top K questions

The [*1..{max_hops}] syntax in the Cypher query defines a variable-length relationship pattern in Neo4j.
**Syntax Explanation**
- `*` indicates a variable-length path
- 1..{max_hops} specifies the range:
    - 1 is the minimum length
    - {max_hops} is the maximum length (passed as a parameter)

**Purpose**
1. Path Flexibility: It allows finding relationships between nodes that are both:
    - Directly connected (1 hop)
    - Indirectly connected (up to max_hops steps away)
2. Example with max_hops=2:
```python
    (Chunk)-->(Entity)           // 1 hop
    (Chunk)-->(Node)-->(Entity)  // 2 hops
```
3. Use Case in the Code:
- The query finds chunks that are connected to relevant entities either:
    - Directly (1 relationship away)
    - Through intermediate nodes (up to max_hops relationships away)
- This broadens the search context while maintaining control over the search depth

**Practical Impact**
```python
    # With max_hops = 1 (direct connections only)
    Chunk -> Entity

    # With max_hops = 2 (includes indirect connections)
    Chunk -> IntermediateNode -> Entity
```
This flexibility is particularly useful in knowledge graphs where relevant information might be connected through intermediate concepts or relationships.

In [5]:
# Set pandas display options to show the full text content
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
topk = 5 # 36 questions in total
results_df = retrieval_rel_docs(graph, df_MedQ, top_k=topk) # Retrieve relevant documents for each question
# results_df.to_csv('./outputs/retrieved_docs_results.csv', index=False)
results_df

Unnamed: 0,Question number,Question,Retrieved FileName,Chunk Text,Page Number,Position,Similarity
0,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,LSPA.pdf,"clinical trials have shown that early mobilization in both medical and surgical critically ill patients is safe and associated with increased ventilator-free days and improved physical function at hospital discharge.22–25 Early mobilization is limited by use of deep sedation and development of delirium, which can be minimized through the use of scale- based targeted light sedation is implemented early on.26 After reviewing this literature in 2013, the Society of Critical Care Medicine (SCCM)’s",3,20,0.878641
1,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,ESCPARDS.pdf,"-day mortality rates among patients enrolled at least 14 days after the onset of ARDS. Methylprednisolone increased the number of ventilator-free and shock- free days during the first 28 days in association with an improvement in oxygen- ation, respiratory-system compliance, and blood pressure with fewer days of vaso- pressor therapy. As compared with placebo, methylprednisolone did not increase the rate of infectious complications but was associated with a higher rate",1,9,0.865857
2,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,ETSDMV.pdf,"▪ Randomized placebo-controlled clinical trial of low dose nighttime dexmedetomidine to prevent ICU delirium. Notable for being one of the few interventions shown to prevent ICU delirium. 30. Kawazoe Y, Miyamoto K, Morimoto T, et al. Effect of dexmedetomidine on mortality and ventilator-free days in patients requiring mechanical ventilation with sepsis: a randomized clinical trial effect of dexmed",8,82,0.863648
3,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,DDS.pdf,"medetomidine in critically ill patients increased ventilator-free time [11] and decreased the incidence of postoperative complications, delirium, and mortality up to 1 year post-cardiac surgery [12]. In postoperative patients, dexmedetomidine provided sympatholytic ac- tivity [13]. It also offers anti-inflammatory and organ protective effects in animal models [14]. The use of dexmedetomidine as an anti-adrenergic strategy in se",1,6,0.863332
4,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,DDS.pdf,"mediated by a blunted immune response, resulting in an improved microcirculation [7], or enhanced adrener- gic receptor sensitivity [8]. Dexmedetomidine is a highly selective alpha-2 adre- noreceptor agonist that has sedative, anxiolytic, and opioid-sparing effects (Table 2) [9, 10]. The use of dexmedetomidine in critically ill patients increased ventilator-free time [11] and decreased",1,5,0.863332
5,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,"significantly lower cumulative fluid balance by 5,074 mL over 7 days than FACTT Liberal. In subjects without baseline shock, in whom the fluid protocol was applied throughout the duration of the study, management with FACTT Lite resulted in an equivalent cumulative fluid balance to FACTT Conservative. FACTT Lite had similar clinical outcomes of ventilator-free days, ICU-free days, and mortality as FACTT Conservative and significantly greater ventilator-",6,57,0.859509
6,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,APROCCHSS.pdf,"-free days to day 28 was signifi- cantly higher in the hydrocortisone-plus-fludrocortisone group than in the placebo group (17 vs. 15 days, P<0.001), as was the number of organ-failure–free days (14 vs. 12 days, P = 0.003). The number of ventilator-free days was similar in the two groups (11 days in the",1,10,0.852851
7,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,"mortality to FACTT Conservative. FACTT Conservative has improved ventilator-free days, ICU-free days, and prevalence of acute kidney injury than FACTT Liberal. FACTT Lite can be used as a simplified and safe alternative to FACTT Conservative for the management of fluid balance in patients with ARDS. Supplementary Material Refer to Web version on PubMed Central for supplementary material. Acknowledgments Supported, in part, by the National Institutes of Health, National",8,76,0.850681
8,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,T Liberal (11%) (p = 0.18 vs Lite). Conclusions—FACTT Lite had a greater cumulative fluid balance than FACTT Conservative but had equivalent clinical and safety outcomes. FACTT Lite is an alternative to FACTT Conservative for fluid management in Acute Respiratory Distress Syndrome. Keywords acute kidney injury; adult respiratory distress syndrome; clinical protocols; critical illness; fluid therapy; shock Conservative fluid management improves ventilator-free days and,2,15,0.850681
9,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,"be easily understood and implemented by physician and nursing staff in the ICU. Conclusions Although the FACTT Lite protocol had a greater cumulative fluid balance than FACTT Conservative, the results of our study indicate that the FACTT Lite protocol is safe and has equivalent ventilator-free days, ICU-free days, acute kidney injury, and adjusted 60-day mortality to FACTT Conservative. FACTT Conservative has improved ventilator-free days,",8,75,0.850681


In [6]:
analysis_df = get_avg_similarity_df(results_df)
analysis_df

Unnamed: 0,Question Number,Question,Retrieved Files,Avg Similarity
0,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,"[APROCCHSS.pdf, LSPA.pdf, ARDSSRDRFMS.pdf, ETSDMV.pdf, DDS.pdf, FMWSCPARDS.pdf, EDvsLSMENDS.pdf, RARDS.pdf, ESCPARDS.pdf]",0.852777
1,2,Do patients with severe ARDS being treated with neuromuscular blocking agents have increased muscle weakness?,"[ENB.pdf, NBSARDS.pdf, TOF-ARDS.pdf, ESCPARDS.pdf, LSPA.pdf, HARDST.pdf, ACURASYS.pdf, APV.pdf, CEIIUPPSARDS.pdf, OSCILLATE.pdf, PVOEMVRARDS.pdf, ETSDMV.pdf, NEvsVP.pdf]",0.863793
2,3,"In patients with moderate to severe ARDS, does early use of continuous neuromuscular blockade improve mortality?","[ENB.pdf, NBSARDS.pdf, TOF-ARDS.pdf, ACURASYS.pdf, HARDST.pdf, LSPA.pdf, APV.pdf, ETSDMV.pdf, PVOEMVRARDS.pdf, ESCPARDS.pdf]",0.855276
3,4,Do patients with moderate-to-severe ARDS have a significance difference in mortality rate beween patients who recieved an early and continous cisatracurium infusion than those with usual care approach with lighter sedation targets?,"[ENB.pdf, NBSARDS.pdf, TOF-ARDS.pdf, ACURASYS.pdf, LSPA.pdf, HARDST.pdf, APV.pdf, ETSDMV.pdf, CEIIUPPSARDS.pdf, PVOEMVRARDS.pdf, ESCPARDS.pdf, BMIMSARDS.pdf]",0.823498
4,5,"Among patients with ALI/ARDS, does a conservative fluid management strategy improves lung function and decrease ventilator days compared to liberal strategy?",[FMWSCPARDS.pdf],0.842364
