# GraphRAG Retrieval Evaluation

## Purpose
The code implements an enhanced document retrieval system that combines vector similarity search with graph-based traversal to find relevant document chunks. Here's a detailed breakdown:

1. **Vector Search & Similarity Scoring**
 - Converts the input query into a vector embedding
 - Performs similarity search against entity nodes in Neo4j
 - Creates a sorted dictionary of entity IDs and their similarity scores
 - Filters results based on a similarity threshold (default 0.8)

2. **Graph Traversal Strategy**
```sql
    MATCH path = (n:Chunk)-[*1..{max_hops}]->(m:`__Entity__`)
    WHERE m.id IN $ids
```
 - Finds paths from document chunks to relevant entities
 - Limits path length to control traversal depth
 - Only considers entities that met the similarity threshold

3. **Relevance Calculation**
```sql
    WITH n, min(length(path)) as distance, m
    WITH n, distance, m.id as entity_id
    WITH n, distance, entity_id, 
            CASE 
            WHEN entity_id IN $ids 
            THEN $similarity_scores[entity_id]
            END as similarity
```
 - Calculates shortest path length to each entity
 - Preserves original similarity scores from vector search
 - Combines structural proximity (distance) with semantic similarity

4. **Result Ordering**
```sql
    ORDER BY similarity DESC, distance
```
 - Prioritizes chunks with higher semantic similarity
 - Uses path distance as a secondary sorting criterion
 - Ensures most relevant chunks appear first

5. **Output Format**
```sql
    RETURN n.text, n.fileName, n.page_number, n.position, entity_id, similarity
```
 - Returns comprehensive chunk metadata:
    - Text content
    - Source file name
    - Page number
    - Position in document
    - Associated entity ID
    - Similarity score

## Key Features
- Hybrid retrieval approach combining:
    - Vector-based semantic search
    - Graph-based structural relationships
- Configurable parameters:
    - Similarity threshold
    - Maximum path length
    - Result limit
- Deduplication of chunks
- Ordered results by relevance
- Rich metadata for each chunk

## Step 0: environment set up

In [7]:
from dotenv import load_dotenv
import os
from langchain_neo4j import Neo4jGraph
from libs import create_vector_index
import pandas as pd
from conn import connect2Googlesheet,retrieval_rel_docs, get_concatenate_df, apply_metric
from libs import context_builder, chunk_finder, enhanced_chunk_finder
# Force reload of the .env file
load_dotenv()

True

In [8]:
# Connect to Neo4j database
try:
    graph = Neo4jGraph(
        url=os.getenv("NEO4J_URL"),
        username=os.getenv("NEO4J_USERNAME"),
        password=os.getenv("NEO4J_PASSWORD")
    )
    print("Connected to Neo4j database successfully.")
except ValueError as e:
    print(f"Could not connect to Neo4j database: {e}")

Connected to Neo4j database successfully.


## Step 1: Create vector index

In [9]:
#create_vector_index(graph, "entities")

## Step 2: Load questions from google sheet

In [10]:
spreadsheet = connect2Googlesheet()

# Select the worksheet: relevance
worksheet = spreadsheet.get_worksheet(2)  

# Get all records as a list of dictionaries
data = worksheet.get_all_records()

# Convert to Pandas DataFrame
df_MedQ = pd.DataFrame(data)
df_MedQ.head()

Unnamed: 0,condition,number,docs,Question,Mahmud's Note,status,comments,Unnamed: 8
0,ARDS,1,ACURASYS,Does early administration of neuromuscular blocking agents increases the ventilator free days?,Like,,,
1,ARDS,2,ACURASYS,Do patients with severe ARDS being treated with neuromuscular blocking agents have increased muscle weakness?,Replace,fixed,,
2,ARDS,3,ROSE,"In patients with moderate to severe ARDS, does early use of continuous neuromuscular blockade improve mortality?","Maybe this question: In patients with moderate to severe ARDS, does early use of continuous neuromuscular blockade improve mortality?",fixed,,
3,ARDS,4,ROSE,Do patients with moderate-to-severe ARDS have a significance difference in mortality rate beween patients who recieved an early and continous cisatracurium infusion than those with usual care approach with lighter sedation targets?,Local question (not sure if this is the aim of your project) It will be nice as second step after proving the general summarization is working but focusing in general summarization would be priority in my opinion so you can have meanigful tool.,fixed,Wrong concept since PEEP by itself is mandatory component in ventilator.,"Does the use of neuromuscular blockers in patients with moderate-to-severe ARDS impact cardiovascular stability, particularly in terms of vasopressor requirements and hemodynamic effects, compared to sedation strategy without routine neuromuscular blockade?"
4,ARDS,5,FACTT,"Among patients with ALI/ARDS, does a conservative fluid management strategy improves lung function and decrease ventilator days compared to liberal strategy?","Local question (not sure if this is the aim of your project) consider (WikiJournal): In patients with ALI/ARDS that are intubated and receiving positive pressure ventilation, how does the conservative compare to the liberal fluid management strategy in reducing mortality?",fixed,Check if studies defined conservative by CVP < 4 or elese just dont mention how much the CVP (i prefer the last approach),


## Step 3: Relevance check for top K questions

In [None]:
# Set pandas display options to show the full text content
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
topk = 5 # 36 questions in total
results_df = retrieval_rel_docs(graph, df_MedQ, top_k=topk) # Retrieve relevant documents for each question
# results_df.to_csv('./outputs/retrieved_docs_results.csv', index=False)
results_df

Unnamed: 0,Question number,Question,Retrieved FileName,Chunk Text,Page Number,Position,Similarity
0,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,LSPA.pdf,"clinical trials have shown that early mobilization in both medical and surgical critically ill patients is safe and associated with increased ventilator-free days and improved physical function at hospital discharge.22–25 Early mobilization is limited by use of deep sedation and development of delirium, which can be minimized through the use of scale- based targeted light sedation is implemented early on.26 After reviewing this literature in 2013, the Society of Critical Care Medicine (SCCM)’s",3,20,0.878212
1,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,LSPA.pdf,"sedation strategies are preferred and improve patient outcomes. Although the optimal sedative agent for ARDS patients is unclear, benzodiazepines should be avoided due to associations with oversedation, delirium, prolonged intensive care unit and hospital length of stay, and increased mortality. Minimizing sedation in patients with ARDS facilitates early mobilization and early discharge from the intensive care unit, potentially aiding in recovery from critical illness. Strategies to optimize ventilation in ARDS patients,",1,4,0.878212
2,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,LSPA.pdf,"ventilation compared with usual care.14,15 One of the key benefits to limiting sedation use in patients with ARDS may be improved ability to participate in early mobilization and rehabilitation.16 Early mobilization is particularly important in patients with ARDS as over 50% of survivors suffer from deficits in physical and cognitive function that persist for years beyond the inciting event.12,17–21 Several clinical trials have shown that early mobilization in both medical and surgical critically ill patients is safe and associated",3,19,0.878212
3,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,LSPA.pdf,"uscular blocking agents in critically ill patients. Critical care medicine. Feb; 2006 34(2): 374–380. [PubMed: 16424717] 8. Treggiari MM, Romand JA, Yanez ND, et al. Randomized trial of light versus deep sedation on mental health after critical illness. Critical care medicine. Sep; 2009 37(9):2527–2534. [PubMed: 19602975] 9. She",7,65,0.878212
4,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,"significantly lower cumulative fluid balance by 5,074 mL over 7 days than FACTT Liberal. In subjects without baseline shock, in whom the fluid protocol was applied throughout the duration of the study, management with FACTT Lite resulted in an equivalent cumulative fluid balance to FACTT Conservative. FACTT Lite had similar clinical outcomes of ventilator-free days, ICU-free days, and mortality as FACTT Conservative and significantly greater ventilator-",6,57,0.858627
5,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,"dobutamine infusion, fluid bolus, or furosemide administration. There are no protocol-directed instructions for management of shock. Fluid management was an important cointervention in the NIH/NHLBI ARDS Network studies following FACTT (2–4). The ARDS Network investigators developed a simplified conservative fluid protocol, FACTT Lite. FACTT Lite excluded instructions for ineffective circulation because the clinical examination findings of ineffective circulation did not correlate with",3,22,0.858627
6,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,"been used in subsequent ARDS Network studies (2–4), its performance has never been formally evaluated. We retrospectively compared the performance of FACTT Lite with FACTT Conservative and FACTT Liberal. We hypothesized that the FACTT Lite protocol would be equivalent to FACTT Conservative, and more favorable than FACTT Liberal, with respect to cumulative fluid balance over 7 days, number of ventilator-free days, 60-day mortality, and prevalence",3,25,0.858627
7,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,T Liberal (11%) (p = 0.18 vs Lite). Conclusions—FACTT Lite had a greater cumulative fluid balance than FACTT Conservative but had equivalent clinical and safety outcomes. FACTT Lite is an alternative to FACTT Conservative for fluid management in Acute Respiratory Distress Syndrome. Keywords acute kidney injury; adult respiratory distress syndrome; clinical protocols; critical illness; fluid therapy; shock Conservative fluid management improves ventilator-free days and,2,15,0.858627
8,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,"7. Greater baseline shock in the FACTT Lite group does not explain the observed increase in fluid balance on days 1 and 2 because similar results were observed in subjects without baseline shock. One possible explanation for these findings is lower clinician compliance with FACTT Lite than with FACTT Conservative during the first 2 study days. At the time that the FACTT study was performed, the FACTT Liberal fluid strategy represented the usual prior practice. Cumulative fluid balance",6,59,0.858627
9,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,"ide dose between groups may act as a surrogate for protocol compliance. FACTT Lite was designed to capture the most commonly applied instructions from FACTT Conservative. FACTT Lite and FACTT Conservative should yield a similar mean daily furosemide dose, and both should have a significantly greater daily furosemide dose than FACTT Liberal. The FACTT Lite protocol had a higher mean daily furosemide dose than FACTT Liberal but less",7,64,0.858627
