# GraphRAG Retrieval Evaluation

## Purpose
The code implements an enhanced document retrieval system that combines vector similarity search with graph-based traversal to find relevant document chunks. Here's a detailed breakdown:

1. **Vector Search & Similarity Scoring**
 - Converts the input query into a vector embedding
 - Performs similarity search against entity nodes in Neo4j
 - Creates a sorted dictionary of entity IDs and their similarity scores
 - Filters results based on a similarity threshold (default 0.8)

2. **Graph Traversal Strategy**
```sql
    MATCH path = (n:Chunk)-[*1..{max_hops}]->(m:`__Entity__`)
    WHERE m.id IN $ids
```
 - Finds paths from document chunks to relevant entities
 - Limits path length to control traversal depth
 - Only considers entities that met the similarity threshold

3. **Relevance Calculation**
```sql
    WITH n, min(length(path)) as distance, m
    WITH n, distance, m.id as entity_id
    WITH n, distance, entity_id, 
            CASE 
            WHEN entity_id IN $ids 
            THEN $similarity_scores[entity_id]
            END as similarity
```
 - Calculates shortest path length to each entity
 - Preserves original similarity scores from vector search
 - Combines structural proximity (distance) with semantic similarity

4. **Result Ordering**
```sql
    ORDER BY similarity DESC, distance
```
 - Prioritizes chunks with higher semantic similarity
 - Uses path distance as a secondary sorting criterion
 - Ensures most relevant chunks appear first

5. **Output Format**
```sql
    RETURN n.text, n.fileName, n.page_number, n.position, entity_id, similarity
```
 - Returns comprehensive chunk metadata:
    - Text content
    - Source file name
    - Page number
    - Position in document
    - Associated entity ID
    - Similarity score

## Key Features
- Hybrid retrieval approach combining:
    - Vector-based semantic search
    - Graph-based structural relationships
- Configurable parameters:
    - Similarity threshold
    - Maximum path length
    - Result limit
- Deduplication of chunks
- Ordered results by relevance
- Rich metadata for each chunk

## Step 0: environment set up

In [None]:
from dotenv import load_dotenv
import os
from langchain_neo4j import Neo4jGraph
from libs import create_vector_index
import pandas as pd
from conn import connect2Googlesheet,retrieval_rel_docs, get_avg_similarity_df,get_concatenate_df
from libs import context_builder, chunk_finder, enhanced_chunk_finder
# Force reload of the .env file
load_dotenv()

True

In [2]:
# Connect to Neo4j database
try:
    graph = Neo4jGraph(
        url=os.getenv("NEO4J_URL"),
        username=os.getenv("NEO4J_USERNAME"),
        password=os.getenv("NEO4J_PASSWORD")
    )
    print("Connected to Neo4j database successfully.")
except ValueError as e:
    print(f"Could not connect to Neo4j database: {e}")

Connected to Neo4j database successfully.


## Step 1: Create vector index

In [3]:
#create_vector_index(graph, "entities")

## Step 2: Load questions from google sheet

In [4]:
spreadsheet = connect2Googlesheet()

# Select the worksheet: relevance
worksheet = spreadsheet.get_worksheet(2)  

# Get all records as a list of dictionaries
data = worksheet.get_all_records()

# Convert to Pandas DataFrame
df_MedQ = pd.DataFrame(data)
df_MedQ.head()

Unnamed: 0,condition,number,docs,Question,Mahmud's Note,status,comments,Unnamed: 8
0,ARDS,1,ACURASYS,Does early administration of neuromuscular blo...,Like,,,
1,ARDS,2,ACURASYS,Do patients with severe ARDS being treated wit...,Replace,fixed,,
2,ARDS,3,ROSE,"In patients with moderate to severe ARDS, does...",Maybe this question: In patients with moderate...,fixed,,
3,ARDS,4,ROSE,Do patients with moderate-to-severe ARDS have ...,Local question (not sure if this is the aim of...,fixed,Wrong concept since PEEP by itself is mandator...,Does the use of neuromuscular blockers in pati...
4,ARDS,5,FACTT,"Among patients with ALI/ARDS, does a conservat...",Local question (not sure if this is the aim of...,fixed,Check if studies defined conservative by CVP <...,


## Step 3: Relevance check for top K questions

The [*1..{max_hops}] syntax in the Cypher query defines a variable-length relationship pattern in Neo4j.
**Syntax Explanation**
- `*` indicates a variable-length path
- 1..{max_hops} specifies the range:
    - 1 is the minimum length
    - {max_hops} is the maximum length (passed as a parameter)

**Purpose**
1. Path Flexibility: It allows finding relationships between nodes that are both:
    - Directly connected (1 hop)
    - Indirectly connected (up to max_hops steps away)
2. Example with max_hops=2:
```python
    (Chunk)-->(Entity)           // 1 hop
    (Chunk)-->(Node)-->(Entity)  // 2 hops
```
3. Use Case in the Code:
- The query finds chunks that are connected to relevant entities either:
    - Directly (1 relationship away)
    - Through intermediate nodes (up to max_hops relationships away)
- This broadens the search context while maintaining control over the search depth

**Practical Impact**
```python
    # With max_hops = 1 (direct connections only)
    Chunk -> Entity

    # With max_hops = 2 (includes indirect connections)
    Chunk -> IntermediateNode -> Entity
```
This flexibility is particularly useful in knowledge graphs where relevant information might be connected through intermediate concepts or relationships.

In [5]:
# Set pandas display options to show the full text content
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
topk = 5 # 36 questions in total
results_df = retrieval_rel_docs(graph, df_MedQ, top_k=topk) # Retrieve relevant documents for each question
# results_df.to_csv('./outputs/retrieved_docs_results.csv', index=False)
results_df

Unnamed: 0,Question number,Question,Retrieved FileName,Chunk Text,Page Number,Position,Similarity
0,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,LSPA.pdf,"clinical trials have shown that early mobilization in both medical and surgical critically ill patients is safe and associated with increased ventilator-free days and improved physical function at hospital discharge.22–25 Early mobilization is limited by use of deep sedation and development of delirium, which can be minimized through the use of scale- based targeted light sedation is implemented early on.26 After reviewing this literature in 2013, the Society of Critical Care Medicine (SCCM)’s",3,20,0.878212
1,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,"significantly lower cumulative fluid balance by 5,074 mL over 7 days than FACTT Liberal. In subjects without baseline shock, in whom the fluid protocol was applied throughout the duration of the study, management with FACTT Lite resulted in an equivalent cumulative fluid balance to FACTT Conservative. FACTT Lite had similar clinical outcomes of ventilator-free days, ICU-free days, and mortality as FACTT Conservative and significantly greater ventilator-",6,57,0.858627
2,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,FMWSCPARDS.pdf,"be easily understood and implemented by physician and nursing staff in the ICU. Conclusions Although the FACTT Lite protocol had a greater cumulative fluid balance than FACTT Conservative, the results of our study indicate that the FACTT Lite protocol is safe and has equivalent ventilator-free days, ICU-free days, acute kidney injury, and adjusted 60-day mortality to FACTT Conservative. FACTT Conservative has improved ventilator-free days,",8,75,0.849876
3,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,ARDSSRDRFMS.pdf,"subphenotypes have remarkably similar prevalence across the cohorts, and similar natural histories and clinical outcomes. Table 3. Clinical Outcomes by ARDS Subphenotype Subphenotype 1 (n = 727) Subphenotype 2 (n = 273) P Value 60-d mortality, % 21 44 ,0.0001 90-d mortality, % 22 45 ,0.0001 Ventilator-free days, median 19 3 ,0.0001 Deﬁnition of abbreviation:",6,93,0.844384
4,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,ETSDMV.pdf,"[17,18,22,23]. The main side effects of dexmedetomidine are bradycardia, hypotension and the potential for withdrawal symptoms upon discontinuation of long-term therapy [17,18]. When compared to other sedatives, dexmedetomidine has been shown to result in a more awake and interactive patient, a lower incidence of delirium, more ventilator free days, and less days in the ICU [17–",3,26,0.843922
5,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,ESCNBC.pdf,"undergoing at least 2 days and up to 3 consecutive days of NMBAs (NMBA treatment), within 48 h from commencement of IMV were compared with subjects who did not receive NMBAs or only upon commence- ment of IMV (control). The primary objective in the PS-matched cohort was comparison between groups in 90-day in-hospital mortality, assessed through Cox proportional hazard modeling. Secondary objectives were comparisons in the numbers of ventilator-free days (",1,8,0.843073
6,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,ETSDMV.pdf,"▪ Randomized placebo-controlled clinical trial of low dose nighttime dexmedetomidine to prevent ICU delirium. Notable for being one of the few interventions shown to prevent ICU delirium. 30. Kawazoe Y, Miyamoto K, Morimoto T, et al. Effect of dexmedetomidine on mortality and ventilator-free days in patients requiring mechanical ventilation with sepsis: a randomized clinical trial effect of dexmed",8,82,0.843073
7,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,SPICE III.pdf,"mentary Appendix). Discussion In this randomized, controlled, open-label trial, the use of dexmedetomidine as the primary or sole sedative in patients undergoing mechanical ventilation in the ICU did not result in lower 90- day mortality than usual care. Early in the course of the critical illness, most patients who were treated with dexmedetomidine received supple- mental sedatives. Although the target level of light sedation was observed more frequently in the dexmedetomidine group, deep sedation was frequently reported in the two groups. The num- ber of days that patients were free from coma or delirium and the number of ventilator-free days were 1 day more in the dexmedetomidine group than in the usual-care group for each of the comparisons; the confidence intervals for the between-group differences did not include zero but were unadjusted for multiple comparisons.",8,50,0.843073
8,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,ACURASYS.pdf,"atracurium on the 90- day survival rate was confined to the two thirds of patients presenting with a PaO2:FiO2 ratio of less than 120. Among these patients, the 90-day mor- tality was 30.8% in the cisatracurium group and 44.6% in the control group (P = 0.04) (Fig. 2 in the Supplementary Appendix). The absolute difference in 28-day mortality (mortality in the cisatracurium group minus mortality in the placebo group) was −9.6 percentage points (95% CI, −19.2 to −0.2; P = 0.05) (Table 3). The cisatracurium group had significantly more ventilator-free days than the placebo group during The New England Journal of Medicine is produced by NEJM Group, a division of the Massachusetts",5,36,0.843073
9,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,APROCCHSS.pdf,"-free days to day 28 was signifi- cantly higher in the hydrocortisone-plus-fludrocortisone group than in the placebo group (17 vs. 15 days, P<0.001), as was the number of organ-failure–free days (14 vs. 12 days, P = 0.003). The number of ventilator-free days was similar in the two groups (11 days in the hydrocortisone-plus-fludrocortisone group and 10 in the placebo group, P = 0.07). The rate of serious adverse events did not differ significantly between the two groups, but hyperglycemia was more common in hydrocortisone-plus-fludrocortisone group. CONCLUSIONS In this trial involving patients with septic shock, 90-day all-cause mortality was lower among those who",1,5,0.843073


In [6]:
analysis_df = get_avg_similarity_df(results_df)
analysis_df

Unnamed: 0,Question Number,Question,Retrieved Files,Avg Similarity
0,1,Does early administration of neuromuscular blocking agents increases the ventilator free days?,"[ETSDMV.pdf, APROCCHSS.pdf, ESCPARDS.pdf, ESCNBC.pdf, ARDSSRDRFMS.pdf, ACURASYS.pdf, SPICE III.pdf, LSPA.pdf, ARDSNet.pdf, DPSMVAS.pdf, FMWSCPARDS.pdf]",0.845444
1,2,Do patients with severe ARDS being treated with neuromuscular blocking agents have increased muscle weakness?,"[CHEST.pdf, ACURASYS.pdf, NBSARDS.pdf, LSPA.pdf, EDPMARDSLPMV.pdf, CEIIUPPSARDS.pdf, ROSE.pdf]",0.869704
2,3,"In patients with moderate to severe ARDS, does early use of continuous neuromuscular blockade improve mortality?","[CHEST.pdf, ESCPARDS.pdf, ACURASYS.pdf, ENB.pdf, NBSARDS.pdf, LSPA.pdf, EDPMARDSLPMV.pdf, CEIIUPPSARDS.pdf, ROSE.pdf, TOF-ARDS.pdf]",0.857568
3,4,Do patients with moderate-to-severe ARDS have a significance difference in mortality rate beween patients who recieved an early and continous cisatracurium infusion than those with usual care approach with lighter sedation targets?,"[CHEST.pdf, ESCPARDS.pdf, ACURASYS.pdf, ENB.pdf, LSPA.pdf, NBSARDS.pdf, EDPMARDSLPMV.pdf, CEIIUPPSARDS.pdf, ROSE.pdf]",0.823973
4,5,"Among patients with ALI/ARDS, does a conservative fluid management strategy improves lung function and decrease ventilator days compared to liberal strategy?","[ACURASYS.pdf, FACTT.pdf, OSCILLATE.pdf, FMWSCPARDS.pdf, ROSE.pdf, TOF-ARDS.pdf]",0.816809
