# GraphRAG pipeline

Developed by Roberto Giordano

## Introduction

This project is based on the research paper titled *"From Local to Global: A Graph RAG Approach to Query-Focused Summarization"* created by Microsoft. The primary objective of this work is to explore and implement the methodologies described in the paper, focusing on leveraging Graph Retrieval-Augmented Generation (Graph RAG) techniques which are able to capture relationships and context across multiple sources, enabling richer and more insightful responses.

### Problem Statement
Modern summarization and QA systems struggle with repository-level questions that require the synthesis of information distributed across multiple documents. The Graph RAG approach aims to overcome this limitation by modeling entities and their relationships through graph structures, facilitating complex information retrieval and generation tasks.


### Objectives
1. Implement a Graph RAG framework for repository-level questions.
2. Use datasets derived from arXiv papers, ensuring coherence with the multimodal module.
3. Develope all the tool with the possibility to switch between closed-sources (openai) and open-sources models (ollama).
4. Demonstrate the framework through some example questions.

### Budget and Tools
The framework is built using state-of-the-art technologies such as Neo4j, Milvus, and LangGraph for robust and scalable performance.

Each document extraction and integration costs ~0.06$-0.08$ using gpt-4o-mini.

This version indeed, differs from the paper since uses also a vector database to enhance retrieval performances and set the bases for futre developments and improvements.



# Setup and imports

In [3]:
# !pip install neo4j
# !pip install plotly
# !pip install langchain
# !pip install PyPDF2
# !pip install tiktoken
# !pip install openai  # Only if you want to use the OpenAI API
# !pip install transformers  # For open (HF) models
# !pip install sentence_transformers
# !pip install -U langchain-community
# !pip install -qU  langchain_milvus
# !pip install -U langchain-ollama
# !pip install graphdatascience
# For advanced community detection with Leiden, you might need external libraries (e.g., igraph, networkx, etc.).


In [4]:
#!ollama pull llama3.1

In [5]:
import os
from typing import List, Dict, Any
import tqdm
import concurrent.futures

# -----------------------
# Neo4j Database imports
# -----------------------
from neo4j import GraphDatabase

# -----------------------
# LLM / Embeddings imports
# -----------------------

# LangChain for retrieval + QA
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter


from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama


from langchain_milvus import Milvus
from uuid import uuid4
from pymilvus import MilvusClient

from langchain.schema import Document
from langchain.schema import HumanMessage, SystemMessage

# to use OpenAI:
import openai
openai_model="gpt-4o-mini"

# -----------------------
# Load environment variables
import dotenv
dotenv.load_dotenv()

# -----------------------
# ArXiv API
# -----------------------
import arxiv

# -----------------------
# PDF Parsing library
# -----------------------
import PyPDF2  # or "pypdf" if needed

# -----------------------
# Plotly for visualization
# -----------------------

from neo4j import GraphDatabase
import networkx as nx
import plotly.graph_objects as go
import matplotlib.colors as mcolors

In [6]:
#############################################
# 1) CONFIGURATION: toggle open vs. OpenAI
#############################################

USE_OPENAI = True  # Set to True if you want to switch to OpenAI’s ChatGPT
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# For Neo4j:
NEO4J_URI = os.getenv("NEO4J_URI")
NEO4J_USER = os.getenv("NEO4J_USERNAME")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# Pipeline functions

## 2.1 Source Documents → Text Chunks

**Objective**  
The goal is to divide the original documents (referred to as "source documents") into shorter text chunks before proceeding with the extraction of entities and relationships from these smaller text segments.

If the chunks are too long, less splitting work is required (resulting in fewer calls to the LLM); however, the LLM struggles to maintain high levels of recall and precision. In other words, larger context windows tend to "lose" more references (the text cites studies by Kuratov et al., 2024 and Liu et al., 2023 in support of this observation).

Conversely, if the chunks are too short, more calls to the LLM will be necessary, but the likelihood of accurately extracting a greater number of entities improves.



In [7]:
##################################################
# 2.1 SOURCE DOCUMENTS → TEXT CHUNKS
##################################################

def parse_pdf(pdf_path: str) -> str:
    """
    Extract raw text from a PDF file using PyPDF2.
    """
    text = ""
    with open(pdf_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

def find_metadata(doc_id: str) -> Dict[str, Any]:
    """
    (Step 2.1) Retrieve metadata for a document from a database or API.
    """
    client = arxiv.Client()
    search = arxiv.Search(
        id_list=[doc_id]
    )

    try:
        result = next(client.results(search))
        return \
            {"title": result.title, 
            "summary": result.summary, 
            "url": result.entry_id,
            "authors": ', '.join([a.name for a in result.authors]),
            "categories": ', '.join(result.categories)
            }
    except StopIteration:
        return {}

def chunk_text(text: str, chunk_size: int = 600, chunk_overlap: int = 100) -> List[str]:
    """
    (Step 2.1) Split text into chunks. 
    Following the guidance in 2.1, we use a smaller chunk size (e.g., ~600 tokens).
    This can improve entity recall at the cost of more LLM calls.
    """
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    chunks = text_splitter.split_text(text)
    return chunks

Define an embedder model to populate in a second step a vector database to enhance retrieving and lower the costs.

In [8]:
##################################################
# EMBEDDING UTILITIES
##################################################

def get_hf_embedding_function(model_name: str = "sentence-transformers/all-MiniLM-L6-v2", device: str = "mps", USE_OPENAI: bool = False):
    """
    Returns a function that can generate embeddings using a HuggingFace model.
    """
    if not USE_OPENAI:
        hf_embed = HuggingFaceEmbeddings(model_name=model_name, model_kwargs={'device': device})
        return hf_embed
    else:
        def _embeddings(texts: List[str], model_name: str = "text-embedding-ada-002") -> List[List[float]]:
            response = openai.Embedding.create(
                input=texts,
                model=model_name
            )
            embeddings = [item["embedding"] for item in response["data"]]
            return embeddings
        return _embeddings


In [9]:
##################################################
# VECTORSTORE UTILITIES
##################################################

def init_milvus_db(collection_name: str, uri: str, embedding_function):
    """
    Initialize the Milvus database if it does not exist.
    """
    if not os.path.exists(uri):
        print(f"Creating database at {uri}")

    vectorstore = Milvus(
        collection_name=collection_name,
        embedding_function=embedding_function,
        connection_args={"uri": uri},
    )

    return vectorstore

# Function to add multiple documents to Milvus
def add_documents_to_milvus(docs: list, embedding_function, collection_name: str = "rag_milvus", uri: str = "./vector_db_graphRAG/milvus_ingest.db"):
    """
    Add multiple documents to the Milvus vector store.

    Args:
        docs (list): List of tuples containing text and metadata.
        collection_name (str): Name of the Milvus collection.
        uri (str): URI for the Milvus database.
    
    # Example usage
    docs = [
        {"text": "Chunk 1 of the document", "metadata": {"doc_id": "doc_1", "chunk": 1}},
        {"text": "Chunk 2 of the document", "metadata": {"doc_id": "doc_1", "chunk": 2}}
    ]

    add_documents_to_milvus(docs)
    """
    # Initialize the database
    vectorstore = init_milvus_db(collection_name, uri, embedding_function)

    # Prepare documents
    document_list = []
    for doc in docs:
        text = doc.get("text", "")
        metadata = doc.get("metadata", {})
        document_list.append(Document(page_content=text, metadata=metadata))
    
    uuids = [str(uuid4()) for _ in range(len(document_list))]

    # Add documents to the vector store
    vectorstore.add_documents(document_list, ids=uuids)

    return vectorstore

# Function to search for similar documents in Milvus
def search_milvus(query: str, vectorstore, top_k: int = 5):
    """
    Search for similar documents in the Milvus vector store.

    Args:
        query (str): Text to search for.
        vectorstore: Milvus vector store.
        top_k (int): Number of similar documents to return.
    """
    return vectorstore.similarity_search_with_score_by_vector(query, k=top_k)



## 2.2 Text Chunks → Element Instances
**Objective**  
Given the chunk, the goal is to extract:  
- **Nodes**: Entities, including name, type, description, etc.  
- **Edges (Relationships)**: Connections between two or more mentioned entities, along with a description of the relationship type.

**Technique**  
A "multipart" prompt (structured in multiple parts) is provided to the LLM to extract both entities (node instances) and relationships (edge instances) in a single response. The prompt produces a list of delimited tuples, enumerating the detected entities and relationships.

Few-shot examples specific to the target domain (e.g., medical, legal, scientific) can be supplied to the LLM to guide its focus and improve extraction quality.

**Covariates**  
If additional information needs to be associated with an entity (e.g., start/end dates, claims, source text), a secondary extraction prompt is used. This approach enriches each node not only with "name and description" but also with supplemental metadata (claims, time periods, etc.).

**Gleanings (Iterative Extractions)**  
To address potential omissions during a single extraction pass, multiple rounds of extraction can be performed (up to a specified limit).  
- In each round, the LLM checks for any missing entities.  
- If entities are found to be missing, an "additional" extraction is triggered, using a logit bias that forces the model to respond with a yes/no prompt.  
- If the response indicates missing entities, a prompt is issued to re-extract them.

This iterative approach allows the use of larger chunks without sacrificing quality, as successive passes help "recover" any entities initially skipped.



In [10]:
##################################################
# 2.2. Text Chunks → Element Instances
##################################################

def extract_element_instances_from_chunk(
    chunk_text: str,
    gleaning_rounds: int = 1,
    USE_OPENAI: bool = True,
    local_llm: str = "llama3.1"
) -> List[Dict[str, Any]]:
    """
    (Step 2.2) Use an LLM prompt to identify entity references, relationships, and covariates.
    - Identifies entities (name, type, description) and relationships.
    - Supports multiple rounds of "gleanings" to find any missed entities.
    """
    extracted_elements = []
    not_parsed = []

    # Select the LLM to use
    if USE_OPENAI:
        llm = ChatOpenAI(model=openai_model, temperature=0.0)
    else:
        llm = ChatOllama(model=local_llm, temperature=0)

    # Base prompt for extracting entities and relationships
    base_prompt = (
        "Extract entities and relationships from the following text. "
        "For each entity, provide its name, type, and description. "
        "For each relationship, provide the source entity, target entity, and description. "
        "Text: \n{chunk_text}\n"
        "Output format: List of dictionaries with keys 'entity_name', 'entity_type', 'entity_description', 'relationship'. "
        "Follow this format in the example: [{{\"entity_name\": \"Alice\", \"entity_type\": \"Person\", \"entity_description\": \"A person of interest.\", \"relationship\": {{\"source_entity\": \"Alice\", \"target_entity\": \"Bob\", \"description\": \"Knows\"}}}}]"
        "Return just the list, so that we can parse it."
        "No bullet list or asterisks needed."
        "It must be a unique list, do NOT separate entities and relationships in different lists."
    )

    # Loop through gleaning rounds
    for round_num in range(gleaning_rounds):
        prompt = base_prompt.format(chunk_text=chunk_text)

        # Send prompt to the selected LLM
        response = llm.invoke([HumanMessage(content=prompt)])

        # Parse the LLM response
        new_elements = response.content
        new_elements = new_elements.replace('```json','').replace('```','')

        # Assume the response is already in JSON format
        try:
            new_elements = eval(new_elements)  # Convert string to list of dicts
        except Exception as e:
            print(f"Error parsing LLM output: {e}")
            not_parsed.extend(new_elements)
            new_elements = []

        # Add new elements to the result
        extracted_elements.extend(new_elements)

        # Check if gleaning is needed (e.g., ask LLM if entities were missed)
        if round_num < gleaning_rounds - 1:
            print("Asking for validation...")
            validation_prompt = (
                "Were any entities or relationships missed in the previous extraction? "
                "Answer 'Yes' or 'No'."
            )
            validation_response = llm.invoke(
                [ HumanMessage(content=validation_prompt)], 
                chat_history=
                [HumanMessage(content=prompt), SystemMessage(content=new_elements)]
            )

            # If LLM says 'No', break early
            if 'No' in validation_response.content:
                break

    # Return all extracted elements
    return extracted_elements, not_parsed


## 2.3 Element Instances → Element Summaries

**Objective**  
The aim is to transition from the level of "entity and relationship instances" to more compact summaries describing each element. Essentially, this involves a second "re-synthesis":  
- For each entity, all descriptions from the original chunks are merged.  
- For each relationship, the same applies: if multiple chunks or extraction passes describe the same relationship, the information is aggregated into a single summary.

**Key Aspect: The LLM as Summarizer**  
The LLM not only performs mechanical extraction but can also abstract information and capture implicit relationships. This step effectively represents a form of "abstractive summarization."

**Potential Duplication**  
If the LLM extracts the same entity with different names or formats across chunks, there is a potential risk of duplicates. However, the following section (community detection) can merge and consolidate highly similar entities if they share relationships with a "core" set of other entities.

**Advantage**  
Having rich textual descriptions in the nodes (instead of just RDF triples in subject-predicate-object format) facilitates a "global query" approach using the LLM, as the text is not constrained to rigid formats.



In [11]:
##################################################
# 2.3 ELEMENT INSTANCES → ELEMENT SUMMARIES
##################################################

summarization_instances_prompt = """
You are an expert in text summarization and knowledge graph construction. I will provide you with:

1. A list of initial nodes, each containing:
   - A unique identifier
   - A textual description extracted from the source
   - Additional properties (optional)

2. A list of relationships between these nodes (e.g., an entity "Einstein" related to another entity "Relativity").

Your task is to:
- Identify when multiple nodes actually refer to the same entity or concept.
- Generate *summarized nodes* by consolidating their textual descriptions and removing duplicates or near-duplicates.
- Maintain references to each node's original ID within your summarized node.
- Create new relationships among these summarized nodes that reflect the original relationships, but merged and simplified where appropriate.

**Important requirements and format details:**
1. Each summarized node should have:
   - A `summary` field with the merged description.
   - A list of `original_ids` that were merged into this new summary node.
   - Any relevant `type` or `label` (e.g., Person, Theory, Location) if it can be inferred from the text.
   - (Optional) A short list of `keywords` extracted from the descriptions.

2. Each relationship should:
   - Include `source` and `target` references to the new summarized nodes.
   - Provide a `relation_type` (e.g., "INVENTED", "WORKS_ON", "LOCATED_IN", etc.).
   - Have a `weight` or `relevance_score` if it can be inferred (e.g., frequency or importance).
   - (Optional) Include an `original_relationships` list indicating which original relationships were merged.

3. Return the final data in **JSON** format, containing two top-level keys: `summarized_nodes` and `summarized_relationships`.

4. Be concise but ensure the summaries and relationships accurately capture the original meaning.

---

### **Here is the initial Nodes and Relationships**:

{initial_data}

---

### **Instructions to the LLM**:
1. **Identify duplicates or near-duplicates** (e.g., "Albert Einstein" and "Einstein" might refer to the same entity).
2. **Create a new summarized node** that merges the descriptions of "N1" and "N2" if they represent the same entity (in this case, Albert Einstein).
3. **Consolidate relationships** so that if multiple original relationships lead to the same concept, you unify them into a single relationship with an updated weight (e.g., sum or average of the original).
4. Provide your final answer in the following **JSON** structure:

```json
{{
  "summarized_nodes": [
    {{
      "title": "NewTitle1",
      "summary": "Your merged summary text here...",
      "original_ids": ["ExampleID1", "ExampleID2", ...],
      "type": "Person",
      "keywords": ["Einstein", "relativity", "physics"]
    }},
    {{
      "title": "NewTitle2",
      "summary": "Your summary text here...",
      "original_ids": ["ExampleID3", "ExampleID4"],
      "type": "Theory",
      "keywords": ["relativity", "physics"]
    }}
  ],
  "summarized_relationships": [
    {{
      "source": "NewTitle1",
      "target": "NewTitle2",
      "relation_type": "DEVELOPED_OR_ASSOCIATED_WITH",
      "weight": 5,
      "original_relationships": ["N1->N3(DEVELOPED)", "N2->N3(ASSOCIATED_WITH)"]
    }}
  ]
}}
```

Please **only** output valid JSON in the format described above, without additional commentary so that we can parse it correctly. 
Make sure to capture the essence of each original node and relationship in your summarized version.
"""

In [12]:
def summarize_element_instances(
    element_instances: List[Dict[str, Any]], 
    USE_OPENAI: bool = True,
    local_llm: str = "llama3.1") -> str:
    """
    (Step 2.3) Summarize extracted nodes/relationships into a single descriptive block of text
    for each chunk. This is an additional LLM-based summarization step, forming "element summaries."
    """
    # Select the LLM to use
    if USE_OPENAI:
        llm = ChatOpenAI(model=openai_model, temperature=0.0)
    else:
        llm = ChatOllama(model=local_llm, temperature=0)

    prompt = summarization_instances_prompt.format(initial_data=element_instances)
    response = llm.invoke([HumanMessage(content=prompt)])
    response = response.content

    try:
        response = response.replace('```json','').replace('```','')
        response = eval(response)
    except Exception as e:
        print(f"Error parsing LLM output: {e}")
    
    return response

In [13]:
def store_element_summary_in_graph(tx, data: Dict[str, Any], doc_id: str, chunks_bounds: tuple):
    """
    Load the summarized graph data into Neo4j.

    Args:
        tx: Neo4j transaction object.
        data (Dict[str, Any]): Summarized nodes and relationships.
        doc_id (str): Document ID.
        chunks_bounds (tuple): Tuple containing the start and end positions
            of the text chunk in the original document.
    """
    
    # Creazione dei nodi
    for node in data["summarized_nodes"]:
        query_create_node = """
        CREATE (n:SummarizedNode {
            title: $title,
            summary: $summary,
            original_ids: $original_ids,
            type: $type,
            keywords: $keywords,
            doc_id : $doc_id,
            chunks_lower_bound: $chunks_lower_bound,
            chunks_upper_bound: $chunks_upper_bound
        })
        """
        tx.run(
            query_create_node,
            title=node.get("title"),
            summary=node.get("summary"),
            original_ids=node.get("original_ids"),
            type=node.get("type"),
            keywords=node.get("keywords"),
            doc_id=doc_id,
            chunks_lower_bound=chunks_bounds[0],
            chunks_upper_bound=chunks_bounds[1]
        )

    # Creazione delle relazioni
    for rel in data["summarized_relationships"]:
        query_create_rel = f"""
        MATCH (source:SummarizedNode {{title: $source_id}})
        MATCH (target:SummarizedNode {{title: $target_id}})
        CREATE (source)-[:RELATIONSHIP_TYPE {{
            type : $relation_type,
            weight: $weight,
            original_relationships: $original_rels
        }}]->(target)
        """
        tx.run(
            query_create_rel,
            source_id=rel["source"],
            target_id=rel["target"],
            relation_type=rel["relation_type"],
            weight=rel.get("weight", 1),  # default=1 if not provided
            original_rels=rel.get("original_relationships", [])
        )

## 2.4 Element Summaries → Graph Communities

**Graph Construction**  
At this stage, we build a homogeneous, undirected graph with:  
- **Nodes** = Entities (with associated claims, if available)  
- **Edges** = Relationships (with weights representing, for instance, how often a specific relationship is observed across different texts)  

**Community Detection**  
A community detection algorithm (specifically Leiden, Traag et al., 2019) is applied to partition the nodes into hierarchical groups (“communities”). A community groups nodes that are more strongly connected to each other than to the rest of the graph.  

**Key Feature**  
Leiden is highly efficient for large-scale graphs and supports hierarchical partitioning, allowing multiple levels of clustering.  

At each “level” of this hierarchy, different granularities of grouping are produced—from broad macro-communities to more specific micro-communities.

In [14]:
##################################################
# 2.4 ELEMENT SUMMARIES → GRAPH COMMUNITIES
##################################################

class CommunityDetection:
    def __init__(self, driver: Any = None, uri: str = '', user: str = '', password: str = ''):
        """
        Initialize the CommunityDetection class with a Neo4j driver or connection details.

        Args:
            driver (Any): An existing Neo4j driver instance. If provided, `uri`, `user`, and `password` are ignored.
            uri (str): The URI for the Neo4j database.
            user (str): The username for the Neo4j database.
            password (str): The password for the Neo4j database.
        """
        # If a driver is provided, use it; otherwise, create a new driver instance
        self.driver = driver or GraphDatabase.driver(uri, auth=(user, password))
        self.graph_name = 'summarizedGraph'

    def close(self):
        self.driver.close()

    def project_graph(self, relationship_weight_property: str = "weight") -> None:
        """
        Projects the graph into memory for analysis.
        """
        with self.driver.session() as session:
            session.run(
                f"""
                CALL gds.graph.project(
                    '{self.graph_name}',
                    'SummarizedNode',
                    {{ RELATIONSHIP_TYPE: {{ orientation: 'UNDIRECTED', properties: ['{relationship_weight_property}'] }} }}
                )
                """
            )


    def set_communities(self, relationship_weight_property: str = "weight") -> List[Dict[str, Any]]:
        """
        Sets the community IDs directly into the graph database.
        """
        with self.driver.session() as session:
            result = session.run(
                f"""
                CALL gds.leiden.stream(
                    '{self.graph_name}', 
                    {{ relationshipWeightProperty: '{relationship_weight_property}' }})
                YIELD nodeId, communityId
                SET gds.util.asNode(nodeId).communityId = communityId
                """
            )

            return True
        
        return False
        
    def retrieve_communities(self) -> List[Dict[str, Any]]:
        """
        Retrieve the community assignments from the graph.
        """
        with self.driver.session() as session:
            result = session.run(
                f"""
                MATCH (n:SummarizedNode)-[r:RELATIONSHIP_TYPE]->(m:SummarizedNode)
                RETURN
                    n.communityId AS communityId,
                    n.title AS nodeTitle,      
                    n.summary AS nodeSummary,
                    n.keywords AS nodeKeywords,
                    r.type AS relationshipType,
                    r.weight AS relationshipWeight,
                    m.title AS targetNodeTitle
                """
            )

            communities = {}
            for record in result:
                community_id = record["communityId"]
                node_title = record["nodeTitle"]
                node_summary = record["nodeSummary"]
                node_keywords = record["nodeKeywords"]
                relationship_type = record["relationshipType"]
                relationship_weight = record["relationshipWeight"]
                target_node_title = record["targetNodeTitle"]
                
                if community_id not in communities:
                    communities[community_id] = {"nodes": [], "relationships": []}
                
                communities[community_id]["nodes"].append({
                    "title": node_title,
                    "summary": node_summary,
                    "keywords": node_keywords
                })
                
                communities[community_id]["relationships"].append({
                    "source": node_title,
                    "target": target_node_title,
                    "type": relationship_type,
                    "weight": relationship_weight
                })

            return communities

    def drop_graph(self) -> None:
        """
        Drops the graph from memory.
        """
        with self.driver.session() as session:
            session.run(f"CALL gds.graph.drop('{self.graph_name}') YIELD graphName")


## 2.5 Graph Communities → Community Summaries

**Objective**  
The goal is to create textual reports or summaries for each identified community. This approach allows users (or systems) to navigate the overall structure of the graph:  
- At a high level, users can view macro-themes (top-level communities).  
- If something is of interest, they can drill down to more granular “child” communities.  

**Utility**  
Even in the absence of a specific query, community summaries help to "understand" the overall content of a textual corpus, functioning like a conceptual map.  
When a global query is posed, these summaries serve as an index for extracting the final answer.  

**Scalability**  
For very large datasets, multi-level hierarchical summaries provide a scalable method for sensemaking.

**Vectorial Search**    
This step moreover gives the possibility to embed this new summaries and perform a custom retrieval like in a normal RAG pipeline.

In [15]:
##################################################
# 2.5 GRAPH COMMUNITIES → COMMUNITY SUMMARIES
##################################################

summary_community_prompt = """
You are an expert summarizer helping to create a concise “community report” from a list of related nodes. Each node has a title, a summary, and keywords. All these nodes belong to the same community, meaning they share a common theme, topic, or set of closely related ideas. Moreover, you are provided with a list of relationships between these nodes, indicating how they are connected or interact with each other.

Here is the list of nodes for this community:

{nodes}

And here are the relationships between these nodes:

{relationships}

Please read through the nodes and relationships and then produce a coherent summary describing:
1. The main topics, themes, or domains covered by the nodes in this community.
2. Any notable or central nodes and why they are important.
3. How the nodes interrelate: highlight significant relationships and mention if there are strong connections (high weight).
4. Overall, what makes this community distinct or interesting?

- Aim for a concise yet informative text, written in a clear paragraph style.
- You may group related nodes together and mention prominent links or patterns.
- Use plain English. Avoid overly technical language unless it is necessary to describe the domain.

4. Provide your final answer in the following **JSON** structure:

```json
{{
    "title": a title capable to summarize the community,
    "community_summary": a single, cohesive summary that helps a reader quickly understand the core content of this community. You do NOT need to repeat every node’s individual summary verbatim. Instead, synthesize the most relevant information into a unified overview.
    "keywords": a list of keywords that capture the main topics or themes of this community.
}}
```

Please **only** output valid JSON in the format described above, without additional commentary so that we can parse it correctly. 

Begin now.
"""

In [16]:
def summarize_communities(
    communities: Dict[int, List[Dict[str, Any]]],
    USE_OPENAI: bool = True,
    local_llm: str = "llama3.1",
) -> Dict[int, str]:
    """
    (Step 2.5) Summarize each community (or sub-community in a hierarchical approach).
    - Gather all element summaries (nodes, edges, covariates) in that community.
    - Summarize them, potentially chunking if they don't fit in an LLM context window.

    Args:
        communities (dict): A dictionary where keys are community IDs and values are lists of nodes with 'title', 'summary', and 'keywords' and a list of relationships with 'source', 'target', 'type', and 'weight'.
        USE_OPENAI (bool): Whether to use OpenAI or a local LLM.
        local_llm (str): The name of the local LLM model.

    Returns:
        dict: Summaries for each community.
    """
    # Select the LLM to use
    if USE_OPENAI:
        llm = ChatOpenAI(model=openai_model, temperature=0.0)
    else:
        llm = ChatOllama(model=local_llm, temperature=0)

    def process_community(community_id, data):
        try:
            # Extract nodes and relationships
            nodes = data["nodes"]
            relationships = data["relationships"]

            # Prepare prompt content
            node_descriptions = "\n".join([f"Title: {node['title']}, Summary: {node['summary']}, Keywords: {', '.join(node['keywords'])}" for node in nodes])
            relationships = "\n".join([f"Source: {rel['source']}, Target: {rel['target']}, Type: {rel['type']}, Weight: {rel['weight']}" for rel in relationships])
            prompt = summary_community_prompt.format(nodes=node_descriptions, relationships=relationships)

            # Generate summary
            response = llm.invoke([HumanMessage(content=prompt)])
            response = response.content.replace('```json','').replace('```','')

            return community_id, eval(response)
        except Exception as e:
            return community_id, [f"Error generating summary: {str(e)}", response]

    # Use ThreadPoolExecutor for parallel processing
    summaries = {}
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Prepare data for processing
        futures = {executor.submit(process_community, community_id, data): community_id for community_id, data in communities.items()}

        # Process results as they complete
        for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
            community_id, result = future.result()
            summaries[community_id] = result


    return summaries




## 2.6 Community Summaries → Community Answers → Global Answer

**Scenario**  
When a user poses a broad question, it is necessary to treat the community summaries as a knowledge base.  

**Multi-Stage Procedure**  

1. **Preparation**  
   - Community summaries (potentially at a specific hierarchical level) are segmented into predefined-sized chunks to avoid exceeding the LLM’s input context limits.  
   - Chunks are shuffled randomly to prevent relevant information from being concentrated in a single block, minimizing the risk of being “lost.”  

2. **Generation of Intermediate Answers**  
   - For each chunk, the LLM generates both a local response to the query and a utility score (from 0 to 100) indicating how useful that response is for the question.  
   - Responses with a score of 0 are discarded.  

3. **Reduction to a Global Answer**  
   - Responses are ranked by utility score, from highest to lowest.  
   - The content of the top responses is concatenated iteratively into a final context until the token limit is reached.  
   - The LLM is then tasked with generating the global answer using this consolidated context.  

**Advantage**  
The global answer does not rely on a single prompt with limited context. Instead, it integrates information from multiple chunks (across communities or sub-communities). This method also incorporates a filtering mechanism based on the perceived utility of information, enabling iterative refinement.

In [1]:
##################################################
# MAIN FUNCTION: 2.6 COMMUNITY SUMMARIES → COMMUNITY ANSWERS → GLOBAL ANSWER
###################################################

# -------------------------------------------------------------------------
# UTILITY: Prompt to get partial answer + helpfulness score
# -------------------------------------------------------------------------
PARTIAL_ANSWER_PROMPT = """\
You have a user query:
\"\"\"{user_query}\"\"\"

Below is a chunk of text from a community summary that may or may not be relevant to the query:
\"\"\"{chunk_text}\"\"\"

1) Please provide a concise partial answer (if any) relevant to the query, based on the chunk above.
   If the chunk is irrelevant, you can say "No relevant info here."
2) Provide a helpfulness score from 0 to 100 (integer), indicating how much this chunk helps answer the query. 
   0 = not relevant at all, 100 = extremely helpful.

Output your response **only** in valid JSON format like:
{{
  "partial_answer": "...",
  "helpfulness_score": ...
}}
"""

In [18]:
def answer_query_from_communities(
    user_query: str,
    community_summaries: Dict[int, str],
    USE_OPENAI: bool = True,
    local_llm: str = "llama3.1",
    max_context_tokens: int = 1000
) -> str:
    """
    (Step 2.6) Use the hierarchical community summaries to answer a user query globally.
    
    Args:
        user_query (str): The question asked by the user.
        community_summaries (Dict[int, str]): A dict of {community_id: summary_text}.
        USE_OPENAI (bool): Whether to use OpenAI or a local LLM.
        local_llm (str): The local LLM name if not using OpenAI.
        max_context_tokens (int): Approx token limit for each chunk.

    Returns:
        str: The final global answer.
    
    High-level process:
      1) For each community summary:
         - Chunk the text (so we don't exceed context window).
      2) For each chunk:
         - Ask the LLM for a partial answer + helpfulness score.
      3) Sort partial answers by score.
      4) Combine top partial answers into a final context.
      5) Ask the LLM for a final answer.
    """

    # -------------------------
    # 0) Select which LLM to use
    # -------------------------
    if USE_OPENAI:
        llm = ChatOpenAI(model=openai_model, temperature=0.0)
    else:
        llm = ChatOllama(model=local_llm, temperature=0)

    # ----------------------------------------------
    # 1) Chunk each community summary
    # ----------------------------------------------
    chunked_texts = []
    for comm_id, summary_text in community_summaries.items():
        # Break down the summary into smaller pieces
        chunks = chunk_text(summary_text, chunk_size=max_context_tokens)
        for c in chunks:
            chunked_texts.append((comm_id, c))

   
    # ----------------------------------------------
    # 2a) For each chunk, get partial answer + score
    # ----------------------------------------------
    # Function to process a single chunk
    def process_chunk(comm_id, text_chunk, user_query):
        try:
            # Build the prompt
            prompt = PARTIAL_ANSWER_PROMPT.format(
                user_query=user_query,
                chunk_text=text_chunk
            )

            # LLM call
            response = llm.invoke([HumanMessage(content=prompt)])
            raw_content = response.content.strip()

            # Attempt to parse JSON response
            first_brace = raw_content.find('{')
            last_brace = raw_content.rfind('}')
            json_str = raw_content[first_brace:last_brace+1]

            parsed = eval(json_str)  # or use json.loads if strictly valid
            partial_answer = parsed.get("partial_answer", "No relevant info here.")
            score = parsed.get("helpfulness_score", 0)

        except Exception as e:
            partial_answer = "Parsing error or no relevant info."
            score = 0

        # Return the result for this chunk
        return (comm_id, partial_answer, score)

    # Parallel processing
    partial_answers = []
    marks = []

    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Submit tasks for each chunk
        futures = [executor.submit(process_chunk, comm_id, text_chunk, user_query) for comm_id, text_chunk in chunked_texts]

        # Collect results as they complete
        print("Processing partial answers...")
        for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
            comm_id, partial_answer, score = future.result()
            partial_answers.append((partial_answer, score))
            marks.append((comm_id, score))

    # ----------------------------------------------
    # 3) Sort partial answers by score (descending)
    # ----------------------------------------------
    partial_answers.sort(key=lambda x: x[1], reverse=True)
    
    # ----------------------------------------------
    # 4) Combine top partial answers into a final context
    #    We'll do a simple cutoff if we have too many
    # ----------------------------------------------
    final_context = []
    used_chars = 0

    for ans, sc in partial_answers:
        # Add some label or delimiter if needed
        if sc==0:
            continue
        
        snippet = f"PartialAnswer(Score={sc}): {ans}\n"
        if used_chars + len(snippet) <= max_context_tokens * 4:
            final_context.append(snippet)
            used_chars += len(snippet)
        else:
            break  # no more space in final context

    # ----------------------------------------------
    # 5) Produce a final answer from the LLM
    # ----------------------------------------------
    final_prompt = f"""\
We have these partial answers and their helpfulness scores:

{''.join(final_context)}

User Query: {user_query}

Based on these partial answers, please produce a single, coherent, and well-structured final answer.
Feel free to synthesize or refine the information. If there's conflicting info, do your best to clarify.

Respond in plain text.
"""

    # Final LLM call
    print("Processing final answer...")
    final_response = llm.invoke([HumanMessage(content=final_prompt)])
    global_answer = final_response.content.strip()

    return global_answer, partial_answers, marks


# Pipeline definition

In [19]:
##################################################
# 5) INGEST PDF -> STORE IN GRAPH (Putting Steps 2.1 and 2.2+ in context)
##################################################

def ingest_pdf_into_graph(pdf_path: str, doc_id: str, embed_opt: bool = True, BATCH_SIZE: int = 25):
    """
    1) Parse PDF into raw text.
    2) Chunk it (Step 2.1).
    3) Generate embeddings for each chunk.
    4) Store chunk nodes in Neo4j.
    5) For each chunk, call LLM to extract element instances (Step 2.2).
    6) Summarize them into a single descriptive block (Step 2.3).
    7) Store the block in Neo4j for further community detection.
    """
    # Step 1: Parse PDF
    print(f"Parsing PDF at {pdf_path}...")
    raw_text = parse_pdf(pdf_path)
    print(f"Extracted {len(raw_text)} characters from {pdf_path} \n\n")

    # Step 1.1: Retrieve metadata
    print(f"Retrieving metadata for {doc_id}...")
    metadata = find_metadata(doc_id)
    print(f"Metadata: {metadata}\n\n")

    # Step 2: Chunk the text (default chunk_size=600 for improved recall)
    chunks = chunk_text(raw_text)
    print(f"Chunked {len(chunks)} segments from {pdf_path} \n\n")

    # Step 3: Embeddings
    if embed_opt:
        print("Generating embeddings for each chunk...")
        embed_fn = get_hf_embedding_function(USE_OPENAI=False)

        print("Storing chunks in Milvus...")
        vectorstore = add_documents_to_milvus([
            {
                "text": chunk, 
                "metadata": {
                    "doc_id": doc_id, 
                    "chunk": i, 
                    **metadata
                    }
            } for i, chunk in enumerate(chunks)], embed_fn)
        print(f"Stored {len(chunks)} chunks in Milvus under Document {doc_id} \n\n")


    # Step 4: Store chunk nodes in Neo4j
    # Process chunks in parallel with batch size of 25
    for lim in tqdm.tqdm(range(0, len(chunks), BATCH_SIZE), desc="Processing chunks in parallel"):
        extracted_elements = []
        not_parsed_elements = []

        with concurrent.futures.ThreadPoolExecutor() as executor:
            # Submit tasks
            futures = [
                executor.submit(extract_element_instances_from_chunk, chunk_text_str, USE_OPENAI=USE_OPENAI)
                for i, chunk_text_str in enumerate(chunks[lim:lim+BATCH_SIZE])
            ]
            
            # Process results as they complete
            for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Processing chunks"):
                elements_instance, not_parsed_instance = future.result()
                extracted_elements.extend(elements_instance)
                not_parsed_elements.extend(not_parsed_instance)

        print("Parallel processing completed.")
            
        # Step 6: Summarize them (element-level)
        print("Summarizing element instances...")
        element_summary = summarize_element_instances(extracted_elements, USE_OPENAI=USE_OPENAI)

        print(f"Summarized {len(element_summary['summarized_nodes'])} nodes and {len(element_summary['summarized_relationships'])} relationships\n\n")

        print("Storing backup files...")
        # Backup element_summary, extracted_elements and not_parsed_elements
        os.makedirs('backup_extraction_nodes/'+doc_id+'/element_summary', exist_ok=True)
        os.makedirs('backup_extraction_nodes/'+doc_id+'/extracted_elements', exist_ok=True)
        os.makedirs('backup_extraction_nodes/'+doc_id+'/not_parsed_elements', exist_ok=True)

        with open(f'backup_extraction_nodes/{doc_id}/element_summary/{lim}.json', 'w') as f:
            f.write(str(element_summary))
        
        with open(f'backup_extraction_nodes/{doc_id}/extracted_elements/{lim}.json', 'w') as f:
            f.write(str(extracted_elements))
        
        with open(f'backup_extraction_nodes/{doc_id}/not_parsed_elements/{lim}.json', 'w') as f:
            f.write(str(not_parsed_elements))
        
        print("Backup files stored.\n\n")

        # Step 7: Store the summary
        print("Storing element summary in Neo4j...")
        with driver.session() as session:
            # Step 7: Store the summary
            session.execute_write(store_element_summary_in_graph, element_summary, doc_id, (lim, lim+BATCH_SIZE))
    
    print(f"Ingested {len(chunks)} chunks from {pdf_path} into Neo4j under Document {doc_id}")
    return True

    

In [18]:
#############################################
# MAIN EXECUTION EXAMPLE
#############################################

In [19]:
PDF_PATH = 'data/docs/'

# Retrieve all the docs in PDF_PATH
docs=sorted([f for f in os.listdir(PDF_PATH) if f.endswith('.pdf')])

In [22]:
filtered_docs=docs[:61]

In [27]:
for doc in filtered_docs:
    print(f"Processing document {doc}...")
    doc_id = doc.replace('.pdf', '')

    if ingest_pdf_into_graph(PDF_PATH+doc, doc_id, embed_opt=True):
        print(f"Finished processing document {doc}.")
        continue
    else:
        raise Exception(f"Error processing document {doc}.")
    

Processing document 1208.1583.pdf...
Parsing PDF at data/docs/1208.1583.pdf...
Extracted 118707 characters from data/docs/1208.1583.pdf 


Retrieving metadata for 1208.1583...
Metadata: {'title': 'Exotic nuclei far from the stability line', 'summary': 'The recent availability of radioactive beams has opened up a new era in\nnuclear physics. The interactions and structure of exotic nuclei close to the\ndrip lines have been studied extensively world wide, and it has been revealed\nthat unstable nuclei, having weakly bound nucleons, exhibit characteristic\nfeatures such as a halo structure and a soft dipole excitation. We here review\nthe developments of the physics of unstable nuclei in the past few decades. The\ntopics discussed in this Chapter include the halo and skin structures, the\nCoulomb breakup, the dineutron correlation, the pair transfer reactions, the\ntwo-nucleon radioactivity, the appearance of new magic numbers, and the pygmy\ndipole resonances.', 'url': 'http://arxiv.org/

Processing chunks: 100%|██████████| 25/25 [01:45<00:00,  4.20s/it]s]


Parallel processing completed.
Summarizing element instances...
Summarized 33 nodes and 31 relationships


Storing backup files...
Backup files stored.


Storing element summary in Neo4j...


Processing chunks: 100%|██████████| 25/25 [01:28<00:00,  3.55s/it]174.84s/it]


Parallel processing completed.
Summarizing element instances...


Processing chunks in parallel:  67%|██████▋   | 2/3 [04:41<02:14, 134.45s/it]

Summarized 6 nodes and 6 relationships


Storing backup files...
Backup files stored.


Storing element summary in Neo4j...


Processing chunks: 100%|██████████| 20/20 [01:04<00:00,  3.24s/it]


Parallel processing completed.
Summarizing element instances...


Processing chunks in parallel: 100%|██████████| 3/3 [06:28<00:00, 129.53s/it]

Summarized 2 nodes and 1 relationships


Storing backup files...
Backup files stored.


Storing element summary in Neo4j...
Ingested 245 chunks from data/docs/1208.1583.pdf into Neo4j under Document 1208.1583
Finished processing document 1208.1583.pdf.





In [62]:
# Initialize the community detection class
detector = CommunityDetection(driver)

In [63]:
# 2) Community detection & summarization (Steps 2.4–2.5)

# Project the graph for community detection
detector.project_graph()

# Set community IDs in the graph
detector.set_communities()

# Drop the graph from memory
detector.drop_graph()


In [66]:
# Retrieve the communities
communities = detector.retrieve_communities()
community_summaries=summarize_communities(communities, USE_OPENAI=True)

100%|██████████| 377/377 [01:32<00:00,  4.07it/s]


In [None]:
# Execute this cell to create the graph plot in the images folder

def fetch_data_from_neo4j():
    driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
    query_nodes = "MATCH (n) RETURN id(n) as id, labels(n) as labels, properties(n) as properties"
    query_relationships = "MATCH (n)-[r]->(m) RETURN id(n) as source, id(m) as target, properties(r) as properties"

    nodes = []
    relationships = []

    with driver.session() as session:
        # Fetch nodes
        result_nodes = session.run(query_nodes)
        for record in result_nodes:
            properties = dict(record['properties'])  # Explicitly cast to dictionary
       
            del properties['summary']
            del properties['original_ids']

            nodes.append({
                'id': record['id'],
                'labels': record['labels'],
                'properties': properties
            })

        # Fetch relationships
        result_relationships = session.run(query_relationships)
        for record in result_relationships:
            relationships.append({
                'source': record['source'],
                'target': record['target'],
                'properties': dict(record['properties'])  # Explicitly cast to dictionary
            })

    driver.close()
    return nodes, relationships

def plot_graph(nodes, relationships):
    G = nx.Graph()

    # Add nodes with attributes
    for node in nodes:
        G.add_node(node['id'], label=node['labels'], properties=node['properties'])

    # Add edges with attributes
    for rel in relationships:
        G.add_edge(rel['source'], rel['target'], properties=rel['properties'])

    # Extract node attributes
    community_ids = [node['properties']['communityId'] for node in nodes]
    unique_communities = list(set(community_ids))
    color_list = list(mcolors.TABLEAU_COLORS.values()) + list(mcolors.CSS4_COLORS.values())
    colors = color_list[:50]  # Limit to 50 colors
    color_map = [colors[unique_communities.index(node['properties']['communityId']) % 50] for node in nodes]

    # Get positions for nodes
    pos = nx.spring_layout(G)

    # Create edges
    edge_x = []
    edge_y = []
    edge_text = []
    for edge in G.edges(data=True):
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        edge_x.extend([x0, x1, None])
        edge_y.extend([y0, y1, None])
        edge_text.append(str(edge[2]['properties']))

    edge_trace = go.Scatter(x=edge_x, y=edge_y, mode='lines', line=dict(width=0.5, color='#888'), hoverinfo='text', text=edge_text)

    # Create nodes
    node_x = []
    node_y = []
    node_color = []
    node_text = []
    for i, node in enumerate(G.nodes(data=True)):
        x, y = pos[node[0]]
        node_x.append(x)
        node_y.append(y)
        node_color.append(color_map[i])
        node_text.append(str(node[1]['properties']))

    node_trace = go.Scatter(
        x=node_x, y=node_y, mode='markers',
        marker=dict(
            size=5,
            color=node_color,
            line_width=0
        ),
        hoverinfo='text',
        text=node_text
    )

    # Create figure
    fig = go.Figure(data=[edge_trace, node_trace],
                    layout=go.Layout(
                        showlegend=False,
                        hovermode='closest',
                        margin=dict(b=0, l=0, r=0, t=0),
                        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
                    ))
    fig.write_html("images/graph_plot.html")

# Create a file with the graph plot
nodes, relationships = fetch_data_from_neo4j()
plot_graph(nodes, relationships)


These are the nodes and communities created, a screenshot taken from Neo4j Bloom:
![Communities](./images/bloom-visualisation.png)

The purple nodes are the communities summary to which are linked all the entities. Moreover the entities are connected between each other. We can see in the bottom a huge network linked. This is related to authors and common topics among the docs.

Execute this query in neo4j to create communities nodes and visualize the relationships with the entities.

``` cypher
MATCH (n:SummarizedNode)
WITH n.communityId AS communityId, count(n) AS nodeCount, collect(n) AS nodes
MERGE (c:Community {communityId: communityId}) 
SET c.size = nodeCount
WITH c, nodes  // Add this WITH clause to pass variables forward
UNWIND nodes AS node
MERGE (node)-[:BELONGS_TO]->(c)
RETURN c
```


In [67]:
# Store json file with community summaries
os.makedirs('backup_extraction_nodes/', exist_ok=True)
with open(f'backup_extraction_nodes/community_summaries.json', 'w') as f:
    f.write(str(community_summaries))

In [68]:
# Create embeddings for the community summaries and store them in Milvus
embed_fn = get_hf_embedding_function(USE_OPENAI=False)
vectorstore_community_summaries = add_documents_to_milvus([
    {
        "text": summary["community_summary"], 
        "metadata": {
            "doc_id": "community_summaries",
            "community_id": id,
            "title": summary["title"],
            "keywords": ", ".join(summary["keywords"]),
            }
    } for id, summary in community_summaries.items()], embed_fn, collection_name="community_summaries")

**Cost examples:**

0.05$ for a single document of 21 pages with 89 chunks and 10 nodes and 7 relationships extracted. Morevoer this price include the generation of the community summaries of 2 documents.

In average a document takes 4 min to be ingested in the graph.

# Generation

In [2]:
raise Exception("Here starts the generation part")

Exception: Here starts the generation part

In [4]:
# Toggle this variable to filter out the communities that are not relevant to the user query through a vector similarity search
RETRIEVE_BY_VECTOR = True

In [5]:
# Ask a question
user_query = "What Charm and B Meson are?"

In [36]:
# Retrieve all community summaries

if RETRIEVE_BY_VECTOR:
    # Retrieve embeddings for the user query if needed.
    # This allows to filter out irrelevant communities based on the query in advance.
    embed_fn = get_hf_embedding_function(USE_OPENAI=False)
    vectorstore_community_summaries = init_milvus_db("community_summaries", "./vector_db_graphRAG/milvus_ingest.db", embed_fn)
    embed=embed_fn.embed_query(user_query)
    search=search_milvus(embed, vectorstore_community_summaries, top_k=10)
    community_summaries_ingestion = {el[0].metadata['community_id']:el[0].page_content for el in search}
else:
    # Load the community summaries from the milvus sb
    clientMilvus = MilvusClient(
        uri="./vector_db_graphRAG/milvus_ingest.db",
    )

    community_summaries_retrieved = clientMilvus.query(
        collection_name="community_summaries",
        output_fields=["community_id", "text"],
        limit = 1000
    )

    # Refactor for generation
    community_summaries_ingestion = {}
    for el in community_summaries_retrieved:
        community_summaries_ingestion[el['community_id']]=el['text']

In [38]:
# Generate answers:
# Partial answers are the answers generated from the community summaries
# Marks are the helpfulness scores of the partial answers for each community summary
global_answer, partial_answers, marks  = answer_query_from_communities(user_query, community_summaries_ingestion, USE_OPENAI=True)

Processing partial answers...


100%|██████████| 10/10 [00:02<00:00,  4.88it/s]


Processing final answer...


In [None]:
print(f"User's question: {user_query}")
print(f"Answer: {global_answer}")

# Sample questions

In [None]:
raise Exception("Sample questions")

In [78]:
# generation through gpt-4o-mini 
# time 31.5 sec
# cost 0.03$
# partial answers generated on all the communities

print(f"User's question: {user_query}")
print(f"Answer: {global_answer}")

User's question: Which are the main themes of the documents?
Answer: The main themes of the documents encompass a diverse range of scientific fields and topics. Key areas of focus include:

1. **Nuclear Physics**: Research on charge radii and the structure of isotopes, highlighting collaborative efforts between institutions like Argonne National Laboratory (ANL) and GSI.

2. **Atomic Interactions**: The dynamics of ion-atom collisions, their impact on energy transfer, and the heating effects that are crucial for understanding atomic behavior.

3. **Astrophysics and Astronomy**: Exploration of exoplanets using advanced instruments such as the Gemini Planet Imager, along with challenges in imaging technology and astrometry. Additionally, the study of solar phenomena, including solar flares and their dynamics, is emphasized.

4. **Astrochemistry**: Investigation into the chemical processes in space, particularly the roles of water, methane, and carbon dioxide in the formation of stars and

In [39]:
# generation through gpt-4o-mini 
# time 5.4 sec
# cost <0.01$
# partial answers generated on retrived top 10 communities by vector similarity

print(f"User's question: {user_query}")
print(f"Answer: {global_answer}")

User's question: What Charm and B Meson are?
Answer: Charm and B mesons are types of mesons, which are subatomic particles made up of quarks. Specifically, charm mesons contain a charm quark, while B mesons contain a bottom quark. 

Charm mesons include particles such as D mesons (D0 and D+) and D+ s mesons, which are composed of a charm quark paired with either an up quark or a strange antiquark. The J/ψ meson is another important charm meson, consisting of a charm quark and a charm antiquark. These particles are significant in the study of particle physics, particularly in understanding the interactions and decay processes involving quarks.

B mesons, on the other hand, are primarily studied in high-energy physics experiments, such as those conducted at the BABAR, Belle, and CLEO-c collaborations. These experiments focus on the decay processes of B mesons, which can provide insights into the fundamental forces and symmetries in particle physics.

Overall, the study of charm and B mes

### Vidore dataset questions sample

#### document 0707.1659

In [137]:
# generation through gpt-4o-mini 
# time 30.5 sec
# cost 0.04$
# partial answers generated on all the communities

# The answer is wrong
# the marks of the partial answers are all zeros
# the real answer is embedded in the picture which is not in the graph

print(f"User's question: {user_query}")
print(f"Answer: {global_answer}")

User's question: What does the graph indicate about the relationship between \( \langle r(t) \rangle \) and time (t) during steady-state conditions?['A. \\( \\langle r(t) \\rangle \\) decreases as time increases.', 'B. \\( \\langle r(t) \\rangle \\) increases as time increases.', 'C. \\( \\langle r(t) \\rangle \\) remains constant as time increases.', 'D. The relationship between \\( \\langle r(t) \\rangle \\) and time cannot be determined from the graph.']
Answer: The relationship between \( \langle r(t) \rangle \) and time (t) during steady-state conditions can be interpreted based on the information provided. The options suggest different behaviors of \( \langle r(t) \rangle \) as time progresses:

- Option A states that \( \langle r(t) \rangle \) decreases as time increases.
- Option B indicates that \( \langle r(t) \rangle \) increases as time increases.
- Option C suggests that \( \langle r(t) \rangle \) remains constant as time increases.
- Option D claims that the relationship 

In [145]:
# generation through gpt-4o-mini 
# time 6.1 sec
# cost <0.01$
# partial answers generated on 20 communities retireved by vector similarity on milvus

# The answer is wrong
# the marks of the partial answers are all zeros
# the real answer is embedded in the picture which is not in the graph

print(f"User's question: {user_query}")
print(f"Answer: {global_answer}")

User's question: What does the graph indicate about the relationship between \( \langle r(t) \rangle \) and time (t) during steady-state conditions?['A. \\( \\langle r(t) \\rangle \\) decreases as time increases.', 'B. \\( \\langle r(t) \\rangle \\) increases as time increases.', 'C. \\( \\langle r(t) \\rangle \\) remains constant as time increases.', 'D. The relationship between \\( \\langle r(t) \\rangle \\) and time cannot be determined from the graph.']
Answer: The relationship between \( \langle r(t) \rangle \) and time (t) during steady-state conditions can be interpreted based on the information provided. The options suggest different behaviors of \( \langle r(t) \rangle \) as time progresses:

- Option A states that \( \langle r(t) \rangle \) decreases as time increases.
- Option B indicates that \( \langle r(t) \rangle \) increases as time increases.
- Option C suggests that \( \langle r(t) \rangle \) remains constant as time increases.
- Option D claims that the relationship 

#### document 0704.2547

In [None]:
# generation through gpt-4o-mini 
# time 23.1 sec
# cost 0.03$
# partial answers generated on all the communities

# the answer is intrinsically correct

print(f"User's question: {user_query}")
print(f"Answer: {global_answer}")

User's question: What are the main author contributions in the field of quantum computing?
Answer: The main author contributions in the field of quantum computing include:

* Research on nonlinear resonators in Quantum Electrodynamics (QED)
* Collaborative research on quantum computing with a focus on specific areas such as signal integrity, low-noise systems, and spin qubits
* Exploration of measurement techniques and their impact on quantum states, particularly in the context of superconducting transmon qubits
* Development of multi-qubit architectures and quantum error correction methods
* Investigation of energy-level structures and resonant peaks that arise from the coupling of photons with qubits

Additionally, notable researchers who have made significant contributions to the field include:

* M.A. Nielsen and I.L. Chuang for their foundational text on quantum computation and information theory
* F. R. Ong, M. Boissonneault, X.-J. Chen, A. Johansson, L. M. Zhang, and M. Fukuda f

### Comparison same question on different retrieval methods and different model generation
The question is: What are the main author contributions in the field of quantum computing?

In [178]:
# generation through gpt-4o-mini 
# time 9.1 sec
# cost <0.01$
# partial answers generated on 20 embeddings retireved by vector similarity on milvus

print(f"User's question: {user_query}")
print(f"Answer: {global_answer}")

User's question: What are the main author contributions in the field of quantum computing?
Answer: In the field of quantum computing, several key contributions have emerged from influential works and notable figures. One of the foundational texts is "Quantum Computation and Quantum Information," co-authored by M.A. Nielsen and I.L. Chuang, which has significantly shaped the understanding of quantum computing and information theory. This work emphasizes the importance of multi-qubit architectures, which are essential for enhancing computational power, and highlights the critical role of quantum error correction techniques to ensure the reliable operation of quantum systems.

Additionally, advancements in cryogenic technologies have been pivotal, particularly in maintaining signal integrity and developing low-noise systems that utilize spin qubits. These technologies are crucial for high-frequency measurements in quantum operations, further advancing the field.

Numerical techniques such

In [179]:
partial_answers

[("The influential work 'Quantum Computation and Quantum Information' co-authored by M.A. Nielsen and I.L. Chuang is a foundational text in the field of quantum computing and information, highlighting significant author contributions.",
  75),
 ('The text discusses the development of multi-qubit architectures in quantum computing, which are essential for enhancing computational power, and highlights the importance of quantum error correction techniques to ensure reliable operation of these systems.',
  70),
 ('The text highlights contributions related to cryogenic technologies and their importance in quantum computing, particularly in areas like signal integrity, low-noise systems, and the use of spin qubits, which are significant for high-frequency measurements in quantum operations.',
  70),
 ('The Density Matrix Renormalization Group (DMRG) is a crucial numerical technique in quantum physics that contributes to the understanding of quantum many-body systems, particularly in calculat

In [193]:
# generation through gpt-4o-mini 
# time 31.6 sec
# cost 0.03$
# partial answers generated on all the communities

# The answer is correct

print(f"User's question: {user_query}")
print(f"Answer: {global_answer}")

User's question: What are the main author contributions in the field of quantum computing?
Answer: In the field of quantum computing, several key contributions from notable authors and researchers have significantly advanced the discipline. One of the foundational texts is "Quantum Computation and Quantum Information," co-authored by M.A. Nielsen and I.L. Chuang, which has been instrumental in shaping the theoretical framework of quantum computing and information.

Research by A. Johansson and colleagues, particularly their studies published in reputable journals like Physical Review Letters, has also made substantial contributions, focusing on innovative approaches and experimental techniques in quantum systems.

The development of cryogenic technologies has been highlighted as crucial for maintaining signal integrity and low-noise conditions in quantum computing, especially in the context of spin qubits, which are vital for high-frequency measurements. Additionally, advancements in s

In [194]:
partial_answers

[("The influential work 'Quantum Computation and Quantum Information' co-authored by M.A. Nielsen and I.L. Chuang is a foundational text in the field of quantum computing and information, highlighting significant author contributions.",
  75),
 ('The research conducted by A. Johansson and colleagues, particularly their study published in Physical Review Letters, represents a significant contribution to the field of quantum computing.',
  70),
 ('The text highlights contributions related to cryogenic technologies and their importance in quantum computing, particularly in areas like signal integrity, low-noise systems, and the use of spin qubits, which are significant for high-frequency measurements in quantum operations.',
  70),
 ('The text discusses the superconducting transmon qubit and measurement-induced dephasing, which are significant contributions to the field of quantum computing, particularly in understanding the performance and coherence of quantum states.',
  70),
 ('The tex

In [25]:
# generation through llama3.1
# time 11.9min on local pc with M2 pro
# No cost

print(f"User's question: {user_query}")
print(f"Answer: {global_answer}")

User's question: What are the main author contributions in the field of quantum computing?
Answer: The main author contributions in the field of quantum computing include:

* Research on nonlinear resonators in Quantum Electrodynamics (QED)
* Collaborative research on quantum computing with a focus on specific areas such as signal integrity, low-noise systems, and spin qubits
* Exploration of measurement techniques and their impact on quantum states, particularly in the context of superconducting transmon qubits
* Development of multi-qubit architectures and quantum error correction methods
* Investigation of energy-level structures and resonant peaks that arise from the coupling of photons with qubits

Additionally, notable researchers who have made significant contributions to the field include:

* M.A. Nielsen and I.L. Chuang for their foundational text on quantum computation and information theory
* F. R. Ong, M. Boissonneault, X.-J. Chen, A. Johansson, L. M. Zhang, and M. Fukuda f

In [26]:
partial_answers

[('F. R. Ong, M. Boissonneault, and others have made notable contributions to the field of quantum computing through their research on nonlinear resonators in QED.',
  60),
 ('X.-J. Chen has made significant contributions to the field of quantum computing through his collaborative research with F. Fu and L Wang.',
  60),
 ("A. Johansson and colleagues' study published in Phys. Rev. Lett. 95, 116805 (2005) is a significant contribution to the field of quantum computing.",
  60),
 ('Key contributors to quantum computing mentioned include those working on signal integrity, low-noise systems, and spin qubits.',
  60),
 ('The main author contributions in this field are related to the exploration of measurement techniques and their impact on quantum states, particularly in the context of superconducting transmon qubits.',
  60),
 ("M.A. Nielsen and I.L. Chuang's book 'Quantum Computation and Quantum Information' is a foundational text in understanding the principles and applications of quant