**Step 1: Introduction**
This notebook integrates the TAAF approach for knowledge-graph-based question answering with an optional local→global community-based method from the paper. We have already performed local community detection in a separate script.

**Step 2: Upload the Processed Graph File**
Here, we upload the knowledge_graph_with_communities.json generated by the local script.

In [1]:
# Step 2: Upload knowledge_graph_with_communities.json
from google.colab import files
import json

print("Please upload 'knowledge_graph_with_communities.json' now.")
uploaded = files.upload()  # user selects the file

with open('knowledge_graph_with_communities.json', 'r') as f:
    data = json.load(f)

nodes = data['nodes']
links = data['links']

print(f"Loaded {len(nodes)} nodes and {len(links)} links from the new file.")


Please upload 'knowledge_graph_with_communities.json' now.


KeyboardInterrupt: 

**Step 3 (Markdown): Install and Import Necessary Packages**
We install neo4j to connect to the database, plus openai and tiktoken for GPT-based query generation and summarization. We also import standard libraries (time, json, etc.).

In [2]:
# Step 3: Install and Import Packages
!pip install neo4j openai tiktoken numpy

import openai
import tiktoken
import numpy as np
import time
import json
from neo4j import GraphDatabase, Session
from neo4j.exceptions import ServiceUnavailable

print("Packages installed and imported.")


Collecting neo4j
  Downloading neo4j-5.27.0-py3-none-any.whl.metadata (5.9 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading neo4j-5.27.0-py3-none-any.whl (301 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.7/301.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neo4j, tiktoken
Successfully installed neo4j-5.27.0 tiktoken-0.8.0
Packages installed and imported.


**Step 4: Connect to Neo4j**
We set up our Neo4j driver with credentials. If using Colab’s secret store (userdata), we retrieve them; else we set them manually.

In [3]:
# Step 4: Connect to Neo4j

try:
    from google.colab import userdata
    openai.api_key = userdata.get('OPENAI_API_KEY')  # or "YOUR_OPENAI_API_KEY"
    uri = userdata.get('NEO4J_URI')                  # e.g. 'neo4j+s://XYZ.databases.neo4j.io'
    password = userdata.get('NEO4J_PASSWORD')        # or "YOUR_NEO4J_PASSWORD"
except:
    openai.api_key = "YOUR_OPENAI_API_KEY"
    uri = "neo4j+s://YOUR-NEO4J-ENDPOINT"
    password = "YOUR-NEO4J-PASSWORD"

driver = GraphDatabase.driver(uri, auth=("neo4j", password))
print("Connected to Neo4j successfully.")


Connected to Neo4j successfully.


**Step 5 : Create Nodes and Relationships in Neo4j**
We define helper functions to load our newly community-enriched nodes and relationships into Neo4j. Each node has an id property, a label, an optional communityId, and the original entity label in Neo4j.

In [17]:
# Step 5: Create Nodes and Relationships in Neo4j

def create_nodes(tx: Session, nodes_data):
    """
    MERGE each node by its 'id'. We store 'label' and 'communityId' if present.
    'entity' is used as the node label in Neo4j (e.g., CPU, Thread).
    """
    for node in nodes_data:
        entity_label = node.get('entity', 'Generic')
        query_str = []
        query_str.append(f"MERGE (n:{entity_label} {{id: $id}})")
        query_str.append("SET n.label = $n_label")

        if "communityId" in node:
            query_str.append("SET n.communityId = $communityId")

        final_query = "\n".join(query_str)

        tx.run(
            final_query,
            id=node['id'],
            n_label=node.get('label', ''),
            communityId=node.get('communityId', None)
        )

def create_relationships(tx: Session, links_data):
    """
    MERGE each relationship. 'relationship' is used as the rel type (spaces replaced).
    'weight' property is stored if present, else default to 1.
    """
    for link in links_data:
        source = link['source']
        target = link['target']
        rel_type = link['relationship'].replace(' ', '_')
        weight = link.get('weight', 1)

        q = f"""
        MATCH (a {{id: $source}}), (b {{id: $target}})
        MERGE (a)-[r:{rel_type} {{weight: $weight}}]->(b)
        """
        tx.run(q, source=source, target=target, weight=weight)

def load_data_into_neo4j(nodes, links):
    with driver.session() as session:
        try:
            session.write_transaction(create_nodes, nodes)
            session.write_transaction(create_relationships, links)
            print("Data loaded into Neo4j successfully.")
        except ServiceUnavailable as e:
            print("Error loading data into Neo4j:", e)

# We'll run the load in the next step.

**Step 6: Execute the Loading**
We now call the loading function, creating all nodes and relationships in your Neo4j DB.

In [18]:
# Step 6: Execute the loading
load_data_into_neo4j(nodes, links)


  session.write_transaction(create_nodes, nodes)
  session.write_transaction(create_relationships, links)


Data loaded into Neo4j successfully.


**Step 7: Retrieve Node/Relationship Schema (Optional)**"We can introspect the database to see what node labels and relationship types exist, along with their properties. This can feed into GPT-based query generation as context.

In [4]:
# Step 7: Retrieve Node/Relationship Schema

def rec_to_type(val):
    if isinstance(val, int):
        return {"type": "Integer", "example_value": str(val)}
    elif isinstance(val, float):
        return {"type": "Float", "example_value": str(val)}
    elif isinstance(val, bool):
        return {"type": "Boolean", "example_value": str(val)}
    elif isinstance(val, list):
        return {"type": "List", "example_value": str(val)}
    else:
        return {"type": "String", "example_value": str(val)}

def get_node_labels_and_properties():
    with driver.session() as session:
        labels_res = session.run("CALL db.labels()")
        labels = [r['label'] for r in labels_res]
        label_props = {}
        for lab in labels:
            q = f"""
            MATCH (n:`{lab}`)
            UNWIND keys(n) AS k
            WITH k, head(collect(n[k])) AS val
            RETURN DISTINCT k, val
            """
            props_res = session.run(q)
            props_dict = {}
            for rec in props_res:
                k = rec['k']
                v = rec['val']
                props_dict[k] = rec_to_type(v)
            label_props[lab] = props_dict
    return label_props

def get_relationship_types_and_properties():
    with driver.session() as session:
        rel_types_res = session.run("CALL db.relationshipTypes()")
        rel_types = [r['relationshipType'] for r in rel_types_res]
        info = {}
        for rt in rel_types:
            # Gather props
            pquery = f"""
            MATCH ()-[r:`{rt}`]->()
            UNWIND keys(r) AS k
            WITH k, head(collect(r[k])) AS val
            RETURN DISTINCT k, val
            """
            p_res = session.run(pquery)
            props = {}
            for rec in p_res:
                k = rec['k']
                v = rec['val']
                props[k] = rec_to_type(v)

            # Gather start/end labels
            cquery = f"""
            MATCH (s)-[r:`{rt}`]->(e)
            RETURN DISTINCT labels(s) AS startLabels, labels(e) AS endLabels
            """
            c_res = session.run(cquery)
            pairs = set()
            for row in c_res:
                for sl in row['startLabels']:
                    for el in row['endLabels']:
                        pairs.add((sl, el))
            startLabs = list(set([p[0] for p in pairs]))
            endLabs = list(set([p[1] for p in pairs]))

            info[rt] = {"properties": props, "startLabels": startLabs, "endLabels": endLabs}

    return info

node_schema = get_node_labels_and_properties()
rel_schema = get_relationship_types_and_properties()

print("Node Schema:")
print(json.dumps(node_schema, indent=2))

print("\nRelationship Schema:")
print(json.dumps(rel_schema, indent=2))


Node Schema:
{
  "CPU": {
    "id": {
      "type": "String",
      "example_value": "CPU_3"
    },
    "label": {
      "type": "String",
      "example_value": "CPU 3"
    },
    "communityId": {
      "type": "Integer",
      "example_value": "0"
    }
  },
  "Thread": {
    "id": {
      "type": "String",
      "example_value": "T_0"
    },
    "label": {
      "type": "String",
      "example_value": "swapper/3 (T_0)"
    },
    "communityId": {
      "type": "Integer",
      "example_value": "0"
    }
  },
  "File": {
    "id": {
      "type": "String",
      "example_value": "File_140637496150960"
    },
    "label": {
      "type": "String",
      "example_value": "Events FD=140637496150960"
    },
    "communityId": {
      "type": "Integer",
      "example_value": "1"
    }
  },
  "Process": {
    "id": {
      "type": "String",
      "example_value": "P_7"
    },
    "label": {
      "type": "String",
      "example_value": "Process 7.0"
    },
    "communityId": {
      "ty

**Step 8: Prepare a Schema Description for GPT**
We’ll create a textual representation of the discovered schema so that GPT knows about labels, properties, and the communityId field.

In [5]:
# Step 8: Prepare a Schema Description for GPT

def prepare_schema_description(node_schema, rel_schema):
    desc = "The knowledge graph has the following structure:\n\n"
    desc += "Node labels and properties:\n"
    for lab, props in node_schema.items():
        desc += f"- {lab}:\n"
        for p, info in props.items():
            desc += f"  - {p} (type={info['type']}, example={info['example_value']})\n"
    desc += "\nRelationship types and properties:\n"
    for rt, info in rel_schema.items():
        desc += f"- {rt}:\n"
        desc += f"  - Connects from {info['startLabels']} to {info['endLabels']}\n"
        desc += f"  - Properties:\n"
        for pk, pv in info['properties'].items():
            desc += f"    - {pk} (type={pv['type']}, example={pv['example_value']})\n"
    return desc

schema_description = prepare_schema_description(node_schema, rel_schema)
print(schema_description)


The knowledge graph has the following structure:

Node labels and properties:
- CPU:
  - id (type=String, example=CPU_3)
  - label (type=String, example=CPU 3)
  - communityId (type=Integer, example=0)
- Thread:
  - id (type=String, example=T_0)
  - label (type=String, example=swapper/3 (T_0))
  - communityId (type=Integer, example=0)
- File:
  - id (type=String, example=File_140637496150960)
  - label (type=String, example=Events FD=140637496150960)
  - communityId (type=Integer, example=1)
- Process:
  - id (type=String, example=P_7)
  - label (type=String, example=Process 7.0)
  - communityId (type=Integer, example=0)
- Network:
  - id (type=String, example=Socket_FD_46)
  - label (type=String, example=Socket FD 46)
  - communityId (type=Integer, example=0)

Relationship types and properties:
- switched_in:
  - Connects from ['CPU'] to ['Thread']
  - Properties:
    - weight (type=Integer, example=44)
- switched_out:
  - Connects from ['Thread'] to ['CPU']
  - Properties:
    - weig

**Step 9: Generate Cypher Queries with GPT (Thorough Prompt Engineering)**
Below is an extensive system prompt describing constraints. We also mention communityId so GPT can filter or group by communities if desired.

In [6]:
from openai import OpenAI
client = OpenAI(
    api_key=userdata.get("OPENAI_API_KEY")
)# Step 9: Generate Cypher Queries with GPT (Thorough Prompt Engineering)

def generate_cypher_query(question, schema_description):
    """
    Thorough system prompt for generating Cypher.
    """
    system_prompt = f"""
You are an expert in translating natural language questions into Cypher queries for a Neo4j graph database.

Important guidelines:
- Only use the provided schema information.
- [VERY IMPORTANT]When generating the Cypher query, ensure that it returns all relevant nodes and relationships needed to answer the question.
- Pay close attention to the data types, formats of node properties, and relationship directionality.
- Node IDs and other properties may have specific formats (e.g., 'CPU_3' instead of '3').
- Be aware of the direction of relationships and which node labels they connect.
- When counting events, sum the 'weight' property of relationships instead of counting the number of relationships. The 'weight' property represents the number of occurrences or events.
- When specifying multiple relationship types using the '|' operator in a Cypher query, include the colon ':' only once, before the first relationship type. Do NOT include colons before subsequent relationship types.
- [THE MOST IMPORTANT]The semantics of using colon in the separation of alternative relationship types in conjunction with the use of variable binding, inlined property predicates, or variable length is no longer supported.
  For example: r:switched_in|:switched_out|:scheduled_to_wake_on|: ... is WRONG. :switched_in|switched_out|scheduled_to_wake_on|... is correct.
- Do not make up properties or labels that are not in the schema.
- Generate a Cypher query that retrieves all relevant data needed to answer the question.
- Include all relevant entities and relationships connected to the main entities.
- Be mindful of potential token limits; if the result set is too large, you can limit the depth or the number of nodes appropriately.
- Do not limit the number of results unless specified in the question.
- Return the query without any explanations or additional text.
- Always return a subgraph. It means you should write queries that return a subgraph including entities and relationships(For example a query that only return number of files is not ok and instead you should return)
- If referencing community-based logic, use the node property 'communityId'.


Schema:
{schema_description}
"""
    user_prompt = f"Question: {question}\n\nCypher Query:"

    response = client.chat.completions.create(
        model='gpt-4',
        messages=[
            {"role": "system", "content": system_prompt.strip()},
            {"role": "user", "content": user_prompt.strip()}
        ],
        temperature=0.1
    )
    query = response.choices[0].message.content.strip()
    return query


**Step 10 : Execute Cypher Queries**
"We define a simple function to run Cypher in Neo4j, returning the list of records.

In [7]:
# Step 10: Execute Cypher Queries

def execute_cypher_query(query):
    with driver.session() as session:
        try:
            result = session.run(query)
            return [r.data() for r in result]
        except Exception as e:
            print(f"An error occurred: {e}")
            return None

**Step 11 : Summarize Results with GPT (Thorough Prompt Engineering)**
We pass the raw query results (JSON) plus the user’s question to GPT for a final textual answer. We do token-limit checks to avoid errors.

In [8]:
# Step 11: Summarize Results with GPT

def generate_final_answer(question, kg_data):
    """
    Summarizes the knowledge graph data in answer to the question.
    We preserve thorough prompt engineering from TAAF.
    """
    if not kg_data.strip():
        kg_data = "No relevant data from the knowledge graph."

    # Token counting to prevent over-limit
    enc = tiktoken.encoding_for_model('gpt-4')
    total_tokens = len(enc.encode(kg_data))
    max_tokens = 7000  # approximate
    if total_tokens > max_tokens:
        # Truncate
        ratio = max_tokens / total_tokens
        trunc_len = int(len(kg_data) * ratio)
        kg_data = kg_data[:trunc_len] + "\n[TRUNCATED DUE TO TOKEN LIMIT]"

    prompt = f"""
You are a highly knowledgeable expert in trace analysis and knowledge graphs.

Question: {question}

Knowledge Graph Data:
{kg_data}

Answer:
"""

    response = client.chat.completions.create(
        model='gpt-4',
        messages=[
            {"role": "system", "content": "Use the provided KG data to provide a thorough and accurate answer."},
            {"role": "user", "content": prompt.strip()}
        ],
        temperature=0.4
    )
    return response.choices[0].message.content.strip()


**Step 12 : Local→Global (Community) Approach vs. Standard Approach**
"Inspired by the paper 'From Local to Global: A Graph RAG Approach to Query-Focused Summarization', we define a two-mode system:

use_communities=False: single GPT-based query + summarization.
use_communities=True: retrieve partial data per community, summarize each, then combine partial answers in a final step.
This is an illustrative adaptation of the local→global pipeline in the paper."

In [13]:
# Step 12: Local->Global Approach vs. Standard Approach

def get_all_community_ids():
    """
    Return distinct communityIds from the loaded graph.
    """
    query = """
    MATCH (n)
    WHERE n.communityId IS NOT NULL
    RETURN DISTINCT n.communityId AS cid
    ORDER BY cid
    """
    records = execute_cypher_query(query)
    return [r['cid'] for r in records] if records else []

def get_subgraph_for_community(cid):
    """
    Retrieve a subgraph (nodes and relationships) belonging to the given community.
    This includes all nodes with communityId=cid and the relationships between them.
    """
    query = f"""
    MATCH (n)
    WHERE n.communityId = {cid}
    OPTIONAL MATCH (n)-[r]->(m)
    WHERE m.communityId = {cid}
    RETURN n, r, m
    """
    return execute_cypher_query(query)

def local_global_pipeline(question):
    cids = get_all_community_ids()
    partial_answers = []

    for cid in cids:
        # 1) Get subgraph
        records = get_subgraph_for_community(cid)
        if records:
            sub_json = json.dumps(records, indent=2)

            # 2) We remove any direct mention of "Community X" in the question
            #    to reduce GPT referencing it in the partial answer
            partial_answer = generate_final_answer(
                question,  # <-- Use the original question
                sub_json
            )

            # Instead of "Community {cid} partial:", keep it neutral
            partial_answers.append(partial_answer)

    # 3) Combine partial answers with minimal preamble
    combined_text = "\n".join(partial_answers)

    final_prompt = f"""
We retrieved partial answers to the question below from different subgraphs.

**Question:**
{question}

**Partial Answers (from each subgraph):**
{combined_text}

Please provide a single, cohesive answer that integrates all of these partial findings without repeating them individually.
"""

    response = client.chat.completions.create(
        model='gpt-4',
        messages=[
            {"role": "system", "content": "Integrate partial subgraph answers into a cohesive overall response."},
            {"role": "user", "content": final_prompt.strip()}
        ],
        temperature=0.4
    )
    return response.choices[0].message.content.strip()

def standard_pipeline(question, schema_description):
    """
    1) Generate a single Cypher query with GPT.
    2) Execute it in Neo4j.
    3) Summarize results with GPT.
    """
    cypher_query = generate_cypher_query(question, schema_description)
    print("Generated Cypher Query:")
    print(cypher_query)

    records = execute_cypher_query(cypher_query)
    if not records:
        kg_data = "No data found for your query."
    else:
        kg_data = json.dumps(records, indent=2)

    final_answer = generate_final_answer(question, kg_data)
    return final_answer


**Step 13 (Markdown): Main Answer Function**
We define answer_question(question, use_communities) that toggles between the single-pass or local→global approach. This is our final user-facing function in TAAF.

In [10]:
# Step 13: Main Answer Function

def answer_question(question, use_communities=False):
    """
    If use_communities=False: standard single-pass TAAF approach.
    If use_communities=True: local->global approach from the paper.
    """
    start_time = time.time()
    if use_communities:
        print("Running local->global pipeline (community-based).")
        answer = local_global_pipeline(question)
    else:
        print("Running standard single-pass pipeline.")
        answer = standard_pipeline(question, schema_description)

    end_time = time.time()
    print(f"Answer generated in {end_time - start_time:.2f} seconds.")
    return answer


**Step 14 : Test the System**
We can now test with a sample question. If use_communities=False, we do a single pass. If use_communities=True, we do a local→global approach referencing community-based partial answers.

In [14]:
# Step 14: Test the System

sample_question_1 = "What thread seems more important in the system and why?"
print("=== Standard Pipeline ===")
ans1 = answer_question(sample_question_1, use_communities=False)
print("\nAnswer (standard):\n", ans1)

print("\n"+"-"*80+"\n")

sample_question_2 = "What thread seems more important in the system and why?"
print("=== Local->Global Pipeline ===")
ans2 = answer_question(sample_question_2, use_communities=True)
print("\nAnswer (local->global):\n", ans2)


=== Standard Pipeline ===
Running standard single-pass pipeline.
Generated Cypher Query:
MATCH (t:Thread)-[r]->()
RETURN t.id AS Thread_ID, sum(r.weight) AS Total_Weight
ORDER BY Total_Weight DESC
LIMIT 1
Answer generated in 7.60 seconds.

Answer (standard):
 Based on the provided Knowledge Graph data, the thread "T_2208" seems to be the most important in the system as it is the only thread data available. The "Total_Weight" of 1972 could indicate its importance, but without additional context or comparison to other threads, it's difficult to definitively assess its relative importance.

--------------------------------------------------------------------------------

=== Local->Global Pipeline ===
Running local->global pipeline (community-based).
Answer generated in 88.70 seconds.

Answer (local->global):
 Based on the information gathered from the knowledge graph data, several threads appear to be important in the system due to their various roles and interactions. 

The threads "swa