## **1. Install Necessary Packages**

In this step, we ensure that all required Python packages are installed. These packages include:

- **openai**: To interact with OpenAI's API for embeddings and language models.
- **neo4j**: To connect and interact with the Neo4j graph database.
- **tiktoken**: For token estimation, helping us manage token limits with OpenAI models.
- **numpy**: For numerical computations, particularly vector operations.

In [None]:
# Install necessary packages
!pip install openai neo4j tiktoken numpy

Collecting openai
  Downloading openai-1.52.0-py3-none-any.whl.metadata (24 kB)
Collecting neo4j
  Downloading neo4j-5.25.0-py3-none-any.whl.metadata (5.7 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.6-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.52.0-py3-none-any.whl (386 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.9/386.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading neo4j-5.25.0-py3-none-a

## **2. Import Necessary Libraries**

We import all the libraries that will be used throughout the notebook. This includes standard libraries and those we just installed.

In [None]:
import json  # For handling JSON data
import openai  # For OpenAI API interactions
from neo4j import GraphDatabase  # For Neo4j database connection
import tiktoken  # For token estimation with OpenAI models
import numpy as np  # For numerical computations
from google.colab import userdata  # For accessing user secrets in Colab

## **3. Set Up API Keys and Database Connections**

Here, we set up the API key for OpenAI and establish a connection to the Neo4j database. We securely retrieve sensitive information using `userdata.get()`.

In [None]:
# Set up OpenAI API key
openai.api_key = userdata.get('OPENAI_API_KEY')  # Replace with your OpenAI API key

# Set up Neo4j connection
uri = userdata.get('NEO4J_URI')  # e.g., 'neo4j+s://xxxxxxxx.databases.neo4j.io'
user = 'neo4j'
password = userdata.get('NEO4J_PASSWORD')  # Replace with your Neo4j password
driver = GraphDatabase.driver(uri, auth=(user, password))

## **4. Upload Knowledge Graph Data**

We generate the knowledge graph from our trace dataset using `knowledge_graph_generator.py`, located in the `knowledge_graph` directory. This script outputs a graph in JSON format, which can be found in `output/knowledge_graph_output/knowledge_graph.json`. We then upload the `knowledge_graph.json` file, which contains the nodes and relationships of our knowledge graph. This data represents the entities and their connections within our system.

In [None]:
# Upload the knowledge graph data
from google.colab import files
uploaded = files.upload()

# Load the JSON file named 'knowledge_graph.json'
with open('knowledge_graph.json', 'r') as f:
    data = json.load(f)

# Extract nodes and links
nodes = data['nodes']
links = data['links']

Saving knowledge_graph.json to knowledge_graph.json


## **5. Upload Event Translations Data**

We generate human-readable translations of system events from the raw trace data using `trace_translator.py`, located in the `trace_translation` directory. This script outputs the translations in JSON format, which can be found in `output/trace_translation_output/event_translations.json`. We then upload the `event_translations.json` file, which contains these descriptions. This data will be used for semantic search and to provide context in our answers.

In [None]:
# Upload the event translations JSON file
uploaded = files.upload()

# Load the JSON file named 'event_translations.json'
with open('event_translations.json', 'r') as f:
    event_translations = json.load(f)

# Extract the traces (event descriptions)
traces = event_translations['traces']

Saving event_translations.json to event_translations.json


## **6. Define Functions to Create Nodes and Relationships**

We define helper functions to create nodes and relationships in the Neo4j database. These functions will be used to load our data into the neo4j graph database.

In [31]:
from neo4j.exceptions import ServiceUnavailable

def create_nodes(tx, nodes):
    """Creates nodes in the Neo4j database."""
    for node in nodes:
        query = f"""
        MERGE (n:{node['entity']} {{
            id: '{node['id']}',
            label: '{node['label']}'
        }})
        """
        tx.run(query)

def create_relationships(tx, links):
    """Creates relationships between nodes in the Neo4j database."""
    for link in links:
        # Prepare properties, excluding certain keys
        props = {k: v for k, v in link.items() if k not in ['source', 'target', 'relationship', 'key']}
        prop_str = ', '.join([f"{k}: '{v}'" for k, v in props.items()])
        prop_str = f"{{{prop_str}}}" if prop_str else ''
        # Sanitize relationship name by replacing invalid characters with underscores
        relationship_name = link['relationship'].replace(' ', '_').replace('=', '_')
        query = f"""
        MATCH (a {{id: '{link['source']}'}})
        MATCH (b {{id: '{link['target']}'}})
        MERGE (a)-[r:{relationship_name} {prop_str}]->(b)
        """
        tx.run(query)

## **7. Load Data into Neo4j Graph Database**

Using the functions defined above, we load the nodes and relationships into the Neo4j database. This step populates the graph database with our knowledge graph.

In [32]:
def load_data_into_neo4j(nodes, links):
    """Loads nodes and relationships into the Neo4j database."""
    with driver.session() as session:
        try:
            # Create nodes
            session.write_transaction(create_nodes, nodes)
            # Create relationships
            session.write_transaction(create_relationships, links)
            print("Data loaded successfully into Neo4j.")
        except ServiceUnavailable as e:
            print(f"An error occurred: {e}")

# Execute the data loading
load_data_into_neo4j(nodes, links)

  session.write_transaction(create_nodes, nodes)
  session.write_transaction(create_relationships, links)


Data loaded successfully into Neo4j.


## **8. Retrieve Node Labels and Properties**

We extract all node labels and their properties from the Neo4j database. This information is crucial for understanding the structure of our graph and for generating accurate Cypher queries in our future steps.

In [None]:
def get_node_labels_and_properties():
    """Retrieves node labels and their properties from the Neo4j database."""
    with driver.session() as session:
        # Get all node labels
        labels_result = session.run("CALL db.labels()")
        labels = [record['label'] for record in labels_result]

        label_properties = {}
        for label in labels:
            properties_result = session.run(f"""
            MATCH (n:`{label}`)
            UNWIND keys(n) AS key
            WITH key, head(collect(n[key])) AS value
            RETURN DISTINCT key, value
            """)
            properties = {}
            for record in properties_result:
                key = record['key']
                value = record['value']
                # Determine data type
                if isinstance(value, int):
                    data_type = 'Integer'
                elif isinstance(value, float):
                    data_type = 'Float'
                elif isinstance(value, bool):
                    data_type = 'Boolean'
                elif isinstance(value, list):
                    data_type = 'List'
                else:
                    data_type = 'String'
                properties[key] = {'type': data_type, 'example_value': str(value)}
            label_properties[label] = properties
        return label_properties

## **9. Retrieve Relationship Types and Properties**

We extract all relationship types and their properties, including which node labels they connect. This helps in understanding how entities are related in the graph.

In [None]:
def get_relationship_types_and_properties():
    """Retrieves relationship types and their properties from the Neo4j database."""
    with driver.session() as session:
        # Get all relationship types
        types_result = session.run("CALL db.relationshipTypes()")
        types = [record['relationshipType'] for record in types_result]

        type_info = {}
        for rel_type in types:
            # Get the connected node labels and direction
            connections_result = session.run(f"""
            MATCH (start)-[r:`{rel_type}`]->(end)
            RETURN DISTINCT labels(start) AS start_labels, labels(end) AS end_labels
            """)
            node_pairs = set()
            for record in connections_result:
                start_labels = record['start_labels']
                end_labels = record['end_labels']
                for start_label in start_labels:
                    for end_label in end_labels:
                        node_pairs.add((start_label, end_label))

            # Get properties and example values
            properties_result = session.run(f"""
            MATCH ()-[r:`{rel_type}`]->()
            UNWIND keys(r) AS key
            WITH key, head(collect(r[key])) AS value
            RETURN DISTINCT key, value
            """)
            properties = {}
            for record in properties_result:
                key = record['key']
                value = record['value']
                # Determine data type
                if isinstance(value, int):
                    data_type = 'Integer'
                elif isinstance(value, float):
                    data_type = 'Float'
                elif isinstance(value, bool):
                    data_type = 'Boolean'
                elif isinstance(value, list):
                    data_type = 'List'
                else:
                    data_type = 'String'
                properties[key] = {'type': data_type, 'example_value': str(value)}

            type_info[rel_type] = {
                'properties': properties,
                'start_labels': list(set([pair[0] for pair in node_pairs])),
                'end_labels': list(set([pair[1] for pair in node_pairs]))
            }
        return type_info

## **10. Prepare Schema Description**

We use the node and relationship information to create a textual schema description. This description will be provided to the language model to help it generate accurate Cypher queries.

In [None]:
def prepare_schema_description():
    """Prepares a textual description of the database schema."""
    node_schema = get_node_labels_and_properties()
    relationship_schema = get_relationship_types_and_properties()

    schema_description = "The knowledge graph has the following structure:\n\n"
    schema_description += "Node labels, their properties, data types, and example values:\n"
    for label, properties in node_schema.items():
        schema_description += f"- {label}:\n"
        for prop, details in properties.items():
            schema_description += f"  - {prop} (type: {details['type']}, example: '{details['example_value']}')\n"

    schema_description += "\nRelationship types, their properties, data types, example values, and connected node labels:\n"
    for rel_type, info in relationship_schema.items():
        schema_description += f"- {rel_type}:\n"
        schema_description += f"  - Connects from {info['start_labels']} to {info['end_labels']}\n"
        schema_description += "  - Properties:\n"
        for prop, details in info['properties'].items():
            schema_description += f"    - {prop} (type: {details['type']}, example: '{details['example_value']}')\n"
        # Add note about weight property
        if 'weight' in info['properties']:
            schema_description += "  - Note: The 'weight' property represents the number of occurrences or events for this relationship.\n"

    return schema_description

## **11. Generate Embeddings for Event Translations (Translations RAG Step)**

We generate vector embeddings for the event translations using OpenAI's embedding model. This is part of the Retrieval-Augmented Generation (RAG) process, where we retrieve relevant information from the event translations to enhance our answers.

*Note:* This step might take several minutes

In [None]:
from openai import OpenAI
client = OpenAI(
    # This is the default and can be omitted
    api_key=userdata.get("OPENAI_API_KEY"),
)

def get_embeddings(texts, model='text-embedding-ada-002'):
    """Generates embeddings for a list of texts using your OpenAI implementation."""
    embeddings = []
    batch_size = 100  # Adjust based on rate limits
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        response = client.embeddings.create(input=batch, model=model)
        batch_embeddings = [np.array(data.embedding) for data in response.data]
        embeddings.extend(batch_embeddings)
    return embeddings

# Extract descriptions
descriptions = [trace['translation'] for trace in traces]

# Generate embeddings
description_embeddings = get_embeddings(descriptions)

## **12. Define Function to Retrieve Relevant Translations(Translations RAG Step)**

We create a function to find the most relevant event translations based on a user's query by calculating the cosine similarity between embeddings.

In [None]:
def get_relevant_translations(query, descriptions, description_embeddings, top_k=20):
    """Retrieves relevant descriptions based on the user's query."""
    # Generate embedding for the query
    response = client.embeddings.create(input=[query], model='text-embedding-ada-002')
    query_embedding = np.array(response.data[0].embedding)

    # Compute cosine similarity
    similarities = np.dot(description_embeddings, query_embedding) / (
        np.linalg.norm(description_embeddings, axis=1) * np.linalg.norm(query_embedding)
    )

    # Get top_k most similar descriptions
    top_k_indices = similarities.argsort()[-top_k:][::-1]
    relevant_descriptions = [descriptions[i] for i in top_k_indices]
    return relevant_descriptions

## **13. Generate Cypher Queries from Natural Language Questions(KG RAG Step)**

We define a function to translate natural language questions into Cypher queries using your OpenAI implementation, guided by the schema description.

In [80]:
def generate_cypher_query(question, schema_description):
    """Generates a Cypher query based on the question and schema using your OpenAI implementation."""
    system_prompt = f"""
You are an expert in translating natural language questions into Cypher queries for a Neo4j graph database.

Important guidelines:
- Only use the provided schema information.
- [VERY IMPORTANT]When generating the Cypher query, ensure that it returns all relevant nodes and relationships needed to answer the question.
- Pay close attention to the data types, formats of node properties, and relationship directionality.
- Node IDs and other properties may have specific formats (e.g., 'CPU_3' instead of '3').
- Be aware of the direction of relationships and which node labels they connect.
- When counting events, sum the 'weight' property of relationships instead of counting the number of relationships. The 'weight' property represents the number of occurrences or events.
- When specifying multiple relationship types using the '|' operator in a Cypher query, include the colon ':' only once, before the first relationship type. Do NOT include colons before subsequent relationship types.
- Do not make up properties or labels that are not in the schema.
- Generate a Cypher query that retrieves all relevant data needed to answer the question.
- Include all relevant entities and relationships connected to the main entities.
- Be mindful of potential token limits; if the result set is too large, you can limit the depth or the number of nodes appropriately.
- Do not limit the number of results unless specified in the question.
- Return the query without any explanations or additional text.

Schema:
{schema_description}
"""
    user_prompt = f"""
Question: {question}

Cypher Query:
"""
    response = client.chat.completions.create(
        model='gpt-4',
        messages=[
            {"role": "system", "content": system_prompt.strip()},
            {"role": "user", "content": user_prompt.strip()}
        ],
        temperature=0.1
    )
    query = response.choices[0].message.content.strip()
    return query

## **14. Execute Cypher Queries(KG RAG Step)**

We define a function to execute the generated Cypher queries against the Neo4j database and retrieve the results.

In [None]:
def execute_cypher_query(query):
    """Executes the Cypher query and returns the results."""
    with driver.session() as session:
        try:
            result = session.run(query)
            # Collect results
            records = [record.data() for record in result]
            return records
        except Exception as e:
            print(f"An error occurred: {e}")
            return None

## **15. Generate the Final Answer**

We generate the final answer, combining knowledge graph data and relevant event translations in addition to the user's query to provide a comprehensive response.

In [84]:
def generate_final_answer(question, kg_data, event_data):
    """Generates the final answer based on the question, knowledge graph data, and event data."""
    # Estimate the number of tokens in the context
    encoding = tiktoken.encoding_for_model('gpt-4')
    kg_tokens = len(encoding.encode(kg_data))
    event_tokens = len(encoding.encode(event_data))
    total_tokens = kg_tokens + event_tokens
    max_context_tokens = 7000  # Adjust this based on the model's token limit

    # If context is too large, truncate or summarize
    if total_tokens > max_context_tokens:
        # Truncate the longer of the two
        if kg_tokens > event_tokens:
            kg_data = kg_data[:int(len(kg_data) * (max_context_tokens / (2 * kg_tokens)))]
            kg_data += "\n\n[Data truncated due to token limit]"
        else:
            event_data = event_data[:int(len(event_data) * (max_context_tokens / (2 * event_tokens)))]
            event_data += "\n\n[Data truncated due to token limit]"

    prompt = f"""
You are a highly knowledgeable expert in trace analysis and knowledge graphs.

Question: {question}

Provide detailed explanations and insights in your answers, utilizing both the data provided and your extensive expertise.
your answer MUST use both kg data and event translations data in your answer (avoid using only one).

Knowledge Graph relevant Data retrieved from database:
{kg_data}

Event Translations relevant data:
{event_data}

Answer:
"""
    response = client.chat.completions.create(
        model='gpt-4',
        messages=[
            {"role": "system", "content": "You provide concise and accurate answers based on the data provided."},
            {"role": "user", "content": prompt.strip()}
        ],
        temperature=0.4
    )
    answer = response.choices[0].message.content.strip()
    return answer

## **16. Define the Main Answer Function**

We tie everything together in a single function that takes a user's question and returns an answer by utilizing the functions defined above.

In [72]:
def answer_question(question):
    """Answers the user's question using the knowledge graph and event translations."""
    # Prepare schema description
    schema_description = prepare_schema_description()

    # Generate Cypher query
    query = generate_cypher_query(question, schema_description)
    print("Generated Cypher Query:")
    print(query)

    # Execute query
    records = execute_cypher_query(query)
    print("Query Results:")
    print(records)
    if records is None or len(records) == 0:
        kg_data = "No data found for your query."
    else:
        # Serialize records to JSON
        kg_data = json.dumps(records, indent=2)

    # Retrieve relevant descriptions from event translations
    relevant_translations = get_relevant_translations(question, descriptions, description_embeddings, top_k=5)
    # Combine the descriptions into a single string
    event_data = '\n'.join(relevant_translations)

    # Print the retrieved event translations
    print("Retrieved Event Translations:")
    for idx, desc in enumerate(relevant_translations, 1):
        print(f"{idx}. {desc}")

    # Generate final answer
    answer = generate_final_answer(question, kg_data, event_data)
    return answer

## **17. Test the System with a Sample Question**

We test the entire system using a sample question to see how it performs and to verify that all components are working as expected.

In [85]:
question = "Are there any threads that depend on both CPU 3 and CPU 0?"

# Get the answer
answer = answer_question(question)

print("\nAnswer:")
print(answer)



Generated Cypher Query:
MATCH (cpu0:CPU {label: 'CPU 0'}), (cpu3:CPU {label: 'CPU 3'})
MATCH path0=(cpu0)<-[:switched_in|:switched_out|:scheduled_to_wake_on|:wake_up|:runtime_stat|:process_freed]-(:Thread)
MATCH path3=(cpu3)<-[:switched_in|:switched_out|:scheduled_to_wake_on|:wake_up|:runtime_stat|:process_freed]-(:Thread)
WHERE ANY(node IN NODES(path0) WHERE node IN NODES(path3))
RETURN DISTINCT NODES(path0) AS ThreadsDependentOnBothCPU0AndCPU3
Query Results:
[{'ThreadsDependentOnBothCPU0AndCPU3': [{'id': 'CPU_0', 'label': 'CPU 0'}, {'id': 'T_0', 'label': 'swapper/3 (T_0)'}]}, {'ThreadsDependentOnBothCPU0AndCPU3': [{'id': 'CPU_0', 'label': 'CPU 0'}, {'id': 'T_7', 'label': 'rcu_sched (T_7)'}]}, {'ThreadsDependentOnBothCPU0AndCPU3': [{'id': 'CPU_0', 'label': 'CPU 0'}, {'id': 'T_2186', 'label': 'lttng-sessiond (T_2186)'}]}, {'ThreadsDependentOnBothCPU0AndCPU3': [{'id': 'CPU_0', 'label': 'CPU 0'}, {'id': 'T_15322', 'label': 'kworker/0:0 (T_15322)'}]}, {'ThreadsDependentOnBothCPU0AndCPU3':