# Question Answering

Sometimes you might want to ask the LLM questions about the dataset. This can be useful to quickly get an overview of the dataset or to get specific information about a certain object. Also, datasets can be quite complex and it can be hard to remember all the different attributes and relationships between objects. This is where question answering comes in handy!

We are using a Large Language Model (LLM) to answer questions about the dataset. Typically, you would use the LLM to "walk" through the dataset and ask it questions about the objects and their relationships. This way the LLM learns the structure of the dataset and can answer questions about it. However, this approach is quite slow and especially for large datasets impractical.

Therefore, we are seeking a different approach. In an initial step, we present the LLM with possible JSON paths it can take to answer the question. Picture this as a blueprint of the dataset. Then, we ask the LLM to answer the question based on this blueprint. Using JSON paths has the advantage that the LLM can quickly understand the structure of the dataset and we can use these paths to navigate the dataset. Here is a full breakdown of the process:

1. We present the LLM with a set of JSON paths that correspond to the objects and attributes in the dataset.
2. The LLM selects the most relevant JSON paths for the question at hand and provides instructions on how to use them to answer the question.
3. The LLM then uses the provided instructions to answer the question.

Let's see how this works in practice!

### Workflow

```mermaid
graph TD
    A[Dataset] --> B[Extract JSON Paths]
    B --> C[Present Paths to LLM]
    C --> D[LLM Analyzes Structure]
    D --> E[LLM Selects Relevant Paths]
    E --> F[LLM Provides Instructions]
    F --> G[LLM Answers Question]
```

In [1]:
import rich

from mdmodels import DataModel
from mdmodels.llm.templates import dataset_query

In [2]:
# Lets load the EnzymeML specification from the EnzymeML GitHub repository.
enzymeml = DataModel.from_github(
    repo="EnzymeML/enzymeml-specifications",
    branch="enzymeml-2",
    spec_path="specifications/enzymeml.md"
)

# Load the dataset, using the EnzymeML specification
with open("dataset.json", "r") as file:
    enzmldoc = enzymeml.EnzymeMLDocument.model_validate_json(file.read())

# Take a look at the JSON paths that are available in the dataset.
rich.print(enzmldoc.json_paths()[5::10])

In [6]:
# Ask the LLM to describe the dataset in detail and provide a table 
# that contains all molecules and enzymes (name, type).
response = dataset_query(
    data=enzmldoc,
    query="Describe the dataset in detail and provide a table that contains all molecules and enzymes (name, type)",
    pre_prompt="You are proficient in biochemistry and have been tasked with analyzing a dataset of enzyme reactions.",
)

rich.print(response)

Output()

Output()