# EX3. Biomarker Identification with BRAD

In this example, we leverage previously developed methods and the `CODE` module to perform biomarker selection efficiently and effectively.


## Biomarker Identification Pipeline

1. **Collect Data**  
   Gather and organize time-series data for the system under study.

2. **Model Data**  
   Build a dynamics model to describe the system's behavior over time.

3. **Observability Analysis**  
   Identify biomarkers that maximize the system's observability.


In many bioinformatics labs, each of these steps often involves pre-existing scripts tailored to specific needs. For this example, we utilize code found within the `scripts/` directory to execute the pipeline.


## Why Use an LLM Agent?

Using an LLM-powered `Agent` for executing this workflow, instead of manually building the pipeline, offers several advantages:

- **Parameter Customization**:  
  The user can easily specify critical parameters and values without needing to manage every detail manually.

- **Default Settings Management**:  
  The LLM can handle the numerous arguments associated with available dynamics models in `scripts/classes/models.py`, most of which are likely insignificant to the user. This allows the user to focus on the important aspects of their analysis while the LLM manages defaults and less critical settings.

By automating these tasks, the LLM `Agent` streamlines the biomarker identification process, saving time and reducing complexity.


## Planner Module

To automate the pipeline execution, BRAD can use a series of tools in response to a single user input. The `Agent.state` maintains a queue of instructions, along with a `queue-pointer` that determines which instruction to execute next. In this instance, the queue will correspond to executing one of the three steps in the biomarker pipeline above.

```
Agent.state = {
    "queue-pointer": 1
    "queue" = [
        "Instructions 1",
        "Instructions 2",
        "Instructions 3",
        ...
    ],
}
```

The `PLANNER` is a core module that allows the `Agent` to construct a series of smaller tasks for a user input. Like other modules, the **/force** keyword can direct the `Agent` to the planner tool. This tool then constructs a pipeline using one of two methods: (1) either the `PLANNER` will select a process from a prebuilt pipeline, or (2) it will design a new workflow, which can then be saved for later use. Like the [Scanpy example](https://github.com/Jpickard1/BRAD-Examples/tree/main/Scanpy), the LLM powered `Agent` is best at making quick, reliable and reproducible decisions, but performs worse under uncertainty or for automating new conditions. As a result, building and saving new pipelines for later use requires significant testing to ensure that the instructions and LLM prompts have their expected behavior.

## Run the Workflow

We use the below configurations, which are also found in `config-ex-3.json`, and we will use the input `/force PLANNER Execute a biomarker selection pipeline for RNAseq data` to run this pipeline.

```
{
    "CODE": {
        "path": [
            "/home/jpic/BRAD-Examples/DMD-Biomarkers/scripts/"
        ]
    },
    "PLANNER": {
        "path": "/home/jpic/BRAD-Examples/DMD-Biomarkers/pipelines/"
    },
    "debug": 1,
    "log_path": "/home/jpic/BRAD-Examples/DMD-Biomarkers/output/",
}
```

For details on the DMD Model and Observability analysis performed in this pipeline, see the scripts found within the code path of the configurations.

In [6]:
from BRAD import agent

bot = agent.Agent(
    config='config-ex-3.json',
    tools = ['CODE', 'PLANNER']
).chat()


Would you like to use a database with BRAD [Y/N]?


 N


[32m2024-11-22 08:55:34 INFO semantic_router.utils.logger local[0m


Welcome to BRAD! The output from this conversation will be saved to /home/jpic/BRAD-Examples/DMD-Biomarkers/output/November 22, 2024 at 08:55:31 AM/log.json. How can I help?


Input >>  /force PLANNER Execute a biomarker selection pipeline for RNAseq data


BRAD >> 1: 


2024-11-22 08:55:37,797 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Display pipeline
** Step 0**
{'order': 0, 'module': 'PLANNER', 'prompt': 'none', 'description': 'This step designed the plan. It is placed in the queue because we needed a place holder for 0 indexed lists.', 'next': [1], 'inputs': []}


** Step 1**
{'order': 1, 'module': 'CODE', 'prompt': '/force CODE Load the 2015 gene expression dataset with gene coordinates.', 'description': 'This step is used to load the correct dataset to BRADs output directory', 'next': [2], 'inputs': []}


** Step 2**
{'order': 2, 'module': 'CODE', 'prompt': '/force CODE build a dynamics model of the 2015 gene coordinate time series data. Please select the appropriate parameters and inputs based on previous iterations.', 'description': 'This step builds models of the time series data that can be used for sensor selection.', 'next': [3], 'inputs': []}


** Step 3**
{'order': 3, 'module': 'CODE', 'prompt': '/force CODE perform biomarker selection on the model of the 2015 dataset. Please select the appropriate para

2024-11-22 08:55:38,665 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  input = step.invoke(input, config, **kwargs)
2024-11-22 08:55:40,462 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-22 08:55:41,792 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


/force CODE build a dynamics model of the 2015 gene coordinate time series data. Please select the appropriate parameters and inputs based on previous iterations.
CODE
BRAD >> 3: 


2024-11-22 08:55:46,503 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  input = step.invoke(input, config, **kwargs)
2024-11-22 08:55:49,139 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-22 08:55:50,812 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


/force CODE perform biomarker selection on the model of the 2015 dataset. Please select the appropriate parameters and inputs based on previous iterations.
CODE
BRAD >> 4: 


2024-11-22 08:56:34,458 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  input = step.invoke(input, config, **kwargs)
2024-11-22 08:56:36,885 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-22 08:56:38,892 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"




Input >>  exit


Thanks for chatting today! I hope to talk soon, and don't forget that a record of this conversation is available at: /home/jpic/BRAD-Examples/DMD-Biomarkers/output/November 22, 2024 at 08:55:31 AM/log.json


---
When finished, the `Agent` output directory should look something like below (the precise names may varry):

```
└── November 22, 2024 at 08:55:31 AM
    ├── log.json       # the agent log records all actions used
    ├── S0-2015.pkl    # the loaded data from step 1
    ├── S1-2015.pkl    # the constructed model from step 2
    └── S3-2015.csv    # the biomarker evaluation from step 3
```

Each of these files corresponds to a different output step of the workflow, and the log file allows for a complete transparency regarding all decisions made by the LLM throughout the workflow. For instance, the below snippit, taken from the log file, shows how the output from stage 1 was passed to the DMD model and that a full rank DMD, i.e. the exact DMD, was used:

```
{
    "func": "pythonCaller.execute_python_code",
    "code": "response = subprocess.run([sys.executable, '/home/jpic/BRAD-Examples/DMD-Biomarkers/scripts/buildModel.py', state['output-directory'], 'S1-2015.pkl', '/home/jpic/BRAD-Examples/DMD-Biomarkers/output/November 22, 2024 at 08:55:31 AM/S0-2015.pkl', '-1'], capture_output=True, text=True)",
    "purpose": "execute python code"
},
```

The information found within `S3-2015.csv` (file name may varry) will contain the desired output or top ranked marker genes for this dataset:

In [9]:
import pandas as pd
df = pd.read_csv('output/November 22, 2024 at 08:55:31 AM/S3-2015.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,state,ev1,weight,rank
0,0,MT-CO1,0.418736,0.418736,1.0
1,1,ACTB,0.349045,0.349045,2.0
2,2,MT-ND4,0.305872,0.305872,3.0
3,3,MT-CO3,0.296518,0.296518,4.0
4,4,MT-CO2,0.210251,0.210251,5.0


---

In essance, the LLM has run the workflow in a reproducible mannor without requiring (but still allowing) human supervision over each stage and decision made.

Thank you for following the last of the four **BRAD-Examples**. Please checkout our GUI and share any feedback on how the system performs for your applications.

