# Example 1: Using Scanpy with BRAD

This example illustrates how to include new tool modules into BRAD. From the [previous example](https://github.com/Jpickard1/BRAD-Examples/blob/main/RAG-SCRAPE/Example-1.ipynb), the following requirements of new tool modules were identified:

1. **Create a New Python File**  
   Add a new Python file in the `BRAD/` directory. This file must include:
   - A **tool method** that accepts and returns a Python dictionary, representing the `Agent.state`.
   - Any necessary **helper methods** to support the tool.

2. **Update the Semantic Router**  
   - Add a new set of routing files in the `BRAD/routes/` directory, allowing the semantic router to recognize when to select the tool.
   - Update the `BRAD.routers.getRoutes()` method to include example prompts that guide the semantic router in using the new tool.

3. **Modify the `Agent` Class**  
   - Add the necessary imports in `agent.py` to include the new tool.
   - Update the `Agent.getModules()` method to register the tool's main method.
  
In this notebook, we will walk though how the `SOFTWARE` module satisfies each of these.

## Implementing the Software Module

### Scope
The `SOFTWARE` module is designed to execute preexisting scripts that are relevant to BRAD's functionality. Unlike modules that retrieve information from documents or databases, this tool focuses on automating data analysis to help users uncover new insights. To achieve this, we utilize LLMs in a RAG-inspired workflow as follows:
1. (R) Retrieve relevant information, such as software documentation and comments.
2. (A) Construct a prompt using all available scripts and their corresponding documentation.
3. (G) Use the LLM to identify and select the most appropriate script to address the user's query.
4. (R) Retrieve the complete documentation for the selected script.
5. (A) Generate a detailed prompt, incorporating techniques like few-shot prompting, to guide the LLM.
6. (G) Instruct the LLM to produce a single line of code that runs the script and provides the user with new information.

Whereas fully LLM generated code remains a significant challenge, using LLMs to identify and run the correct codes is a more feasible, practical application. Still, this requires LLMs to produce a line of code, but by minimizing the amount of code the LLM write, we can ensure that few-shot prompting techniques guide the LLM to formulate appropriate code.


## 1. Create a New Python File

To implement the `SOFTWARE` module, the `BRAD/coder.py` and `BRAD/pythonCaller.py` files were introduced (at one point, `BRAD/matlabCaller.py` existed, but this was removed due to the increased software dependencies). Within `coder.py`, the 6 stage pipeline outlined above is executed by the tool function `codeCaller()`. Like all tool modules, this method accepts as input `Agent.state` and returns the updated `state` dictionary. Packed within the passed `state` parameter are all the information necessary to work with the LLM and other resources of the `Agent`. For instance, the first few lines of `codeCaller` extract access to the LLM and user prompt with the code:

```
    prompt = state['prompt']  # Get the user prompt
    llm = state['llm'        # Get access to the llm]```



This file contains helper methods and imports many python specific methods from `pythonCaller.py`. In particular, methods such as `find_py_files()`, `read_python_docstrings()`, and `get_py_description()` are important for the retrieval aspects of the workflow above.

In addition to helper code for `codeCaller`, `coder.py` also imports methods from `BRAD.promptTemplates`, `BRAD.log` and `BRAD.utils`. The later two are two files that provide general functionality for developing with the LLM, managing the output directory, interfacing with the GUI, and more -- these do not necessarily need to be developed to build new tools. `promptTemplates.py` is a separate file that contains all the different prompt templates used thorughout BRAD. For the `CODE` tool, templates such as the `pythonPromptTemplate` and `scriptSelectorTemplate` were developed. These templates allow the developer to provide additional context, information, and examples to the LLM. In the `pythonPromptTemplate`, which asks the LLM to write one line of code, several examples as to what the particular code should look like are provided. Here is that template:

Providing clear, explcitit examples and instructions allow us to leverage the LLM as more than just a router or control flow predictor, as we do to make selections in the `DATABASE` tool or search the archive, but rather leverage the generation capabilities of the tool to formate code.

New templates can be placed here as needed.

## 2. Update the Semantic Routers

To update the routing and allow access of the `SOFTWARE` module to an `Agent`, the file `BRAD/routers/code.txt` was created. To initialize the file and give the semantic router examples of what should use this tool, we initialized `code.txt` with the following prompts:

Additionally, the following lines of code were added to the `getRouter` method in `BRAD/router.py`:

```
    if available is None or 'CODE' in available:
        routeCode = Route(
            name = 'CODE',
            utterances = read_prompts(getRouterPath('code.txt'))
        )
        routes.append(routeC```ode)


## 3. Modify the `Agent` Class

The following modifications were made within `BRAD/agent.py`:
1. The `CODE` tool is imported with the addition of: `from BRAD.coder import codeCaller`
2. The `Agent` is given access to the `codeCaller` method with the addition of the following line within `Agent.getModules()`: `'CODE'   : codeCaller,`3. Additional configurations were placed within the `config/config.json` file to be used by the `CODE` tool.

The following configuration parameter is included for the `CODE` tool to tell the `Agent` which python scripts are accessible.


While these do modify the `Agent` class, these modifications are relatively superficial and straightforward.

## Using the `CODE` tool

### Available Scripts

To use this module, we first require a python script that is separate from the the BRAD software. In this case, we developed the below script, found at `scripts/scanpy-brad.py`, that provides access to run arbitrary `scanpy` commands:

This script requires 4 arguments to be passed:
1. the location of the output directory, which through prompting we will set to be `state['output-directory']`,
2. the name of the output file, which the LLM can choose,
3. the name of the input file, which is important information the user must have, and
4. a list of `scanpy` commands, which the LLM should select to respond to the users query

### Docstrings and Prompting
While writing a line of code to use this file does require LLM generated code, we can construct sufficiently detailed prompts with many examples to ensure the LLM will be successful. Below is the documentation for the script, which is used in the `pythonPromptTemplate` as input to prompt the LLM.

### Execution

To execute this, we follow a similar format as done previously:

In [1]:
from BRAD import agent

bot = agent.Agent(
    config='config-ex-2.json',
    tools = ['CODE']
)

Enter your Open AI API key:  ········



Would you like to use a database with BRAD [Y/N]?


 N


[32m2024-11-21 20:49:51 INFO semantic_router.utils.logger local[0m


Welcome to BRAD! The output from this conversation will be saved to /home/jpic/BRAD-Examples/Scanpy/output/November 21, 2024 at 08:49:11 PM/log.json. How can I help?


In [2]:
response = bot.invoke('/force CODE Using the data from /nfs/turbo/umms-indikar/shared/projects/DGC/data/tabula_sapiens/extract/TS_Liver.h5ad, perform a PCA and save a visualization of the results.')


2024-11-21 20:49:55,888 - INFO - CODE
2024-11-21 20:49:55,891 - INFO - CODER
2024-11-21 20:49:55,895 - INFO - input_variables=['user_query'] template="You must select which code to run to help a user.\n\n**Available Scripts**\nScript Name: scanpy-brad, \t Description: This script executes a series of scanpy commands on an AnnData object loaded from a .h5ad file and saves the resulting AnnData object back to disk.\n\n\n**User Query**\n{user_query}\n\n**Task**\nBased on the user's query, select the best script from the available scripts. Provide the script name and explain why it is the best match. If no script is good, replace script with None\n\n**Response Template**\nSCRIPT: <selected script>\nREASON: <reasoning why the script fits the user prompt>\n"
2024-11-21 20:49:55,896 - INFO - FIRST LLM CALL


BRAD >> 1: 


2024-11-21 20:49:57,276 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-21 20:49:57,288 - INFO - SCRIPT: scanpy-brad
REASON: The scanpy-brad script is the best match for the user's query because it specifically executes scanpy commands on an AnnData object loaded from a .h5ad file, which is exactly what the user needs to perform PCA on the data from TS_Liver.h5ad and save a visualization of the results.
2024-11-21 20:49:57,290 - INFO - scriptPath=/home/jpic/BRAD-Examples/Scanpy/scripts/
2024-11-21 20:49:57,291 - INFO - ALL SCRIPTS FOUND. BUILDING TEMPLATE
2024-11-21 20:49:57,293 - INFO - input_variables=['history', 'input'] template='Current conversation:\n{history}\n\n**PYTHON SCRIPT**\nYou must run this python script:\n/home/jpic/BRAD-Examples/Scanpy/scripts/scanpy-brad.py\n\n**PYTHON SCRIPT DOCUMENTATION**:\nThis is the doc string of this python script:\n"""\nThis script executes a series of scanpy commands on an AnnData object loaded



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:


**PYTHON SCRIPT**
You must run this python script:
/home/jpic/BRAD-Examples/Scanpy/scripts/scanpy-brad.py

**PYTHON SCRIPT DOCUMENTATION**:
This is the doc string of this python script:
"""
This script executes a series of scanpy commands on an AnnData object loaded from a .h5ad file and saves the resulting AnnData object back to disk.

Arguments (four arguments):
    1. output directory: chatstatus['output-directory']
    2. output file: <name of output file>
    3. input file: <file created in previous step>
    4. scanpy commands: a list of scanpy commands to be executed on the AnnData object (provided as a single string with commands separated by a delimiter, e.g., ';')

Based on the arguments, the input file will be loaded, then your commands will be executed, and finally, the output or resulting ann data object will be saved to the corrrectou output file and directory

2024-11-21 20:50:02,682 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-11-21 20:50:02,684 - INFO - LLM OUTPUT PARSER
2024-11-21 20:50:02,685 - INFO - Arguments:
1. Output directory: chatstatus['output-directory']
2. Output file: "PCA_results.h5ad"
3. Input file: "/nfs/turbo/umms-indikar/shared/projects/DGC/data/tabula_sapiens/extract/TS_Liver.h5ad"
4. Scanpy commands: "sc.pp.pca(adata); sc.pl.pca(adata, save='pca_results.png')"

Python Code Explanation: The provided arguments include the output directory, output file name, input file path, and the scanpy commands to perform PCA and save the visualization. 

Execute: subprocess.call([sys.executable, '/home/jpic/BRAD-Examples/Scanpy/scripts/scanpy-brad.py', chatstatus['output-directory'], "PCA_results.h5ad", "/nfs/turbo/umms-indikar/shared/projects/DGC/data/tabula_sapiens/extract/TS_Liver.h5ad", "sc.pp.pca(adata); sc.pl.pca(adata, save='pca_results.png')"], capture_output=True, text=True)
20


[1m> Finished chain.[0m


2024-11-21 20:50:38,826 - INFO - Debug: PYTHON code executed successfully.
2024-11-21 20:50:38,828 - INFO - Code Execution Output:
2024-11-21 20:50:38,829 - INFO - Debug: PYTHON code output saved to output.
2024-11-21 20:50:38,830 - INFO - 


route



2024-11-21 20:50:38,830 - INFO - CODE


When functioning correctly, the above commands will generate a PCA plot in the `Agent`'s output directory.

We say "when functioning correctly" because, as is often the case with code, both computers and humans introduce bugs that can lead to unexpected behavior. This prototyped module demonstrates the straightforward execution of small commands. However, despite the utility of RAG, LLMs are not yet capable of fully automating complex data analysis. The real strength of LLMs in automatic code execution lies in their ability to quickly write simple code—often producing errors or bugs that humans can identify and fix—rather than their ability to generate original, error-free analysis.

In the next section, we will demonstrate how to use the `CODE` tool sequentially to execute larger bioinformatics pipelines.

