# Agent 3

The outputs generated by Agents may contain technical details or extraneous information, making them redundant or meaningless to users. To enhance readability, we developed Agent 3, which refines and restructures Python output strings into standardized, user-friendly formats. For example, Python lists are reformatted as clear, numbered lists, and file download information is converted into simple, intuitive messages. This ensures that responses are concise, relevant, and easy for users to understand, regardless of the original technical formatting.

In [5]:
import re, io, sys

In [4]:
from IPython.display import HTML, display, Markdown

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [3]:
import base64
import vertexai
from vertexai.generative_models import GenerativeModel, Part, SafetySetting

In [6]:
PROJECT_ID = "isb-cgc-external-004"
LOCATION = "us-central1"

import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)

In [9]:
generation_config = {
    "max_output_tokens": 8192,
    "temperature": 1,
    "top_p": 0.5,
}

safety_settings = [
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
        threshold=SafetySetting.HarmBlockThreshold.OFF
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=SafetySetting.HarmBlockThreshold.OFF
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold=SafetySetting.HarmBlockThreshold.OFF
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=SafetySetting.HarmBlockThreshold.OFF
    ),
]

def generate(question = """What can you do?"""):

    ## General prompt
    prompt = """You are an experienced writer and editor. When given an input, convert it into a well formatted string output. If the input string is blank, just say that there are no outputs generated for the current query. When someone ask you who you are, return accordingly.

    Here are some examples:
    Example 1:
    Input: '''
    Generated code: import scanpy as sc
    import pandas as pd

    adata = sc.read_h5ad("/content/demo_data/6723_KL_1_unfiltered.h5ad")
    print(adata)
    AnnData object with n_obs × n_vars = 4992 × 19074
    obs: 'in_tissue', 'array_row', 'array_col', 'kmeans_7_clusters', 'kmeans_10_clusters', 'kmeans_4_clusters', 'kmeans_2_clusters', 'kmeans_6_clusters', 'graphclust', 'kmeans_3_clusters', 'kmeans_8_clusters', 'kmeans_9_clusters', 'kmeans_5_clusters'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'Morans_I', 'Morans_I_p_val', 'Morans_I_adj_p_val', 'Feature Counts in Spots Under Tissue', 'Median Normalized Average Counts', 'Barcodes Detected per Feature'
    uns: 'spatial'
    obsm: 'spatial'
    Agent 2: Executed Python code with accessible variables.
    The agent action is tool='agent_2' tool_input={'question': 'Can you use scanpy to load the /content/demo_data/6723_KL_1_unfiltered.h5ad?'} log="\nInvoking: `agent_2` with `{'question': 'Can you use scanpy to load the /content/demo_data/6723_KL_1_unfiltered.h5ad?'}`\n\n\n" message_log=[AIMessage(content='', additional_kwargs={'function_call': {'name': 'agent_2', 'arguments': '{"question": "Can you use scanpy to load the /content/demo_data/6723_KL_1_unfiltered.h5ad?"}'}}, response_metadata={'is_blocked': False, 'safety_ratings': [], 'usage_metadata': {'prompt_token_count': 380, 'candidates_token_count': 35, 'total_token_count': 415, 'cached_content_token_count': 0}, 'finish_reason': 'STOP', 'avg_logprobs': -0.004535583513123649, 'logprobs_result': {'top_candidates': [], 'chosen_candidates': []}}, id='run-7d91d738-3af2-468a-bb2b-e9b23235e38a-0', tool_calls=[{'name': 'agent_2', 'args': {'question': 'Can you use scanpy to load the /content/demo_data/6723_KL_1_unfiltered.h5ad?'}, 'id': '62e66320-2760-4403-9049-43fd7a676e0f', 'type': 'tool_call'}], usage_metadata={'input_tokens': 380, 'output_tokens': 35, 'total_tokens': 415})] tool_call_id='62e66320-2760-4403-9049-43fd7a676e0f'
    The tool result is: None
    {'intermediate_steps': [(ToolAgentAction(tool='agent_2', tool_input={'question': 'Can you use scanpy to load the /content/demo_data/6723_KL_1_unfiltered.h5ad?'}, log="\nInvoking: `agent_2` with `{'question': 'Can you use scanpy to load the /content/demo_data/6723_KL_1_unfiltered.h5ad?'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'name': 'agent_2', 'arguments': '{"question": "Can you use scanpy to load the /content/demo_data/6723_KL_1_unfiltered.h5ad?"}'}}, response_metadata={'is_blocked': False, 'safety_ratings': [], 'usage_metadata': {'prompt_token_count': 380, 'candidates_token_count': 35, 'total_token_count': 415, 'cached_content_token_count': 0}, 'finish_reason': 'STOP', 'avg_logprobs': -0.004535583513123649, 'logprobs_result': {'top_candidates': [], 'chosen_candidates': []}}, id='run-7d91d738-3af2-468a-bb2b-e9b23235e38a-0', tool_calls=[{'name': 'agent_2', 'args': {'question': 'Can you use scanpy to load the /content/demo_data/6723_KL_1_unfiltered.h5ad?'}, 'id': '62e66320-2760-4403-9049-43fd7a676e0f', 'type': 'tool_call'}], usage_metadata={'input_tokens': 380, 'output_tokens': 35, 'total_tokens': 415})], tool_call_id='62e66320-2760-4403-9049-43fd7a676e0f'), 'None')]}

    '''
    Well formated output:

    '''The dataset is loaded to the anndata object named adata. There are 4992 cells or spots, 19074 genes, and 13 feature columns. The data object is printed below:

    AnnData object with n_obs × n_vars = 4992 × 19074
    obs: 'in_tissue', 'array_row', 'array_col', 'kmeans_7_clusters', 'kmeans_10_clusters', 'kmeans_4_clusters', 'kmeans_2_clusters', 'kmeans_6_clusters', 'graphclust', 'kmeans_3_clusters', 'kmeans_8_clusters', 'kmeans_9_clusters', 'kmeans_5_clusters'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'Morans_I', 'Morans_I_p_val', 'Morans_I_adj_p_val', 'Feature Counts in Spots Under Tissue', 'Median Normalized Average Counts', 'Barcodes Detected per Feature'
    uns: 'spatial'
    obsm: 'spatial'
    '''

    Example 2:
    Input: '''
    Welcome, arun.das!

    INFO:synapseclient_default:Welcome, arun.das!

    Downloading files:  77%|███████▋  | 16.8M/21.8M [00:00<00:00, 28.0MB/s, syn51133599]Downloaded syn51133599 to /content/demo_data/8578_AS_1_unfiltered.h5ad
    [INFO] Downloaded syn51133599 to /content/demo_data/8578_AS_1_unfiltered.h5ad
    Downloading files: 100%|██████████| 21.8M/21.8M [00:00<00:00, 31.8MB/s, syn51133599]INFO:synapseclient_default:Downloaded syn51133599 to /content/demo_data/8578_AS_1_unfiltered.h5ad
    Downloading files: 100%|██████████| 21.8M/21.8M [00:00<00:00, 31.6MB/s, syn51133599]import synapseclient
    syn = synapseclient.login(silent=True)
    entity = syn.get('syn51133599', downloadLocation='/content/demo_data')
    The agent action is tool='agent_1' tool_input={'question': 'Can you download the synapse dataset syn51133599 to /content/demo_data/?'} log="\nInvoking: `agent_1` with `{'question': 'Can you download the synapse dataset syn51133599 to /content/demo_data/?'}`\n\n\n" message_log=[AIMessage(content='', additional_kwargs={'function_call': {'name': 'agent_1', 'arguments': '{"question": "Can you download the synapse dataset syn51133599 to /content/demo_data/?"}'}}, response_metadata={'is_blocked': False, 'safety_ratings': [], 'usage_metadata': {'prompt_token_count': 396, 'candidates_token_count': 27, 'total_token_count': 423, 'cached_content_token_count': 0}, 'finish_reason': 'STOP', 'avg_logprobs': -0.0003545089038433852, 'logprobs_result': {'top_candidates': [], 'chosen_candidates': []}}, id='run-49c29b88-a638-4a7d-89fa-122d8edb910d-0', tool_calls=[{'name': 'agent_1', 'args': {'question': 'Can you download the synapse dataset syn51133599 to /content/demo_data/?'}, 'id': '48d647d4-94d2-4cae-bd6f-f5d5e2240945', 'type': 'tool_call'}], usage_metadata={'input_tokens': 396, 'output_tokens': 27, 'total_tokens': 423})] tool_call_id='48d647d4-94d2-4cae-bd6f-f5d5e2240945'
    The tool result is: None
    ----
    '''

    Well formated output:
    '''
    The synapse dataset with synapse id syn51133599 is successfully downloaded to /content/demo_data/8578_AS_1_unfiltered.h5ad.
    '''


    Example 3:
    Input: '''
    {'agent_outcome': [ToolAgentAction(tool='agent_1', tool_input={'question': 'Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?'}, log="\nInvoking: `agent_1` with `{'question': 'Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'name': 'agent_1', 'arguments': '{"question": "Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?"}'}}, response_metadata={'is_blocked': False, 'safety_ratings': [], 'usage_metadata': {'prompt_token_count': 536, 'candidates_token_count': 49, 'total_token_count': 585, 'cached_content_token_count': 0}, 'finish_reason': 'STOP', 'avg_logprobs': -0.0002904396352111077, 'logprobs_result': {'top_candidates': [], 'chosen_candidates': []}}, id='run-b2cf906b-d897-4164-b743-1120419ebdf9-0', tool_calls=[{'name': 'agent_1', 'args': {'question': 'Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?'}, 'id': '6987b536-4024-4b91-a0b1-720e41b1076e', 'type': 'tool_call'}], usage_metadata={'input_tokens': 536, 'output_tokens': 49, 'total_tokens': 585})], tool_call_id='6987b536-4024-4b91-a0b1-720e41b1076e')]}
    import pandas_gbq
    from google.cloud import bigquery
    import synapseclient

    project_id = "isb-cgc-external-004"

    query = '''
    SELECT entityId
    FROM `isb-cgc-bq.HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current`
    WHERE File_Format = 'hdf5'
    '''

    df = pandas_gbq.read_gbq(query, project_id=project_id)
    print(df)

    syn = synapseclient.login(silent=True)
    entity = syn.get('syn51133602', downloadLocation='/content/datasets')
    Downloading: 100%|##########|
        entityId
    0   syn51133519
    1   syn51133520
    2   syn51133521
    3   syn51133522
    4   syn51133523
    5   syn51133524
    6   syn51133525
    7   syn51133526
    8   syn51133527
    9   syn51133528
    10  syn51133529
    11  syn51133530
    12  syn51133531
    13  syn51133532
    14  syn51133533
    15  syn51133534
    16  syn51133537
    17  syn51133540
    18  syn51133578
    19  syn51133580
    20  syn51133583
    21  syn51133587
    22  syn51133591
    23  syn51133592
    24  syn51133593
    25  syn51133595
    26  syn51133596
    27  syn51133597
    28  syn51133598
    29  syn51133599
    30  syn51133600
    31  syn51133601
    32  syn51133602
    33  syn51133603
    34  syn51133604
    35  syn51133605
    36  syn51133606
    37  syn51133607
    38  syn51133608
    39  syn51133609
    40  syn51133612
    The agent action is tool='agent_1' tool_input={'question': 'Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?'} log="\nInvoking: `agent_1` with `{'question': 'Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?'}`\n\n\n" message_log=[AIMessage(content='', additional_kwargs={'function_call': {'name': 'agent_1', 'arguments': '{"question": "Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?"}'}}, response_metadata={'is_blocked': False, 'safety_ratings': [], 'usage_metadata': {'prompt_token_count': 536, 'candidates_token_count': 49, 'total_token_count': 585, 'cached_content_token_count': 0}, 'finish_reason': 'STOP', 'avg_logprobs': -0.0002904396352111077, 'logprobs_result': {'top_candidates': [], 'chosen_candidates': []}}, id='run-b2cf906b-d897-4164-b743-1120419ebdf9-0', tool_calls=[{'name': 'agent_1', 'args': {'question': 'Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?'}, 'id': '6987b536-4024-4b91-a0b1-720e41b1076e', 'type': 'tool_call'}], usage_metadata={'input_tokens': 536, 'output_tokens': 49, 'total_tokens': 585})] tool_call_id='6987b536-4024-4b91-a0b1-720e41b1076e'
    The tool result is: None
    {'intermediate_steps': [(ToolAgentAction(tool='agent_1', tool_input={'question': 'Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?'}, log="\nInvoking: `agent_1` with `{'question': 'Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'name': 'agent_1', 'arguments': '{"question": "Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?"}'}}, response_metadata={'is_blocked': False, 'safety_ratings': [], 'usage_metadata': {'prompt_token_count': 536, 'candidates_token_count': 49, 'total_token_count': 585, 'cached_content_token_count': 0}, 'finish_reason': 'STOP', 'avg_logprobs': -0.0002904396352111077, 'logprobs_result': {'top_candidates': [], 'chosen_candidates': []}}, id='run-b2cf906b-d897-4164-b743-1120419ebdf9-0', tool_calls=[{'name': 'agent_1', 'args': {'question': 'Can you load the entityId of all data under HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is hdf5?'}, 'id': '6987b536-4024-4b91-a0b1-720e41b1076e', 'type': 'tool_call'}], usage_metadata={'input_tokens': 536, 'output_tokens': 49, 'total_tokens': 585})], tool_call_id='6987b536-4024-4b91-a0b1-720e41b1076e'), 'None')]}
    ---- Initial response captured ----
    '''
    Well formatted output:
    '''
    A total of 41 entity IDs were retrieved from the BigQuery table isb-cgc-bq.HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level4_metadata_current where the File_Format is 'hdf5'.  The entity IDs are listed below::

    1. syn51133519
    2. syn51133520
    3. syn51133521
    4. syn51133522
    5. syn51133523
    6. syn51133524
    7. syn51133525
    8. syn51133526
    9. syn51133527
    10. syn51133528
    11. syn51133529
    12. syn51133530
    13. syn51133531
    14. syn51133532
    15. syn51133533
    16. syn51133534
    17. syn51133537
    18. syn51133540
    19. syn51133578
    20. syn51133580
    21. syn51133583
    22. syn51133587
    23. syn51133591
    24. syn51133592
    25. syn51133593
    26. syn51133595
    27. syn51133596
    28. syn51133597
    29. syn51133598
    30. syn51133599
    31. syn51133600
    32. syn51133601
    33. syn51133602
    34. syn51133603
    35. syn51133604
    36. syn51133605
    37. syn51133606
    38. syn51133607
    39. syn51133608
    40. syn51133609
    41. syn51133612
    '''
    """

    prompt += f"""
    Input: {question}
    Please generate the well formated output.
    """

    vertexai.init(project="isb-cgc-external-004", location="us-central1")
    model = GenerativeModel(
        "gemini-1.5-flash-002",
    )
    responses = model.generate_content(
        [prompt],
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=False,
        # tools='code_execution'
    )

    # Example usage
    display(Markdown(responses.text))


In [None]:
generate('Who are you and what can you do?')

In [11]:
# Test 1

text = """
candidates {
  content {
    role: "model"
    parts {
      text: "The HTAN spatial transcriptomics metadata file contains information about the spatial location of cells, including attributes such as `Spatial Barcode Length`, `UMI Barcode Offset`, `UMI Barcode Length`, and potentially coordinates if available in the specific dataset.  It also includes general file information like `Filename`, `Run ID`, `File Format`, and HTAN identifiers such as `HTAN Parent Biospecimen ID` and `HTAN Data File ID`.  The exact attributes will vary depending on the specific dataset.\n"
    }
  }
  finish_reason: STOP
  avg_logprobs: -0.10189473395254098
}
usage_metadata {
  prompt_token_count: 3155
  candidates_token_count: 102
  total_token_count: 3257
}
model_version: "gemini-1.5-flash-002"
"""
generate(text)

```
The HTAN spatial transcriptomics metadata file contains information about the spatial location of cells, including attributes such as `Spatial Barcode Length`, `UMI Barcode Offset`, `UMI Barcode Length`, and potentially coordinates if available in the specific dataset. It also includes general file information like `Filename`, `Run ID`, `File Format`, and HTAN identifiers such as `HTAN Parent Biospecimen ID` and `HTAN Data File ID`. The exact attributes will vary depending on the specific dataset.
```


In [14]:
text = """
Downloading: 100%|██████████|
         CellID cell_type
0        271351     Other
1        271535     Other
2        271733     Other
3        272229     Other
4        272613     Other
...         ...       ...
6282743   83619     Other
6282744  201761     Other
6282745  450282     Other
6282746   71082     Other
6282747    5520     Other

[6282748 rows x 2 columns]
"""
generate(text)

The provided input is a table showing cell IDs and their corresponding cell types.  There are 6,282,748 rows in the table.  The majority of cells are classified as "Other".  A more detailed analysis would require further information about the data and the desired level of detail in the output.
