This is the script for task2 of AISBench. This script includes downloading the dataset, obtaining the background information, and getting the multiple-choice question and their reference answers. You can fed these to the AI scientists and ask them to answer the question based on the data.

In [1]:
import pandas as pd
import re
import os
import numpy as np

In [None]:
question_id = 2

df = pd.read_excel('Task2_data/BAISBench_task2.xlsx', sheet_name='Sheet1')

question_name = df['name'][question_id]
background_info = df[df['name']==question_name]['background'].values.item()

question_list = []
question_answer = []
for i in range(1,6):
    question_list.append(df[df['name']==question_name][f'Questions{i}'].values.item())
    answers = re.findall(r'\b([A-Z])\)', df[df['name']==question_name][f'Answer{i}'].values.item())
    question_answer.append(answers)


In [None]:
from huggingface_hub import list_repo_files, hf_hub_download

# 设置 repo 名称
repo_id = "EperLuo/BAISBench" 
repo_type = "dataset" 

# 设置前缀
prefix = f"task2 - {question_name}"

# 列出所有文件
all_files = list_repo_files(repo_id=repo_id, repo_type=repo_type)

# 筛选出带有指定前缀的文件
target_files = [f for f in all_files if f.startswith(prefix)]

# 下载这些文件
for file_name in target_files:
    local_path = hf_hub_download(
        repo_id=repo_id, 
        filename=file_name, 
        repo_type=repo_type,
        local_dir="Task2_data", 
        local_dir_use_symlinks=False )
    print(f"Downloaded: {file_name} -> {local_path}")

We can then feed these questions into AI scientist to analysis the data and obtain the answer. Below are the AI scientist we evaluated in our benchmark.

# Biomni
To run Biomni in the task2, you need to first install the environment for Biomni. Please refer to https://github.com/snap-stanford/biomni for the details.

After set the api key and base url, run:
```
cd model_zoo/Biomni
python run_task2_claude_sonnet.py
```
This will run all questions in task2 iteratively with Claude sonnet model. You can modifiy the model and base url in this python file. The results will be saved at ./output_claude_sonnet.

Below is the script of results evaluation.

In [2]:
df_quest = pd.read_excel("Task2_data/BAISBench_task2.xlsx", sheet_name='Sheet1')
all_cato = np.load('Task2_data/BAISBench_task2_categories.npy', allow_pickle=True).tolist()

In [3]:
# Questions and answers
question_list = []
question_answer = []
# Read question and background information
for question_id in range(41):
    question_name = df_quest['name'][question_id]
    background_info = df_quest[df_quest['name']==question_name]['background'].values.item()

    for i in range(1,6):
        if pd.isna(df_quest[df_quest['name']==question_name][f'Questions{i}'].values.item()):
            continue
        else:
            question_list.append(df_quest[df_quest['name']==question_name][f'Questions{i}'].values.item())
            answers = re.findall(r'\b([A-Z])\)', df_quest[df_quest['name']==question_name][f'Answer{i}'].values.item())
            question_answer.append([','.join(answers)][0])
    

In [4]:
np.unique(all_cato, return_counts=True)

(array(['Analysis of cellular components', 'Cell heterogeneity analysis',
        'Cell-cell communication', 'Cellular function reasoning',
        'Developmental state analysis', 'Disease analysis',
        'Key gene analysis', 'Other', 'Pathway analysis',
        'Reasoning & analysis based on data'], dtype='<U34'),
 array([27, 22,  4, 26, 15, 34, 33,  4,  5, 23]))

In [5]:
def score_answers(gt_list, pred_list):
    """
    gt_list   : standard answers list ['B', 'A,C,D']
    pred_list : AI scientist answers list ['B', 'A,C']
    return    : (total_score, score_list)
    """
    assert len(gt_list) == len(pred_list), "the length of gt_list and pred_list must be the same."

    scores = []

    for gt, pred in zip(gt_list, pred_list):
        gt_set = set(gt.split(","))
        pred_set = set(pred.split(","))

        # 单选题
        if len(gt_set) == 1:
            score = 1.0 if pred_set == gt_set else 0.0

        # 多选题
        else:
            if pred_set == gt_set:
                score = 1.0                      # all correct
            elif pred_set < gt_set:
                score = 0.5                      # missing some correct options
            else:
                score = 0.0                      # wrong options included

        scores.append(score)

    return sum(scores), scores

from collections import defaultdict

def accuracy_by_type(gt_list, pred_list, type_list):
    assert len(gt_list) == len(pred_list) == len(type_list)

    total_scores, scores = score_answers(gt_list, pred_list)

    stats = defaultdict(lambda: {
        "count": 0,
        "strict_correct": 0,
        "total_score": 0.0
    })

    for s, t in zip(scores, type_list):
        stats[t]["count"] += 1
        stats[t]["total_score"] += s
        if s == 1.0:
            stats[t]["strict_correct"] += 1

    # 计算概率
    result = {}
    for t, v in stats.items():
        result[t] = {
            "num_questions": v["count"],
            "strict_accuracy": v["strict_correct"] / v["count"],
            "expected_score": v["total_score"] / v["count"]
        }

    return result

In [6]:
answer_biomni_claude = 'B B B D B C C C D A B A B A,C,D B,D A B B,C C B C B B C B A A B C A C C B D A D D B B B B A A D B A C B C C D D C D A,B,C,D A,D A,C A B B D D A A B B B A,B,C A C A B D A C A D B B A A A A A B C A A,B,D,E A A A,C C B A A B A B B A B A B A D D D A A C D D A A A A B C B A C C A,C,D A A B C A A,D A B B B A D C A A A A D B B D A,B,C,D C C B D A D B B A C A B B B C C D B C D C B C C A B B A A C B B B C C C D B A B C B A A B B B A'
answer_biomni_claude = answer_biomni_claude.split(' ')

In [7]:
total_scores, score_list =score_answers(question_answer, answer_biomni_claude)
print("Total Score:", total_scores)
print("100 point scale:", total_scores / len(question_answer) * 100)

result = accuracy_by_type(question_answer, answer_biomni_claude, all_cato)

for k, v in result.items():
    print(k, v['expected_score']*100)

Total Score: 136.0
100 point scale: 70.46632124352331
Analysis of cellular components 68.51851851851852
Key gene analysis 62.121212121212125
Disease analysis 76.47058823529412
Cell heterogeneity analysis 72.72727272727273
Reasoning & analysis based on data 76.08695652173914
Pathway analysis 50.0
Developmental state analysis 73.33333333333333
Cellular function reasoning 76.92307692307693
Cell-cell communication 75.0
Other 25.0


# Pantheon

First set the environment for Pantheon. Please refer to https://pantheonos.stanford.edu/cli/docs/ or https://github.com/aristoteleo/pantheon-cli for the details.

After set the model, api key and base url, run:
```
cd model_zoo/Pantheon
chmod -x run_task2_claude_sonnet.sh
bash run_task2_claude_sonnet.sh
```

The results and analysis process will be saved in model_zoo/Pantheon/output_task2_claude_sonnet. The prompts for this task can be found at model_zoo/Pantheon/prompt_task2.

# STELLA

First set the environment for STELLA. Please refer to https://github.com/zaixizhang/STELLA for the details.


After set the api key and base url, run:
```
cd model_zoo/STELLA
python stella_core.py --use_template --use_mem0
```

The prompt for stella can be found in model_zoo/STELLA/prompt_task2

The results will be saved at model_zoo/STELLA/output_task2. The analysis can be accessed through the web browser.