# GEO Curator

Making an agent that could find the single cell RNA dataset and find the h5ad, h5Seurat, csv files and downloaded if approved by the user

Agent loop through:
1. Uses Qwen locally
2. Lets the model decide to call a tool
3. Executes geo_search
4. Feeds results back
5. Optionally downloads a dataset

User question → LLM thinks → decides: "I need GEO" → LLM outputs TOOL_CALL(JSON) → Python executes geo_search() → Tool result injected back into LLM → LLM continues reasoning


In [36]:
# Defining a tool
import json
from Bio import Entrez

Entrez.email = "abhinavjj@gmail.com"

def geo_search(query: str, retmax: int = 5) -> str:
    """
    Search GEO DataSets (GDS) using a query string.
    Returns a JSON string of GEO IDs.
    """
    handle = Entrez.esearch(
        db="gds",
        term=query,
        retmax=retmax
    )
    record = Entrez.read(handle)
    handle.close()
    return json.dumps(record["IdList"])


Since LLM is hard constraints we tell Qwen: If you want to call a tool, output EXACTLY this JSON

In [37]:
import pandas as pd
handle = Entrez.esummary(db="gds", id=200301650)
record = Entrez.read(handle)
pd.DataFrame(record[0].keys(), record[0].values())

Unnamed: 0,0
[],Item
200301650,Id
GSE301650,Accession
,GDS
Single-cell RNA-seq of isolated non-parenchymal cells from imiquimod-induced psoriasis mouse model.,title
"We applied 10x Genomics single-cell RNA sequencing to profile non-parenchymal liver cells (NPCs) in a psoriasis-like mouse model. The study focuses on immune and liver sinusoidal endothelial cell (LSEC) alterations along the skin–liver axis to elucidate mechanisms driving comorbid liver disease in psoriasis. Given that 30–50% of psoriasis patients develop liver involvement, this approach aims to identify maladaptive cellular crosstalk and potential therapeutic targets.",summary
24247,GPL
301650,GSE
Mus musculus,taxon
GSE,entryType


In [38]:
# new GEO with extracting of the FTP link to be downloaded
import json
from Bio import Entrez
import re

Entrez.email = "abhinavjj@gmail.com"

def geo_summary(uid: str) -> str:

    # --- Step 1: Get GSE accession ---
    handle = Entrez.esummary(db="gds", id=uid)
    record = Entrez.read(handle)
    handle.close()

    doc = record[0]

    summary = {
        "uid": uid,
        "accession": doc.get("Accession", "unknown"),
        "title": doc.get("title", "unknown"),
        "type": doc.get("gdsType", "unknown"),
        "platform": doc.get("GPL", "unknown"),
        "n_samples": doc.get("n_samples", "unknown"),
        "suppFile": doc.get("suppFile", "unknown"),
        "Date": doc.get("PDAT", "unknown"),
        "FTP": doc.get("FTPLink", "unknown"),
    }

    return json.dumps(summary, indent=2)


In [39]:
## Test to see if we can download using https rather than ftp
''' 
import requests
from pathlib import Path
url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE301650&format=file"
out_dir = Path('/mnt/data/projects/.immune/Personal/scRNA_Agent/Data_Curator')
out_dir.mkdir(parents=True, exist_ok=True)
out_path = out_dir / f"GSE301650_supplementary.tar"
with requests.get(url, stream=True, timeout=120) as r:
    r.raise_for_status()
    with open(out_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)
'''

' \nimport requests\nfrom pathlib import Path\nurl = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE301650&format=file"\nout_dir = Path(\'/mnt/data/projects/.immune/Personal/scRNA_Agent/Data_Curator\')\nout_dir.mkdir(parents=True, exist_ok=True)\nout_path = out_dir / f"GSE301650_supplementary.tar"\nwith requests.get(url, stream=True, timeout=120) as r:\n    r.raise_for_status()\n    with open(out_path, "wb") as f:\n        for chunk in r.iter_content(chunk_size=8192):\n            if chunk:\n                f.write(chunk)\n'

In [40]:
import requests
from pathlib import Path

def geo_download_https(gse: str, out_dir: str = "geo_downloads") -> str:
    """
    Download GEO supplementary files using HTTPS endpoint.
    """
    url = f"https://www.ncbi.nlm.nih.gov/geo/download/?acc={gse}&format=file"

    out_dir = Path(out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    out_path = out_dir / f"{gse}_supplementary.tar"

    with requests.get(url, stream=True, timeout=120) as r:
        r.raise_for_status()
        with open(out_path, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)

    return str(out_path)


In [41]:
AGENT_SYSTEM_PROMPT = """
You are a bioinformatics research agent specializing in single-cell RNA-seq data curation.

You have access to TWO tools:
1. geo_search(query: str) → returns GEO Entrez UIDs and summarized with gse and other detail
2. geo_download_https(gse: str) → downloads processed GSE* processed file

Rules:
- You may call ONLY ONE tool per response.
- If you need to search GEO, output ONLY a JSON tool call.
- Do NOT hallucinate GEO accessions.
- GEO search results are Entrez UIDs, NOT GSE accessions.
- UID resolution (UID → GSE → metadata) will be provided to you as tool output.
- You may ONLY use GSE accessions explicitly provided in tool results.
- Do not mix explanations with tool calls.
- If a field is missing, say "unknown".
- Ask the user for confirmation BEFORE downloading any dataset.
- NEVER construct or guess GEO HTTPs URLs.
- Check for the FTP link before downloading the HTTP URLs.
- If no supplementary FTP link exists, report "unknown".
- In case of multiple study accession id, download only ONE dataset

After UID resolution, analyze each dataset and determine:
- Is this single-cell RNA-seq?
- What organism and immune context?
- Are RAW or processed files available?
- Does it match the user’s biological question?

Tool call formats (EXACT):

{
  "tool_call": {
    "name": "geo_search",
    "arguments": {
      "query": "..."
    }
  }
}

{
  "tool_call": {
    "name": "geo_download_https",
    "arguments": {
      "gse": "..."
    }
  }
}
"""


In [42]:
## Loading a Qwen model
# conda activate torch_gpu_dna
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Qwen/Qwen2.5-7B-Instruct"
## KimiK2 thinking cannot be downloaded so we start with Qwen. Also my GPU is Tesla T4 so I will stick to Qwen-7B.

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="cuda"
)

  from .autonotebook import tqdm as notebook_tqdm
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.87s/it]


In [43]:
### This is just generating the like provide you the plan as you instructed the agent
# We will directly use it in the agent
'''
def run_llm(messages, max_new_tokens=512):
    # This line converts structured chat messages into a single text prompt in the 
    # exact format the model was trained on.
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    # It tokenizes based on the model tokenization was done.
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad(): # Since inference not training
        output_ids = model.generate( # autoregressive text generation
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.2, # randomness or creativity is low
            do_sample=False # does not matter about the randomness
        )
    return tokenizer.decode(
        output_ids[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True
    ).strip()
'''

'\ndef run_llm(messages, max_new_tokens=512):\n    # This line converts structured chat messages into a single text prompt in the \n    # exact format the model was trained on.\n    prompt = tokenizer.apply_chat_template(\n        messages,\n        tokenize=False,\n        add_generation_prompt=True\n    )\n    # It tokenizes based on the model tokenization was done.\n    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)\n    with torch.no_grad(): # Since inference not training\n        output_ids = model.generate( # autoregressive text generation\n            **inputs,\n            max_new_tokens=max_new_tokens,\n            temperature=0.2, # randomness or creativity is low\n            do_sample=False # does not matter about the randomness\n        )\n    return tokenizer.decode(\n        output_ids[0][inputs["input_ids"].shape[-1]:],\n        skip_special_tokens=True\n    ).strip()\n'

#### Tool-call detector (the agent “router”)

In [44]:
import json

def parse_tool_call(text: str):
    try:
        data = json.loads(text)
        if "tool_call" in data:
            return data["tool_call"]
    except json.JSONDecodeError:
        return None
    return None


In [45]:
import json

def run_agent(prompt, max_steps=5):
    # This message is just the initialization where Agent System Prompt tells the agent what to do and user provided the input
    # Then in the message it keep on getting appended more messages, tools etc.
    messages = [
        {"role": "system", "content": AGENT_SYSTEM_PROMPT},
        {"role": "user", "content": prompt},
    ]
    for step in range(max_steps):
        print(f'step:{step}')
        # This line converts structured chat messages into a single text prompt in the  exact format the model was trained on.
        # every time model see the full history what has happened. This converts all previous messages into a single prompt
        # The model sees everything: system rules, user query, previous tool calls, previous tool results
        inputs = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        # It tokenizes based on the model tokenization was done.
        model_inputs = tokenizer(inputs, return_tensors="pt").to(model.device)
        with torch.no_grad(): # Since inference not training
            output_ids = model.generate( # autoregressive text generation
                **model_inputs,
                max_new_tokens=512,
                do_sample=False,
            )
        response = tokenizer.decode(
            output_ids[0][model_inputs["input_ids"].shape[-1]:],
            skip_special_tokens=True
        ).strip()
        print(f"\nLLM OUTPUT:\n{response}") ## This returns only the JSON since it is instructed in the Agent system prompt when geo_search is called
        # while when geo_summary is called it will result the output in the language
        # 2. Try parsing tool call
        try:
            data = json.loads(response) 
            tool_call = data.get("tool_call") # if there is a tool call it is true
        except json.JSONDecodeError:
            # Normal text → done
            return response
        if tool_call:
            tool_name = tool_call["name"] # tool name whether geo_search or geo_summary
            args = tool_call["arguments"]
            # 3. Execute tool
            # if tool_name == "geo_search": # this is taking only one UID
            #     result = geo_search(**args)
            if tool_name == "geo_search": # this can take multiple UID and feed to geo_summary
                uids = json.loads(geo_search(**args))
                summaries = []
                for uid in uids:
                    summaries.append(json.loads(geo_summary(uid=uid)))
                result = json.dumps(summaries, indent=2)
                # print(f'geo_search result: {result}')
            # elif tool_name == "geo_summary":
            #     result = geo_summary(**args)
            elif tool_name == "geo_download_https":
                result = geo_download_https(**args)
            else:
                raise ValueError(f"Unknown tool: {tool_name}")
            print(f"\nTOOL RESULT:\n{result}")
            # 4. Feed tool result back to model
            # After every step, you append new entries so the model can reason based on what already happened.
            messages.append({
                "role": "assistant",
                "content": response
            })
            # Models trained for tool calling (Qwen, Llama, GPT-style) expect:
            # | Role        | Meaning                           |
            # | ----------- | --------------------------------- |
            # | `system`    | Rules and behavior                |
            # | `user`      | Human request                     |
            # | `assistant` | Model’s reasoning / tool decision |
            # | `tool`      | External factual input            |
            # when role: tool it knows: “This came from the real world, not my imagination.”
            messages.append({
                "role": "tool",
                "name": tool_name,
                "content": result
            })
            # So you need to append both assistant and the tool
            # Assistant	“I decided to call a tool”
            # Tool	“Here is the result of that tool”
        else:
            return response
    raise RuntimeError("Agent did not finish in time")


In [46]:
query = """
Find single-cell RNA-seq datasets related to immune cell aging.
Prefer processed data formats if possible.
"""

final_output = run_agent(query)
print("\nFINAL OUTPUT:\n", final_output)


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


step:0

LLM OUTPUT:
{
  "tool_call": {
    "name": "geo_search",
    "arguments": {
      "query": "immune cell aging single-cell RNA-seq processed"
    }
  }
}

TOOL RESULT:
[
  {
    "uid": "200289404",
    "accession": "GSE289404",
    "title": "Inflammatory Bowel Disease Leads to Long-Term Ovarian Dysfunction via Immune-Mediated Follicular Aging",
    "type": "Expression profiling by high throughput sequencing",
    "platform": "34290",
    "n_samples": 14,
    "suppFile": "TXT",
    "Date": "2025/02/17",
    "FTP": "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE289nnn/GSE289404/"
  },
  {
    "uid": "200125300",
    "accession": "GSE125300",
    "title": "TOP2B disturbed the quality of human oocytes with advanced maternal age",
    "type": "Expression profiling by high throughput sequencing",
    "platform": "20795",
    "n_samples": 6,
    "suppFile": "CSV, TXT",
    "Date": "2019/01/19",
    "FTP": "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE125nnn/GSE125300/"
  }
]
step:1

LLM OUTPUT:

In [None]:
# Making it more specific for better dataset curation
query = (
    "single cell[All Fields] AND RNA-seq[All Fields] AND immune[All Fields]"
)

final_output = run_agent(query)
print("\nFINAL OUTPUT:\n", final_output)


step:0

LLM OUTPUT:
{
  "tool_call": {
    "name": "geo_search",
    "arguments": {
      "query": "single cell[All Fields] AND RNA-seq[All Fields] AND immune[All Fields]"
    }
  }
}

TOOL RESULT:
[
  {
    "uid": "200301650",
    "accession": "GSE301650",
    "title": "Single-cell RNA-seq of isolated non-parenchymal cells from imiquimod-induced psoriasis mouse model.",
    "type": "Expression profiling by high throughput sequencing",
    "platform": "24247",
    "n_samples": 8,
    "suppFile": "MTX, TSV",
    "Date": "2026/01/01",
    "FTP": "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE301nnn/GSE301650/"
  },
  {
    "uid": "200286125",
    "accession": "GSE286125",
    "title": "The Transcription Factor T-bet Plays a Crucial Role in Regulating the Immune Regulatory Function of Double-Negative T Cells",
    "type": "Expression profiling by high throughput sequencing",
    "platform": "17021",
    "n_samples": 4,
    "suppFile": "CSV",
    "Date": "2026/01/01",
    "FTP": "ftp://ftp.ncbi

In [2]:
## Loading a Qwen model
# conda activate torch_gpu_dna
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Qwen/Qwen2.5-7B-Instruct"
## KimiK2 thinking cannot be downloaded so we start with Qwen. Also my GPU is Tesla T4 so I will stick to Qwen-7B.

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="cuda"
)

  from .autonotebook import tqdm as notebook_tqdm
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.89s/it]


In [3]:
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(152064, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear4bit(in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (v_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (o_proj): Linear4bit(in_features=3584, out_features=3584, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear4bit(in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm