In [57]:
from openai import OpenAI
import os
import json
from tqdm import tqdm
import time
import numpy as np
import pandas as pd
from dotenv import load_dotenv
import instructor # I'm not sure if this will actually be needed...
from pydantic import BaseModel, Field
from typing import List, Dict
import subprocess

In [58]:
load_dotenv('../../.env')

openai_api_key = os.getenv('OPENAI_API_KEY')

# Test OpenAI API...

client = OpenAI(
  api_key=openai_api_key,  # this is also the default, it can be omitted
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Can you state the name of a real species of bird? Only reply with the species name.",
        }
    ],
    model="gpt-4o-mini",
)

result = chat_completion.choices[0].message.content
print(result)

Pica pica


# Plan of action

The ideal plan will be to:
- Download FASTQ files
- Download metadata

With a bit of exploration, I have come to the realisation that some datasets will have both human and mouse samples (for example). Therefore, guiding the type of Kallisto analysis that I do will be important...

What do I specifically need to do?
- I do honestly think downloading the FASTQ files is straightforward (spoilers - it is not):
- Download metadata, then identify the SRR ID (or whatever it's called - EDIT - turns out this is not so straightforward. There's either a knowledge gap in me, or I will need an LLM here)
- prefetch (SRR)
- fasterq dump (x)
- kallisto quant (x)

In between I can use an LLM call to just say "did this work"?

So what will the pipeline look like?

1. We start with a GEO accession. These come from the previous step. For testing purposes, we can manually specify this (there is no challenge here)
2. From the GEO accession, we download the metadata. This can be done in R
3. In the metadata download, we can print out the output, and query an LLM to check if this occurred correctly. I think it's also good to look through the metadata, and see if it makes sense given the title (i.e. that the metadata matches the GEO accession)
4. Use a bit of bash scripting (Entrez e-utilities) to go from the GEO accession to SRA IDs.
5. I do note at least in the metadata that I've looked at that the SRA IDs are also in there. We can use this as a sanity check.
6. We also use the metadata to determine which Kallisto index we need to download (there is also the option to produce the index myself, but I'd prefer to download the pre-built indices if possible - that would be much easier).
7. We take into the above information to download the FASTQ files and perform the Kallisto quantification.

In [1]:
import subprocess

# Define the path to the R script
r_script_path = "./RScript_GetGEOMetadata.r"  # Ensure the path is correct and script is executable

# Define the GEO accession you want to use
geo_accession = "GSE273561"  # Replace with the actual GEO accession number

# Define output directory

outdir = "../Testing/GSE273561"

# Construct the command to run the R script with the GEO accession as an argument
command = ["Rscript", r_script_path, "-g", geo_accession, "-o", outdir]

# Call the R script using subprocess
try:
    result = subprocess.run(command, capture_output=True, text=True, check=True)
    print("R script output:")
    print(result.stdout)  # This will print the standard output from the R script
except subprocess.CalledProcessError as e:
    print("Error running R script:")
    print(e.stderr)  # This will print any error messages from the R script

R script output:
Output directory exists: ../Testing/GSE273561 
Retrieving metadata for GEO accession: GSE273561 
Saving metadata to: ../Testing/GSE273561/GSE273561-GPL24247_series_matrix_metadata.csv 
Saving metadata to: ../Testing/GSE273561/GSE273561-GPL24676_series_matrix_metadata.csv 
Metadata saved successfully!



In [44]:
# Define the path to your Bash script
bash_script = './get_sra_ids.sh'

# Define the GEO accession you want to pass to the script
geo_accession = 'GSE273561'  # Replace with the actual GEO accession

# Run the Bash script with the GEO accession as an argument
try:
    result = subprocess.run([bash_script, geo_accession], check=True, text=True, capture_output=True)
    print("Script executed successfully")
    print("Output:\n", result.stdout)
except subprocess.CalledProcessError as e:
    print("Script execution failed")
    print("Error:\n", e.stderr)

Script executed successfully
Output:
 Processing GEO accession: GSE273561
Processing sample: GSM8432354
Processing sample: GSM8432353
Processing sample: GSM8432352
Processing sample: GSM8432351
Processing sample: GSM8432350
Processing sample: GSM8443197
Processing sample: GSM8432349
Processing sample: GSM8443196
Processing sample: GSM8432348
Processing sample: GSM8443195
Processing sample: GSM8432347
Processing sample: GSM8443194
Processing sample: GSM8432346
Processing sample: GSM8443193
Processing sample: GSM8443192
Processing sample: GSM8443191
Processing sample: GSM8443190
Processing sample: GSM8443189
Processing sample: GSM8443188
Processing sample: GSM8443187
Processing sample: GSM8443186
Processing sample: GSM8443185
Processing sample: GSM8443184
Processing sample: GSM8432345
Processing sample: GSM8432344
Processing sample: GSM8432343
Processing sample: GSM8432342



Now that I have at least a basic working example of extracting this data, I will work towards integrating these together. Specifically, this will entail:
- Metadata: given the title/summary of the GEO accession, does the metadata overall make sense? Y/N
- Is there anything to suggest consistency between the metadata and SRA IDs? (Experiments) Y/N

Part of what I'm wondering is how much of the metadata I should be including. Clearly, including less would be cheaper, however I then risk losing information.

# Experimentation

Here, I will work with the manually downloaded data for one example, GSE273651, to develop the LLM call. I can see that there should be matches. This is also an interesting case where I have two metadata matrices.

With a bit of testing, there is some hallucination (i.e. it will imagine up samples that do not exist). Therefore, there will need to be a mechanism that goes "here's what we've reported, is this grounded in truth based on the metadata?

So I think the workflow I will go with:
- I extract metadata (1), GSM sample IDs, SRA experiment IDs, and SRA IDs (2) from a GEO dataset accession
- My expectation is that everything in (2) should align with eachother. However, we can do a check against the metadata to ensure this is the case:
- In the metadata, link each GSM sample ID to a SRX ID (and perform an LLM check that this is correct, i.e. "given the metadata, is this response correct?")
- Compare the metadata evaluataion to the (2) links. If they match, we can be confident (this conveniently gives us a way to link the metadata to the FASTQ files as well)
- If the above checks all pass, then we can extract FASTQ files (as I did above - prefetch -> fasterq).
- Independently, we also need to perform a check for what Kallisto file we need to download.

In [45]:
# Start by loading my data
meta1 = pd.read_csv("../Testing/GSE273561/GSE273561-GPL24247_series_matrix_metadata.csv")
meta2 = pd.read_csv("../Testing/GSE273561/GSE273561-GPL24676_series_matrix_metadata.csv")
SRA_IDs = pd.read_table("./results.txt")

In [31]:
meta1_string = meta1.to_json(orient="records")

In [50]:
class MetadataExtraction(BaseModel):
    GSMSample_ID: list[str]
    GSMSample_column: str
    SRAExperiment_ID: list[str]
    SRAExperiment_column: str

prompt = f"""Consider the following information, which is a data frame which has been converted to JSON output. 

{meta1_string}

Can you identify each of the unqiue GSM sample IDs, the column they are found in, and do the same for SRA experiment IDs?

"""

chat_completion = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-4o-mini",
    response_format=MetadataExtraction,
)

result = chat_completion.choices[0].message.parsed
print(result)

AttributeError: 'Beta' object has no attribute 'chat'

In [54]:
client.chat.completions.

SyntaxError: invalid syntax (547858539.py, line 1)