# Kallisto quantification

The purpose of this notebook is to develop a framework where I can automate the Kallisto quantification. 

The ground rules that I plan on setting:
- I will use Kallisto indices which are already prepared
- I am using human data as a test case, however I really think I can easily generalise this to other species
- I will specify the directory where the FASTQ files are prepared.

In [71]:
from openai import OpenAI
import openai # Probably don't need above... but this is for testing tools with structured outputs
import os
from tqdm import tqdm
import time
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from typing import List, Dict, Literal
import subprocess
import glob
import asyncio
import json

In [3]:
# Notebookly test

load_dotenv('../../.env')

openai_api_key = os.getenv('OPENAI_API_KEY')

# Test OpenAI API...

client = OpenAI(
  api_key=openai_api_key,  # this is also the default, it can be omitted
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "In two sentences or fewer, can you describe the molecular composition of water?",
        }
    ],
    model="gpt-4o-mini",
)

result = chat_completion.choices[0].message.content
print(result)

Water (H₂O) is composed of two hydrogen atoms covalently bonded to one oxygen atom. This molecular structure gives water its unique properties, including its solvent capabilities and high surface tension.


# Plan of action

- Specify the location of the directory containing the FASTQ files
- Specify the location of the metadata (EDIT: I think I will develop this later - I need to be linking SRX/SRA IDs if I want this to have any chance of working)
- Develop a prompt for Kallisto quantification
- The outcomes should be:
  - Kallisto code that appropriately identifies: the correct pair of FASTQ files, correct parameters, correct index selection
  - Outputs the results in an appropriate directory
  - Outputs the code and logic used
  - Outputs the... output... which can be used as an evaluation tool

In [14]:
class KallistoCode(BaseModel):
    command: str
    kallisto_param_logic: str
    other_command_logic: str

# Intention is to print out logic associated with the Kallisto parameter and any other commands. In my mind, I anticipate a for loop, with the logic behind this decision belonging in other_command

prompt = f"""

## IDENTITY AND PURPOSE

You are an expert bioinformatician who specialises in RNA-seq data. You have great familiarity with the Kallisto tool, and use this in command line prompts to perform quantification.

You will be given the paths to FASTQ files and required to construct a command line prompt that uses Kallisto to quantify these FASTQ files. Take a deep breath, and carefully consider the following steps to best achieve this goal.

## STEPS

1. Consider the list of FASTQ files. Use this to determine which files are matched with eachother - for example, the R1 and R2 files that correspond to eachother.
2. Take into account that the path to the index file is ~/work/data/kallisto_indices/human/index.idx
3. Take into consideration this documentation for kallisto quant:

kallisto 0.50.0
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
                              predicted to lie outside a transcript
    --fr-stranded             Strand specific reads, first read forward
    --rf-stranded             Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: -l, -s values are estimated from paired
                               end data, but are required when using --single)
-t, --threads=INT             Number of threads to use (default: 1)

Also note that Kallisto will not take in any of the following legacy options:

Legacy options:

These options are only supported in kallisto versions before 0.50.0. To use them install an older version of kallisto (recommended: 0.48.0).

kallisto quant:

--bias learns parameters for a model of sequences specific bias and corrects the abundances accordlingly.

--fusion does normal quantification, but additionally looks for reads that do not pseudoalign because they are potentially from fusion genes. All output is written to the file fusion.txt in the output folder.

--pseudobam outputs all pseudoalignments to a file pseudoalignments.bam in the output directory. This BAM file contains the pseudoalignments in BAM format, ordered by reads so that each pseudoalignment of a read is adjacent in the BAM file.

--genomebam constructs the pseudoalignments to the transcriptome, but projects the transcript alignments to genome coordinates, resulting in split-read alignments. When the --genomebam option is supplied at GTF file must be given with the --gtf option. The GTF file, which can be plain text or gzipped, translates transcripts into genomic coordinates. We recommend downloading a the cdna FASTA files and GTF files from the same data source. The --chromosomes option can provide a length of the genomic chromosomes, this option is not neccessary, but gives a more consistent BAM header, some programs may require this for downstream analysis. kallisto does not require the genome sequence to do pseudoalignment, but downstream tools such as genome browsers will probably need it.

4. Carefully consider what you know about using Kallisto for quantification, and provide code to quantify ALL provided FASTQ files in a way that is biologically meaningful. Take into account:
- relevant output directories
- that all FASTQ files should be quantified in a biologically accurate manner
5. Provide a SINGLE command that can be used to perform quantification. This command should be something that can be copy-pasted into the command line
6. For each parameter that is specified in the kallisto command, include a justification for why the parameter was included
7. For any other part of the command (e.g. wrapping in a for loop), include a justification for why this is necesssary

## OUTPUT

Your output should be structured with the following components. Do not include any information that does not belong in a component. For example, any output for "command" should ONLY include code, and no additional comments or commentary:
1. command - the command that can be copy-pasted into the command line to perform the quantification for all FASTQ files
2. kallisto parameter logic - for each kallisto parameter that was specified, include the reasoning for selecting the value that was used
3. other command logic - for anything that is used to support the kallisto command (e.g. a for loop), justify why these were needed.

## INPUT
List of FASTQ files, relative to current directory: 

/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101300_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101291_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101298_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101300_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101302_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101296_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101294_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101299_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101292_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101293_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101296_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101297_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101298_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101295_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101302_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101297_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101295_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101301_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101294_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101292_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101291_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101299_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101301_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101293_2.fastq.gz"""


In [15]:
chat_completion = client.beta.chat.completions.parse(
    messages=[
        {"role": "system", "content": "You are an expert bioinformatician who specialises in generating accurate commands to achieve tasks."},
        {"role": "user", "content": prompt}
    ],
    model="gpt-4o-mini",
    response_format=KallistoCode
)
print(chat_completion.choices[0].message.parsed)

command='mkdir -p output && for R1 in /home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/*_1.fastq.gz; do R2=${R1/_1/_2}; if [ -f "$R2" ]; then kallisto quant -i ~/work/data/kallisto_indices/human/index.idx -o output/$(basename ${R1%_1.fastq.gz}) --bias --pseudobam --threads=4 $R1 $R2; fi; done' kallisto_param_logic="-i ~/work/data/kallisto_indices/human/index.idx: This specifies the path to the Kallisto index for quantification.\\n-o output/$(basename ${R1%_1.fastq.gz}): Creates an output directory for each sample using the basename of the R1 file minus the '_1.fastq.gz' suffix. The output structure is important for organizing results.\\n--bias: This flag helps to correct for sequence-specific bias, ensuring more accurate quantification.\\n--pseudobam: This option generates a BAM file with pseudoalignments, which is useful for further analysis and validation of read alignments in downstream tools.\\n--threads=4: Utilizes 4 threads to speed up the quantification process 

In [21]:
import asyncio
import subprocess

async def execute_command(command, timeout=10):
    # Create a process to run the shell command directly
    process = await asyncio.create_subprocess_shell(
        command,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE,
        shell=True
    )
    
    try:
        # Wait for initial output or timeout
        stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=timeout)
        stdout = stdout.decode()
        stderr = stderr.decode()
        return_code = process.returncode
    except asyncio.TimeoutError:
        # If we timeout, it means the process is still running
        stdout = "Process started and running in the background."
        stderr = ""
        return_code = "Running"
    
    return stdout, stderr, return_code

In [17]:
generated_command = chat_completion.choices[0].message.parsed.command

In [18]:
generated_command

'mkdir -p output && for R1 in /home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/*_1.fastq.gz; do R2=${R1/_1/_2}; if [ -f "$R2" ]; then kallisto quant -i ~/work/data/kallisto_indices/human/index.idx -o output/$(basename ${R1%_1.fastq.gz}) --bias --pseudobam --threads=4 $R1 $R2; fi; done'

# "Code sub-modules"

I will be writing individual blocks for each aspect of the code. This will involve:
- Identifying the locations of the FASTQ files
- (Getting the metadata and other files?)
- (Identifying the locations of the index files?)
- (Producing the Kallisto documentation?)
- Generating Kallisto code
- Executing the code
- The self-correction mechanism
- The "Main" workflow

In [162]:
# Experiment - I will instead have a generic function for finding files, with an optional parameter to specify the file extension i.e. file type
# I hope this might enable a more agentic workflow - i.e. "use this function to find files" -> filter these file names

def list_files(extensions):
    files = []
    
    # Loop through each extension and collect the files
    for extension in extensions:
        matched_files = glob.glob(os.path.join(os.path.expanduser("~"), '**', f'*{extension}'), recursive=True)
        # Filter out directories and keep only files
        files.extend([f for f in matched_files if os.path.isfile(f)])
    
    # Get absolute paths and sort them
    absolute_paths = [os.path.abspath(f) for f in files]
    sorted_paths = sorted(absolute_paths)
    
    return sorted_paths

# Implement the function in tools

class GetFiles(BaseModel):
    extension: list[str]

tools = [openai.pydantic_function_tool(GetFiles)]

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Can you identify pre-built Kallisto index files that were downloaded from the official Kallisto GitHub? Take a deep breath, carefully consider what likely extensions may be. Only provide responses which are specific to Kallisto indices.",
        }
    ],
    model="gpt-4o-mini",
    tools=tools
)

chat_completion.choices[0].message

ChatCompletionMessage(content=None, refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_Rr4e39ENmSdYlva2A42PCO9C', function=Function(arguments='{"extension":["kallisto","idx","kallisto.idx"]}', name='GetFiles'), type='function')])

In [163]:
tool_call = chat_completion.choices[0].message.tool_calls[0]
arguments = json.loads(tool_call.function.arguments)
extension = arguments.get('extension')
print(extension)

tools

['kallisto', 'idx', 'kallisto.idx']


[{'type': 'function',
  'function': {'name': 'GetFiles',
   'strict': True,
   'parameters': {'properties': {'extension': {'items': {'type': 'string'},
      'title': 'Extension',
      'type': 'array'}},
    'required': ['extension'],
    'title': 'GetFiles',
    'type': 'object',
    'additionalProperties': False}}}]

In [164]:
prompt = f"""

Are any of these file extensions: {extension} used, either commonly or rarely, for Kallisto index files, particularly those downloaded from the official Kallisto GitHub?

"""

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-4o-mini"
)

print(chat_completion.choices[0].message.content)

Kallisto is a software tool used for quantifying transcript abundance from RNA-seq data using a pseudo-alignment approach. The file extension that is typically associated with Kallisto index files is `.idx`. 

From the file extensions you provided:
- **`idx`** is the correct and commonly used extension for Kallisto index files.
- **`kallisto.idx`** could also be a valid naming convention since it includes a specific context by prefixing with 'kallisto', but it is not the standard extension used by Kallisto itself.
- The extension **`kallisto`** on its own is not a recognized file type for Kallisto index files.

In summary, the primary file extension used for Kallisto index files is `.idx`. You might come across files named `kallisto.idx`, but `idx` is the standard.


In [165]:
list_files(extension)

['/home/myuser/work/data/kallisto_indices/human/index.idx']

I have done a bit of proof-of-concept-ing, I will now test a similar principle for getting documentation.

In [172]:
# Using LLM to automate documentation

def get_documentation(command):
    try:
        # Execute the kallisto command
        result = subprocess.run(command, shell=True, capture_output=True, text=True)
        
        # Capture the stdout
        stdout = result.stdout
        
        # Return the results
        return stdout
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

docs = get_documentation("kallisto quant --help")
print(docs) # What? I swear this works by accident...

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
                              predicted to lie outside a transcript
    --fr-stranded             Strand specific reads, first read forward
    --rf-stranded             Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              

In [179]:
class GetDocumentation(BaseModel):
    command: str

tools = [openai.pydantic_function_tool(GetDocumentation)]

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "What command can I use to extract the documentation associated with Kallisto quantification? Think carefully about what the specific Kallisto sub-commands are",
        }
    ],
    model="gpt-4o-mini",
    tools=tools
)

print(chat_completion.choices[0].message)

tool_call = chat_completion.choices[0].message.tool_calls[0]
arguments = json.loads(tool_call.function.arguments)
command = arguments.get('command')
print(command)

docs = get_documentation(command)
print(docs)


ChatCompletionMessage(content=None, refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_iQyyX0slcWAVSPCxtNklQFp2', function=Function(arguments='{"command": "kallisto quant"}', name='GetDocumentation'), type='function'), ChatCompletionMessageToolCall(id='call_UIDE8Ybjlg2ZCrSkXGNQAPfx', function=Function(arguments='{"command": "kallisto index"}', name='GetDocumentation'), type='function')])
kallisto quant
kallisto 0.51.0
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plainte

In [None]:
# Combine the documentation and file finding tools into a workflow to generate a prompt

# This will be:
# Use LLMs to determine inputs for finding FASTQ files and Kallisto index files... though now that I think about, I did need an additional step to ensure I get the dataset relevant FASTQs...

In [177]:
# Identifying the locations of FASTQ files

# Since the output directory is predefined in the extraction of the FASTQ files, I can use this to specify where the output FASTQ files will be.

def list_fastq_gz_files(directory):
    # Use glob to find all files ending with .fastq.gz in the specified directory
    files = glob.glob(os.path.join(directory, '*.fastq.gz'))
    # Convert relative paths to absolute paths
    absolute_paths = [os.path.abspath(f) for f in files]
    sorted_paths = sorted(absolute_paths)
    # Print the absolute paths
    #for path in absolute_paths:
    #    print(path)
    
    return sorted_paths

# Example usage
directory = "/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/"
fastq_files = list_fastq_gz_files(directory)
print("\n".join(fastq_files))

/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101291_1.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101291_2.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101292_1.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101292_2.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101293_1.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101293_2.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101294_1.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101294_2.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101295_1.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101295_2.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101296_1.fastq.gz
/home/myuser/work/data/2_extract_data/scratch/FASTQs_G

In [5]:
# Extracting the additional files?

# Leave for later

# Intentions are for example: linking sample metadata to the Kallisto files; identifying the correct species, i.e. human or mouse index

In [31]:
# Identifying locations of the index files

# This is hard coded, I'm not sure how I want to handle creation of Kallisto indices...

# For the moment, I've manually downloaded the pre-built indices (just human at the moment)

def list_kallisto_indices(directory):
    # Use glob to find all files ending with .fastq.gz in the specified directory
    files = glob.glob(os.path.join(directory, '**', '*.idx'), recursive=True)
    # Convert relative paths to absolute paths
    absolute_paths = [os.path.abspath(f) for f in files]
    
    # Print the absolute paths
    #for path in absolute_paths:
    #    print(path)
    
    return absolute_paths

directory = os.path.expanduser("~")
kallisto_indices = list_kallisto_indices("~")
kallisto_indices

[]

In [53]:
directory = os.path.expanduser("~")
list_kallisto_indices("./")

[]

In [7]:
# Producing Kallisto documentation

def kallisto_documentation():
    try:
        # Execute the kallisto command
        result = subprocess.run("kallisto quant", shell=True, capture_output=True, text=True)
        
        # Capture the stdout
        stdout = result.stdout
        
        # Return the results
        return stdout
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

kallisto_docs = kallisto_documentation()
print(kallisto_docs)

kallisto 0.51.0
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
                              predicted to lie outside a transcript
    --fr-stranded             Strand specific reads, first read forward
    --rf-stranded             Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE           

In [8]:
# Generating Kallisto code: prompt/class

class KallistoCode(BaseModel):
    command: str
    command_logic: str

# Intention is to create a prompt that can be used to facilitate generation of a Kallisto command

def CreateInitialPrompt(fastq_dir, index_dir):
    # Step 1: Prepare the necessary inputs
    
    fastq_files = list_fastq_gz_files(fastq_dir)
    formatted_fastq_files = "\n".join(fastq_files)
    
    kallisto_indices = list_kallisto_indices(index_dir)
    formatted_kallisto_indices = "\n".join(kallisto_indices)
    kallisto_docs = kallisto_documentation()

    
    kallisto_prompt = f"""

## IDENTITY AND PURPOSE

You are an expert bioinformatician who specialises in RNA-seq data. You have great familiarity with the Kallisto tool, and use this in command line prompts to perform quantification.

You will be given the paths to FASTQ files and required to construct a command line prompt that uses Kallisto to quantify these FASTQ files. Take a deep breath, and carefully consider the following steps to best achieve this goal.

## STEPS

1. Identify and pair the provided FASTQ files (e.g., R1 and R2 pairs) for each sample.
2. Use the Kallisto index located here:
{formatted_kallisto_indices}
3. Take into consideration this documentation for kallisto quant:

{kallisto_docs}

4. Provide a SINGLE command that can be used to perform quantification. 
- This command will be executed via asyncio.create_subprocess_shell
- Do not include any placeholder values - ensure all values are accurate based on the information provided.
- Input FASTQ files corresponding to each sample
- Account for output directories and file management
- Account for error handling and output verification
- Ensure compatibility with POSIX-compliant shell syntax and Python's `asyncio.create_subprocess_shell`.
- Do not assume information which is not provided
- Ensure any outputs are created in a scratch subdirectory nested within the current directory
5. Justify the inclusion of each Kallisto parameter and any addition command components (e.g. loops, mkdir)

## OUTPUT

Your output should be structured with the following components. Do not include any information that does not belong in a component. For example, any output for "command" should ONLY include code, and no additional comments or commentary:
1. command - the command that can be copy-pasted into the command line to perform the quantification for all FASTQ files
2. command logic - for each kallisto parameter that was specified, include the reasoning for enabling the parameter or specifying why the parameter was used. Additionally, if any auxiliary functions are used, such as mkdir or loops, include a justification. Example format:
- The -b 100 parameter was specified to generate 100 bootstrap samples, improving reliablility of quantification
- The --plaintext option was specified as Kallisto was not built with HDF5
- A for loop was constructed to iteratively perform analysis on each pair of samples

## INPUT
List of FASTQ files: 

{formatted_fastq_files}

"""
    return(kallisto_prompt)

temp = CreateInitialPrompt(index_dir="/home/myuser/work/data/kallisto_indices",
                           fastq_dir="/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/")
print(temp)



## IDENTITY AND PURPOSE

You are an expert bioinformatician who specialises in RNA-seq data. You have great familiarity with the Kallisto tool, and use this in command line prompts to perform quantification.

You will be given the paths to FASTQ files and required to construct a command line prompt that uses Kallisto to quantify these FASTQ files. Take a deep breath, and carefully consider the following steps to best achieve this goal.

## STEPS

1. Identify and pair the provided FASTQ files (e.g., R1 and R2 pairs) for each sample.
2. Use the Kallisto index located here:
/home/myuser/work/data/kallisto_indices/human/index.idx
3. Take into consideration this documentation for kallisto quant:

kallisto 0.51.0
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING   

In [9]:
# Executing code
async def execute_command(command, timeout=1000):
    # Create a process to run the shell command directly
    process = await asyncio.create_subprocess_shell(
        command,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE,
        shell=True
    )
    
    try:
        # Wait for initial output or timeout
        stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=timeout)
        stdout = stdout.decode()
        stderr = stderr.decode()
        return_code = process.returncode
    except asyncio.TimeoutError:
        # If we timeout, it means the process is still running
        stdout = "Process started and running in the background."
        stderr = ""
        return_code = "Running"
    
    return stdout, stderr, return_code

In [15]:
# Self correction mechanism
class CodeAssessment(BaseModel):
    rewrite_needed: Literal["Yes", "No"]
    code_analysis: str
    suggested_code: str

async def analyze_and_correct_code(command, stdout, stderr, return_code, formatted_fastq_files, kallisto_docs):
    # Define analysis prompt

    analysis_prompt = f"""

## IDENTITY AND PURPOSE

You are an expert bioinformatician who specialises in RNA-seq data. You critically assess a Kallisto command provided to you and provide feedback about the validity of the code.

Take a deep breath, and carefully follow the below steps to provide the best possible feedback. 

## STEPS

Throughout these steps, do NOT make any assumptions about information that is not explicitly provided.

1. Carefully digest the Kallisto command that was given to you, and consider whether the command is likely to represent a reasonable method of quantifying samples. Take into account:
- If ALL FASTQ files were correctly analysed. Note that the FASTQ files are here:

{formatted_fastq_files}

- whether any additional functions, such as loops, would be correctly applied
- whether any improvements can be made based on this documentation:

{kallisto_docs}

- whether the parameters included in the Kallisto command are logical and biologically sound given the input and documentation
- whether there are any potential issues regarding validity and reliability

2. Carefully digest the command's output, for example taking note of:
- whether or not there were errors
- whether or not there were warnings
- check for compatibility issues with asyncio.create_subprocess_shell execution.
- verify that all shell syntax used is POSIX-compliant and avoids bash-specific features.
- proper escaping of special characters for both shell and Python string contexts.
- potential issues with command length or complexity that might cause problems in subprocess execution.
- if any part of the command might be causing the process to hang or timeout

3. After considering the above, provide feedback about:
- if the command should be rewritten. Note that:
    - if there were ANY errors, the command should be rewritten
    - if there were warnings, consider if the warning contributes to a biologically meaningful difference: if the warning does not affect the intended function of the comand, it can be ignored. However, if the warning indicates the command's outcome could be biologically compromised, the command should be rewritten to address the issue.
- IF the command is to be rewritten, then take into account the stderr and stdout output, and use this to inform about
    - the likely source of error(s) and/or warning(s)
    - an analysis of ALL parameters being used, and whether they should be changed
    - an analysis of whether the exclusion of any parameters might be appropriate
    - a suggested code rewrite.

## OUTPUT

Your output should consist of:
- whether or not a rewrite is necessary. Respond either "Yes" or "No"
- if a rewrite is needed, provide a dot point list of all changes that should be made. This should be where the analysis of parameters is included.
- a suggested code rewrite

If a rewrite is not needed, include "Not needed" for points 2 and 3.

## INPUT

Input code:

{command}

Stdout:

{stdout}

Stderr:

{stderr}

Return code:

{return_code}

"""
    
    # Use OpenAI to analyze the error and suggest a correction
    response = client.beta.chat.completions.parse(
        model="gpt-4o-mini",  # Adjust the model as needed
        messages=[
            {"role": "system", "content": "You are an expert in shell scripting and command-line operations."},
            {"role": "user", "content": analysis_prompt}
        ],
        response_format=CodeAssessment
    )
    
    assessment = response.choices[0].message.parsed
    return assessment

In [16]:
# Main workflow

# Define Pydantic models
class KallistoCode(BaseModel):
    command: str
    command_logic: str

class CodeAssessment(BaseModel):
    rewrite_needed: Literal["Yes", "No"]
    code_analysis: str
    suggested_code: str

async def main_workflow(fastq_dir, index_dir, max_iterations=5):
    # Preliminary step - prepare the initial prompt

    prompt = CreateInitialPrompt(index_dir = index_dir,
                                 fastq_dir = fastq_dir)
    
    for iteration in range(max_iterations):
        print(f"\n(Iteration {iteration + 1})\n\n")
        # Step 1: Generate the command using OpenAI API
        chat_completion = client.beta.chat.completions.parse(
            messages=[
                {"role": "system", "content": "You are an expert bioinformatician who specialises in generating accurate commands to achieve tasks."},
                {"role": "user", "content": prompt}
            ],
            model="gpt-4o-mini",
            response_format=KallistoCode
        )
        # Extract the generated command
        generated_command = chat_completion.choices[0].message.parsed.command
        print(f"Command justification: {chat_completion.choices[0].message.parsed.command_logic}\n\n")
        # Step 2: Execute the generated command
        print(f"Executing command: {generated_command}")
        stdout, stderr, return_code = await execute_command(generated_command)
        # Step 3: Always analyze the command output, and proceed based on the analysis. Provide FASTQ files for checking
        fastq_files = list_fastq_gz_files(fastq_dir)
        formatted_fastq_files = "\n".join(fastq_files)
        kallisto_docs = kallisto_documentation()
        assessment = await analyze_and_correct_code(generated_command, stdout, stderr, 
                                                    return_code, formatted_fastq_files, kallisto_docs)
        print(f"Stdout:\n\n{stdout}")
        print(f"Stderr:\n\n{stderr}")

        # Step 4: Rewrite if needed, otherwise finish
        if assessment.rewrite_needed == "No":
            print("Command executed successfully and no rewrite is needed.")
            print(f"Stdout:\n{stdout}")
            break
        else:
            print(f"Command needs improvement:\n{assessment.error_warnings_source}")
            print(f"Suggested corrections:\n{assessment.suggestion}")
            
            if iteration == max_iterations - 1:
                print("Max iterations reached. The command might still have issues.")
                break
            
            # Update the prompt for the next iteration with suggestions
            prompt = f"""
            The previous command had issues when executed via Python's asyncio.create_subprocess_shell. Please generate a new command addressing these problems:
            Previous command: {generated_command}
            Analysis: {assessment.code_analysis}
            Suggestions: {assessment.suggested_code}
            
            Original requirements:
            {prompt}
            """

    return generated_command, stdout, stderr, return_code, assessment

In [17]:
await main_workflow(index_dir = "/home/myuser/work/data/kallisto_indices/",
                   fastq_dir="/home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/")


(Iteration 1)


Command justification: - The command begins with 'mkdir -p' to create an output directory where results will be stored, ensuring that the command won't fail due to a missing output path.
- The 'kallisto quant' command is invoked for each pair of FASTQ files, specifying the index with '-i' pointing to '/home/myuser/work/data/kallisto_indices/human/index.idx'. This is essential for Kallisto to know which index to use for mapping.
- Each quantification invocation uses '-o' to define a unique output directory for each sample based on its identifier (e.g., SRR29101291).
- The '--threads=4' option is utilized to enable multi-threading for faster processing, assuming a machine that supports at least 4 threads. This can significantly speed up the quantification process.
- The command uses backslashes to create a line continuation for readability, accommodating more commands in a single execution. Each command is separated and will execute sequentially.
- Kallisto does not requ

('mkdir -p /home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/output && \\\nkallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o /home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/output/SRR29101291 --threads=4 /home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101291_1.fastq.gz /home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101291_2.fastq.gz && \\\nkallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o /home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/output/SRR29101292 --threads=4 /home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101292_1.fastq.gz /home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/SRR29101292_2.fastq.gz && \\\nkallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o /home/myuser/work/data/2_extract_data/scratch/FASTQs_GSE268034/output/SRR29101293 --threads=4 /home/myuser/work/data/2_extract_data/scr

In [144]:
print(temp)



## IDENTITY AND PURPOSE

You are an expert bioinformatician who specialises in RNA-seq data. You have great familiarity with the Kallisto tool, and use this in command line prompts to perform quantification.

You will be given the paths to FASTQ files and required to construct a command line prompt that uses Kallisto to quantify these FASTQ files. Take a deep breath, and carefully consider the following steps to best achieve this goal.

## STEPS

1. Identify and pair the provided FASTQ files (e.g., R1 and R2 pairs) for each sample.
2. Use the Kallisto index located at ['/home/myuser/work/data/kallisto_indices/human/index.idx'].
3. Take into consideration this documentation for kallisto quant:

kallisto 0.51.0
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING 