# Kallisto quantification

The purpose of this notebook is to develop a framework where I can automate the Kallisto quantification. 

The ground rules that I plan on setting:
- I will use Kallisto indices which are already prepared
- I am using human data as a test case, however I really think I can easily generalise this to other species
- I will specify the directory where the FASTQ files are prepared.

In [1]:
from openai import OpenAI
import os
from tqdm import tqdm
import time
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from pydantic import BaseModel
from typing import List, Dict
import subprocess

In [4]:
# Notebookly test

load_dotenv('../../.env')

openai_api_key = os.getenv('OPENAI_API_KEY')

# Test OpenAI API...

client = OpenAI(
  api_key=openai_api_key,  # this is also the default, it can be omitted
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "In two sentences or fewer, can you describe the molecular composition of water?",
        }
    ],
    model="gpt-4o-mini",
)

result = chat_completion.choices[0].message.content
print(result)

Water is composed of two hydrogen atoms covalently bonded to one oxygen atom, represented by the chemical formula H₂O. This molecular structure results in polar molecules, contributing to water's unique properties.


# Plan of action

- Specify the location of the directory containing the FASTQ files
- Specify the location of the metadata (EDIT: I think I will develop this later - I need to be linking SRX/SRA IDs if I want this to have any chance of working)
- Develop a prompt for Kallisto quantification
- The outcomes should be:
  - Kallisto code that appropriately identifies: the correct pair of FASTQ files, correct parameters, correct index selection
  - Outputs the results in an appropriate directory
  - Outputs the code and logic used
  - Outputs the... output... which can be used as an evaluation tool

In [14]:
class KallistoCode(BaseModel):
    command: str
    kallisto_param_logic: str
    other_command_logic: str

# Intention is to print out logic associated with the Kallisto parameter and any other commands. In my mind, I anticipate a for loop, with the logic behind this decision belonging in other_command

prompt = f"""

## IDENTITY AND PURPOSE

You are an expert bioinformatician who specialises in RNA-seq data. You have great familiarity with the Kallisto tool, and use this in command line prompts to perform quantification.

You will be given the paths to FASTQ files and required to construct a command line prompt that uses Kallisto to quantify these FASTQ files. Take a deep breath, and carefully consider the following steps to best achieve this goal.

## STEPS

1. Consider the list of FASTQ files. Use this to determine which files are matched with eachother - for example, the R1 and R2 files that correspond to eachother.
2. Take into account that the path to the index file is ~/work/data/kallisto_indices/human/index.idx
3. Take into consideration this documentation for kallisto quant:

kallisto 0.50.0
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
                              predicted to lie outside a transcript
    --fr-stranded             Strand specific reads, first read forward
    --rf-stranded             Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: -l, -s values are estimated from paired
                               end data, but are required when using --single)
-t, --threads=INT             Number of threads to use (default: 1)

Also note that Kallisto will not take in any of the following legacy options:

Legacy options:

These options are only supported in kallisto versions before 0.50.0. To use them install an older version of kallisto (recommended: 0.48.0).

kallisto quant:

--bias learns parameters for a model of sequences specific bias and corrects the abundances accordlingly.

--fusion does normal quantification, but additionally looks for reads that do not pseudoalign because they are potentially from fusion genes. All output is written to the file fusion.txt in the output folder.

--pseudobam outputs all pseudoalignments to a file pseudoalignments.bam in the output directory. This BAM file contains the pseudoalignments in BAM format, ordered by reads so that each pseudoalignment of a read is adjacent in the BAM file.

--genomebam constructs the pseudoalignments to the transcriptome, but projects the transcript alignments to genome coordinates, resulting in split-read alignments. When the --genomebam option is supplied at GTF file must be given with the --gtf option. The GTF file, which can be plain text or gzipped, translates transcripts into genomic coordinates. We recommend downloading a the cdna FASTA files and GTF files from the same data source. The --chromosomes option can provide a length of the genomic chromosomes, this option is not neccessary, but gives a more consistent BAM header, some programs may require this for downstream analysis. kallisto does not require the genome sequence to do pseudoalignment, but downstream tools such as genome browsers will probably need it.

4. Carefully consider what you know about using Kallisto for quantification, and provide code to quantify ALL provided FASTQ files in a way that is biologically meaningful. Take into account:
- relevant output directories
- that all FASTQ files should be quantified in a biologically accurate manner
5. Provide a SINGLE command that can be used to perform quantification. This command should be something that can be copy-pasted into the command line
6. For each parameter that is specified in the kallisto command, include a justification for why the parameter was included
7. For any other part of the command (e.g. wrapping in a for loop), include a justification for why this is necesssary

## OUTPUT

Your output should be structured with the following components. Do not include any information that does not belong in a component. For example, any output for "command" should ONLY include code, and no additional comments or commentary:
1. command - the command that can be copy-pasted into the command line to perform the quantification for all FASTQ files
2. kallisto parameter logic - for each kallisto parameter that was specified, include the reasoning for selecting the value that was used
3. other command logic - for anything that is used to support the kallisto command (e.g. a for loop), justify why these were needed.

## INPUT
List of FASTQ files, relative to current directory: 

/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101300_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101291_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101298_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101300_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101302_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101296_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101294_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101299_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101292_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101293_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101296_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101297_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101298_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101295_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101302_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101297_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101295_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101301_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101294_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101292_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101291_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101299_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101301_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101293_2.fastq.gz"""


In [15]:
chat_completion = client.beta.chat.completions.parse(
    messages=[
        {"role": "system", "content": "You are an expert bioinformatician who specialises in generating accurate commands to achieve tasks."},
        {"role": "user", "content": prompt}
    ],
    model="gpt-4o-mini",
    response_format=KallistoCode
)
print(chat_completion.choices[0].message.parsed)

command='mkdir -p output && for R1 in /home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/*_1.fastq.gz; do R2=${R1/_1/_2}; if [ -f "$R2" ]; then kallisto quant -i ~/work/data/kallisto_indices/human/index.idx -o output/$(basename ${R1%_1.fastq.gz}) --bias --pseudobam --threads=4 $R1 $R2; fi; done' kallisto_param_logic="-i ~/work/data/kallisto_indices/human/index.idx: This specifies the path to the Kallisto index for quantification.\\n-o output/$(basename ${R1%_1.fastq.gz}): Creates an output directory for each sample using the basename of the R1 file minus the '_1.fastq.gz' suffix. The output structure is important for organizing results.\\n--bias: This flag helps to correct for sequence-specific bias, ensuring more accurate quantification.\\n--pseudobam: This option generates a BAM file with pseudoalignments, which is useful for further analysis and validation of read alignments in downstream tools.\\n--threads=4: Utilizes 4 threads to speed up the quantification process 

In [21]:
import asyncio
import subprocess

async def execute_command(command, timeout=10):
    # Create a process to run the shell command directly
    process = await asyncio.create_subprocess_shell(
        command,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE,
        shell=True
    )
    
    try:
        # Wait for initial output or timeout
        stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=timeout)
        stdout = stdout.decode()
        stderr = stderr.decode()
        return_code = process.returncode
    except asyncio.TimeoutError:
        # If we timeout, it means the process is still running
        stdout = "Process started and running in the background."
        stderr = ""
        return_code = "Running"
    
    return stdout, stderr, return_code

In [17]:
generated_command = chat_completion.choices[0].message.parsed.command

In [18]:
generated_command

'mkdir -p output && for R1 in /home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/*_1.fastq.gz; do R2=${R1/_1/_2}; if [ -f "$R2" ]; then kallisto quant -i ~/work/data/kallisto_indices/human/index.idx -o output/$(basename ${R1%_1.fastq.gz}) --bias --pseudobam --threads=4 $R1 $R2; fi; done'

In [44]:
class KallistoCode(BaseModel):
    command: str
    kallisto_param_logic: str
    other_command_logic: str

# Intention is to print out logic associated with the Kallisto parameter and any other commands. In my mind, I anticipate a for loop, with the logic behind this decision belonging in other_command

prompt = f"""

## IDENTITY AND PURPOSE

You are an expert bioinformatician who specialises in RNA-seq data. You have great familiarity with the Kallisto tool, and use this in command line prompts to perform quantification.

You will be given the paths to FASTQ files and required to construct a command line prompt that uses Kallisto to quantify these FASTQ files. Take a deep breath, and carefully consider the following steps to best achieve this goal.

## STEPS

1. Identify and pair the provided FASTQ files (e.g., R1 and R2 pairs). Ensure that, for any given Kallisto input, only FASTQ files corresponding to a single sample are included.
2. Use the Kallisto index located at `~/work/data/kallisto_indices/human/index.idx`.
3. Take into consideration this documentation for kallisto quant:

kallisto 0.50.0
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
                              predicted to lie outside a transcript
    --fr-stranded             Strand specific reads, first read forward
    --rf-stranded             Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: -l, -s values are estimated from paired
                               end data, but are required when using --single)
-t, --threads=INT             Number of threads to use (default: 1)

Also note that Kallisto will not take in any of the following legacy options:

Legacy options:

These options are only supported in kallisto versions before 0.50.0. To use them install an older version of kallisto (recommended: 0.48.0).

kallisto quant:

--bias learns parameters for a model of sequences specific bias and corrects the abundances accordlingly.

--fusion does normal quantification, but additionally looks for reads that do not pseudoalign because they are potentially from fusion genes. All output is written to the file fusion.txt in the output folder.

--pseudobam outputs all pseudoalignments to a file pseudoalignments.bam in the output directory. This BAM file contains the pseudoalignments in BAM format, ordered by reads so that each pseudoalignment of a read is adjacent in the BAM file.

--genomebam constructs the pseudoalignments to the transcriptome, but projects the transcript alignments to genome coordinates, resulting in split-read alignments. When the --genomebam option is supplied at GTF file must be given with the --gtf option. The GTF file, which can be plain text or gzipped, translates transcripts into genomic coordinates. We recommend downloading a the cdna FASTA files and GTF files from the same data source. The --chromosomes option can provide a length of the genomic chromosomes, this option is not neccessary, but gives a more consistent BAM header, some programs may require this for downstream analysis. kallisto does not require the genome sequence to do pseudoalignment, but downstream tools such as genome browsers will probably need it.

4. Carefully consider what you know about using Kallisto for quantification, and provide code to quantify ALL provided FASTQ files in a way that is biologically meaningful. Take into account:
- relevant output directories
- that all FASTQ files should be quantified in a biologically accurate manner
5. Provide a SINGLE command that can be used to perform quantification. This command should be something that can be copy-pasted into the command line. Do not include any placeholder values - ensure all values are accurate based on the information provided.
- you should ensure checks that any relevant files are included
- also include checks that the output files are produced as expected
- consider the use of for loops in generating the code
- include robust error handling
- include informative print messages
- The command will be executed within a Python function using `asyncio.create_subprocess_shell`. Ensure that the command is fully compatible with execution in this context.
- Pay attention to quoting and escaping characters to avoid issues with Python's string handling.
- Consider any differences in behavior when the command is executed via Python versus directly in the shell.
6. For each parameter that is specified in the kallisto command, include a justification for why the parameter was included
7. For any other part of the command (e.g. wrapping in a for loop), include a justification for why this is necesssary

## OUTPUT

Your output should be structured with the following components. Do not include any information that does not belong in a component. For example, any output for "command" should ONLY include code, and no additional comments or commentary:
1. command - the command that can be copy-pasted into the command line to perform the quantification for all FASTQ files
2. kallisto parameter logic - for each kallisto parameter that was specified, include the reasoning for selecting the value that was used
3. other command logic - for anything that is used to support the kallisto command (e.g. a for loop), justify why these were needed.

## INPUT
List of FASTQ files, relative to current directory: 

/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101300_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101291_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101298_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101300_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101302_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101296_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101294_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101299_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101292_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101293_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101296_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101297_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101298_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101295_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101302_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101297_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101295_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101301_1.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101294_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101292_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101291_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101299_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101301_2.fastq.gz
/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/SRR29101293_2.fastq.gz"""

async def execute_command(command, timeout=300):
    # Create a process to run the shell command directly
    process = await asyncio.create_subprocess_shell(
        command,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE,
        shell=True
    )
    
    try:
        # Wait for initial output or timeout
        stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=timeout)
        stdout = stdout.decode()
        stderr = stderr.decode()
        return_code = process.returncode
    except asyncio.TimeoutError:
        # If we timeout, it means the process is still running
        stdout = "Process started and running in the background."
        stderr = ""
        return_code = "Running"
    
    return stdout, stderr, return_code

async def analyze_and_correct_code(command, error_output):
    analysis_prompt = f"""
    Analyze the following shell command and its error output. Suggest corrections or improvements to fix the issue.

    Command:
    {command}

    Error Output:
    {error_output}

    Begin by providing the input command, and stating why this did not work. Then, provide the corrected command or detailed suggestions on what changes are needed.
    """
    
    # Use OpenAI to analyze the error and suggest a correction
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Adjust the model as needed
        messages=[
            {"role": "system", "content": "You are an expert in shell scripting and command-line operations."},
            {"role": "user", "content": analysis_prompt}
        ]
    )
    
    suggestions = response.choices[0].message.content
    return suggestions

async def main_workflow(prompt, max_iterations=3):
    for iteration in range(max_iterations):
        # Step 1: Generate the command using OpenAI API
        chat_completion = client.beta.chat.completions.parse(
            messages=[
                {"role": "system", "content": "You are an expert bioinformatician who specialises in generating accurate commands to achieve tasks."},
                {"role": "user", "content": prompt}
            ],
            model="gpt-4o-mini",
            response_format=KallistoCode
        )

        # Extract the generated command
        generated_command = chat_completion.choices[0].message.parsed.command
        
        # Step 2: Execute the generated command
        print(f"Executing command (Iteration {iteration + 1}): {generated_command}")
        stdout, stderr, return_code = await execute_command(generated_command)
        
        if return_code == 0:
            print(f"Command executed successfully:\n{stdout}")
            break
        else:
            print(f"Error executing command:\n{stderr}\nStdout:\n{stdout}")
            # Step 3: Analyze and attempt to correct the command if there is an error
            suggestions = await analyze_and_correct_code(generated_command, stderr)
            print(f"Suggested corrections:\n{suggestions}")
            # Update the prompt for the next iteration with suggestions or the corrected command
            prompt = suggestions

            if iteration == max_iterations - 1:
                print("Max iterations reached. The command might still have issues.")
                break

In [45]:
await main_workflow(prompt)

Executing command (Iteration 1): for pair in /home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/*_1.fastq.gz; do name=$(basename "$pair" _1.fastq.gz); echo "Processing sample: $name"; if [ -f "$pair" ] && [ -f "/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/${name}_2.fastq.gz" ]; then kallisto quant -i ~/work/data/kallisto_indices/human/index.idx -o /home/myuser/work/results/kallisto_output/$name --bootstrap-samples=100 --threads=4 "$pair" "/home/myuser/work/results/2_extract_data/scratch/FASTQs_GSE268034/${name}_2.fastq.gz"; else echo "FASTQ pairs for sample $name are missing!"; fi; done
Error executing command:

[quant] fragment length distribution will be estimated from the data
Error: could not create directory /home/myuser/work/results/kallisto_output/SRR29101291
will be performed. Run quant with --plaintext option or recompile with
HDF5 support to obtain bootstrap estimates.


[quant] fragment length distribution will be estimated from the data
E