# Notebook description

In the other notebook, I had a somewhat convoluted method of performing Kallisto quantification. In this notebook, I will attempt to make stronger use of structured outputs, and perhaps hard code the output file locations a little more.


In [85]:
from openai import OpenAI
import openai # Probably don't need above... but this is for testing tools with structured outputs
import os
from tqdm import tqdm
import time
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from typing import List, Dict, Literal, Optional
import subprocess
import glob
import asyncio
import json

In [7]:
# Import OpenAI API key

load_dotenv('../../.env')

Entrez.email = os.getenv('ENTREZ_EMAIL')
Entrez.api_key = os.getenv('ENTREZ_API_KEY')
openai_api_key = os.getenv('OPENAI_API_KEY')

In [8]:
# Test the OpenAI API key is working

client = OpenAI(
  api_key=openai_api_key,  # this is also the default, it can be omitted
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "State a word that begins with the letter H, and define this word in less than one sentence",
        }
    ],
    model="gpt-4o-mini",
)

result = chat_completion.choices[0].message.content
print(result)

Hapless: unfortunate or unlucky, often in a way that provokes pity.


Ok not sure how you can define something in "less than one sentence" (flub on my part), but yes the test works.

# Notebook overall objective

The overall objective of this notebook is to develop an agent that can perform quantification of FASTQ files using Kallisto. 

In terms of the prompt that I will be giving it the documentation (I already know from prior experience that the knowledge cutoff is not accurate to the version of Kallisto I am currently using). I probably will want to specify the particular outputs that I want. So I think what I will want the LLM to do is to determine whether each parameter is appropriate or not.

## Get documentation

In [9]:
def get_documentation(command):
    try:
        # Execute the kallisto command
        result = subprocess.run(command, shell=True, capture_output=True, text=True)
        
        # Capture the stdout
        stdout = result.stdout
        
        # Return the results
        return stdout
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

docs = get_documentation("kallisto quant --help")
print(docs)

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
                              predicted to lie outside a transcript
    --fr-stranded             Strand specific reads, first read forward
    --rf-stranded             Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              

## Prepare prompt/code to determine input files

I believe the input files I will need are the FASTQ files themselves, as well as an index file.

Ah, I need to download the index don't I? (done).

Being able to locate the FASTQ files... this might get a bit messy (and will be a consideration when I 

In [104]:
# Function to get files

def get_files(directory, suffix):
    """
    Recursively lists all files in a given directory and its subdirectories that end with the specified suffix,
    returning their absolute paths.

    Parameters:
    directory (str): The path to the directory to search in.
    suffix (str): The file suffix to look for (e.g., 'fastq.gz').

    Returns:
    list: A list of absolute file paths that match the given suffix.
    """
    matched_files = []
    
    try:
        # Walk through directory and subdirectories
        for root, _, files in os.walk(directory):
            for f in files:
                if f.endswith(suffix):
                    matched_files.append(os.path.join(root, f))
                    
        return matched_files
    except FileNotFoundError:
        print(f"Directory '{directory}' not found.")
        return []
    except Exception as e:
        print(f"An error occurred: {e}")
        return []

# Example usage:
directory = "../2_extract_data/" # this is something I will need to work out how to optimize
suffix = "fastq.gz"
fastq_files = get_files(directory, suffix) # just for my own sanity I didn't print the output, but I can see it was able to find all the files

directory = "/home/myuser/work/data/kallisto_indices/"
suffix = ".idx"
index_files = get_files(directory, suffix)
print(index_files) # The vision I have for the moment is to have the LLM automatically select the correct index

# This "finding files" might be something I encode as an agent which I call in different steps

['/home/myuser/work/data/kallisto_indices/mouse/index.idx', '/home/myuser/work/data/kallisto_indices/human/index.idx']


In [34]:
Entrez.email = os.getenv('ENTREZ_EMAIL')
Entrez.api_key = os.getenv('ENTREZ_API_KEY')

def get_study_summary(accession):

    # Define the command as a string
    command = (
        f'esearch -db gds -query "{accession}[ACCN]" | '
        'efetch -format docsum | '
        'xtract -pattern DocumentSummarySet -block DocumentSummary '
        f'-if Accession -equals {accession} -element summary'
    )

    # Execute the command
    result = subprocess.run(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

    # Check if the command was successful
    if result.returncode == 0:
        # Return the output
        return result.stdout.strip()
    else:
        # Raise an error with the stderr output
        raise Exception(f"Error: {result.stderr}")

# Example usage:
study_summary = get_study_summary("GSE268034")
print(study_summary)

Despite selective HDAC3 inhibition showing promise in a subset of lymphomas with CREBBP mutations, wild-type tumors generally exhibit resistance. Here, using unbiased genome-wide CRISPR screening, we identify GNAS knockout (KO) as a sensitizer of resistant lymphoma cells to HDAC3 inhibition. Mechanistically, GNAS KO-induced sensitization is independent of the canonical G-protein activities but unexpectedly mediated by viral mimicry-related interferon (IFN) responses, characterized by TBK1 and IRF3 activation, double-stranded RNA formation, and transposable element (TE) expression. GNAS KO additionally synergizes with HDAC3 inhibition to enhance CD8+ T cell-induced cytotoxicity. Moreover, we observe in human lymphoma patients that low GNAS expression is associated with high baseline TE expression and upregulated IFN signaling and shares common disrupted biological activities with GNAS KO in histone modification, mRNA processing, and transcriptional regulation. Collectively, our findings

In [62]:
import time
import subprocess
import os
import pandas as pd
from io import StringIO
from Bio import Entrez

# Path to the file containing SRA IDs (one per line)
sra_ids_file = "sra_ids.txt"

# Set NCBI API key from environment variable
Entrez.api_key = os.getenv('ENTREZ_API_KEY')
if Entrez.api_key:
    print("API key detected.")
else:
    print("No API key detected; proceeding without it.")

# Delay (in seconds) between requests to respect rate limits
delay = 0.5 if not Entrez.api_key else 0.1

# List to store each SRA ID's fetched data as a dictionary
data = []

# Read the SRA IDs from the file
with open(sra_ids_file, 'r') as ids_file:
    for line in ids_file:
        sra_id = line.strip()
        if not sra_id:
            continue  # Skip empty lines

        print(f"\nProcessing SRA ID: {sra_id}")

        # Construct the full command as a single shell command
        command = f"esearch -db sra -query {sra_id} | efetch -format runinfo"

        # Add API key if provided
        if Entrez.api_key:
            command = f"esearch -db sra -query {sra_id} | efetch -format runinfo"

        # Print the command being attempted for debugging
        print(f"Executing command: {command}")

        try:
            # Execute the combined command as a single shell pipeline
            result = subprocess.run(
                command, shell=True, capture_output=True, text=True, check=True
            )

            # Convert the result to a DataFrame and append it to the list
            csv_data = StringIO(result.stdout)
            df = pd.read_csv(csv_data)
            data.append(df)

        except subprocess.CalledProcessError as e:
            print(f"Error processing {sra_id}: {e}")
            print(f"Command output: {e.output}")
            continue  # Skip to the next SRA ID if there’s an error

        # Respect API rate limits
        time.sleep(delay)

# Combine all DataFrames into one
if data:
    combined_df = pd.concat(data, ignore_index=True)
    
    # Remove columns where all entries are NaN
    combined_df.dropna(axis=1, how='all', inplace=True)

    # Display the resulting DataFrame
    print("\nData fetching complete.")
    print(combined_df)
else:
    print("No data was fetched.")

API key detected.

Processing SRA ID: SRR29101291
Executing command: esearch -db sra -query SRR29101291 | efetch -format runinfo

Processing SRA ID: SRR29101292
Executing command: esearch -db sra -query SRR29101292 | efetch -format runinfo

Processing SRA ID: SRR29101293
Executing command: esearch -db sra -query SRR29101293 | efetch -format runinfo

Processing SRA ID: SRR29101294
Executing command: esearch -db sra -query SRR29101294 | efetch -format runinfo

Processing SRA ID: SRR29101295
Executing command: esearch -db sra -query SRR29101295 | efetch -format runinfo

Processing SRA ID: SRR29101296
Executing command: esearch -db sra -query SRR29101296 | efetch -format runinfo

Processing SRA ID: SRR29101297
Executing command: esearch -db sra -query SRR29101297 | efetch -format runinfo

Processing SRA ID: SRR29101298
Executing command: esearch -db sra -query SRR29101298 | efetch -format runinfo

Processing SRA ID: SRR29101299
Executing command: esearch -db sra -query SRR29101299 | efetch

## Prepare Kallisto quantification

I think the above should give me enough information to get accurately determine the quantification parameters. 

On the topic of validation - the best I have for the moment is to identify a diverse set of 5 (?) studies to hold out as a validation dataset... (ChatGPT is not helping me here)

My plan will be - give it documentation, give it dataset metadata, give it the sample metadata, and give it the location of the FASTQ files. Also the location of (all?) indices. 

In [65]:
# I can access the sample metadata extracted above via...

# combined_df

# Just not going to show the output since it's a bit chunky

In [74]:
print(docs)

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
                              predicted to lie outside a transcript
    --fr-stranded             Strand specific reads, first read forward
    --rf-stranded             Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              

In [111]:
class KallistoCommand(BaseModel):
    index: str = Field(..., description="Filename for the Kallisto index to be used for quantification")
    fastq1: str = Field(..., description="Filename for the first FASTQ file (Read 1) to be quantified")
    fastq2: Optional[str] = Field(description="Filename for the second FASTQ file (Read 2) to be quantified (optional for single-end reads)")
    output: str = Field(..., description="Directory to write output to")
    bootstraps: int = Field(..., description="Number of bootstrap samples")
    single: bool = Field(..., description="If the reads are single-end")
    fr_stranded: bool = Field(..., description="If the reads are strand-specific, with first read forward")
    rf_stranded: bool = Field(..., description="If the reads are strand-specific, with first read reverse")
    frag_length: Optional[int] = Field(description="Estimated average fragment length (required for single-end reads)")
    sd: Optional[int] = Field(description="Estimated standard deviation of fragment length (required for single-end reads)")
    justification: str = Field(..., description="Justification for each chosen parameter, including if the parameter was excluded")

class KallistoCommands(BaseModel):
    commands: List[KallistoCommand] = Field(description="List of Kallisto quantification commands for each sample")

def identify_kallisto_params():
    prompt = f"""

## IDENTITY AND PURPOSE

You are an expert in bioinformatic analyses. You will be provided with various pieces of information, and use this information to determine the appropriate parameters for a Kallisto analysis.

## STEPS

1. Carefully digest the contents of the provided Kallisto documentation. Note that any existing knowledge you have of Kallisto may not be correct, so follow the documentation closely.
2. Carefully consider the contents of the sample metadata. Not all information will be relevant, however there will be content that will be needed.
3. Carefully look through the dataset metadata. This may contain details that are useful.
4. After considering all of the above, determine which Kallisto parameters should be set. Do not make any assumptions that are not explicitly stated for any optional fields. 
5. In determining parameters, make sure you only choose valid files (i.e. pick out of the options which are provided)
6. Ensure that the chosen parameters allow for a robust analysis that would satisfy the most critical peer reviewers.
7. You should prioritize scientific robustness over ease of computational burden.
8. Note the following guidelines for some specific parameters:
- the output directory should be named such that the sample being quantified can be identified from this output directory.

## OUTPUT

Your output should consist of each parameter, and either:
- the value to be included for the parameter
- if the parameter should not be included, you should state NA
- For ALL chosen parameters, describe the justification for including the particular value, or excluding it.

This should be applied to all parameters identified as per the provided Kallisto documentation.

## INPUT

Kallisto documentation: {docs}

Dataset summary: {study_summary}

FASTQ files: {fastq_files}

Possible Kallisto indices: {index_files}

Sample metadata: {combined_df.to_string}

"""
    chat_completion = client.beta.chat.completions.parse(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="gpt-4o-mini",
        response_format = KallistoCommands
        )
    result = chat_completion.choices[0].message.parsed
    print(f"Generated tokens: ", chat_completion.usage.completion_tokens)
    print(f"Prompt tokens: ", chat_completion.usage.prompt_tokens)
    print(f"Total tokens: ", chat_completion.usage.total_tokens)
    return(result)

In [112]:
kallisto_commands = identify_kallisto_params()

for cmd in kallisto_commands.commands:
    # Construct the Kallisto command string
    kallisto_cmd = f"kallisto quant -i {cmd.index} -o {cmd.output} -t 4"
    
    if cmd.bootstraps > 0:
        kallisto_cmd += f" --bootstrap-samples={cmd.bootstraps}"
    
    if cmd.single:
        kallisto_cmd += " --single"
        if cmd.frag_length:
            kallisto_cmd += f" -l {cmd.frag_length}"
        if cmd.sd:
            kallisto_cmd += f" -s {cmd.sd}"
    else:
        # Paired-end
        if cmd.fr_stranded:
            kallisto_cmd += " --fr-stranded"
        elif cmd.rf_stranded:
            kallisto_cmd += " --rf-stranded"
    
    # Append FASTQ files
    kallisto_cmd += f" {cmd.fastq1} {cmd.fastq2}"

Generated tokens:  1588
Prompt tokens:  4804
Total tokens:  6392


In [115]:
def execute_kallisto_commands(kallisto_commands: KallistoCommands):
    for cmd in kallisto_commands.commands:
        # Construct the Kallisto command string
        kallisto_cmd = f"kallisto quant -i {cmd.index} -o {cmd.output} -t 4 --plaintext"
        
        if cmd.bootstraps > 0:
            kallisto_cmd += f" --bootstrap-samples={cmd.bootstraps}"
        
        if cmd.single:
            kallisto_cmd += " --single"
            if cmd.frag_length:
                kallisto_cmd += f" -l {cmd.frag_length}"
            if cmd.sd:
                kallisto_cmd += f" -s {cmd.sd}"
        else:
            # Paired-end
            if cmd.fr_stranded:
                kallisto_cmd += " --fr-stranded"
            elif cmd.rf_stranded:
                kallisto_cmd += " --rf-stranded"
        
        # Append FASTQ files
        if cmd.fastq2 and cmd.fastq2.lower() != 'na':
            kallisto_cmd += f" {cmd.fastq1} {cmd.fastq2}"
        else:
            kallisto_cmd += f" {cmd.fastq1}"
        
        print(f"Executing Kallisto command for {cmd.fastq1}:")
        print(kallisto_cmd)

        # Execute the command
        try:
            subprocess.run(kallisto_cmd, shell=True, check=True)
            print(f"Kallisto quantification completed for {cmd.fastq1}\n")
        except subprocess.CalledProcessError as e:
            print(f"Error executing Kallisto for {cmd.fastq1}: {e}\n")
        
        # Optionally, log the justification
        justification_path = os.path.join(cmd.output, "justification.txt")
        os.makedirs(cmd.output, exist_ok=True)
        with open(justification_path, "w") as f:
            f.write(cmd.justification)
        print(f"Justification saved to {justification_path}\n")

In [116]:
if __name__ == "__main__":
    kallisto_commands = identify_kallisto_params()
    execute_kallisto_commands(kallisto_commands)

Generated tokens:  1566
Prompt tokens:  4804
Total tokens:  6370
Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101291_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101291 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101291_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101291_2.fastq.gz
Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved re


[quant] fragment length distribution will be estimated from the data
Error: could not create directory output/SRR29101291


[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101292_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101292_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 30,242 reads pseudoaligned
[quant] estimated average fragment length: 200.645
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 669 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101292_1.fastq.gz

Justification saved to output/SRR29101292/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101293_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101293 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101293_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101293_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101293_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101293_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 32,224 reads pseudoaligned
[quant] estimated average fragment length: 201.753
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 724 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101293_1.fastq.gz

Justification saved to output/SRR29101293/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101294_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101294 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101294_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101294_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101294_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101294_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 30,720 reads pseudoaligned
[quant] estimated average fragment length: 212.242
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 675 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101294_1.fastq.gz

Justification saved to output/SRR29101294/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101295_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101295 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101295_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101295_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101295_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101295_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 29,669 reads pseudoaligned
[quant] estimated average fragment length: 203.446
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 711 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101295_1.fastq.gz

Justification saved to output/SRR29101295/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101296_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101296 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101296_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101296_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101296_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101296_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 31,540 reads pseudoaligned
[quant] estimated average fragment length: 214.288
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 683 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101296_1.fastq.gz

Justification saved to output/SRR29101296/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101297_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101297 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101297_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101297_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101297_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101297_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 33,865 reads pseudoaligned
[quant] estimated average fragment length: 212.372
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 675 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101297_1.fastq.gz

Justification saved to output/SRR29101297/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101298_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101298 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101298_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101298_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101298_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101298_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 34,001 reads pseudoaligned
[quant] estimated average fragment length: 210.536
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 692 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101298_1.fastq.gz

Justification saved to output/SRR29101298/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101299_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101299 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101299_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101299_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101299_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101299_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 36,915 reads pseudoaligned
[quant] estimated average fragment length: 204.182
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 810 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101299_1.fastq.gz

Justification saved to output/SRR29101299/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101300_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101300 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101300_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101300_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101300_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101300_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 37,198 reads pseudoaligned
[quant] estimated average fragment length: 211.44
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 709 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101300_1.fastq.gz

Justification saved to output/SRR29101300/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101301_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101301 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101301_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101301_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101301_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101301_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 38,817 reads pseudoaligned
[quant] estimated average fragment length: 203.794
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 704 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101301_1.fastq.gz

Justification saved to output/SRR29101301/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101302_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101302 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101302_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101302_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101302_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101302_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 38,759 reads pseudoaligned
[quant] estimated average fragment length: 219.194
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 738 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101302_1.fastq.gz

Justification saved to output/SRR29101302/justification.txt



In [117]:
execute_kallisto_commands(kallisto_commands)

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101291_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101291 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101291_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101291_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101291_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101291_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 30,364 reads pseudoaligned
[quant] estimated average fragment length: 209.483
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 732 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101291_1.fastq.gz

Justification saved to output/SRR29101291/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101292_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101292 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101292_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101292_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101292_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101292_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 30,242 reads pseudoaligned
[quant] estimated average fragment length: 200.645
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 669 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101292_1.fastq.gz

Justification saved to output/SRR29101292/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101293_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101293 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101293_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101293_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101293_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101293_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 32,224 reads pseudoaligned
[quant] estimated average fragment length: 201.753
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 724 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101293_1.fastq.gz

Justification saved to output/SRR29101293/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101294_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101294 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101294_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101294_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101294_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101294_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 30,720 reads pseudoaligned
[quant] estimated average fragment length: 212.242
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 675 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101294_1.fastq.gz

Justification saved to output/SRR29101294/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101295_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101295 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101295_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101295_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101295_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101295_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 29,669 reads pseudoaligned
[quant] estimated average fragment length: 203.446
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 711 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101295_1.fastq.gz

Justification saved to output/SRR29101295/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101296_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101296 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101296_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101296_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101296_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101296_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 31,540 reads pseudoaligned
[quant] estimated average fragment length: 214.288
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 683 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101296_1.fastq.gz

Justification saved to output/SRR29101296/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101297_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101297 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101297_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101297_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101297_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101297_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 33,865 reads pseudoaligned
[quant] estimated average fragment length: 212.372
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 675 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101297_1.fastq.gz

Justification saved to output/SRR29101297/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101298_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101298 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101298_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101298_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101298_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101298_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 34,001 reads pseudoaligned
[quant] estimated average fragment length: 210.536
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 692 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101298_1.fastq.gz

Justification saved to output/SRR29101298/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101299_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101299 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101299_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101299_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101299_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101299_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 36,915 reads pseudoaligned
[quant] estimated average fragment length: 204.182
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 810 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101299_1.fastq.gz

Justification saved to output/SRR29101299/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101300_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101300 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101300_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101300_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101300_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101300_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 37,198 reads pseudoaligned
[quant] estimated average fragment length: 211.44
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 709 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101300_1.fastq.gz

Justification saved to output/SRR29101300/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101301_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101301 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101301_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101301_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101301_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101301_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 38,817 reads pseudoaligned
[quant] estimated average fragment length: 203.794
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 704 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101301_1.fastq.gz

Justification saved to output/SRR29101301/justification.txt

Executing Kallisto command for ../2_extract_data/GSE268034_data/SRR29101302_1.fastq.gz:
kallisto quant -i /home/myuser/work/data/kallisto_indices/human/index.idx -o output/SRR29101302 -t 4 --plaintext --bootstrap-samples=100 ../2_extract_data/GSE268034_data/SRR29101302_1.fastq.gz ../2_extract_data/GSE268034_data/SRR29101302_2.fastq.gz



[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 227,665
[index] number of k-mers: 139,900,295
[index] number of D-list k-mers: 5,477,475
[quant] running in paired-end mode
[quant] will process pair 1: ../2_extract_data/GSE268034_data/SRR29101302_1.fastq.gz
                             ../2_extract_data/GSE268034_data/SRR29101302_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 80,000 reads, 38,759 reads pseudoaligned
[quant] estimated average fragment length: 219.194
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 738 rounds




Kallisto quantification completed for ../2_extract_data/GSE268034_data/SRR29101302_1.fastq.gz

Justification saved to output/SRR29101302/justification.txt



This looks good to me... I will leave it here for the moment, but the pressing questions are
a) external validation (which I think I will rely on external datasets...)
b) is this method robust... 
c) an internal validation/checking mechanism

But I am happy with the bones of this. 

Wait no I'm not, I want to take everything here and put it in a single function. That can wait until I finalise the next step.

# Combined function