# Introduction

This notebook will contain experimentation (and hopefully the final implementation) of me using LLMs to analyse the Kallisto quantification data.

The ideal final outcome is a workflow where I can take in my Kallisto quantification files and perform DEG analysis. However, the exploratory steps that I'd be interested in:
- How well does the LLM produce a working R (I can more comfortably work with R) pipeline?
- How well does it handle inputs/outputs?
- How well will it handle the METADATA?
- How much guidance do I need to give? e.g. with the libraries that are available (in theory, I'd like this to be a "step" that the LLM is smart enough to know to implement). I don't want to have the LLM install new packages, that feels like a security risk.

Other notes:
- For the moment, I'll have the LLM use the "LLM Playground" directory to save its outputs
- In my head, this "workflow" will be "hi, here's what I want, do some steps to achieve this" - a bit like the worked example of solving an equation
- I also need to integrate this with 

In [1]:
# Load modules
from openai import OpenAI
import sys
import openai # I need this and above
import os
from tqdm import tqdm
import time
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from typing import List, Dict, Literal
import subprocess
import glob
import asyncio
import json
import base64 # image interpretation
import requests # image interpretation
import shlex # suggested for command-line strings
from datetime import datetime

In [3]:
# Quick OpenAI API test - note this does not reflect what I intend my end prompt to be, just want to get a quick idea of what I get...

load_dotenv('../../.env')

openai_api_key = os.getenv('OPENAI_API_KEY')

# Test OpenAI API...

client = OpenAI(
  api_key=openai_api_key,
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Could you provide code to import abundance.tsv Kallisto files into R and identify DEGs?",
        }
    ],
    model="gpt-4o-mini",
)

result = chat_completion.choices[0].message.content
print(result)

To analyze differential gene expression (DEG) using Kallisto's output (`abundance.tsv` files) in R, we typically follow several steps. Below is an outline of the process along with sample code.

### Prerequisites

Make sure you have the following packages installed in R:
- `tximport` for importing Kallisto's abundance data.
- `DESeq2` for differential expression analysis.

You can install these packages using:
```R
install.packages("BiocManager")
BiocManager::install("tximport")
BiocManager::install("DESeq2")
```

### Step-by-Step Code

Here’s how to do it:

1. **Import the Kallisto output**: Use the `tximport` function to read in your `abundance.tsv` files.

2. **Create DESeq2 data objects**: Create a DESeq2 object with the imported data.

3. **Run differential expression analysis**: Perform the analysis to identify DEGs.

Here’s a sample code snippet that combines these steps:

```R
# Load necessary libraries
library(tximport)
library(DESeq2)

# Define your paths and conditions
kalli

In [4]:
print(f"Generated tokens: ", chat_completion.usage.completion_tokens)
print(f"Prompt tokens: ", chat_completion.usage.prompt_tokens)
print(f"Total tokens: ", chat_completion.usage.total_tokens)

Generated tokens:  712
Prompt tokens:  26
Total tokens:  738


Obviously a one-sentence prompt will get nowhere.

# Investigating metadata

I technically have a separate notebook analysing metadata, but I will more formally do my tests here.

The initial test case is to give a metadata CSV and see if the LLM is able to identify what contrasts would be interesting. However, I would eventually probably want a separate function for finding the CSV, and I would later also need to determine what specific outputs I want.

At least in the initial conceptualisation stage, I'm not sure where I'll be integrating this (i.e. will this be something I do separately, then feed as input into the LLM), but nonetheless my goal is to develop a prompt that will get meaningful results

In [150]:
meta = pd.read_csv("/home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv")
meta

Unnamed: 0,title,geo_accession,status,submission_date,last_update_date,type,channel_count,source_name_ch1,organism_ch1,characteristics_ch1,...,library_selection,library_source,library_strategy,relation,relation.1,supplementary_file_1,cell line:ch1,cell type:ch1,genotype:ch1,treatment:ch1
0,SUDHL4_LacZ_RGFP0_1,GSM8284502,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479047,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625208,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284502/suppl/GSM8284502_SUDHL4_LacZ_RGFP0_1.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,WT,DMSO
1,SUDHL4_LacZ_RGFP0_2,GSM8284503,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479046,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625209,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284503/suppl/GSM8284503_SUDHL4_LacZ_RGFP0_2.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,WT,DMSO
2,SUDHL4_LacZ_RGFP5_1,GSM8284504,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479045,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625210,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284504/suppl/GSM8284504_SUDHL4_LacZ_RGFP5_1.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,WT,RGFP966 (5 µM)
3,SUDHL4_LacZ_RGFP5_2,GSM8284505,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479044,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625211,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284505/suppl/GSM8284505_SUDHL4_LacZ_RGFP5_2.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,WT,RGFP966 (5 µM)
4,SUDHL4_GNASKO2_RGFP0_1,GSM8284506,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479043,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625212,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284506/suppl/GSM8284506_SUDHL4_GNASKO2_RGFP0_1.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,DMSO
5,SUDHL4_GNASKO2_RGFP0_2,GSM8284507,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479042,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625213,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284507/suppl/GSM8284507_SUDHL4_GNASKO2_RGFP0_2.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,DMSO
6,SUDHL4_GNASKO2_RGFP5_1,GSM8284508,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479041,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625214,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284508/suppl/GSM8284508_SUDHL4_GNASKO2_RGFP5_1.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,RGFP966 (5 µM)
7,SUDHL4_GNASKO2_RGFP5_2,GSM8284509,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479040,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625215,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284509/suppl/GSM8284509_SUDHL4_GNASKO2_RGFP5_2.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,RGFP966 (5 µM)
8,SUDHL4_GNASKO3_RGFP0_1,GSM8284510,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479039,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625216,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284510/suppl/GSM8284510_SUDHL4_GNASKO3_RGFP0_1.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,DMSO
9,SUDHL4_GNASKO3_RGFP0_2,GSM8284511,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479038,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625217,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284511/suppl/GSM8284511_SUDHL4_GNASKO3_RGFP0_2.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,DMSO


In [151]:
meta.to_string

<bound method DataFrame.to_string of                      title geo_accession                 status  \
0      SUDHL4_LacZ_RGFP0_1    GSM8284502  Public on Aug 08 2024   
1      SUDHL4_LacZ_RGFP0_2    GSM8284503  Public on Aug 08 2024   
2      SUDHL4_LacZ_RGFP5_1    GSM8284504  Public on Aug 08 2024   
3      SUDHL4_LacZ_RGFP5_2    GSM8284505  Public on Aug 08 2024   
4   SUDHL4_GNASKO2_RGFP0_1    GSM8284506  Public on Aug 08 2024   
5   SUDHL4_GNASKO2_RGFP0_2    GSM8284507  Public on Aug 08 2024   
6   SUDHL4_GNASKO2_RGFP5_1    GSM8284508  Public on Aug 08 2024   
7   SUDHL4_GNASKO2_RGFP5_2    GSM8284509  Public on Aug 08 2024   
8   SUDHL4_GNASKO3_RGFP0_1    GSM8284510  Public on Aug 08 2024   
9   SUDHL4_GNASKO3_RGFP0_2    GSM8284511  Public on Aug 08 2024   
10  SUDHL4_GNASKO3_RGFP5_1    GSM8284512  Public on Aug 08 2024   
11  SUDHL4_GNASKO3_RGFP5_2    GSM8284513  Public on Aug 08 2024   

   submission_date last_update_date type  channel_count source_name_ch1  \
0      May 21 20

In [152]:
prompt = f"""

## IDENTITY AND PURPOSE

You are an expert in bioinformatic analyses. You will be provided with a metadata sheet, and are tasked with identifying contrasts that could be interesting in the metadata, with the intention of analysing these in a edgeR/limma based pipeline.
Take a deep breath, and carefully follow the steps outlined below to achieve the intended task.

## STEPS

1. Carefully consider each column, inferring what each column means from its name, and also the values in the column. 
2. Determine columns that appear to contain data that would be scientifically and biologically interesting to compare within the column.
- Only include comparisons that can be easily analysed in a limma/edgeR based pipeline
- Only include comparisons that would be generally valuable to scientific and medical literature
- Only include comparisons that can be made within this dataset only - i.e. does not require samples from additional datasets

## OUTPUT

1. For each comparison, include the EXACT column name, as well as the EXACT values that should be used for the comparison. Additionally, justify why the comparison would be interesting using up to 3 sentences

## INPUT

Metadata:
{meta.to_string()}

"""

chat_completion = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-4o-mini",
)

result = chat_completion.choices[0].message.content
print(result)

Here are several interesting comparisons based on the provided metadata:

1. **Comparison 1**:
   - **Column Name**: `genotype:ch1`
   - **Values**: `WT` vs. `GNAS knockout`
   - **Justification**: Comparing the wild-type (WT) genotype with the GNAS knockout model can reveal insights into the role of GNAS in diffuse large B-cell lymphoma. A differential expression analysis may uncover how gene knockout affects the expression profile, potentially identifying pathways or markers critical in the disease.

2. **Comparison 2**:
   - **Column Name**: `treatment:ch1`
   - **Values**: `DMSO` vs. `RGFP966 (5 µM)`
   - **Justification**: Analyzing the gene expression differences between the DMSO-treated and RGFP966-treated samples allows for the evaluation of RGFP966's effects as a therapeutic agent. Such comparisons can yield important information on its mechanism of action and efficacy in affecting cellular processes in DLBCL.

3. **Comparison 3**:
   - **Column Name**: `genotype:ch1`
   - **V

The above does seem pretty good - it is capturing everything that I want. However, I could imagine improvements if I
1. Repeated multiple times
2. Collate responses (a bit of experimentation reveals this will most likely be a combination of code, but also an LLM to remove "loose" duplicates)
3. Give scores to responses, to determine what the "final" list of contrasts to analyse should be.

I will therefore adapt the approach I took in identifying relevant datasets, and implement it here (since I did perform both).

I will need to give special consideration to how to evaluate/score the contrasts (perhaps Mr. Claude/ChatGPT will be helpful for me...)

In [218]:
# This code block is for getting the study summary. I will want to implement this in determining the appropriate contrasts.

def get_study_summary(accession, edirect_path="/home/myuser/edirect"):

    # Define the command as a string
    command = (
        f'esearch -db gds -query "{accession}[ACCN]" | '
        'efetch -format docsum | '
        'xtract -pattern DocumentSummarySet -block DocumentSummary '
        f'-if Accession -equals {accession} -element summary'
    )

    # Execute the command
    result = subprocess.run(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

    # Check if the command was successful
    if result.returncode == 0:
        # Return the output
        return result.stdout.strip()
    else:
        # Raise an error with the stderr output
        raise Exception(f"Error: {result.stderr}")

# Example usage:
study_summary = get_study_summary("GSE268034")
print(study_summary)

Despite selective HDAC3 inhibition showing promise in a subset of lymphomas with CREBBP mutations, wild-type tumors generally exhibit resistance. Here, using unbiased genome-wide CRISPR screening, we identify GNAS knockout (KO) as a sensitizer of resistant lymphoma cells to HDAC3 inhibition. Mechanistically, GNAS KO-induced sensitization is independent of the canonical G-protein activities but unexpectedly mediated by viral mimicry-related interferon (IFN) responses, characterized by TBK1 and IRF3 activation, double-stranded RNA formation, and transposable element (TE) expression. GNAS KO additionally synergizes with HDAC3 inhibition to enhance CD8+ T cell-induced cytotoxicity. Moreover, we observe in human lymphoma patients that low GNAS expression is associated with high baseline TE expression and upregulated IFN signaling and shares common disrupted biological activities with GNAS KO in histone modification, mRNA processing, and transcriptional regulation. Collectively, our findings

In [384]:
class Assessment(BaseModel):
    name: str = Field(description = "A name to be given to describe the contrast")
    column: str = Field(description = "Column, or column, in the metadata containing the values to be compared")
    values: str = Field(description = "The values in the identified column that are to be compared")
    justification: str = Field(description = "Justification for why the suggested contrast will be of use")

class Contrasts(BaseModel):
    contrasts: list[Assessment]

def identify_contrasts(meta):
    prompt = f"""

## IDENTITY AND PURPOSE

You are an expert in bioinformatic analyses. You will be provided with a metadata sheet, and are tasked with identifying contrasts that could be interesting in the metadata, with the intention of analysing these in a edgeR/limma based pipeline.
Take a deep breath, and carefully follow the steps outlined below to achieve the intended task.

## STEPS

1. Carefully consider each column, inferring what each column means from its name, and also the values in the column. 
2. Carefully digest the contents of the study summary to help identify points of interest in the study
- Use the study summary to generate up to three highly valuable and focussed research questions
3. Determine columns that appear to contain data that would be scientifically and biologically interesting to analyse
- These should be derived from the research questions determined from the study summary
- Only include analyses that can be made within this dataset only - i.e. does not require samples from additional datasets
- You are permitted to draw comparisons involving multiple different columns
4. Specify the values in the columns that should be used to for the comparison
- Only include comparisons that can be easily analysed in a limma/edgeR based pipeline. 
- Specifically take into consideration how a contrast matrix could be set up using the model.matrix and makeContrasts functions.
- You are permitted to draw comparisons involving multiple different columns
- Each comparison should be highly focussed
- A comparison should only involve a comparison between two groups (i.e. no three-way comparisons - this should instead be framed as 3 separate two-way comaprisons)

## OUTPUT

1. Include output for each proposed comparison
2. Specify the exact column name(s) that will need to be used for the comparison
- If this includes columns that are needed to identify relevant samples, include these as well
- For example, if the comparison is "Treatment X vs. Y in genotype A samples," you should indicate both the Treatment and Genotype columns, assuming there are multiple genotypes
- This is because the "Genotype" column is relevant for filtering down to Genotype A samples"
- Separate each value with ",". Do not include any other formatting, e.g. "vs" or "-".
3. Specify the exact values that will be used for the comparison
- This includes any values which will be needed for filtering down, as per Point 2 in OUTPUT
- If the samples to be compared are, for example "Treatment X vs. Y in genotpye A samples", the output should be "X, A vs. Y, A"
- If the samples to be compared are, for example "Treatment X vs. Y", the output should be "X vs. Y"
4. Justify why the comparison would be interesting using up to 3 sentences


For points 2 and 3, note that this should include enough information for someone to generate an appropriate contrast matrix using model.matrix and makeContrasts.

## INPUT

Study summary:
{study_summary}

Metadata:
{meta.to_string()}

"""
    chat_completion = client.beta.chat.completions.parse(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="gpt-4o-mini",
        response_format = Contrasts
        )
    result = chat_completion.choices[0].message.parsed
    print(f"Generated tokens: ", chat_completion.usage.completion_tokens)
    print(f"Prompt tokens: ", chat_completion.usage.prompt_tokens)
    print(f"Total tokens: ", chat_completion.usage.total_tokens)
    return(result)

async def identify_contrasts_multiple(meta, num_queries: int = 3) -> Contrasts:
    async def single_identify_contrasts():
        return identify_contrasts(meta)

    tasks = [single_identify_contrasts() for _ in range(num_queries)]
    results = await asyncio.gather(*tasks)

    # Combine the results
    all_contrasts = Contrasts(contrasts=[])
    for result in results:
        all_contrasts.contrasts.extend(result.contrasts)

    # Deduplication process to remove duplicate contrasts
    contrasts_dict = all_contrasts.dict()
    seen = set()
    unique_contrasts = []

    for item in contrasts_dict['contrasts']:
        identifier = (item['column'], item['values'])
        if identifier not in seen:
            unique_contrasts.append(item)
            seen.add(identifier)

    # Replace the original list with the filtered one
    contrasts_dict['contrasts'] = unique_contrasts

    # Convert back to the Contrasts model
    unique_contrasts_model = Contrasts(**contrasts_dict)

    return unique_contrasts_model

In [385]:
contrasts = await identify_contrasts_multiple(meta, num_queries=2)
print(contrasts)

Generated tokens:  397
Prompt tokens:  9442
Total tokens:  9839
Generated tokens:  206
Prompt tokens:  9442
Total tokens:  9648
contrasts=[Assessment(name='RGFP966 Treatment vs DMSO in WT Genotype', column='treatment:ch1, genotype:ch1', values='RGFP966 (5 µM), WT vs DMSO, WT', justification='This contrast will allow us to analyze the effect of HDAC3 inhibition (RGFP966) on gene expression specifically in the wild-type genotype of lymphoma cells, providing insights into how treatment alters cellular pathways and potential resistance mechanisms.'), Assessment(name='RGFP966 Treatment vs DMSO in GNAS Knockout', column='treatment:ch1, genotype:ch1', values='RGFP966 (5 µM), GNAS knockout vs DMSO, GNAS knockout', justification='Comparing the gene expression profiles between GNAS knockout cells treated with RGFP966 versus DMSO can elucidate the specific contributions of GNAS and its interactions with HDAC3 inhibition, especially in the context of enhanced cytotoxicity.'), Assessment(name='GNAS

In [386]:
class ComparisonEval(BaseModel):
    comparison: str
    score: int
    score_justification: str
    redundant: Literal["Yes", "No"]
    redundant_justification: str
    retain: Literal["Yes", "No"]

class AllEvals(BaseModel):
    evals: list[ComparisonEval]

prompt = f"""

### PURPOSE AND IDENTITY

You are an expert and experienced bioinformatician and scientist, who focuses on clarifying analyses which will be meaningful to perform. 

You have been tasked with evaluating the potential scientific value of proposed comparisons. These comparisons are intended to be performed in a edgeR/limma-based RNA-seq pipeline.

Take a deep breath, and carefully follow the below steps to achieve the best possible outcome.

### STEPS 

1. You will be provided a Python dictionary of proposed scientific comparisons which have been proposed for a limma/edgeR RNAseq pipeline.
- Do not propose any additional scientific comparisons beyond those specified in this Python dictionary
- Throughout your evaluation, keep in mind that the analysis will be based on the construction of a contrast matrix, using the values specified in the column and values.
2. You will also be provided metadata, which contains data that is mentioned in the Python dictionary, as well as a study summary
- Do NOT use this metadata to hallucinate additional comparisons
- Use this metadata ONLY to gather additional context for the defined scientific comparisons.
- Use the study summary to contextualise
3. For each proposed analysis, assign a score between 1 - 5, based on the scientific value that can be extracted out of the comparison. Do this independently for each comparison. Use the below as a scoring guide:

Score 5 – Outstanding Scientific Value

	•	The proposed comparison is highly relevant and addresses a significant scientific question or hypothesis.
	•	The comparison is likely to yield new and impactful insights that could lead to meaningful advancements in the field.
	•	The analysis is well-aligned with the biological context provided by the metadata and is expected to generate robust, interpretable results.
	•	The comparison is novel or provides a unique perspective that has not been previously explored.

Score 4 – High Scientific Value

	•	The proposed comparison is scientifically sound and addresses an important question.
	•	The analysis has the potential to contribute valuable insights, though it may be incremental rather than groundbreaking.
	•	The comparison is well-supported by the metadata and is expected to produce meaningful results.
	•	The comparison adds depth to existing knowledge but may not be entirely novel.

Score 3 – Moderate Scientific Value

	•	The proposed comparison is reasonable and could yield useful information.
	•	The analysis addresses a relevant question, though the scientific impact may be limited or somewhat unclear.
	•	The comparison is supported by the metadata but may not be as compelling or novel as higher-scoring comparisons.
	•	The results may be interesting but are likely to confirm existing knowledge rather than provide new insights.

Score 2 – Low Scientific Value

	•	The proposed comparison is somewhat relevant but does not address a particularly important or novel question.
	•	The analysis may yield some useful data, but the scientific impact is expected to be minimal.
	•	The comparison is only partially supported by the metadata, and the results may be difficult to interpret or have limited applicability.
	•	The comparison may be redundant with existing analyses or provide only marginal additional insights.

Score 1 – Minimal or No Scientific Value

	•	The proposed comparison is poorly conceived and unlikely to yield meaningful scientific insights.
	•	The analysis does not address a relevant or important question, or the rationale for the comparison is unclear.
	•	The comparison is not well-supported by the metadata, and the results are likely to be uninterpretable or irrelevant.
	•	The comparison may be redundant, trivial, or based on a flawed premise.

4. For each comparison, also identify if it is redundant and/or overlapping with another comparison.
- An example of this is identical "column" and "values" (e.g. column of "A" and values of "val1, val2" as compared to "val2, val1" or "val1 - val2")
- **Important** Note that if comparison 1 is redundant with comparison 2, BOTH comparisons 1 and 2 should be marked as redundant.
- Comparisons which are similar, but not overlapping, should not be classed as redundant
- Only classify comparisons as redundant if you are highly confident that they are redundant
- Keep in mind the analysis will be based on an edgeR/limma/DESeq2 pipeline - if two analyses are likely to require the identical experimental setup, these are redundant.
- After evaluating all comparisons for redundancy, double check whether the intended repsonse for any other comparison needs to be altered accordingly.
5. Based on your score evaluation and redundancy evaluation, make an evaluation as to whether each comparison should be retained. 
- When there are redundant comparisons, ONLY the comparison with the higher scientific value score should be retained
- If redundant comparisons have the same scientific value score, then retain EXACTLY one if both meet the scientific value score
- A scientific value score of 4 should be used as the threshold to retain a comparison
6. Prior to reporting results, double check that your responses are reasonable, and you have followed the steps correctly.
7. Report your results in accordance to the instructions in OUTPUT.

### OUTPUT

1. Include output for all proposed comparisons. Use the comparison name to describe each comparison.
2. Specify the scientific evaluation score
3. Include justification for the scientific evaluation score
4. Specify if the comparison is redundant
5. If redundant, justify why it is redundant. If not redundant, specify "Not redundant" for this field.
- The justification for selecting which of the redundant comparisons, if any, should be specified here.
6. Specify if the comparison should be retained or not

### PROPOSED SCIENTIFIC ANALYSES

{contrasts}

### METADATA

{meta.to_string()}
"""

chat_completion = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-4o-mini",
    response_format=AllEvals
)
print(f"Generated tokens: ", chat_completion.usage.completion_tokens)
print(f"Prompt tokens: ", chat_completion.usage.prompt_tokens)
print(f"Total tokens: ", chat_completion.usage.total_tokens)
evals = chat_completion.choices[0].message.parsed

Generated tokens:  453
Prompt tokens:  10245
Total tokens:  10698


In [387]:
contrasts_df = contrasts.dict()
contrasts_df = pd.DataFrame(contrasts_df['contrasts'])
contrasts_df 

Unnamed: 0,name,column,values,justification
0,RGFP966 Treatment vs DMSO in WT Genotype,"treatment:ch1, genotype:ch1","RGFP966 (5 µM), WT vs DMSO, WT","This contrast will allow us to analyze the effect of HDAC3 inhibition (RGFP966) on gene expression specifically in the wild-type genotype of lymphoma cells, providing insights into how treatment alters cellular pathways and potential resistance mechanisms."
1,RGFP966 Treatment vs DMSO in GNAS Knockout,"treatment:ch1, genotype:ch1","RGFP966 (5 µM), GNAS knockout vs DMSO, GNAS knockout","Comparing the gene expression profiles between GNAS knockout cells treated with RGFP966 versus DMSO can elucidate the specific contributions of GNAS and its interactions with HDAC3 inhibition, especially in the context of enhanced cytotoxicity."
2,GNAS Knockout vs Wild Type in DMSO Treatment,"genotype:ch1, treatment:ch1","GNAS knockout, DMSO vs WT, DMSO","This contrast aims to reveal differences in baseline gene expression between GNAS knockout and wild-type cells in a non-treated state, likely highlighting the inherent differences in response mechanisms to environmental signals and basal gene activity."
3,GNAS Knockout vs Wild Type in RGFP966 Treatment,"genotype:ch1, treatment:ch1","GNAS knockout, RGFP966 (5 µM) vs WT, RGFP966 (5 µM)","Investigating how GNAS knockout influences the response to RGFP966 treatment compared to the wild-type cells will shed light on the impact of GNAS deletion on treatment outcomes, potentially identifying GNAS as a therapeutic target for improving treatment efficacy."
4,RGFP966 vs DMSO in WT samples,treatment,"RGFP966,DMSO","This comparison will help assess the effects of RGFP966 treatment on gene expression in wild-type lymphoma cells, potentially revealing significant changes due to treatment."
5,Comparison of treatment effects between WT and GNAS knockout samples,"genotype,treatment","WT,RGFP966 vs GNAS knockout,RGFP966","Comparing RGFP966-treated WT and GNAS knockout samples could illuminate the adaptive response differences due to GNAS presence or absence, which may inform potential therapeutic strategies."


In [388]:
evals_df = evals.dict()
evals_df = pd.DataFrame(evals_df['evals'])
evals_df.to_csv("temp.csv")
evals_df

Unnamed: 0,comparison,score,score_justification,redundant,redundant_justification,retain
0,RGFP966 Treatment vs DMSO in WT Genotype,4,"This comparison is relevant as it assesses the impact of HDAC3 inhibition on gene expression in a well-defined WT genotype, promising insights into treatment effects and resistance mechanisms.",No,Not redundant,Yes
1,RGFP966 Treatment vs DMSO in GNAS Knockout,4,"This analysis focuses on the specific role of GNAS in the response to RGFP966, likely yielding important insights on its contribution to cytotoxicity and treatment effectiveness, complementing the WT analysis.",No,Not redundant,Yes
2,GNAS Knockout vs Wild Type in DMSO Treatment,3,"While this contrast addresses basic differences in gene expression between genotypes under a control treatment, it is less impactful compared to the treatment contrasts and could mainly confirm existing knowledge.",No,Not redundant,No
3,GNAS Knockout vs Wild Type in RGFP966 Treatment,4,"This comparison is significant as it explores how GNAS knockout affects the treatment response, potentially identifying a therapeutic target for enhancing treatment effectiveness.",No,Not redundant,Yes
4,RGFP966 vs DMSO in WT samples,4,"This analysis is crucial for understanding overall treatment effects in WT samples, providing foundational insights into gene expression changes induced by RGFP966.",No,Not redundant,Yes
5,Comparison of treatment effects between WT and GNAS knockout samples,5,"This comparative analysis has outstanding scientific value as it investigates the differential treatment responses, which could lead to significant advancements in therapeutic strategies involving GNAS.",No,Not redundant,Yes


I'm not entirely satisfied with the outcome so far (mainly with the inability to identify redundant contrasts) - however, the contrasts it is identifying do seem of interest, and I must admit is better than the singular one I came up with in my initial testing.

I noted several instances of hallucinations, parituclarly with imagining contrast that I did not specify. I've tried to stamp these out... a bit concerningly, these were sometimes marked as "retain".

My plan at the moment is to leave this as is (at least for the moment), and when I begin the prompt to develop the code a bit more explicitly, I think I might just include another check to see "would the code be functionally identical? -> if yes, ignore". This might be sufficient.

As it turns out, I will also want a bit of code to just return the contrasts that are to be retained.

In [372]:
keep = evals_df.index[evals_df['retain'] == 'Yes'].tolist()
contrasts.dict()

filtered_contrasts = {
    'contrasts': [contrasts.dict()['contrasts'][i] for i in keep]
}
filtered_contrasts # I think this is sufficient for my purposes

{'contrasts': [{'name': 'RGFP966 vs DMSO in WT samples',
   'column': 'treatment:ch1',
   'values': 'RGFP966 (5 µM)',
   'justification': 'This comparison will help elucidate the differential gene expression triggered by RGFP966 treatment in wild-type lymphoma cells, providing insights into the mechanism underlying HDAC3 inhibition.'},
  {'name': 'GNAS knockout vs Wild-type in DMSO-treated samples',
   'column': 'genotype:ch1',
   'values': 'GNAS knockout',
   'justification': 'This contrast will allow us to assess the impact of GNAS knockout on baseline gene expression profiles in the absence of treatment, highlighting pathways involved in tumor resistance to therapies.'},
  {'name': 'GNAS WT vs DMSO in SU-DHL-4',
   'column': 'genotype:ch1, treatment:ch1',
   'values': 'WT, DMSO vs. WT, RGFP966 (5 µM)',
   'justification': 'Comparing the gene expression profiles of wild-type GNAS treated with DMSO versus RGFP966 can elucidate the baseline effects of the drug relative to control, help

The other way to approach the generation of contrasts is to
1. Identify the relevant columns
2. Identify what biologically relevant contrasts could be made here

I will test what I get when I take this approach, and see how things turn out.

In [394]:
class Assessment(BaseModel):
    name: str = Field(description = "A name to be given to describe the contrast")

prompt = f"""

## IDENTITY AND PURPOSE

You are an expert in bioinformatic analyses. You will be provided with a metadata sheet, and are tasked with identifying contrasts that could be interesting in the metadata, with the intention of analysing these in a edgeR/limma based pipeline.
Take a deep breath, and carefully follow the steps outlined below to achieve the intended task.

## STEPS

1. Carefully consider each column, inferring what each column means from its name, and also the values in the column. 
2. Carefully digest the contents of the study summary to help identify points of interest in the study
3. Determine columns that appear to contain data that would be scientifically and biologically interesting to analyse
- Only include analyses that can be made within this dataset only - i.e. does not require samples from additional datasets
- You are permitted to draw comparisons involving multiple different columns
- You are also permitted to include contrasts that would serve as useful controls
4. Consider all possible comparisons that could be made, using only values in the columns specified

## OUTPUT

1. Include the columns that contain biologically relevant and interesting values
2. Describe all comparisons that could be made
3. Evaluate the potential utility of these comparisons

## INPUT

Study summary:
{study_summary}

Metadata:
{meta.to_string()}

"""

chat_completion = client.chat.completions.create(
     messages=[
          {
             "role": "user",
              "content": prompt,
          }
        ],
        model="gpt-4o-mini",
        )
result = chat_completion.choices[0].message.content
print(f"Generated tokens: ", chat_completion.usage.completion_tokens)
print(f"Prompt tokens: ", chat_completion.usage.prompt_tokens)
print(f"Total tokens: ", chat_completion.usage.total_tokens)

print(result)

Generated tokens:  793
Prompt tokens:  8914
Total tokens:  9707
# OUTPUT

## 1. Columns with Biologically Relevant Values

The following columns contain biologically relevant data for the analysis of differential expression in the context of the study:

- **cell line:ch1**: This indicates the specific cell line used, which is essential to interpret results based on model systems (e.g., SU-DHL-4).
- **cell type:ch1**: This denotes the type of cells being studied, in this case, diffuse large B-cell lymphoma cells.
- **genotype:ch1**: Indicates whether the genotype is wild type (WT) or knockout (KO) for the GNAS gene, which is crucial for understanding the genetic influence on treatment response.
- **treatment:ch1**: This provides information about the treatment conditions, including whether cells were treated with DMSO (control) or RGFP966 (a specific intervention), which is important for comparing responses to therapies.

## 2. Descriptions of Possible Comparisons

Given the metadata pr

# Generating the RNAseq analysis code

My focus will now switch to generating RNAseq analysis code. In my head, this would turn out as:
- Identify the Kallisto files
- Import the Kallisto abundances
- Import the metadata
- Create a DGEList object
- Filtering/normalisation
- DEG analysis

Of course, if it ends up pivoting from this, then I can assess the performance. In any case, I think the overall workflow I am aiming for:
- Propose pipeline
- Evaluate the pipeline
- If needed, adjust the proposed pipeline
- Execute pipeline
- Assess results of the pipeline (mainly in terms of stderr/stdout) - do the results make sense?
- I'd also be interested to see how possible it is to integrate QC into this. e.g. I can provide images, but would these actually be interpreted correctly? (and I'm also unsure how I'd even feed these in...).

EDIT - this will be changed to:
- create an RNAseq pipeline with optional parameters
- Have the LLM decide all the parameters for me
- have a correction workflow
- Make assessments based on the generated results - since I know I can do this for images. 

# Small image test

I want to get an idea of the value of inputting images. Both in terms of - can I expect it will perform well, and also how much does it cost? 

Seems that cost is not going to be a big issue. The manual implementation of this works... fine (it can interpret the image), but I'm not sure how viable this will become 

In [158]:
# Examples of image-based QC I might do include interpreting PCA plots, assessing the outcome of filtering as well as normalization.

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "./ManualRNASeqAnalysis/Filtering.png"

# Getting the base64 string
base64_image = encode_image(image_path)

prompt = f"""

The provided image is intended for use as a QC check in a bioinformatic analysis.

Make an assessment as to whether the QC check is likely to have passed.
"""



headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {openai_api_key}"
}

payload = {
  "model": "gpt-4o-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": prompt
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

print(response.json())

{'id': 'chatcmpl-9yyGKp5IvN3DnfstTddh1kZ1mRcpa', 'object': 'chat.completion', 'created': 1724318356, 'model': 'gpt-4o-mini-2024-07-18', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "Based on the provided image, here's an assessment of the quality control (QC) check in the bioinformatic analysis:\n\n1. **Unfiltered Data** (Left Panel):\n   - The density plot shows multiple peaks and a significant amount of density at lower log-cpm values (around 6). This suggests the presence of low count or noise, which may indicate poor quality data or a large number of low-expressed genes.\n\n2. **Filtered Data** (Right Panel):\n   - The density plot indicates a much sharper and centralized peak around higher log-cpm values (8 to 12+). This suggests that filtering has improved the overall quality of the data by removing low-expressed genes or noise, and focusing on relevant gene expression.\n\n### Conclusion:\nThe QC check likely passed since the filtered data shows a more nor

# Code generation test

In the worked example OpenAI provides of structured outputs, they have an example of solving an equation - I am planning on doing something similar.

I need to think carefully about the implementation - my first test case will be "produce step-by-step pipeline", followed by "what are the expected inputs and outputs" and "generate code" for each step in the pipeline. 

I need to think somewhat carefully about how I implement the metaeata... it is definitely necessary to get the correct code. I think with this in mind, it makes sense to have a separate LLM call for... each step...?

EDIT FROM THE FUTURE - just do a customatizable R script and have the LLM plug in appropriate values.

In [177]:
class Steps(BaseModel):
    name: str = Field(description = "A simple, descriptive name of the step to be performed")
    description: str = Field(description="Description of the step to be performed")
    purpose: str = Field(description="Justification of why step is needed")
    functions: list[str] = Field(description = "R functions required for this step")
    input: list[str] = Field(description = "List of required input files")
    output: list[str] = Field(description = "List of output files")

class Pipeline(BaseModel):
    pipeline: list[Steps]

prompt = f"""

### IDENTITY AND PURPOSE

You are an expert bioinformatician, who meticulously and carefully plans computational RNAseq experiments. 

You have been asked to provide a basic analysis pipeline which can be used to analyse quantification data generated from Kallisto. 

Take a deep breath, and carefully take note of the requirements outlined below to achieve the best possible outcome.

### PIPELINE REQUIREMENTS

1. Keep in mind that the pipeline will have the following available inputs: Kallisto abundance files, sample metadata
2. Your pipeline should be based in R
3. Avoid installing unnecessary packages. Packages such as tidyverse, limma, edgeR, tximport, and DESeq2 are permitted.
4. Your pipeline should incorporate analysis steps, validation steps (e.g. checking inputs/outputs), and also quality control steps
5. The final output should be related to differentially expressed genes
6. The pipeline should be constructed in a logical order, taking into consideration the expected inputs and outputs of each stage
7. The starting point of the pipeline will be the input Kallisto abudance files
8. Specify the functions that will be used at each ste

### OUTPUT

1. Construct a list of sequential steps to perform the RNAseq analysis
2. Each step should achieve a single goal
3. Specify the expected input and output of each stage
4. Specify the functions that will be used at each step, but you do not need to include any parameters (e.g. specify "write_csv" rather than "write_csv(x, file = "test.csv')


"""

chat_completion = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-4o-mini",
    response_format=Pipeline
)
result = chat_completion.choices[0].message.parsed
result.dict()

{'pipeline': [{'name': 'Load Required Libraries',
   'description': 'Load necessary R packages for the analysis.',
   'purpose': 'To ensure that all required libraries are available for subsequent steps.',
   'functions': ['library', 'suppressPackageStartupMessages'],
   'input': [],
   'output': []},
  {'name': 'Load Metadata',
   'description': 'Read sample metadata and validate its structure.',
   'purpose': 'To gather important information regarding samples, such as conditions and replicates for analysis.',
   'functions': ['read.csv', 'glimpse'],
   'input': ['sample_metadata.csv'],
   'output': ['metadata']},
  {'name': 'Load Kallisto Abundance Files',
   'description': 'Load Kallisto abundance files and prepare for quantification analysis.',
   'purpose': 'To aggregate quantification data from all Kallisto outputs for subsequent analysis.',
   'functions': ['list.files', 'read_tsv'],
   'input': ['kallisto_abundance_files/'],
   'output': ['abundance_data']},
  {'name': 'Combine

In [178]:
# Here is where I will include the initial plan for an evaluation framework - the goal is just to evaluate whether the proposed pipeline looks reasonable

# A quick glance at the JSON makes it seem... ok...?

class StepAssessment(BaseModel):
    name: str = Field(description = "A simple, descriptive name of the step to be performed")
    code_eval: str = Field(description="Evaluation of if the proposed code is likely to work")
    pipeline_eval: str = Field(description="Evaluation of if the proposed step is useful in the context of the pipeline")
    step_eval: Literal["Yes", "No"] = Field(description="Yes if the step is ok, No if the step needs improvements/changes, or was missing from the input pipeline")


class OverallAssessment(BaseModel):
    all_assessments: list[StepAssessment]
    overall_eval: Literal["Yes", "No"] = Field(description="Yes if all steps in the pipeline are ok, No if any step needs improvements/changes, or was missing from the input pipeline")

### TEMPORARY NOTE THAT I WILL HOPEFULLY REMEMBER TO LOOK AT 
# ^^^^^^^^^^^^^^^^^^^^
# .... kek, I forgot what I wanted to make a tempo

prompt = f"""

### IDENTITY AND PURPOSE

You are an expert bioinformatician, who meticulously and carefully plans computational RNAseq experiments. 

You have been asked to evaluate a basic analysis pipeline which can be used to analyse quantification data generated from Kallisto. 

Take a deep breath, and carefully take note of the steps outlined below to achieve the best possible outcome.

### STEPS

1. Evaluate each proposed step in the pipeline, taking into consideration
a) The accuracy of the proposed functions 
- Assess the general structure of the code, and double check the existence of any proposed functions and libraries
- Pay special attention to capitalization, correct use of underscores/periods in functions, and whether or not the function exists
- There may be placeholder values in the proposed code - do not assess these as incorrect
- The original instruction was simply to list the name of the function - this is your main focus, rather than the parameters within the function
b) The necessity and value of each step of the code
c) The simplicity of each code
d) Whether the code follows commonly accepted guidelines
2. If possible improvements can be made in any of the above factors, make these suggestions to individual steps
3. If there appears to be steps missing, generate additional steps. Mark these additional steps as "not" passing the step evaluation.
4. If and only if all steps pass the evaluation, mark the overall pipeline as "Yes".

### OUTPUT

1. Return a JSON of a list of evaluations
2. One evaluation should be made for each step
3. Each step should achieve a single goal

### INPUT PIPELINE

{result.dict()}

"""

pipeline_eval = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-4o-mini",
    response_format=OverallAssessment
)
pipeline_eval.dict()

{'id': 'chatcmpl-9zELI9BtD5ey4VnhNnmTecSRM01tY',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
   'logprobs': None,
   'message': {'content': '{"all_assessments":[{"name":"Load Required Libraries","code_eval":"The functions \'library\' and \'suppressPackageStartupMessages\' exist in R and are commonly used for loading libraries and suppressing messages. Both functions are correctly named and structured.","pipeline_eval":"Very necessary step as loading libraries is crucial for the analysis to proceed without errors.","step_eval":"Yes"},{"name":"Load Metadata","code_eval":"Functions \'read.csv\' and \'glimpse\' exist and are correctly referenced. They perform the tasks of reading a CSV file and glancing at the data structure, respectively.","pipeline_eval":"Necessary for importing sample information which is essential for subsequent analyses.","step_eval":"Yes"},{"name":"Load Kallisto Abundance Files","code_eval":"Functions \'list.files\' and \'read_tsv\' are valid R functions. \

In [190]:
pipeline_eval.choices[0].message.parsed.all_assessments[2]

StepAssessment(name='Load Kallisto Abundance Files', code_eval="Functions 'list.files' and 'read_tsv' are valid R functions. 'list.files' helps in getting the list of files, and 'read_tsv' is suitable for reading tab-separated values from files.", pipeline_eval='Important step since it aggregates quantification data from Kallisto outputs, which is central for the analysis.', step_eval='Yes')

In [198]:
# Your existing model definitions
class Steps(BaseModel):
    name: str = Field(description="A simple, descriptive name of the step to be performed")
    description: str = Field(description="Description of the step to be performed")
    purpose: str = Field(description="Justification of why step is needed")
    functions: list[str] = Field(description="R functions required for this step")
    input: list[str] = Field(description="List of required input files")
    output: list[str] = Field(description="List of output files")

class Pipeline(BaseModel):
    pipeline: list[Steps]

class StepAssessment(BaseModel):
    name: str = Field(description="A simple, descriptive name of the step to be performed")
    code_eval: str = Field(description="Evaluation of if the proposed code is likely to work")
    pipeline_eval: str = Field(description="Evaluation of if the proposed step is useful in the context of the pipeline")
    step_eval: Literal["Yes", "No"] = Field(description="Yes if the step is ok, No if the step needs improvements/changes, or was missing from the input pipeline")

class OverallAssessment(BaseModel):
    all_assessments: list[StepAssessment]
    any_missing_res: Literal["Yes", "No"] = Field(description="A yes/no answer as to whether any critical steps are missing")
    any_missing_justification: str = Field(description="A description of any steps which are missing in this analysis")
    overall_eval: Literal["Yes", "No"] = Field(description="Yes if all steps in the pipeline are ok, No if any step needs improvements/changes, or was missing from the input pipeline")

def create_pipeline(feedback=None, pipeline = None):
    base_prompt = """
    ### IDENTITY AND PURPOSE
    You are an expert bioinformatician, who meticulously and carefully plans computational RNAseq experiments. 
    You have been asked to provide a basic analysis pipeline which can be used to analyse quantification data generated from Kallisto. 
    Take a deep breath, and carefully take note of the requirements outlined below to achieve the best possible outcome.
    ### PIPELINE REQUIREMENTS
    1. Keep in mind that the pipeline will have the following available inputs: Kallisto abundance files, sample metadata
    2. Your pipeline should be based in R
    3. Avoid installing unnecessary packages. Packages such as tidyverse, limma, edgeR, tximport, and DESeq2 are permitted.
    4. Your pipeline should incorporate analysis steps, validation steps (e.g. checking inputs/outputs), and also quality control steps
    5. The final output should be related to differentially expressed genes
    6. The pipeline should be constructed in a logical order, taking into consideration the expected inputs and outputs of each stage
    7. The starting point of the pipeline will be the input Kallisto abudance files
    8. Specify the functions that will be used at each step
    ### OUTPUT
    1. Construct a list of sequential steps to perform the RNAseq analysis
    2. Each step should achieve a single goal
    3. Specify the expected input and output of each stage
    4. Specify the functions that will be used at each step, but you do not need to include any parameters (e.g. specify "write_csv" rather than "write_csv(x, file = "test.csv')
    """
    
    if feedback:
        prompt = base_prompt + f"""
        ### FEEDBACK
        The following feedback was given on this pipeline. 
        
        {pipeline}
        
        Improve the pipeline by integrating this feedback, focussing only on the steps that needed to be improved:
        {feedback}
        """
    else:
        prompt = base_prompt

    chat_completion = client.beta.chat.completions.parse(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="gpt-4o-mini",
        response_format=Pipeline
    )
    return chat_completion.choices[0].message.parsed

def evaluate_pipeline(pipeline):
    prompt = f"""
    ### IDENTITY AND PURPOSE
    You are an expert bioinformatician, who meticulously and carefully plans computational RNAseq experiments. 
    You have been asked to evaluate a basic analysis pipeline which can be used to analyse quantification data generated from Kallisto. 
    Take a deep breath, and carefully take note of the steps outlined below to achieve the best possible outcome.
    ### STEPS
    1. Evaluate each proposed step in the pipeline, taking into consideration
    a) The accuracy of the proposed functions 
    - Assess the general structure of the code, and double check the existence of any proposed functions and libraries
    - Pay special attention to capitalization, correct use of underscores/periods in functions, and whether or not the function exists
    - Comment ONLY on if the functions are correct and valid - do not make any comments about parameters. For example, if the proposed function specifies "write_csv()", this is ok, as is "write_csv". However, "writeCSV" is incorrect, as this function does not exist.
    b) The necessity and value of each step of the code
    c) The simplicity of each code
    d) Whether the code follows commonly accepted guidelines
    2. If possible improvements can be made in any of the above factors, make these suggestions to individual steps
    3. Specify if there appear to be any missing steps, for example if normalisation or filtering is not performed at all.
    4. If and only if all steps pass the evaluation, mark the overall pipeline as "Yes".
    ### OUTPUT
    1. Return a JSON of a list of evaluations
    2. One evaluation should be made for each step
    3. Each step should achieve a single goal
    ### INPUT PIPELINE
    {pipeline.dict()}
    """
    pipeline_eval = client.beta.chat.completions.parse(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="gpt-4o-mini",
        response_format=OverallAssessment
    )
    return pipeline_eval

def run_pipeline_creation_and_evaluation(max_iterations=3):
    feedback = None
    for iteration in range(max_iterations):
        print(f"\nIteration {iteration + 1}:")
        
        # Create pipeline
        pipeline = create_pipeline(feedback)
        print("\nProposed Pipeline:")
        print(pipeline.dict())
        
        # Evaluate pipeline
        evaluation = evaluate_pipeline(pipeline)
        print("\nPipeline Evaluation:")
        print(evaluation.dict())
        
        # Check if pipeline passes evaluation
        if evaluation.choices[0].message.parsed.overall_eval == "Yes":
            print("\nPipeline passed evaluation. Returning final pipeline.")
            return pipeline
        elif iteration < max_iterations - 1:
            print("\nPipeline needs improvement. Rewriting with feedback...")
            # Prepare feedback for the next iteration
            feedback = "Evaluation results:\n"
            for assessment in evaluation.choices[0].message.parsed.all_assessments:
                feedback += f"Step: {assessment.name}\n"
                feedback += f"Code evaluation: {assessment.code_eval}\n"
                feedback += f"Pipeline evaluation: {assessment.pipeline_eval}\n"
                feedback += f"Step evaluation: {assessment.step_eval}\n\n"
            feedback += f"Missing steps: {evaluation.choices[0].message.parsed.any_missing_justification}\n\n"
        else:
            print("\nMaximum iterations reached. Returning last proposed pipeline.")
            return pipeline

    return pipeline

final_pipeline = run_pipeline_creation_and_evaluation(max_iterations=3)


Iteration 1:

Proposed Pipeline:
{'pipeline': [{'name': 'Load Required Libraries', 'description': 'Load necessary R packages for the analysis.', 'purpose': 'To ensure that all required functions from specified packages are available for use in the pipeline.', 'functions': ['library', 'require'], 'input': [], 'output': []}, {'name': 'Load Sample Metadata', 'description': 'Import sample metadata for the analysis.', 'purpose': 'Sample metadata is needed to associate sample information with Kallisto abundance files for downstream analysis.', 'functions': ['read.csv'], 'input': ['sample_metadata.csv'], 'output': ['metadata']}, {'name': 'Load Kallisto Abundance Files', 'description': 'Import Kallisto abundance files using tximport.', 'purpose': 'To read the Kallisto output files into R to prepare for differential expression analysis.', 'functions': ['tximport'], 'input': ['kallisto_abundance_files'], 'output': ['txi_kallisto']}, {'name': 'Validate Inputs', 'description': 'Check the integrit

In [199]:
final_pipeline.dict()

{'pipeline': [{'name': 'Load Required Libraries',
   'description': 'Load necessary R packages for the analysis pipeline.',
   'purpose': 'Ensure that all required functions from the specified packages are available for analysis steps.',
   'functions': ['library', 'require'],
   'input': [],
   'output': []},
  {'name': 'Load Sample Metadata',
   'description': 'Import a CSV file containing sample metadata.',
   'purpose': 'Associate sample specific information with Kallisto outputs for analysis context.',
   'functions': ['read.csv'],
   'input': ['sample_metadata.csv'],
   'output': ['metaData']},
  {'name': 'Load Kallisto Abundance Files',
   'description': 'Use the tximport function to load Kallisto abundance files into R.',
   'purpose': 'Prepare the Kallisto output for downstream analysis.',
   'functions': ['tximport'],
   'input': ['kallisto_abundance_files'],
   'output': ['kallisto_data']},
  {'name': 'Validate Inputs',
   'description': 'Check for integrity and compatibilit

The above seems to be functional. 

I think the next steps will be to:
- Develop the pipeline from the scaffold
- Feed in the metadata, specifying the metadata contrasts that I would want to investigate
- Execute the pipeline?

I do anticipate having problems 

# The future is now, thanks to science...

In other words, this is where I will be using the R scripts I have been creating.

I am doing this in two parts - one where I construct the processed DGE object (i.e. after filtering and normalization), and one where I do all the steps after (construction of contrast matrix, DEG analysis. I am anticipating the setup of the contrast matrix to be quite problematic, but all the steps before that point should be quite binary (i.e. there is either a "correct" or "reasonable" value I am expecting). 
- This point also acts as a checkpoint - "did normalization and filtering occur as expected?"
- My initial plan with the contrast matrix was to have the LLM create the code for the contrast matrix, and give it a few examples to work with (just in my head, I don't see this working out too well if I don't give it guidance)
- The other benefit of separating these out is I think the Kallisto importing step is the one that takes the longest, yet the one that's most likely to fail (I think) is the construction of the contrast matrix. By saving the intermediate output, I do think it will save me a very appreciable amount of time.

We will begin by checking that the script works with a known test case (and I will be using this as a troubleshooting case).

In [205]:
command = "Rscript"
script_path = "./ManualRNASeqAnalysis/RNASeq_DGEObjectCreate.r"
args = [
    "--directory", "/home/myuser/work/data/kallisto_output/",
    "--t2g", "/home/myuser/work/data/kallisto_indices/human/t2g.txt",
    "--metadata", "/home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv",
    "--clean_columns", "genotype_ch1",
    "--group", "genotype_clean",
    "--output", "./TheLLMPlayground/ManualInputs",
    "--geo_sra_mapping", "/home/myuser/work/notebooks/2_extract_data/results.txt"
]

# Combine the command and arguments into a single command string
cmd = f"{command} {script_path} {' '.join(shlex.quote(arg) for arg in args)}"

# Execute the command
try:
    result = subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    print("Script executed successfully.")
    print("Output:\n", result.stdout)
except subprocess.CalledProcessError as e:
    print("An error occurred while executing the script.")
    print("Error message:\n", e.stderr)

Script executed successfully.
Output:
 [2024-08-23 07:43:40] Step 1: Loading GEO-SRA mapping data...
[2024-08-23 07:43:40] GEO-SRA mapping data loaded.
[2024-08-23 07:43:40] Step 2: Loading metadata...
[2024-08-23 07:43:40] Original metadata column names:
 [1] "title"                   "geo_accession"          
 [3] "status"                  "submission_date"        
 [5] "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"        
 [9] "organism_ch1"            "characteristics_ch1"    
[11] "characteristics_ch1.1"   "characteristics_ch1.2"  
[13] "characteristics_ch1.3"   "treatment_protocol_ch1" 
[15] "growth_protocol_ch1"     "molecule_ch1"           
[17] "extract_protocol_ch1"    "extract_protocol_ch1.1" 
[19] "taxid_ch1"               "data_processing"        
[21] "data_processing.1"       "data_processing.2"      
[23] "platform_id"             "contact_name"           
[25] "contact_email"           "contact_phone"          
[27

Great - this doesn't mean too much, but is a good sanity check that things are working. As it happens, ChatGPT is REALLY helpful with just streamlining something that might otherwise take 30 minutes. 

Now for the slightly bigger test - how well will this do when I pass it to the LLM?

There is also the question of how I'm going to handle the metadata files... I think in the end, I would want to implement it such that it considers the input code that was used to generate these files, and then uses this information to determine what the most appropriate value is. 

In my initial test cases (just with prompt development and whatnot), I'll hard code this in - I will make a note that these need to be adjusted as I work on a integrated pipeline.

In [272]:
import subprocess
import shlex
import json

# Prepare the command and script path
command = "Rscript"
script_path = "./ManualRNASeqAnalysis/RNASeq_DGEObjectCreate.r"

# Define the JSON string for --clean_columns and --mutate_columns
clean_columns = json.dumps({"genotype_ch1": "genotype_clean"})
mutate_columns = json.dumps({"new_col1": "genotype_clean+treatment_ch1"})

# Define the argument list with the updated JSON strings
args = [
    "--directory", "/home/myuser/work/data/kallisto_output/",
    "--t2g", "/home/myuser/work/data/kallisto_indices/human/t2g.txt",
    "--metadata", "/home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv",
    "--clean_columns", clean_columns,
    "--merge_columns", mutate_columns,  # Optional, include only if needed
    "--group", "genotype_clean",
    "--output", "./TheLLMPlayground/ManualInputs",
    "--geo_sra_mapping", "/home/myuser/work/notebooks/2_extract_data/results.txt"
]

# Combine the command and arguments into a single command string
cmd = f"{command} {script_path} {' '.join(shlex.quote(arg) for arg in args)}"

# Execute the command
try:
    result = subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    print("Script executed successfully.")
    print("Output:\n", result.stdout)
except subprocess.CalledProcessError as e:
    print("An error occurred while executing the script.")
    print("Error message:\n", e.stderr)

Script executed successfully.
Output:
 [2024-08-26 06:09:08] Step 1: Loading GEO-SRA mapping data...
[2024-08-26 06:09:08] GEO-SRA mapping data loaded.
[2024-08-26 06:09:08] Step 2: Loading metadata...
[2024-08-26 06:09:08] Original metadata column names:
 [1] "title"                   "geo_accession"          
 [3] "status"                  "submission_date"        
 [5] "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"        
 [9] "organism_ch1"            "characteristics_ch1"    
[11] "characteristics_ch1.1"   "characteristics_ch1.2"  
[13] "characteristics_ch1.3"   "treatment_protocol_ch1" 
[15] "growth_protocol_ch1"     "molecule_ch1"           
[17] "extract_protocol_ch1"    "extract_protocol_ch1.1" 
[19] "taxid_ch1"               "data_processing"        
[21] "data_processing.1"       "data_processing.2"      
[23] "platform_id"             "contact_name"           
[25] "contact_email"           "contact_phone"          
[27

In [274]:
r_script_file = "./ManualRNASeqAnalysis/RNASeq_DGEObjectCreate.r"

# Read the R script file into a string
with open(r_script_file, 'r') as file:
    RNAseq_Rscript = file.read()

RNAseq_Rscript



In [334]:
class FunctionInputs(BaseModel):
    kallisto_directory: str = Field(description = "Name of directory containing Kallisto abundances")
    t2g: str = Field(description = "Path to the text-to-gene (t2g.txt) file")
    metadata: str = Field(description = "Path to sample metadata")
    columns_to_clean: str = Field(description = "Columns where values should be cleaned")
    filter_group: str = Field(description = "Name of column that should be used to determine experimental sample groups")
    output_directory: str = Field(description = "Name of output directory")
    geo_sra_mapping: str = Field(description = "Path to file linking sample GEO accessions to SRA IDs")
    merge_columns: str = Field(description = "Optional: Columns that should be joined using the dplyr::mutate function")
    justifications: str = Field(description = "Justifications for selection of all command-line arguments")

RNAseq_Inputs_Prompt = f"""

### IDENTITY AND PURPOSE

You are an expert bioinformatician who routinely performs standardized RNAseq experiments to find differentially expressed genes (DEGs) in a given experiment. 

You will be asked to perform part of an RNAseq analysis on a given dataset. The RNAseq analysis will begin from quantification files produced by Kallisto, and finish with the production of a DGEList object. The RNAseq analysis pipeline will be provided to you - note that this is an R script, with command line arguments. Your task will be to identify the most appropriate parameters for the command line arguments.

Carefully follow the steps outlined below to achieve the best possible outcome.

### STEPS

1. Carefully digest the R script that you will be using. 
- Take note of the optional parameters
- Take note of the functions which are being used for each parameter to ascertain how each parameter is being used
- Take note of the description of each parameter to ascertain the necessary format required for the parameter
- Take careful note of packages such as "janitor" and other value clean-ups, which may change what the correct value should be
- Note that this R script has been carefully validated: there are no errors in the script
2. Carefully analyse the provided metadata as well as the study summary
- Do not assume any additional comparisons which are not explicitly provided
- Note that the study summary should be used to 
- Keep in mind this recommendation: "The filtering should be based on the grouping factors or treatment factors that will be involved in the differential expression test"
3. After understanding the script and metadata, state the appropriate parameters that should be included.
- Note that mutate_columns is optional. If no columns should be joined, then specify NULL
4. 

### OUTPUT 

- Return the required command line arguments and justification for ALL included arugments

### INPUT AND OTHER INFORMATION

This is the R script:

{RNAseq_Rscript}

This is the dataset metadata:

{meta.to_string}

This is the study summary:

{study_summary}

Use the following information as well:

The Kallisto abundances can be found at /home/myuser/work/data/kallisto_output/
The geo_sra_mapping file can be found at /home/myuser/work/notebooks/2_extract_data/results.txt
The output should be ./TheLLMPlayground/LLMInputs
The metadata can be found at /home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv
The text-to-gene file can be found at /home/myuser/work/data/kallisto_indices/human/t2g.txt

"""
chat_completion = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": RNAseq_Inputs_Prompt,
        }
    ],
    model="gpt-4o-mini",
    response_format=FunctionInputs,
)

print(f"Generated tokens: ", chat_completion.usage.completion_tokens)
print(f"Prompt tokens: ", chat_completion.usage.prompt_tokens)
print(f"Total tokens: ", chat_completion.usage.total_tokens)

result = chat_completion.choices[0].message.parsed
result

Generated tokens:  294
Prompt tokens:  5813
Total tokens:  6107


FunctionInputs(kallisto_directory='/home/myuser/work/data/kallisto_output/', t2g='/home/myuser/work/data/kallisto_indices/human/t2g.txt', metadata='/home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv', columns_to_clean='{"treatment:ch1":"treatment_clean"}', filter_group='genotype_clean', output_directory='./TheLLMPlayground/LLMInputs', geo_sra_mapping='/home/myuser/work/notebooks/2_extract_data/results.txt', merge_columns='{"genotype_treatment":"genotype_clean+treatment_clean"}', justifications="The directory containing Kallisto abundance files is set to the path where they are produced. The t2g file is necessary for mapping transcripts to genes and is located correctly. The metadata file contains essential information about samples and will be used directly. The treatment column will be cleaned to create a new 'treatment_clean' column which will help in the differential expression analysis. The filter group 'genotype_clean' is appropriate for filtering b

In [279]:
def extract_arguments(result):
    # Define the argument mapping
    arg_map = {
        "kallisto_directory": "--directory",
        "t2g": "--t2g",
        "metadata": "--metadata",
        "columns_to_clean": "--clean_columns",
        "filter_group": "--group",
        "output_directory": "--output",
        "geo_sra_mapping": "--geo_sra_mapping",
        "merge_columns": "--merge_columns"
    }
    
    # Extract values from the result and format them into args
    args = []
    for attr, arg in arg_map.items():
        value = getattr(result, attr, None)
        if value:  # Only add if the attribute is not None
            args.extend([arg, value])
    
    return args

# Example usage:
args = extract_arguments(result)
args

['--directory',
 '/home/myuser/work/data/kallisto_output/',
 '--t2g',
 '/home/myuser/work/data/kallisto_indices/human/t2g.txt',
 '--metadata',
 '/home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv',
 '--clean_columns',
 '{"genotype:ch1":"genotype_clean","treatment:ch1":"treatment_clean"}',
 '--group',
 'genotype_clean',
 '--output',
 './TheLLMPlayground/LLMInputs',
 '--geo_sra_mapping',
 '/home/myuser/work/notebooks/2_extract_data/results.txt']

In [284]:
command = "Rscript"
script_path = "./ManualRNASeqAnalysis/RNASeq_DGEObjectCreate.r"

def extract_arguments(result):
    # Define the argument mapping
    arg_map = {
        "kallisto_directory": "--directory",
        "t2g": "--t2g",
        "metadata": "--metadata",
        "columns_to_clean": "--clean_columns",
        "filter_group": "--group",
        "output_directory": "--output",
        "geo_sra_mapping": "--geo_sra_mapping",
        "merge_columns": "--merge_columns"
    }
    
    # Extract values from the result and format them into args
    args = []
    for attr, arg in arg_map.items():
        value = getattr(result, attr, None)
        if value:  # Only add if the attribute is not None
            args.extend([arg, value])
    
    return args

# Example usage:
args = extract_arguments(result)
args

# Combine the command and arguments into a single command string
cmd = f"{command} {script_path} {' '.join(shlex.quote(arg) for arg in args)}"
print("Executing command: ", cmd)
# Execute the command
try:
    result = subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    print("Script executed successfully.")
    print("Output:\n", result.stdout)
except subprocess.CalledProcessError as e:
    print("An error occurred while executing the script.")
    print("Error message:\n", e.stderr)

Executing command:  Rscript ./ManualRNASeqAnalysis/RNASeq_DGEObjectCreate.r --directory /home/myuser/work/data/kallisto_output/ --t2g /home/myuser/work/data/kallisto_indices/human/t2g.txt --metadata /home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv --clean_columns '{"genotype:ch1":"genotype_clean","treatment:ch1":"treatment_clean"}' --group genotype_clean --output ./TheLLMPlayground/LLMInputs --geo_sra_mapping /home/myuser/work/notebooks/2_extract_data/results.txt
An error occurred while executing the script.
Error message:
 Rows: 12 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): sample_ID, experiment, SRA_ID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 12 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (41)

In [373]:
import subprocess
import shlex
from pydantic import BaseModel, Field
from openai import OpenAI
import time
import sys

class FunctionInputs(BaseModel):
    kallisto_directory: str = Field(description="Name of directory containing Kallisto abundances")
    t2g: str = Field(description="Path to the text-to-gene (t2g.txt) file")
    metadata: str = Field(description="Path to sample metadata")
    columns_to_clean: str = Field(description="Columns where values should be cleaned. Note that the column names supplied here should refer to the cleaned column names, not the original column names")
    filter_group: str = Field(description="Name of column that should be used to determine experimental sample groups. Can be a column created after merging.")
    output_directory: str = Field(description="Name of output directory")
    geo_sra_mapping: str = Field(description="Path to file linking sample GEO accessions to SRA IDs")
    merge_columns: str = Field(description="Optional: Columns that should be joined using the dplyr::mutate function. If included, follow dictionary format: e.g., '{\"new_col1\":\"col1+col2\",\"new_col2\":\"col3+col4\"}'")
    justifications: str = Field(description="Justifications for selection of all command-line arguments")

def run_rnaseq_analysis(RNAseq_Rscript, meta, study_summary, max_retries=3):
    def generate_prompt(error_message=None):
        prompt = f"""
        ### IDENTITY AND PURPOSE
        You are an expert bioinformatician who routinely performs standardized RNAseq experiments to find differentially expressed genes (DEGs) in a given experiment. 
        You will be asked to perform part of an RNAseq analysis on a given dataset. The RNAseq analysis will begin from quantification files produced by Kallisto, and finish with the production of a DGEList object. The RNAseq analysis pipeline will be provided to you - note that this is an R script, with command line arguments. Your task will be to identify the most appropriate parameters for the command line arguments.
        Carefully follow the steps outlined below to achieve the best possible outcome.
        ### STEPS
        1. Carefully digest the R script that you will be using. 
        - Take note of the optional parameters
        - Take note of the functions which are being used for each parameter to ascertain how each parameter is being used
        - Take note of the description of each parameter to ascertain the necessary format required for the parameter
        - Take careful note of packages such as "janitor" and other value clean-ups, which may change what the correct value should be. For example, if the original column name is "genotype:ch1", this will likely be changed to "genotype_ch1"
        - Note that this R script has been carefully validated: there are no errors in the script
        - The columns to clean should be used for any column that may be used for identification of DEGs
        - The columns to merged should be used when identification of DEGs makes use of multiple groups simulataneously
            - For example, if we are interested in "Treatment X vs Y in Genotype A", then the Treatment and Genotype columns should be merged
            - To identify where this is necessary, refer to the "Comparisons to make" in the INPUT AND OTHER INFORMATION section
        2. Carefully analyse the provided metadata as well as the study summary
        - Do not assume any additional comparisons which are not explicitly provided
        - Note that the study summary should be used to 
        - Keep in mind this recommendation: "The filtering should be based on the grouping factors or treatment factors that will be involved in the differential expression test"
        3. After understanding the script and metadata, state the appropriate parameters that should be included.
        - Note that mutate_columns is optional. If no columns should be joined, then specify NULL
        4. If you have been provided error messages, take into consideration these error messages 
        ### OUTPUT 
        - Return the required command line arguments and justification for ALL included arguments
        - If there was previous error messages, describe how you have incorporated this information in your response as part of the justificaiton 
        - If no error message was provided, explicitly indicate this as well.
        ### INPUT AND OTHER INFORMATION
        This is the R script:
        {RNAseq_Rscript}
        This is the dataset metadata:
        {meta.to_string()}
        This is the study summary:
        {study_summary}
        These are the comparisons to make:
        {filtered_contrasts}
        Use the following information as well:
        The Kallisto abundances can be found at /home/myuser/work/data/kallisto_output/
        The geo_sra_mapping file can be found at /home/myuser/work/notebooks/2_extract_data/results.txt
        The output should be ./TheLLMPlayground/LLMInputs
        The metadata can be found at /home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv
        The text-to-gene file can be found at /home/myuser/work/data/kallisto_indices/human/t2g.txt
        """
        if error_message:
            prompt += f"""\n### PREVIOUS ERROR\nThe previous attempt resulted in the following error. Please adjust your recommendations accordingly:\n{error_message}
            
            The command used to produce this error message was {cmd}"""
        return prompt

    def extract_arguments(result):
        arg_map = {
            "kallisto_directory": "--directory",
            "t2g": "--t2g",
            "metadata": "--metadata",
            "columns_to_clean": "--clean_columns",
            "filter_group": "--group",
            "output_directory": "--output",
            "geo_sra_mapping": "--geo_sra_mapping",
            "merge_columns": "--merge_columns"
        }
        args = []
        for attr, arg in arg_map.items():
            value = getattr(result, attr, None)
            if value:
                args.extend([arg, value])
        return args

    command = "Rscript"
    script_path = "./ManualRNASeqAnalysis/RNASeq_DGEObjectCreate.r"

    error_message = None  # Initialize error_message to None

    for attempt in range(max_retries):
        try:
            chat_completion = client.beta.chat.completions.parse(
                messages=[
                    {
                        "role": "user",
                        "content": generate_prompt(error_message),
                    }
                ],
                model="gpt-4o-mini",
                response_format=FunctionInputs,
            )
            result = chat_completion.choices[0].message.parsed
            args = extract_arguments(result)
            cmd = f"{command} {script_path} {' '.join(shlex.quote(arg) for arg in args)}"
            
            # Get the output directory and set the log file path in the "logs" subdirectory
            output_directory = result.output_directory
            log_dir = os.path.join(output_directory, "logs")
            os.makedirs(log_dir, exist_ok=True)
            log_file_path = os.path.join(log_dir, f"DGEListCreation_log_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt")

            # Log the generated tokens
            token_info = (f"Generated tokens: {chat_completion.usage.completion_tokens}\n"
                          f"Prompt tokens: {chat_completion.usage.prompt_tokens}\n"
                          f"Total tokens: {chat_completion.usage.total_tokens}\n")

            print(token_info)
            # Write token info to the log file
            with open(log_file_path, 'a') as log_file:
                log_file.write(token_info)
                log_file.write("="*50 + "\n\n")
            
            cmd = f"{command} {script_path} {' '.join(shlex.quote(arg) for arg in args)}"
            
            print(f"Attempt {attempt + 1}: Executing command: {cmd}")
            print(f"Justification: {chat_completion.choices[0].message.parsed.justifications}")
            
            # Execute the command with real-time output capture
            process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

            # Capture and store stdout and stderr
            stdout, stderr = process.communicate()
            
            # Print stdout and stderr
            sys.stdout.write(stdout)
            sys.stderr.write(stderr)
            
            # Log the command, stdout, and stderr to the file
            with open(log_file_path, 'a') as log_file:
                log_file.write(f"Attempt {attempt + 1}\n")
                log_file.write(f"Command executed:\n{cmd}\n")
                log_file.write(f"STDOUT:\n{stdout}\n")
                log_file.write(f"STDERR:\n{stderr}\n")
                log_file.write("="*50 + "\n\n")
                
            if process.returncode == 0:
                print("Script executed successfully.")
                return process
            else:
                # Create a combined error message
                error_message = f"STDOUT:\n{stdout}\nSTDERR:\n{stderr}"
                raise subprocess.CalledProcessError(process.returncode, cmd)

        except subprocess.CalledProcessError as e:
            print(f"Attempt {attempt + 1} failed.")
            if attempt < max_retries - 1:
                print("Retrying with error information...")
                time.sleep(2)  # Add a short delay before retrying
            else:
                print("Max retries reached. Analysis failed.")
                raise

    return None

result = run_rnaseq_analysis(RNAseq_Rscript, meta, study_summary)

Generated tokens: 449
Prompt tokens: 12394
Total tokens: 12843

Attempt 1: Executing command: Rscript ./ManualRNASeqAnalysis/RNASeq_DGEObjectCreate.r --directory /home/myuser/work/data/kallisto_output/ --t2g /home/myuser/work/data/kallisto_indices/human/t2g.txt --metadata /home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv --clean_columns '{"genotype:ch1":"genotype_clean","treatment:ch1":"treatment_clean"}' --group genotype_clean --output ./TheLLMPlayground/LLMInputs --geo_sra_mapping /home/myuser/work/notebooks/2_extract_data/results.txt --merge_columns '{"genotype_treatment":"genotype_clean+treatment_clean"}'
Justification: 1. **Kallisto Directory**: Specified the path to location of Kallisto output to find abundance files.
2. **T2G File**: Required for mapping transcripts to genes, crucial for translating abundance data into meaningful gene expression data. 
3. **Metadata**: Necessary to provide sample descriptions, which identify conditions and treatm

Rows: 12 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): sample_ID, experiment, SRA_ID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 12 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (41): title, geo_accession, status, submission_date, last_update_date, t...
dbl  (3): channel_count, taxid_ch1, data_row_count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Error in `mutate()`:
ℹ In argument: `genotype_clean = str_replace_all(`genotype:ch1`,
  "[^a-zA-Z0-9_]", "")`.
Caused by error in `vctrs::vec_size_common()`:
! object 'genotype:ch1' not found
Backtrace:
     ▆
  1. ├─meta %>% ...
  2. ├─dplyr::mutate(...)
  3. ├─dplyr:::mutate.data.fra

Generated tokens: 313
Prompt tokens: 13909
Total tokens: 14222

Attempt 2: Executing command: Rscript ./ManualRNASeqAnalysis/RNASeq_DGEObjectCreate.r --directory /home/myuser/work/data/kallisto_output/ --t2g /home/myuser/work/data/kallisto_indices/human/t2g.txt --metadata /home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv --clean_columns '{"genotype:ch1":"genotype_clean","treatment:ch1":"treatment_clean"}' --group genotype_clean --output ./TheLLMPlayground/LLMInputs --geo_sra_mapping /home/myuser/work/notebooks/2_extract_data/results.txt --merge_columns '{"genotype_treatment":"genotype_clean+treatment_clean"}'
Justification: The parameters are chosen based on the need to clean and prepare the metadata for differential gene expression analysis of the dataset related to HDAC3 inhibition in lymphoma. The columns designated for cleaning are critical identifiers for filtering DEGs (differentially expressed genes). The 'genotype_clean' will be used for groupin

Rows: 12 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): sample_ID, experiment, SRA_ID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 12 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (41): title, geo_accession, status, submission_date, last_update_date, t...
dbl  (3): channel_count, taxid_ch1, data_row_count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Error in `mutate()`:
ℹ In argument: `genotype_clean = str_replace_all(`genotype:ch1`,
  "[^a-zA-Z0-9_]", "")`.
Caused by error in `vctrs::vec_size_common()`:
! object 'genotype:ch1' not found
Backtrace:
     ▆
  1. ├─meta %>% ...
  2. ├─dplyr::mutate(...)
  3. ├─dplyr:::mutate.data.fra

Generated tokens: 329
Prompt tokens: 13909
Total tokens: 14238

Attempt 3: Executing command: Rscript ./ManualRNASeqAnalysis/RNASeq_DGEObjectCreate.r --directory /home/myuser/work/data/kallisto_output/ --t2g /home/myuser/work/data/kallisto_indices/human/t2g.txt --metadata /home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv --clean_columns '{"genotype_ch1":"genotype_clean","treatment_ch1":"treatment_clean"}' --group genotype_clean --output ./TheLLMPlayground/LLMInputs --geo_sra_mapping /home/myuser/work/notebooks/2_extract_data/results.txt --merge_columns '{"genotype_treatment":"genotype_clean+treatment_clean"}'
Justification: The "kallisto_directory" points to the correct location of the Kallisto output files. The "t2g" path is where the text-to-gene mapping file is stored, necessary for gene quantification. The "metadata" file path provides the relevant sample metadata needed for the DGE analysis. The columns to clean were adjusted to reflect the cleaned

Rows: 12 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): sample_ID, experiment, SRA_ID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 12 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (41): title, geo_accession, status, submission_date, last_update_date, t...
dbl  (3): channel_count, taxid_ch1, data_row_count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 227665 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): X1, X2, X3, X4, X5, X8
dbl (2): X6, X7

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_

# Personal notes - construction of DGEList object

This is working well - at this point my main points to worry about are:
- that I won't know if the merging of columns has "worked" until I do the next step (it might be the case that I can wait for the next part to really do it)
- to do a bit of QC (i.e. the image test)
- it seems to pretty reliably be able to get it correct after one incorrect attempt - I think this is because it can't really predict the column names accurately the first time around, but seeing the output and being able to identify the correct column names is helpful.

The next step I think will be a quick bit of QC - not me of course! But the LLM. I want to see how well it performs...

# Image interpretation as QC

...hrmmm. Now that I'm thinking about this, I surely would want to generate a PCA plot as well... I will need to incorporate this into the original R script...

# Identification of DEGs

- Ok I am beginning to realise I have sort of dug myself something of a hole.
- To identify DEGs, I will want to identify the contrasts of interest.
- To have any chance of correctly identifying these contrasts of interest, I will need to get the relevant column names
- Although in my "identify interesting contrasts" component of the code I do specify the values/columns that are of interest, these values would have been changed after cleaning the names of columns as well as the values in some columns. Hence, there is a chance that these values do not match up any longer (and now that I'm thinking about it, I might need to further change the code, since there are other special characters I need to account for...)

Hmm. I do think it is appropriate to generate new columns - this is necessary for when I perform the filtering. However, I think the value cleaning is something I should move towards the next script (since the troubleshooting can occur there...)

As well - generating this next script may be a bit complicated... I am going to need to have the LLM write some scripts mainly for defining the contrasts. 

In [None]:
# Here I will develop a framework to construct the code for contrast matrices

# I don't know how much I want to structure the output? Or even this workflow in general...

# I think it will make sense to check the "possible" values and columns I can pick from...

DGE_path = "/home/myuser/work/notebooks/3_analyse_data/TheLLMPlayground/LLMInputs/DGE.RDS" # I do need to work out how I'm going to be able to specify this in my workflow... might be the case that I need to save the location of the output of the previous step...
r_script_path = "./ManualRNASeqAnalysis/RNASeq_GetCleanedMetadata.r"
command = ["Rscript", r_script_path, "--path", DGE_path]

try:
    result = subprocess.run(command, capture_output=True, text=True, check=True)
    
    # Print the output of the R script
    print(result.stdout)
except subprocess.CalledProcessError as e:
    print("An error occurred while running the R script:")
    print(e.stderr)

In [378]:
DGE_path = "/home/myuser/work/notebooks/3_analyse_data/TheLLMPlayground/LLMInputs/DGE.RDS"
r_script_path = "./ManualRNASeqAnalysis/RNASeq_GetCleanedMetadata.r"
command = ["Rscript", r_script_path, "--path", DGE_path]

try:
    cleaned_meta = subprocess.run(command, capture_output=True, text=True, check=True)
    
    # Print the output of the R script
    print(cleaned_meta.stdout)
except subprocess.CalledProcessError as e:
    print("An error occurred while running the R script:")
    print(e.stderr)

Metadata columns and their unique values:
Column: characteristics_ch1_2
Values: genotype: WT, genotype: GNAS knockout

Column: characteristics_ch1_3
Values: treatment: DMSO, treatment: RGFP966 (5 µM)

Column: genotype_ch1
Values: WT, GNAS knockout

Column: treatment_ch1
Values: DMSO, RGFP966 (5 µM)

Column: genotype_clean
Values: WT, GNASknockout

Column: treatment_clean
Values: DMSO, RGFP9665M

Column: genotype_treatment
Values: WT_DMSO, WT_RGFP9665M, GNASknockout_DMSO, GNASknockout_RGFP9665M




This output looks good - the purpose of this is so that the LLM will know exactly what the possible values to use are. I anticipate this to be necessary given that the column names and values to be used would have changed because of value cleaning. As a reminder, this is absolutely required, because the contrast matrix cannot be constructed correctly if the names of values inside the column are invalid.

With that said, let's now attempt to construct a contrast matrix. I will just begin with some LLM prompts to see if I can get the contrast matrix...

(And as a side note, I do need to start using better names, not just defaulting to "result")

In [379]:
cleaned_meta.stdout

'Metadata columns and their unique values:\nColumn: characteristics_ch1_2\nValues: genotype: WT, genotype: GNAS knockout\n\nColumn: characteristics_ch1_3\nValues: treatment: DMSO, treatment: RGFP966 (5 µM)\n\nColumn: genotype_ch1\nValues: WT, GNAS knockout\n\nColumn: treatment_ch1\nValues: DMSO, RGFP966 (5 µM)\n\nColumn: genotype_clean\nValues: WT, GNASknockout\n\nColumn: treatment_clean\nValues: DMSO, RGFP9665M\n\nColumn: genotype_treatment\nValues: WT_DMSO, WT_RGFP9665M, GNASknockout_DMSO, GNASknockout_RGFP9665M\n\n'

In [381]:
class ContrastStructure(BaseModel):
    name: str = Field(description = "Name for contrast")
    comparison: str = Field(description = "Specific values to make up the contrast")

class DEGFunctionInputs(BaseModel):
    column: str = Field(description = "Columnm in metadata to use for design matrix")
    comparisons: list[str] = Field(description = "Specific contrasts to make")
    

RNAseq_Inputs_Prompt = f"""

### IDENTITY AND PURPOSE

You are an experienced bioinformatician who expertly performs RNAseq experiments, and identifies differentially expressed genes by carefully constructing scientifically accurate comparisons.

Your task will be to construct R code that will allow us to perform a scientifically sound test for differential gene expression. You will have access to the following:
- A list of POSSIBLY relevant columns and values in the metadata
- A list of comparisons to be made

Take a deep breath, and carefully follow the steps below to produce the most accurate possible output.

### STEPS

1. Carefully consider the columns and unique values contained within the metadata
- Your final response will refer to some of these EXACTLY
- Take particular note of the relationship between columns - for example, some columns will contain values which are merged from two other columns
2. Carefully consider the comparisons that are to be made.
- This will be supplemented by "column names" and "values" to be used - note that these will be similar to the values in the metadata. However, the metadata has been cleaned, so the values will not match up completely
- You should make attempts to determine which values and columns in the suggested comparison align with which columns and values in the metadata
3. Identify a single column that will be used for a design matrix. You should select this based on:
- values in the column NOT containing characters such as spaces, slashes, brackets, and other similar characters. Note that underscores are permitted.
- the column containing values that are specific. For example, if column A contains values "a" and "b", column B contains values "c" and "d", whereas column C contains values "a_c" and "b_d", column C is most desirable
4. Using values in this column ONLY, specify comparisons that should be used to create the contrast. You should base these off the list of comparisons to be made. Use the below as guidelines for formatting, which is based on the makeContrasts function in edgeR/limma:


# EXAMPLE POSSIBLE VALUES: Geno1_Treat1, Geno1_Treat2, Geno2_Treat1, Geno2_Treat2 

# CORRECT EXAMPLE 1 - Comparing two genotypes, regardless of treatment option

name: Geno1-Geno2
comparison: (Geno1_Treat1 + Geno1_Treat2) - (Geno2_Treat1 + Geno2_Treat2)

Note the use of "+" to "combine" the different values corresponding to a single genotype. Note also that the name is a simple, but descriptive comparison.

# CORRECT EXAMPLE 2 - Comparing treatment options in a single genotype

name:Geno1_Treat1-Treat2
comparison: Geno1_Treat1 - Geno1_Treat2

# INCORRECT EXAMPLE - Comparising two genotypes, regardless of treatment option

name: Geno1-Geno2
comparison: Geno1 - Geno2

This is incorrect as Geno1 is not a valid value in the possible values. CORRECT EXAMPLE 1 shows how to achieve an equivalent result using a more detailed column.

5. If any comparison would be a direct repeat of another comparison, exclude this. 
- Note that comparisons that differ only by name are also classed as repeats

### OUTPUT

1. State the column to be used in the design matrix
2. For each comparison, assign a short, descriptive name, as well as stating the specific comaprison to be made.
- Follow the recommended examples EXACTLY

### INPUT

Metadata:

{cleaned_meta.stdout}

Comparisons:

{filtered_contrasts}

"""
chat_completion = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": RNAseq_Inputs_Prompt,
        }
    ],
    model="gpt-4o-mini",
    response_format=DEGFunctionInputs,
)

print(f"Generated tokens: ", chat_completion.usage.completion_tokens)
print(f"Prompt tokens: ", chat_completion.usage.prompt_tokens)
print(f"Total tokens: ", chat_completion.usage.total_tokens)

DEG_inputs = chat_completion.choices[0].message.parsed
DEG_inputs

Generated tokens:  126
Prompt tokens:  1549
Total tokens:  1675


DEGFunctionInputs(column='genotype_treatment', comparisons=['{"name":"RGFP966-DMSO in WT","comparison":"(WT_RGFP9665M)-(WT_DMSO)"}', '{"name":"GNASknockout-DMSO vs WT-DMSO","comparison":"(GNASknockout_DMSO)-(WT_DMSO)"}', '{"name":"GNASknockout-RGFP966 vs WT-RGFP966","comparison":"(GNASknockout_RGFP9665M)-(WT_RGFP9665M)"}'])

In [395]:
DEG_inputs.column

'genotype_treatment'

In [403]:
DGE_path = "/home/myuser/work/notebooks/3_analyse_data/TheLLMPlayground/LLMInputs/DGE.RDS"
r_script_path = "./ManualRNASeqAnalysis/RNASeq_DEGIdentification.r"
command = ["Rscript", r_script_path, "--DGE", DGE_path, "--column", str(DEG_inputs.column), "--comparisons", str(DEG_inputs.comparisons)]

try:
    DEGs = subprocess.run(command, capture_output=True, text=True, check=True)
    
    # Print the output of the R script
    print(DEGs.stdout)
except subprocess.CalledProcessError as e:
    print("An error occurred while running the R script:")
    print(e.stderr)

DEGs

An error occurred while running the R script:
Error in makeContrasts(contrasts = setNames(lapply(comparisons, function(cmp) cmp$comparison),  : 
  The levels must by syntactically valid names in R, see help(make.names).  Non-valid names: get(opt$column)GNASknockout_DMSO,get(opt$column)GNASknockout_RGFP9665M,get(opt$column)WT_DMSO,get(opt$column)WT_RGFP9665M
Execution halted



NameError: name 'DEGs' is not defined

In [410]:
import json
import re
import tempfile
import subprocess

# Example input
deg_function_input = DEG_inputs

# Extract column and comparisons using regular expressions
column_match = re.search(r"column='(.*?)'", deg_function_input)
comparisons_match = re.search(r"comparisons=\[(.*?)\]", deg_function_input)

if column_match and comparisons_match:
    column = column_match.group(1)
    comparisons_str = comparisons_match.group(1)

    # Clean and convert the comparisons string to a list of dictionaries
    clean_comparisons_str = comparisons_str.replace("'", "\"")
    comparisons = json.loads(f"[{clean_comparisons_str}]")

    # Generate the R script sub-section for contrast matrix creation
    contrast_snippet = "\n".join([
        f'"{cmp["name"]}" = "{cmp["comparison"]}"' 
        for cmp in comparisons
    ])

TypeError: expected string or bytes-like object, got 'DEGFunctionInputs'

In [439]:

# Parse the list to extract names and comparisons
contrast_list = []
for comparison_str in DEG_inputs.comparisons:
    comparison_dict = json.loads(comparison_str)
    name = comparison_dict['name']
    contrast = comparison_dict['comparison']
    contrast_list.append(f'"{name}" = "{contrast}"')

contrast_list

contrast_snippet = ",  ".join(contrast_list)

contrast_snippet

column = DEG_inputs.column

In [451]:
r_script = f"""
suppressMessages({{
  suppressWarnings({{
    library(tidyverse)
    library(edgeR)
    library(limma)
    library(tximport)
    library(DESeq2)
    library(janitor)
    library(viridis)
    library(optparse)
  }})
}})

# Define options for optparse
option_list <- list(
  make_option(c("--DGE"), type="character", default=NULL, help="Path to the DGE object (RDS file)", metavar="character")
)

opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)

# Read in the DGEList object
DGE <- readRDS(opt$DGE)

# Construct model matrix
design <- model.matrix(data = DGE$samples, ~0 + {column})
colnames(design) <- gsub("{column}", "", colnames(design))

# Construct contrast matrix
contrast.matrix <- makeContrasts(
  {contrast_snippet},
  levels = colnames(design)
)

# Perform the DEG analysis
v <- voom(DGE, design)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrast.matrix)
efit <- eBayes(vfit)

# Record results of DEG analysis
LFC.summary <- lapply(colnames(contrast.matrix), function(x){{
    topTable(efit, coef = x, number = Inf) %>% arrange(adj.P.Val)
}})

# Example output: print the top results for the first contrast
print(LFC.summary[[1]])
"""

In [452]:
r_script



In [453]:
import tempfile

DGE_path = "/home/myuser/work/notebooks/3_analyse_data/TheLLMPlayground/LLMInputs/DGE.RDS"

# Write the R script to a temporary file
with tempfile.NamedTemporaryFile(suffix=".R", delete=False) as temp_r_script:
    temp_r_script.write(r_script.encode())
    r_script_path = temp_r_script.name

command = ["Rscript", r_script_path, "--DGE", DGE_path]

# Execute the R script
result = subprocess.run(command, capture_output=True, text=True)

# Print the output
print(result.stdout)
print(result.stderr)

                  logFC   AveExpr             t     P.Value adj.P.Val         B
ACTB      -8.217892e-01 13.177774 -4.1922216282 0.000834971 0.5185170 -3.810379
MT-CO1    -2.635802e-01 14.551496 -1.9846800767 0.066363426 0.9675515 -4.165956
EEF2      -3.116307e-01 14.538422 -1.8486825074 0.084922243 0.9675515 -4.227588
ACTG1     -7.257701e-01 12.477332 -2.6195658584 0.019699678 0.9675515 -4.313086
RPL13A    -4.016216e-01 12.459716 -2.2447128259 0.040797024 0.9675515 -4.382749
TPT1       5.509206e-01 12.199895  2.4632386793 0.026771500 0.9675515 -4.392231
DDX17     -1.052839e+00 11.446916 -3.1015472918 0.007513120 0.9675515 -4.417341
EEF1A1     1.655618e-01 14.518635  1.4467485559 0.169161841 0.9675515 -4.455416
BCAT1     -2.249111e+00 11.551306 -3.3834087799 0.004248771 0.9675515 -4.468628
TUBB      -7.123157e-01 11.790165 -2.1189430558 0.051737877 0.9675515 -4.494431
KMT2C     -9.923781e-01 10.980094 -2.5297857382 0.023505684 0.9675515 -4.509103
NEAT1     -5.254182e-01 12.346720 -1.508

# Personal evaluation of running RNAseq workflow

Ok this works - I will leave it here for today, but I will need to save the output of all contrasts.

I do find the workflow rather messy and convoluted. It works, but I really need to consider how I might be able to simplify things. Part of the challenge is in where I define quotation marks, and where I do not need quotation marks.