# Introduction

This notebook will contain experimentation (and hopefully the final implementation) of me using LLMs to analyse the Kallisto quantification data.

The ideal final outcome is a workflow where I can take in my Kallisto quantification files and perform DEG analysis. However, the exploratory steps that I'd be interested in:
- How well does the LLM produce a working R (I can more comfortably work with R) pipeline?
- How well does it handle inputs/outputs?
- How well will it handle the METADATA?
- How much guidance do I need to give? e.g. with the libraries that are available (in theory, I'd like this to be a "step" that the LLM is smart enough to know to implement). I don't want to have the LLM install new packages, that feels like a security risk.

Other notes:
- For the moment, I'll have the LLM use the "LLM Playground" directory to save its outputs
- In my head, this "workflow" will be "hi, here's what I want, do some steps to achieve this" - a bit like the worked example of solving an equation
- I also need to integrate this with 

In [148]:
# Load modules
from openai import OpenAI
import openai # I need this and above
import os
from tqdm import tqdm
import time
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from typing import List, Dict, Literal
import subprocess
import glob
import asyncio
import json
import base64 # image interpretation
import requests # image interpretation

In [149]:
# Quick OpenAI API test - note this does not reflect what I intend my end prompt to be, just want to get a quick idea of what I get...

load_dotenv('../../.env')

openai_api_key = os.getenv('OPENAI_API_KEY')

# Test OpenAI API...

client = OpenAI(
  api_key=openai_api_key,
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Could you provide code to import abundance.tsv Kallisto files into R and identify DEGs?",
        }
    ],
    model="gpt-4o-mini",
)

result = chat_completion.choices[0].message.content
print(result)

Certainly! To analyze RNA-Seq data obtained from Kallisto and identify differentially expressed genes (DEGs) using R, you will typically follow these steps:

1. **Load the necessary libraries**: You will need libraries like `tximport` for importing Kallisto data, `DESeq2` for differential expression analysis, and `tidyverse` for data manipulation and visualization.

2. **Import the abundance data**: Use `tximport` to import the abundance files from Kallisto.

3. **Create a DESeq2 dataset**: Prepare the data for DESeq2.

4. **Run the differential expression analysis**: Use DESeq2 to identify DEGs.

Here's a sample code illustrating these steps:

```R
# Load necessary libraries
library(tximport)
library(DESeq2)
library(readr)    # for reading tsv files
library(dplyr)    # for data manipulation

# Set the working directory to where the abundance.tsv files are located
setwd("path/to/your/kallisto/abundance/files")  # Change to your directory

# List the abundance.tsv files
files <- list.fi

Obviously a one-sentence prompt will get nowhere.

# Investigating metadata

I technically have a separate notebook analysing metadata, but I will more formally do my tests here.

The initial test case is to give a metadata CSV and see if the LLM is able to identify what contrasts would be interesting. However, I would eventually probably want a separate function for finding the CSV, and I would later also need to determine what specific outputs I want.

At least in the initial conceptualisation stage, I'm not sure where I'll be integrating this (i.e. will this be something I do separately, then feed as input into the LLM), but nonetheless my goal is to develop a prompt that will get meaningful results

In [150]:
meta = pd.read_csv("/home/myuser/work/notebooks/Testing/GSE268034/GSE268034_series_matrix_metadata.csv")
meta

Unnamed: 0,title,geo_accession,status,submission_date,last_update_date,type,channel_count,source_name_ch1,organism_ch1,characteristics_ch1,...,library_selection,library_source,library_strategy,relation,relation.1,supplementary_file_1,cell line:ch1,cell type:ch1,genotype:ch1,treatment:ch1
0,SUDHL4_LacZ_RGFP0_1,GSM8284502,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479047,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625208,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284502/suppl/GSM8284502_SUDHL4_LacZ_RGFP0_1.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,WT,DMSO
1,SUDHL4_LacZ_RGFP0_2,GSM8284503,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479046,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625209,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284503/suppl/GSM8284503_SUDHL4_LacZ_RGFP0_2.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,WT,DMSO
2,SUDHL4_LacZ_RGFP5_1,GSM8284504,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479045,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625210,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284504/suppl/GSM8284504_SUDHL4_LacZ_RGFP5_1.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,WT,RGFP966 (5 µM)
3,SUDHL4_LacZ_RGFP5_2,GSM8284505,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479044,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625211,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284505/suppl/GSM8284505_SUDHL4_LacZ_RGFP5_2.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,WT,RGFP966 (5 µM)
4,SUDHL4_GNASKO2_RGFP0_1,GSM8284506,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479043,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625212,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284506/suppl/GSM8284506_SUDHL4_GNASKO2_RGFP0_1.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,DMSO
5,SUDHL4_GNASKO2_RGFP0_2,GSM8284507,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479042,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625213,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284507/suppl/GSM8284507_SUDHL4_GNASKO2_RGFP0_2.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,DMSO
6,SUDHL4_GNASKO2_RGFP5_1,GSM8284508,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479041,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625214,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284508/suppl/GSM8284508_SUDHL4_GNASKO2_RGFP5_1.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,RGFP966 (5 µM)
7,SUDHL4_GNASKO2_RGFP5_2,GSM8284509,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479040,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625215,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284509/suppl/GSM8284509_SUDHL4_GNASKO2_RGFP5_2.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,RGFP966 (5 µM)
8,SUDHL4_GNASKO3_RGFP0_1,GSM8284510,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479039,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625216,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284510/suppl/GSM8284510_SUDHL4_GNASKO3_RGFP0_1.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,DMSO
9,SUDHL4_GNASKO3_RGFP0_2,GSM8284511,Public on Aug 08 2024,May 21 2024,Aug 08 2024,SRA,1,SU-DHL-4,Homo sapiens,cell line: SU-DHL-4,...,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN41479038,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX24625217,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284nnn/GSM8284511/suppl/GSM8284511_SUDHL4_GNASKO3_RGFP0_2.txt.gz,SU-DHL-4,diffuse large B-cell lymphoma cells,GNAS knockout,DMSO


In [151]:
meta.to_string

<bound method DataFrame.to_string of                      title geo_accession                 status  \
0      SUDHL4_LacZ_RGFP0_1    GSM8284502  Public on Aug 08 2024   
1      SUDHL4_LacZ_RGFP0_2    GSM8284503  Public on Aug 08 2024   
2      SUDHL4_LacZ_RGFP5_1    GSM8284504  Public on Aug 08 2024   
3      SUDHL4_LacZ_RGFP5_2    GSM8284505  Public on Aug 08 2024   
4   SUDHL4_GNASKO2_RGFP0_1    GSM8284506  Public on Aug 08 2024   
5   SUDHL4_GNASKO2_RGFP0_2    GSM8284507  Public on Aug 08 2024   
6   SUDHL4_GNASKO2_RGFP5_1    GSM8284508  Public on Aug 08 2024   
7   SUDHL4_GNASKO2_RGFP5_2    GSM8284509  Public on Aug 08 2024   
8   SUDHL4_GNASKO3_RGFP0_1    GSM8284510  Public on Aug 08 2024   
9   SUDHL4_GNASKO3_RGFP0_2    GSM8284511  Public on Aug 08 2024   
10  SUDHL4_GNASKO3_RGFP5_1    GSM8284512  Public on Aug 08 2024   
11  SUDHL4_GNASKO3_RGFP5_2    GSM8284513  Public on Aug 08 2024   

   submission_date last_update_date type  channel_count source_name_ch1  \
0      May 21 20

In [152]:
prompt = f"""

## IDENTITY AND PURPOSE

You are an expert in bioinformatic analyses. You will be provided with a metadata sheet, and are tasked with identifying contrasts that could be interesting in the metadata, with the intention of analysing these in a edgeR/limma based pipeline.
Take a deep breath, and carefully follow the steps outlined below to achieve the intended task.

## STEPS

1. Carefully consider each column, inferring what each column means from its name, and also the values in the column. 
2. Determine columns that appear to contain data that would be scientifically and biologically interesting to compare within the column.
- Only include comparisons that can be easily analysed in a limma/edgeR based pipeline
- Only include comparisons that would be generally valuable to scientific and medical literature
- Only include comparisons that can be made within this dataset only - i.e. does not require samples from additional datasets

## OUTPUT

1. For each comparison, include the EXACT column name, as well as the EXACT values that should be used for the comparison. Additionally, justify why the comparison would be interesting using up to 3 sentences

## INPUT

Metadata:
{meta.to_string()}

"""

chat_completion = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-4o-mini",
)

result = chat_completion.choices[0].message.content
print(result)

Here are several interesting comparisons based on the provided metadata:

1. **Comparison 1**:
   - **Column Name**: `genotype:ch1`
   - **Values**: `WT` vs. `GNAS knockout`
   - **Justification**: Comparing the wild-type (WT) genotype with the GNAS knockout model can reveal insights into the role of GNAS in diffuse large B-cell lymphoma. A differential expression analysis may uncover how gene knockout affects the expression profile, potentially identifying pathways or markers critical in the disease.

2. **Comparison 2**:
   - **Column Name**: `treatment:ch1`
   - **Values**: `DMSO` vs. `RGFP966 (5 µM)`
   - **Justification**: Analyzing the gene expression differences between the DMSO-treated and RGFP966-treated samples allows for the evaluation of RGFP966's effects as a therapeutic agent. Such comparisons can yield important information on its mechanism of action and efficacy in affecting cellular processes in DLBCL.

3. **Comparison 3**:
   - **Column Name**: `genotype:ch1`
   - **V

The above does seem pretty good - it is capturing everything that I want. However, I could imagine improvements if I
1. Repeated multiple times
2. Collate responses (a bit of experimentation reveals this will most likely be a combination of code, but also an LLM to remove "loose" duplicates)
3. Give scores to responses, to determine what the "final" list of contrasts to analyse should be.

I will therefore adapt the approach I took in identifying relevant datasets, and implement it here (since I did perform both).

I will need to give special consideration to how to evaluate/score the contrasts (perhaps Mr. Claude/ChatGPT will be helpful for me...)

In [153]:
class Assessment(BaseModel):
    name: str = Field(description = "A name to be given to describe the contrast")
    column: str = Field(description = "Column, or column, in the metadata containing the values to be compared")
    values: str = Field(description = "The values in the identified column that are to be compared")
    justification: str = Field(description = "Justification for why the suggested contrast will be of use")

class Contrasts(BaseModel):
    contrasts: list[Assessment]

def identify_contrasts(meta):
    prompt = f"""

## IDENTITY AND PURPOSE

You are an expert in bioinformatic analyses. You will be provided with a metadata sheet, and are tasked with identifying contrasts that could be interesting in the metadata, with the intention of analysing these in a edgeR/limma based pipeline.
Take a deep breath, and carefully follow the steps outlined below to achieve the intended task.

## STEPS

1. Carefully consider each column, inferring what each column means from its name, and also the values in the column. 
2. Determine columns that appear to contain data that would be scientifically and biologically interesting to analyse
- Only consider analyses that would be generally valuable to scientific and medical literature
- Only include analyses that can be made within this dataset only - i.e. does not require samples from additional datasets
- You are permitted to draw comparisons involving multiple different columns
3. Specify the values in the columns that should be used to for the comparison
- Only include comparisons that can be easily analysed in a limma/edgeR based pipeline. 
- Specifically take into consideration how a contrast matrix could be set up using the model.matrix and makeContrasts functions.
- You are permitted to draw comparisons involving multiple different columns


## OUTPUT

1. Include output for each proposed comparison
2. Specify the exact column name(s) that will need to be used for the comparison
3. Specify the exact values that will be used for the comparison
4. Justify why the comparison would be interesting using up to 3 sentences

For points 2 and 3, note that this should include enough information for someone to generate an appropriate contrast matrix using model.matrix and makeContrasts.

## INPUT

Metadata:
{meta.to_string()}

"""
    chat_completion = client.beta.chat.completions.parse(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="gpt-4o-mini",
        response_format = Contrasts
        )
    result = chat_completion.choices[0].message.parsed
    return(result)

async def identify_contrasts_multiple(meta, num_queries: int = 3) -> Contrasts:
    async def single_identify_contrasts():
        return identify_contrasts(meta)

    tasks = [single_identify_contrasts() for _ in range(num_queries)]
    results = await asyncio.gather(*tasks)

    # Combine the results
    all_contrasts = Contrasts(contrasts=[])
    for result in results:
        all_contrasts.contrasts.extend(result.contrasts)

    # Deduplication process to remove duplicate contrasts
    contrasts_dict = all_contrasts.dict()
    seen = set()
    unique_contrasts = []

    for item in contrasts_dict['contrasts']:
        identifier = (item['column'], item['values'])
        if identifier not in seen:
            unique_contrasts.append(item)
            seen.add(identifier)

    # Replace the original list with the filtered one
    contrasts_dict['contrasts'] = unique_contrasts

    # Convert back to the Contrasts model
    unique_contrasts_model = Contrasts(**contrasts_dict)

    return unique_contrasts_model

In [154]:
contrasts = await identify_contrasts_multiple(meta, num_queries=3)
print(contrasts)

contrasts=[Assessment(name='Treatment Comparison: DMSO vs RGFP966 (5 µM) for WT Genotype', column='treatment:ch1', values='DMSO, RGFP966 (5 µM)', justification="This comparison will help determine the effect of the RGFP966 treatment on gene expression levels relative to the DMSO control in the wild-type (WT) genotype of diffuse large B-cell lymphoma cells, which could provide insights into the drug's mechanism of action."), Assessment(name='GNAS Knockout vs WT Genotype under DMSO Treatment', column='genotype:ch1', values='GNAS knockout, WT', justification='Examining this contrast will reveal how gene expression differs in the presence or absence of the GNAS gene under the same experimental conditions (DMSO treatment), potentially identifying pathways impacted by GNAS knockout.'), Assessment(name='Effect of RGFP966 Treatment on GNAS Knockout vs WT Genotype', column='treatment:ch1', values='RGFP966 (5 µM), DMSO', justification='This contrast focuses on evaluating the effectiveness of RGF

In [155]:
class ComparisonEval(BaseModel):
    comparison: str
    score: int
    score_justification: str
    redundant: Literal["Yes", "No"]
    redundant_justification: str
    retain: Literal["Yes", "No"]

class AllEvals(BaseModel):
    evals: list[ComparisonEval]

prompt = f"""

### PURPOSE AND IDENTITY

You are an expert and experienced bioinformatician and scientist, who focuses on clarifying analyses which will be meaningful to perform. 

You have been tasked with evaluating the potential scientific value of proposed comparisons. These comparisons are intended to be performed in a edgeR/limma-based RNA-seq pipeline.

Take a deep breath, and carefully follow the below steps to achieve the best possible outcome.

### STEPS 

1. You will be provided a Python dictionary of proposed scientific comparisons which have been proposed for a limma/edgeR RNAseq pipeline.
- Do not propose any additional scientific comparisons beyond those specified in this Python dictionary
- Throughout your evaluation, keep in mind that the analysis will be based on the construction of a contrast matrix, using the values specified in the column and values.
2. You will also be provided metadata, which contains data that is mentioned in the Python dictionary. 
- Do NOT use this metadata to hallucinate additional comparisons
- Use this metadata ONLY to gather additional context for the defined scientific comparisons.
3. For each proposed analysis, assign a score between 1 - 5, based on the scientific value that can be extracted out of the comparison. Do this independently for each comparison. Use the below as a scoring guide:

Score 5 – Outstanding Scientific Value

	•	The proposed comparison is highly relevant and addresses a significant scientific question or hypothesis.
	•	The comparison is likely to yield new and impactful insights that could lead to meaningful advancements in the field.
	•	The analysis is well-aligned with the biological context provided by the metadata and is expected to generate robust, interpretable results.
	•	The comparison is novel or provides a unique perspective that has not been previously explored.

Score 4 – High Scientific Value

	•	The proposed comparison is scientifically sound and addresses an important question.
	•	The analysis has the potential to contribute valuable insights, though it may be incremental rather than groundbreaking.
	•	The comparison is well-supported by the metadata and is expected to produce meaningful results.
	•	The comparison adds depth to existing knowledge but may not be entirely novel.

Score 3 – Moderate Scientific Value

	•	The proposed comparison is reasonable and could yield useful information.
	•	The analysis addresses a relevant question, though the scientific impact may be limited or somewhat unclear.
	•	The comparison is supported by the metadata but may not be as compelling or novel as higher-scoring comparisons.
	•	The results may be interesting but are likely to confirm existing knowledge rather than provide new insights.

Score 2 – Low Scientific Value

	•	The proposed comparison is somewhat relevant but does not address a particularly important or novel question.
	•	The analysis may yield some useful data, but the scientific impact is expected to be minimal.
	•	The comparison is only partially supported by the metadata, and the results may be difficult to interpret or have limited applicability.
	•	The comparison may be redundant with existing analyses or provide only marginal additional insights.

Score 1 – Minimal or No Scientific Value

	•	The proposed comparison is poorly conceived and unlikely to yield meaningful scientific insights.
	•	The analysis does not address a relevant or important question, or the rationale for the comparison is unclear.
	•	The comparison is not well-supported by the metadata, and the results are likely to be uninterpretable or irrelevant.
	•	The comparison may be redundant, trivial, or based on a flawed premise.

4. For each comparison, also identify if it is redundant and/or overlapping with another comparison.
- An example of this is identical "column" and "values" (e.g. column of "A" and values of "val1, val2" as compared to "val2, val1" or "val1 - val2")
- **Important** Note that if comparison 1 is redundant with comparison 2, BOTH comparisons 1 and 2 should be marked as redundant.
- Comparisons which are similar, but not overlapping, should not be classed as redundant
- Only classify comparisons as redundant if you are highly confident that they are redundant
- Keep in mind the analysis will be based on an edgeR/limma/DESeq2 pipeline - if two analyses are likely to require the identical experimental setup, these are redundant.
- After evaluating all comparisons for redundancy, double check whether the intended repsonse for any other comparison needs to be altered accordingly.
5. Based on your score evaluation and redundancy evaluation, make an evaluation as to whether each comparison should be retained. 
- When there are redundant comparisons, ONLY the comparison with the higher scientific value score should be retained
- If redundant comparisons have the same scientific value score, then only retain one.
6. Prior to reporting results, double check that your responses are reasonable, and you have followed the steps correctly.
7. Report your results in accordance to the instructions in OUTPUT.

### OUTPUT

1. Include output for all proposed comparisons. Use the comparison name to describe each comparison.
2. Specify the scientific evaluation score
3. Include justification for the scientific evaluation score
4. Specify if the comparison is redundant
5. If redundant, justify why it is redundant. If not redundant, specify "Not redundant" for this field.
- The justification for selecting which of the redundant comparisons, if any, should be specified here.
6. Specify if the comparison should be retained or not

### PROPOSED SCIENTIFIC ANALYSES

{contrasts}

### METADATA

{meta.to_string()}
"""

chat_completion = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-4o-mini",
    response_format=AllEvals
)
result = chat_completion.choices[0].message.parsed

In [156]:
df = contrasts.dict()
df = pd.DataFrame(df['contrasts'])
df

Unnamed: 0,name,column,values,justification
0,Treatment Comparison: DMSO vs RGFP966 (5 µM) for WT Genotype,treatment:ch1,"DMSO, RGFP966 (5 µM)","This comparison will help determine the effect of the RGFP966 treatment on gene expression levels relative to the DMSO control in the wild-type (WT) genotype of diffuse large B-cell lymphoma cells, which could provide insights into the drug's mechanism of action."
1,GNAS Knockout vs WT Genotype under DMSO Treatment,genotype:ch1,"GNAS knockout, WT","Examining this contrast will reveal how gene expression differs in the presence or absence of the GNAS gene under the same experimental conditions (DMSO treatment), potentially identifying pathways impacted by GNAS knockout."
2,Effect of RGFP966 Treatment on GNAS Knockout vs WT Genotype,treatment:ch1,"RGFP966 (5 µM), DMSO","This contrast focuses on evaluating the effectiveness of RGFP966 treatment between the GNAS knockout and wild-type genotypes, enhancing understanding of how genetic background may influence the drug response."
3,Comparison of Wild-Type versus GNAS Knockout with DMSO,genotype:ch1,"WT, GNAS knockout","Analyzing the differences in gene expression between wild-type and GNAS knockout SU-DHL-4 cells treated with DMSO can reveal how GNAS loss influences the cellular response, potentially identifying targets for therapeutic intervention."
4,Comparison of Wild-Type versus GNAS Knockout with RGFP966 (5 µM),"genotype:ch1, treatment:ch1","WT, GNAS knockout, RGFP966 (5 µM)","This contrast will shed light on how the knockdown of GNAS affects the efficacy of the RGFP966 treatment in SU-DHL-4 cells, providing critical information on the role of GNAS in regulating responses to therapy."
5,Comparative Effects of RGFP966 in Different Genotypes,"treatment:ch1, genotype:ch1","RGFP966 (5 µM), WT, GNAS knockout",Investigating how RGFP966 affects wild-type versus GNAS knockout SU-DHL-4 cells will elucidate the molecular pathways impacted by this treatment and whether the genotype modifies the drug response.


In [157]:
df = result.dict()
df = pd.DataFrame(df['evals'])
df

Unnamed: 0,comparison,score,score_justification,redundant,redundant_justification,retain
0,Treatment Comparison: DMSO vs RGFP966 (5 µM) for WT Genotype,5,"This comparison directly examines the effects of RGFP966 treatment against a control (DMSO) in WT cells, providing significant insight into the drug's mechanism and therapeutic potential in diffuse large B-cell lymphoma. Additionally, it targets a critical question regarding drug efficacy in cancer treatment, which is relevant to current therapeutic strategies.",No,Not redundant,Yes
1,GNAS Knockout vs WT Genotype under DMSO Treatment,4,"This analysis targets the biological impact of losing the GNAS gene under a common treatment condition (DMSO), which can reveal insights into the pathways affected by GNAS. While it is valuable, it is less direct than the DMSO vs RGFP966 comparison.",No,Not redundant,Yes
2,Effect of RGFP966 Treatment on GNAS Knockout vs WT Genotype,4,"This assessment explores how the GNAS knockout affects the response to RGFP966 treatment. It builds on existing knowledge and can provide insights into genetic influence on drug response, adding value but providing somewhat incremental knowledge.",No,Not redundant,Yes
3,Comparison of Wild-Type versus GNAS Knockout with DMSO,4,"This comparison analyzes gene expression differences in the context of DMSO. It provides insights into the role of GNAS, though it's more focused on constitutive gene expression rather than treatment effects, making it valuable but slightly less critical than treatment-focused comparisons.",No,Not redundant,Yes
4,Comparison of Wild-Type versus GNAS Knockout with RGFP966 (5 µM),5,"This analysis allows for direct comparison of genotype response to the potential therapeutic agent RGFP966, yielding insights into how the GNAS gene influences drug efficacy, addressing both efficacy and genetic context, and is highly relevant in therapeutic research.",No,Not redundant,Yes
5,Comparative Effects of RGFP966 in Different Genotypes,5,"Investigating RGFP966 effects across genetic backgrounds (WT and GNAS knockout) addresses fundamental questions of how gene expression changes beget different therapeutic responses, unveiling critical insights on personalized medicine and therapeutic efficacy based on genotype.",No,Not redundant,Yes


I'm not entirely satisfied with the outcome so far (mainly with the inability to identify redundant contrasts) - however, the contrasts it is identifying do seem of interest, and I must admit is better than the singular one I came up with in my initial testing.

I noted several instances of hallucinations, parituclarly with imagining contrast that I did not specify. I've tried to stamp these out... a bit concerningly, these were sometimes marked as "retain".

My plan at the moment is to leave this as is (at least for the moment), and when I begin the prompt to develop the code a bit more explicitly, I think I might just include another check to see "would the code be functionally identical? -> if yes, ignore". This might be sufficient.

# Generating the RNAseq analysis code

My focus will now switch to generating RNAseq analysis code. In my head, this would turn out as:
- Identify the Kallisto files
- Import the Kallisto abundances
- Import the metadata
- Create a DGEList object
- Filtering/normalisation
- DEG analysis

Of course, if it ends up pivoting from this, then I can assess the performance. In any case, I think the overall workflow I am aiming for:
- Propose pipeline
- Evaluate the pipeline
- If needed, adjust the proposed pipeline
- Execute pipeline
- Assess results of the pipeline (mainly in terms of stderr/stdout) - do the results make sense?
- I'd also be interested to see how possible it is to integrate QC into this. e.g. I can provide images, but would these actually be interpreted correctly? (and I'm also unsure how I'd even feed these in...). 

# Small image test

I want to get an idea of the value of inputting images. Both in terms of - can I expect it will perform well, and also how much does it cost? 

Seems that cost is not going to be a big issue. The manual implementation of this works... fine (it can interpret the image), but I'm not sure how viable this will become 

In [158]:
# Examples of image-based QC I might do include interpreting PCA plots, assessing the outcome of filtering as well as normalization.

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "./ManualRNASeqAnalysis/Filtering.png"

# Getting the base64 string
base64_image = encode_image(image_path)

prompt = f"""

The provided image is intended for use as a QC check in a bioinformatic analysis.

Make an assessment as to whether the QC check is likely to have passed.
"""



headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {openai_api_key}"
}

payload = {
  "model": "gpt-4o-mini",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": prompt
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

print(response.json())

{'id': 'chatcmpl-9yyGKp5IvN3DnfstTddh1kZ1mRcpa', 'object': 'chat.completion', 'created': 1724318356, 'model': 'gpt-4o-mini-2024-07-18', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "Based on the provided image, here's an assessment of the quality control (QC) check in the bioinformatic analysis:\n\n1. **Unfiltered Data** (Left Panel):\n   - The density plot shows multiple peaks and a significant amount of density at lower log-cpm values (around 6). This suggests the presence of low count or noise, which may indicate poor quality data or a large number of low-expressed genes.\n\n2. **Filtered Data** (Right Panel):\n   - The density plot indicates a much sharper and centralized peak around higher log-cpm values (8 to 12+). This suggests that filtering has improved the overall quality of the data by removing low-expressed genes or noise, and focusing on relevant gene expression.\n\n### Conclusion:\nThe QC check likely passed since the filtered data shows a more nor

# Code generation test

In the worked example OpenAI provides of structured outputs, they have an example of solving an equation - I am planning on doing something similar.

I need to think carefully about the implementation - my first test case will be "produce step-by-step pipeline", followed by "what are the expected inputs and outputs" and "generate code" for each step in the pipeline. 

I need to think somewhat carefully about how I implement the metaeata... it is definitely necessary to get the correct code. I think with this in mind, it makes sense to have a separate LLM call for... each step...?

In [165]:
class Steps(BaseModel):
    name: str = Field(description = "A simple, descriptive name of the step to be performed")
    description: str = Field(description="Description of the step to be performed")
    purpose: str = Field(description="Justification of why step is needed")
    code_requirements: list[str] = Field(description = "R packages and functions required for this step")

class Pipeline(BaseModel):
    pipeline: list[Steps]

prompt = f"""

### IDENTITY AND PURPOSE

You are an expert bioinformatician, who meticulously and carefully plans computational RNAseq experiments. 

You have been asked to provide a basic analysis pipeline which can be used to analyse quantification data generated from Kallisto. 

Take a deep breath, and carefully take note of the requirements outlined below to achieve the best possible outcome.

### PIPELINE REQUIREMENTS

1. Keep in mind that the pipeline will have the following available inputs: Kallisto abundance files, sample metadata
2. Your pipeline should be based in R
3. Avoid installing unnecessary packages. Packages such as tidyverse, limma, edgeR, tximport, and DESeq2 are permitted.
4. Your pipeline should incorporate analysis steps, validation steps (e.g. checking inputs/outputs), and also quality control steps
5. The final output should be related to differentially expressed genes
6. The pipeline should be constructed in a logical order
7. The starting point of the pipeline will be the input Kallisto abudance files

### OUTPUT

1. Construct a list of sequential steps to perform the RNAseq analysis
2. Each step should achieve a single goal


"""

chat_completion = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-4o-mini",
    response_format=Pipeline
)
result = chat_completion.choices[0].message.parsed
print(result.dict())

{'pipeline': [{'name': 'Load Required Packages', 'description': 'Load all necessary R packages for the analysis.', 'purpose': 'To ensure that all required libraries are available for RNAseq analysis.', 'code_requirements': ['library(tximport)', 'library(DESeq2)', 'library(limma)', 'library(edgeR)', 'library(tidyverse)']}, {'name': 'Read Kallisto Abundance Files', 'description': 'Import the Kallisto abundance files into R as a data frame or list of data frames.', 'purpose': 'To obtain the raw quantification data generated by Kallisto for further processing.', 'code_requirements': ["abundance_files <- list.files(path = 'path_to_kallisto_output', pattern = '*.tsv', full.names = TRUE)", 'abundance_list <- lapply(abundance_files, read.table, header = TRUE)']}, {'name': 'Prepare Sample Metadata', 'description': 'Import and prepare the sample metadata file.', 'purpose': 'To provide essential information regarding the samples under study such as conditions, replicates, and any experimental des

In [166]:
result.dict()

{'pipeline': [{'name': 'Load Required Packages',
   'description': 'Load all necessary R packages for the analysis.',
   'purpose': 'To ensure that all required libraries are available for RNAseq analysis.',
   'code_requirements': ['library(tximport)',
    'library(DESeq2)',
    'library(limma)',
    'library(edgeR)',
    'library(tidyverse)']},
  {'name': 'Read Kallisto Abundance Files',
   'description': 'Import the Kallisto abundance files into R as a data frame or list of data frames.',
   'purpose': 'To obtain the raw quantification data generated by Kallisto for further processing.',
   'code_requirements': ["abundance_files <- list.files(path = 'path_to_kallisto_output', pattern = '*.tsv', full.names = TRUE)",
    'abundance_list <- lapply(abundance_files, read.table, header = TRUE)']},
  {'name': 'Prepare Sample Metadata',
   'description': 'Import and prepare the sample metadata file.',
   'purpose': 'To provide essential information regarding the samples under study such as 

In [168]:
# Here is where I will include the initial plan for an evaluation framework - the goal is just to evaluate whether the proposed pipeline looks reasonable

# A quick glance at the JSON makes it seem... ok...?

class StepAssessment(BaseModel):
    name: str = Field(description = "A simple, descriptive name of the step to be performed")
    code_eval: str = Field(description="Evaluation of if the proposed code is likely to work")
    pipeline_eval: str = Field(description="Evaluation of if the proposed step is useful in the context of the pipeline")
    step_eval: Literal["Yes", "No"] = Field(description="Yes if the step is ok, No if the step needs improvements/changes, or was missing from the input pipeline")


class OverallAssessment(BaseModel):
    all_assessments: list[StepAssessment]
    overall_eval: Literal["Yes", "No"] = Field(description="Yes if all steps in the pipeline are ok, No if any step needs improvements/changes, or was missing from the input pipeline")

### TEMPORARY NOTE THAT I WILL HOPEFULLY REMEMBER TO LOOK AT 
# ^^^^^^^^^^^^^^^^^^^^

prompt = f"""

### IDENTITY AND PURPOSE

You are an expert bioinformatician, who meticulously and carefully plans computational RNAseq experiments. 

You have been asked to evaluate a basic analysis pipeline which can be used to analyse quantification data generated from Kallisto. 

Take a deep breath, and carefully take note of the steps outlined below to achieve the best possible outcome.

### STEPS

1. Evaluate each proposed step in the pipeline, taking into consideration
a) The accuracy of the proposed code 
- Assess the general structure of the code, and double check the existence of any proposed functions and libraries, as opposed to specific parameters
- Pay special attention to capitalization, correct use of underscores/periods in functions, and whether or not the function exists
- There will be placeholder values in the proposed code - do not assess these as incorrect.
b) The necessity and value of each step of the code
c) The simplicity of each code
d) Whether the code follows commonly accepted guidelines
2. If possible improvements can be made in any of the above factors, make these suggestions to individual steps
3. If there appears to be steps missing, generate additional steps. Mark these additional steps as "not" passing the step evaluation.
4. If and only if all steps pass the evaluation, mark the overall pipeline as "Yes".

### OUTPUT

1. Where needed, improve the list of sequential steps to perform the RNAseq analysis
2. Each step should achieve a single goal

### INPUT PIPELINE

{result}

"""

pipeline_eval = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-4o-mini",
    response_format=OverallAssessment
)
pipeline_eval.dict()

{'id': 'chatcmpl-9zCoiXWYBVSmWpizXYRef0p0VG24S',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
   'logprobs': None,
   'message': {'content': '{"all_assessments":[{"name":"Load Required Packages","code_eval":"The libraries listed are widely used for RNAseq analyses and the code structure is correct.","pipeline_eval":"Loading required packages is essential for the analysis.","step_eval":"Yes"},{"name":"Read Kallisto Abundance Files","code_eval":"The code correctly uses list.files and lapply to read Kallisto output, assuming the placeholder paths are correctly specified.","pipeline_eval":"This step is crucial for obtaining raw data, which is foundational for analysis.","step_eval":"Yes"},{"name":"Prepare Sample Metadata","code_eval":"The code correctly reads and transforms the sample metadata into a data frame.","pipeline_eval":"Sample metadata is essential for differential expression analysis.","step_eval":"Yes"},{"name":"Validate Input Files","code_eval":"The stopifnot function