In [35]:
# Load modules 

from openai import OpenAI
import sys
import openai # I need this and above
import os
from tqdm import tqdm
import time
import re
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from typing import List, Dict, Literal, Optional
import subprocess
import glob
import asyncio
import json
import base64 # image interpretation
import requests # image interpretation
import shlex # suggested for command-line strings
from datetime import datetime

load_dotenv('../../.env')

openai_api_key = os.getenv('OPENAI_API_KEY')

# Test OpenAI API...

client = OpenAI(
  api_key=openai_api_key,
)

# Purpose

The purpose of this notebook will be to go from the Kallisto quantification data into the DEGs. 

The specific steps I envisage at this stage:
- Identify the Kallisto abundance files (I previously noted the need for an "agent" to identify files, it would again come in handy here)
- The DEG analysis, which would include: reading in files (which I think includes the tx2gene file...), filtering/normalisation, DESIGN OF CONTRASTS (important part!!), performing the DEG tests

I will start with the design of contrasts. I need this to be super robust, as this is a crucial part in determining the information I can extract out of a dataset.

## Contrast design

Considerations when designing the contrasts:

### INPUTS
- Sample metadata (something I donwloaded in the data extraction part)
- Dataset summary (something I did during Kallisto quantification I think?)
- (probably for later) Research question/existing findings

### EXECUTION
- Contrasts to be analysed need to be possible
- I previously did a step where columns were combined - I suspect this is a necessary step

In [52]:
# We will start with reading in the metadata. This will also include removing columns that are all duplicated values, since those are definitely not useful at all. 

df = pd.read_csv("/home/myuser/work/notebooks/2_extract_data/GSE268034_data/GSE268034_series_matrix_metadata.csv")
df = df.loc[:, df.nunique() > 1]
metadata_json = df.to_json(orient='records', lines=False, indent=2) # parse to JSON

In [25]:
class ColumnMerging(BaseModel):
    merge: bool = Field(..., description="Whether or not columns should be merged")
    cols: Optional[list[str]] = Field(..., description="List of columns to be merged")
    justification: str = Field(..., description = "Justification of columns being merged/why no columns needed to be merged")

def Identify_ColMerges():
    prompt = f"""

### IDENTITY AND PURPOSE

You are an expert in bioinformatics. You advise on the most scientifically valuable experiments that can be performed, and have a deep awareness of DEG analysis tools, such as limma and edgeR.

Your task is to study the provided metadata, and determine which columns to use in proceeding with the analysis.

### STEPS

Note that a future step of the analysis will involve design of a matrix as follows:
design <- model.matrix(data = DGE.final$samples,
                       ~0 + column)

Crucially, this only includes a single column. Therefore, take a deep breath, and follow these steps to ensure that subsequent analyses are as robust as possible:

1. Assess the content of each column in the provided metadata
2. Determine which columns contain anything of biological relevance
3. Determine if any columns are redundant, and do not need to be considered (e.g. similar content). In this case, only consider the column with simpler values (i.e. fewer special characters)
4. Determine which columns contain information that would be scientifically valuable to analyse, i.e. could result in a meaningful biological finding.
5. If there are multiple columns that contain scientifically valuable information, identify these columns as needing to be merged.
6. If there is one one column containing scientifically valuable information, no columns need to be merged

### OUTPUT

- Specify if any columns will need to be merged
- State the names of the columns to be merged
- Justify your choice

### INPUT METADATA

{metadata_json}

"""
    chat_completion = client.beta.chat.completions.parse(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="gpt-4o-mini",
        response_format = ColumnMerging
        )
    result = chat_completion.choices[0].message.parsed
    print(f"Generated tokens: ", chat_completion.usage.completion_tokens)
    print(f"Prompt tokens: ", chat_completion.usage.prompt_tokens)
    print(f"Total tokens: ", chat_completion.usage.total_tokens)
    return(result)

In [48]:
col_merge_info = Identify_ColMerges()
col_merge_info

Generated tokens:  93
Prompt tokens:  2994
Total tokens:  3087


ColumnMerging(merge=True, cols=['genotype:ch1', 'treatment:ch1'], just="Both columns contain crucial biological information related to the genotype and treatment conditions of the samples. They provide essential context for understanding the experimental design and potential results of the DEG analysis. While both columns are significant, they can be combined into a single column (e.g., 'genotype_treatment') to simplify the design matrix and maintain clarity in data interpretation.")

The above is able to identify columns to be merged. Now, let's merge those columns.

Hmm. In my actual workflow... I 

In [55]:
def clean_string(s: str) -> str:
    """
    Clean a string by replacing spaces with underscores and removing special characters.

    Args:
        s (str): The string to clean.

    Returns:
        str: The cleaned string.
    """
    if pd.isnull(s):
        return "NA"  # Handle missing values
    s = str(s)
    s = s.strip()  # Remove leading and trailing whitespaces
    s = s.replace(" ", "_")  # Replace spaces with underscores
    s = re.sub(r'[^\w]', '', s)  # Remove non-word characters (retain letters, digits, underscores)
    return s

def process_column_merging(df: pd.DataFrame, column_merge_info: ColumnMerging) -> pd.DataFrame:
    """
    Process column merging based on ColumnMerging information.

    Args:
        df (pd.DataFrame): The sample metadata DataFrame.
        column_merge_info (ColumnMerging): Information about column merging.

    Returns:
        pd.DataFrame: The updated DataFrame with merged columns if applicable.
    """
    if column_merge_info.merge:
        # Ensure that at least two columns are provided for merging
        if not column_merge_info.cols or len(column_merge_info.cols) < 2:
            raise ValueError("At least two columns must be specified for merging when merge=True.")
        
        cols_to_merge = column_merge_info.cols

        # Generate new column name by combining base names of the columns to merge
        # For example, merging 'genotype:ch1' and 'treatment:ch1' becomes 'genotype_treatment_clean'
        base_names = [col.split(":")[0] for col in cols_to_merge]
        new_col_name = "merged_analysis_group"

        # Clean the values in the columns to be merged
        cleaned_columns = df[cols_to_merge].map(clean_string)

        # Merge the cleaned columns by concatenating their values with underscores
        df[new_col_name] = cleaned_columns.apply(lambda row: "_".join(row.values), axis=1)

        print(f"Merged columns {cols_to_merge} into '{new_col_name}'.")
    else:
        # When merging is not required, ensure exactly one column is specified
        if not column_merge_info.cols or len(column_merge_info.cols) != 1:
            raise ValueError("Exactly one column must be specified for cleaning when merge=False.")
        
        col_to_clean = column_merge_info.cols[0]

        # Generate a new column name by appending '_clean' to the original column name
        new_col_name = "merged_analysis_group"

        # Rename the column in the DataFrame
        df = df.rename(columns={col_to_clean: new_col_name})

        # Clean the values in the renamed column
        df[new_col_name] = df[new_col_name].apply(clean_string)

        print(f"Cleaned column '{col_to_clean}' into '{new_col_name}'.")

    return df

In [56]:
cleaned_df = process_column_merging(df, col_merge_info)

Merged columns ['genotype:ch1', 'treatment:ch1'] into 'merged_analysis_group'.


In [76]:
cleaned_df

Unnamed: 0,title,geo_accession,characteristics_ch1.2,characteristics_ch1.3,relation,relation.1,supplementary_file_1,genotype:ch1,treatment:ch1,merged_analysis_group
0,SUDHL4_LacZ_RGFP0_1,GSM8284502,genotype: WT,treatment: DMSO,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284...,WT,DMSO,WT_DMSO
1,SUDHL4_LacZ_RGFP0_2,GSM8284503,genotype: WT,treatment: DMSO,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284...,WT,DMSO,WT_DMSO
2,SUDHL4_LacZ_RGFP5_1,GSM8284504,genotype: WT,treatment: RGFP966 (5 µM),BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284...,WT,RGFP966 (5 µM),WT_RGFP966_5_µM
3,SUDHL4_LacZ_RGFP5_2,GSM8284505,genotype: WT,treatment: RGFP966 (5 µM),BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284...,WT,RGFP966 (5 µM),WT_RGFP966_5_µM
4,SUDHL4_GNASKO2_RGFP0_1,GSM8284506,genotype: GNAS knockout,treatment: DMSO,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284...,GNAS knockout,DMSO,GNAS_knockout_DMSO
5,SUDHL4_GNASKO2_RGFP0_2,GSM8284507,genotype: GNAS knockout,treatment: DMSO,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284...,GNAS knockout,DMSO,GNAS_knockout_DMSO
6,SUDHL4_GNASKO2_RGFP5_1,GSM8284508,genotype: GNAS knockout,treatment: RGFP966 (5 µM),BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284...,GNAS knockout,RGFP966 (5 µM),GNAS_knockout_RGFP966_5_µM
7,SUDHL4_GNASKO2_RGFP5_2,GSM8284509,genotype: GNAS knockout,treatment: RGFP966 (5 µM),BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284...,GNAS knockout,RGFP966 (5 µM),GNAS_knockout_RGFP966_5_µM
8,SUDHL4_GNASKO3_RGFP0_1,GSM8284510,genotype: GNAS knockout,treatment: DMSO,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284...,GNAS knockout,DMSO,GNAS_knockout_DMSO
9,SUDHL4_GNASKO3_RGFP0_2,GSM8284511,genotype: GNAS knockout,treatment: DMSO,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM8284...,GNAS knockout,DMSO,GNAS_knockout_DMSO


In [47]:
col_merge_info

ColumnMerging(merge=True, cols=['characteristics_ch1.2', 'characteristics_ch1.3'], just="The columns 'characteristics_ch1.2' and 'characteristics_ch1.3' provide distinct yet complementary biological information regarding the genotype and treatment of the samples. Merging these columns into a single column will streamline the design matrix and make it easier to analyze the combined effects of genotype and treatment in the subsequent analysis.")

Hmm. This kind of works.

Anyway, the next step is to work out the actual analyses to be performed. 

In [128]:
cleaned_meta_json = cleaned_df.to_json(orient='records', lines=False, indent=2) # parse to JSON

class Contrast(BaseModel):
    name: str = Field(..., description = "Name of contrast to perform")
    values: list[str] = Field(..., description = "Values involved in analysis of the contrast")
    description: str = Field(..., description = "Description of the contrast")
    justification: str = Field(..., description = "Justification of why the contrast is of interest to analyse")

class AllAnalysisContrasts(BaseModel):
    contrasts: list[Contrast]

def get_study_summary(accession):

    # Define the command as a string
    command = (
        f'esearch -db gds -query "{accession}[ACCN]" | '
        'efetch -format docsum | '
        'xtract -pattern DocumentSummarySet -block DocumentSummary '
        f'-if Accession -equals {accession} -element summary'
    )

    # Execute the command
    result = subprocess.run(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

    # Check if the command was successful
    if result.returncode == 0:
        # Return the output
        return result.stdout.strip()
    else:
        # Raise an error with the stderr output
        raise Exception(f"Error: {result.stderr}")

# Example usage:
study_summary = get_study_summary("GSE268034")

def IdentifyContrasts():
    prompt = f"""

### IDENTITY AND PURPOSE

You are an expert in bioinformatics. You advise on the most scientifically valuable experiments that can be performed, and have a deep awareness of DEG analysis tools, such as limma and edgeR.

Your task is to study the provided information, and determine what contrasts would be interesting to study.

### STEPS

1. You will be given input sample metadata. The crux of the decision making should be based on this.
2. You will be given some input information about a "merged column" called "merged_analysis_group". You should focus on the values in this column. However, the information will also detail where the merged values are derived from, so you can use this information as well.
3. You will be provided information about the dataset summary. Use this to inform about the scientific purpose of the dataset.
4. Having considered and digested the input information, carefully decide what the most valuable contrasts to analyse will be. Keep in mind the following guidelines:
- The values you specify should be derived ONLY from the merged column
- The contrasts you analyse should have scientific value, and not simply be "control experiments"
- The contrasts should be focussed and have a clear defined purpose
- Here are some examples of how to structure the contrasts:
    - If the samples to be compared are, for example "Treatment X vs. Y in genotpye A samples", the output should be "X_A, Y_A" (where X_A refers to the EXACT value in the merged_analysis_group column)
    - If the samples to be compared are, for example "Treatment X vs. Y", the output should be "X_A, X_B, Y_A, Y_B". 
5. Once you have produced the output, double check that:
- You have considered the correct column
- The values you have stated are derived from the correct column


### OUTPUT

- Assign a name for each contrast
- State the values required to correctly analyse each contrast. These values must EXACTLY match the value in the merged_analysis_group column
- Describe what the contrast is investigating
- Justify why the contrast is scientifically valuable

### INPUTS

Sample metadata: {cleaned_meta_json}
Information about merged columns: {col_merge_info}
Dataset summary: {study_summary}


"""
    chat_completion = client.beta.chat.completions.parse(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="gpt-4o-mini",
        response_format = AllAnalysisContrasts
        )
    result = chat_completion.choices[0].message.parsed
    print(f"Generated tokens: ", chat_completion.usage.completion_tokens)
    print(f"Prompt tokens: ", chat_completion.usage.prompt_tokens)
    print(f"Total tokens: ", chat_completion.usage.total_tokens)
    return(result)

In [129]:
contrasts_data = IdentifyContrasts()

def contrasts_to_dataframe(all_contrasts: AllAnalysisContrasts) -> pd.DataFrame:
    """
    Converts AllAnalysisContrasts to a pandas DataFrame.
    
    Args:
        all_contrasts (AllAnalysisContrasts): The contrasts data.
    
    Returns:
        pd.DataFrame: DataFrame containing contrast details.
    """
    # Extract contrast details into a list of dictionaries
    contrast_dicts = [
        {
            "Contrast Name": contrast.name,
            "Group 1": contrast.values[0],
            "Group 2": contrast.values[1],
            "Justification": contrast.justification,
            "Description": contrast.description
        }
        for contrast in all_contrasts.contrasts
    ]
    
    # Create DataFrame
    df_contrasts = pd.DataFrame(contrast_dicts)
    
    return df_contrasts

# Generate the DataFrame
df_contrasts = contrasts_to_dataframe(contrasts_data)

# Display the DataFrame
print(df_contrasts)

Generated tokens:  357
Prompt tokens:  3744
Total tokens:  4101
                                        Contrast Name             Group 1  \
0                               DMSO vs RGFP966 in WT             WT_DMSO   
1                    DMSO vs RGFP966 in GNAS knockout  GNAS_knockout_DMSO   
2  Comparison of GNAS knockout with treatment effects  GNAS_knockout_DMSO   

                      Group 2  \
0             WT_RGFP966_5_μM   
1  GNAS_knockout_RGFP966_5_μM   
2  GNAS_knockout_RGFP966_5_μM   

                                                                                                                                                                                                                      Justification  \
0  Understanding the effects of RGFP966 relative to the control (DMSO) in WT cells is crucial for establishing the efficacy of HDAC3 inhibition in this genotype and may inform the mechanistic basis related to the GNAS knockout.   
1    Since GNAS KO has been iden

In [130]:
df_contrasts

Unnamed: 0,Contrast Name,Group 1,Group 2,Justification,Description
0,DMSO vs RGFP966 in WT,WT_DMSO,WT_RGFP966_5_μM,Understanding the effects of RGFP966 relative to the control (DMSO) in WT cells is crucial for establishing the efficacy of HDAC3 inhibition in this genotype and may inform the mechanistic basis related to the GNAS knockout.,This contrast investigates the gene expression differences in wild-type (WT) cells treated with DMSO compared to those treated with RGFP966 (a selective HDAC3 inhibitor).
1,DMSO vs RGFP966 in GNAS knockout,GNAS_knockout_DMSO,GNAS_knockout_RGFP966_5_μM,"Since GNAS KO has been identified as a sensitizer to HDAC3 inhibition, analyzing the differences in gene expression in this context may elucidate molecular pathways driving sensitivity and potential therapeutic strategies.",This contrast explores the gene expression changes in GNAS knockout cells treated with DMSO versus those treated with RGFP966.
2,Comparison of GNAS knockout with treatment effects,GNAS_knockout_DMSO,GNAS_knockout_RGFP966_5_μM,"This comparative analysis will shed light on the differential responses to HDAC3 inhibition based on the GNAS genotype, possibly revealing new insights into biomarkers for treatment responsiveness in lymphomas.",This contrast compares gene expression profiles across both genotypes (WT and GNAS knockout) under both treatments (DMSO and RGFP966).


In [131]:
contrasts_data

AllAnalysisContrasts(contrasts=[Contrast(name='DMSO vs RGFP966 in WT', values=['WT_DMSO', 'WT_RGFP966_5_μM'], description='This contrast investigates the gene expression differences in wild-type (WT) cells treated with DMSO compared to those treated with RGFP966 (a selective HDAC3 inhibitor).', justification='Understanding the effects of RGFP966 relative to the control (DMSO) in WT cells is crucial for establishing the efficacy of HDAC3 inhibition in this genotype and may inform the mechanistic basis related to the GNAS knockout.'), Contrast(name='DMSO vs RGFP966 in GNAS knockout', values=['GNAS_knockout_DMSO', 'GNAS_knockout_RGFP966_5_μM'], description='This contrast explores the gene expression changes in GNAS knockout cells treated with DMSO versus those treated with RGFP966.', justification='Since GNAS KO has been identified as a sensitizer to HDAC3 inhibition, analyzing the differences in gene expression in this context may elucidate molecular pathways driving sensitivity and pote

The above code won't work if there are four groups for example - that will need to change. For the moment I will proceed and see if this is sufficient information to proceed

### Use identified contrasts to analyse data

The above allowed me to identify contrasts that would be of interest. Now the task is to see how to use this to perform the DEG analysis.

The main challenge I encountered previously was setting up the contrast matrix, but I'll see how I might be able to use structured outputs to bypass this...

The steps in a DEG analysis are:

(Phase 1)
- to read in the Kallisto quantification files
- to read in the appropriate index
- to read in the metadata (in this case, it's already loaded, very convenient)
- use the above to generate the DGEList object

(Phase 2)
- Perform filtering and normalisation

(Phase 3)
- Construct the design and contrast matrices

(Phase 4)
- Perform the DEG experiment


#### Generating DGEList object

In [95]:
# Regurgitate the function to find files... 

def get_files(directory, suffix):
    """
    Recursively lists all files in a given directory and its subdirectories that end with the specified suffix,
    returning their absolute paths.

    Parameters:
    directory (str): The path to the directory to search in.
    suffix (str): The file suffix to look for (e.g., 'fastq.gz').

    Returns:
    list: A list of absolute file paths that match the given suffix.
    """
    matched_files = []
    
    try:
        # Walk through directory and subdirectories
        for root, _, files in os.walk(directory):
            for f in files:
                if f.endswith(suffix):
                    matched_files.append(os.path.join(root, f))
                    
        return matched_files
    except FileNotFoundError:
        print(f"Directory '{directory}' not found.")
        return []
    except Exception as e:
        print(f"An error occurred: {e}")
        return []

In [99]:
# We will use this to find the Kallisto quantification tsvs, as well as the tx2gene files

directory = "/home/myuser/work/notebooks/3_analyse_data/output"
suffix = "abundance.tsv"
abundance_files = get_files(directory, suffix) # just for my own sanity I didn't print the output, but I can see it was able to find all the files
SRA_IDs = pd.read_table("/home/myuser/work/notebooks/2_extract_data/GSE268034_data/sra_ids.txt") # I suspect I will need this to link the FASTQ files to the sample metadata
tx2gene_files = get_files(directory = "/home/myuser/work/data/kallisto_indices/",
                          suffix = ".txt")

In [105]:
import pandas as pd

def link_data(df1, df2, file_paths):
    """
    Links two data frames and a list of file paths based on shared identifiers.

    Parameters:
        df1 (pd.DataFrame): First data frame containing 'sample_ID', 'experiment', and 'SRA_ID'.
        df2 (pd.DataFrame): Second data frame containing detailed metadata, with 'geo_accession' and SRA links.
        file_paths (list): List of file paths containing SRA IDs.

    Returns:
        pd.DataFrame: Merged data frame containing metadata from both data frames and corresponding file paths.
    """

    # Create a dictionary to map SRA IDs to file paths
    sra_file_dict = {path.split('/')[-2]: path for path in file_paths}

    # Add the file paths to df1 based on SRA_ID
    df1['file_path'] = df1['SRA_ID'].map(sra_file_dict)

    # Merge df1 and df2 based on matching GEO accession IDs
    merged_df = df1.merge(df2, left_on='sample_ID', right_on='geo_accession', how='inner')

    # Optional: Drop redundant columns if needed
    merged_df.drop(columns=['geo_accession'], inplace=True)

    return merged_df

# Example usage:
# Assuming df1 and df2 are defined as the first and second data frames, and file_paths is the list provided
linked_data = link_data(SRA_IDs, cleaned_df, abundance_files)

# This will do for the moment, though I wonder how robust it will be...

In [122]:
import pandas as pd
import subprocess
import tempfile
import os

# Paths
tx2gene_path = tx2gene_files[1]

abundance_files = linked_data['file_path'].tolist()

# analysis group
analysis_group = "merged_analysis_group"

# Export metadata to a temporary CSV file for R to read
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.csv') as tmp_meta:
    metadata_path = tmp_meta.name
    linked_data.to_csv(metadata_path, index=False)

# Create a temporary R script file
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.R') as tmp_r_script:
    r_script_path = tmp_r_script.name
    r_script = f"""
    library(tximport)
    library(tidyverse)
    library(edgeR)

    # Read tx2gene
    tx2gene <- read_tsv("{tx2gene_path}", col_names = FALSE) %>%
      dplyr::select(1, 3) %>%
      drop_na()

    # Define abundance files
    files <- c({', '.join([f'"{file}"' for file in abundance_files])})

    # Import data using tximport
    kallisto <- tximport(files = files,
                        type = "kallisto",
                        tx2gene = tx2gene,
                        ignoreAfterBar = TRUE,
                        countsFromAbundance = "lengthScaledTPM")

    # Read metadata
    meta <- read.csv("{metadata_path}", row.names = 1)

    # Create DGEList
    DGE <- DGEList(counts = kallisto$counts,
                  samples = meta)

    keep.exprs <- filterByExpr(DGE, group = DGE$samples${analysis_group})
        DGE.filtered <- DGE[keep.exprs, keep.lib.sizes = FALSE]

    # Normalize
    DGE.final <- calcNormFactors(DGE.filtered)
    
    # Optionally, you can add more R code here to perform downstream analysis
    # For example, saving the DGE object
    saveRDS(DGE.final, file = "DGE.RDS")
    """

    tmp_r_script.write(r_script)

In [123]:
try:
    result = subprocess.run(
        ["Rscript", r_script_path],
        check=True,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True
    )
    print("R script executed successfully.")
    print("Output:", result.stdout)
except subprocess.CalledProcessError as e:
    print("Error executing R script.")
    print("Error message:", e.stderr)
finally:
    # Clean up temporary files if desired
    os.remove(r_script_path)
    os.remove(metadata_path)

R script executed successfully.
Output: 


#### Constructing design and contrast matrices

This is the tricky step.

The end goal is something that looks like this:

design <- model.matrix(data = DGE.final$samples,
                       ~0 + genotype_clean)
colnames(design) <- str_remove_all(colnames(design),
                                   "genotype_clean")
contrast.matrix <- makeContrasts(
    KO = "GNASknockout - WT",
    levels = colnames(design))

The part that was causing strife when I tried initially was the makeContrasts. However, with structured outputs, I hope I can perhaps make something out of this...

So what's the plan going to be? I think I give it the example prompt just get it to fill in the "make contrasts" levels.

In [139]:
cleaned_meta_json = cleaned_df.to_json(orient='records', lines=False, indent=2) # parse to JSON

class Expressions(BaseModel):
    name: str = Field(..., description = "Name of contrast to perform")
    expressions: str = Field(..., description = "Expressions representing contrasts")

class ContrastMatrix(BaseModel):
    contrasts: list[Expressions]

def GenerateContrastExpressions():
    prompt = f"""

### IDENTITY AND PURPOSE

You are an expert in bioinformatics. You advise on the most scientifically valuable experiments that can be performed, and have a deep awareness of DEG analysis tools, such as limma and edgeR.

Your task is to study the provided information, and determine the epxressions to use to construct the contrast matrix.

### STEPS

1. You will be given input information about the contrasts to use. Make note of the description of the contrast, as well as the values
2. For each suggested contrast, state a simple name to represent it (e.g. TreatmentInKO). The fewer characters the better, however it should still be informative.
3. For each suggested contrast, use an expression to represent it. The expression must only use values, exactly as written, indicated in the information about contrasts. Note that this expression MUST be compatible with the makeContrasts function. See below for some examples:
"GNASknockout - WT"
"(GNASknockout_A - GNASknockout_B) - (WT_A - WT_B)"


### OUTPUT

- State a simple name for each contrast
- State an appropriate expression for each contrast

### INPUTS

Contrast information: {contrasts_data}


"""
    chat_completion = client.beta.chat.completions.parse(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="gpt-4o-mini",
        response_format = ContrastMatrix
        )
    result = chat_completion.choices[0].message.parsed
    print(f"Generated tokens: ", chat_completion.usage.completion_tokens)
    print(f"Prompt tokens: ", chat_completion.usage.prompt_tokens)
    print(f"Total tokens: ", chat_completion.usage.total_tokens)
    return(result)

In [140]:
exprs = GenerateContrastExpressions()

Generated tokens:  128
Prompt tokens:  718
Total tokens:  846


In [143]:
json_input = """
{
    "ContrastMatrix": {
        "contrasts": [
            {
                "name": "DMSOvsRGFP966_WT",
                "expressions": "WT_RGFP966_5_μM - WT_DMSO"
            },
            {
                "name": "DMSOvsRGFP966_GNAS",
                "expressions": "GNAS_knockout_RGFP966_5_μM - GNAS_knockout_DMSO"
            },
            {
                "name": "GNASKO_TreatmentComparison",
                "expressions": "(GNAS_knockout_RGFP966_5_μM - GNAS_knockout_DMSO) - (WT_RGFP966_5_μM - WT_DMSO)"
            }
        ]
    }
}
"""

contrasts_json = json.loads(json_input)
contrasts = contrasts_json.get("ContrastMatrix", {}).get("contrasts", [])

In [151]:
contrasts


[{'name': 'DMSOvsRGFP966_WT', 'expressions': 'WT_RGFP966_5_μM - WT_DMSO'},
 {'name': 'DMSOvsRGFP966_GNAS',
  'expressions': 'GNAS_knockout_RGFP966_5_μM - GNAS_knockout_DMSO'},
 {'name': 'GNASKO_TreatmentComparison',
  'expressions': '(GNAS_knockout_RGFP966_5_μM - GNAS_knockout_DMSO) - (WT_RGFP966_5_μM - WT_DMSO)'}]

In [152]:
exprs.dict()['contrasts']

[{'name': 'DMSOvsRGFP966_WT', 'expressions': 'WT_RGFP966_5_μM - WT_DMSO'},
 {'name': 'DMSOvsRGFP966_GNAS',
  'expressions': 'GNAS_knockout_RGFP966_5_μM - GNAS_knockout_DMSO'},
 {'name': 'GNASKO_TreatmentComparison',
  'expressions': '(GNAS_knockout_RGFP966_5_μM - GNAS_knockout_DMSO) - (WT_RGFP966_5_μM - WT_DMSO)'}]