# Evaluation of LLM Agent against Ground Truth Labels for Spatial Clustering Methods
## Introduction 
This research project seeks to answer the following question: "Is it possible to create an agent that can understand and summarise the literature of spatial clustering methods employed in the field of spatial transcriptomics?" I plan to answer this by creating my own LLM agent and comparing it against other existing LLM models. I improved the agent by employing RAG (Retrieval Augmented Generation) and specialised web search engine Tavily as tools for the Agent. Additional system prompts also were used to help guide the Agent for the nature of its responses. Evaluation of the agent was performed by comparing a large spreadsheet of various metrics and categories for each spatial clustering method generated from human experts against the results generated by the agent. Ablation for each tool has been performed to analyse further impacts of each tool added, and finally a comparison of the agent against the end-user application of ChatGPT was performed.

## Method
The LLM Agent was developed using the Langchain framework, which allows for flexible abstractions during the development process. Langchain allows for easy integration of new features such as the use of embedding models for RAG and an AgentExecutor class for a useful interface for helping the agent choose a relevant tool when coming up with a response. 

### Model Selection
Google Gemini Flash was chosen as it was free to use for a large number of tokens for educational purposes. 

### System Prompt
The system was provided with a prompt that informs it that it is an assistant capable of answering various topics, from simple topics to deep discussions. It is also told to act as a spatial transcriptomics expert, and that answers should be kept technical, but concise. 

Additional system prompts include the definitions of various categories for each spatial algorithm as determined by human experts which can be seen [here](https://docs.google.com/spreadsheets/d/1P1-Nw0i_MpLoE8he1H7ZT-acYd4jOgDPrKZBxR-L6dw/edit?gid=0#gid=0). This ranged from things such as clustering method (Bayesian, Graph-based, Autoencoder, Centroid, Hierarchical-based), Scalability, Assumptions, Input data, programming language and metrics + simulations employed in the article. See the `prompts.py` file for more information regarding the specific details.

### Information Review Sheet
The spatial information review sheet had its categories filled in by 2 domain experts. On top of discrete categories, various columns of the sheet also contained details of the experts' thoughts of the algorithm on things such as realism, additional datasets, and comments.

### RAG Process
The RAG process begins by taking in a scientific article in the form of an input pdf file by loading it using the PyPDFLoader package. It is then broken down into several chunks using a RecursiveTextCharacterSplitter to improve proccessing efficiency, this can then be converted to vectors via an embedding model. The vectors are then stored in a Chroma vector database, which can be fetched from the LLM via a document retriever. A document retriever attempts to find the most relavant information in relation to a user query (In this case, the user's question). To improve the RAG process, an ensemble retriever was used consisting of the following: The first retriever uses a vector similarity search from the Chroma vector database by looking at vectors that were closer to the query in a high dimensional space. The second is a BM25 retriever which ranks the best match 25 chunks based on relevance to the given query.

### Tavily Search
Tavily Search API is a web search engine API that can be used by an LLM during the processing stage, it searches the web for additional relevant webpages by using the context of the user query. This helps fill in gaps and allows the Agent to infer certain details when the information of the article is insufficient for answering its questions. (I.e. Accessing the source code of the algorithm via public GitHub repository)


## Evaluation Method
With all this in mind, let us discuss the details of the evaluation. Five different methods were evaluated:

1. The Agent with all tools available.
2. The Agent without Tavily Search.
3. The Agent without document retrievers.
4. The Agent with no tools available
5. ChatGPT's end-user interface.

In each of these methods, a scientific article of a spatial clustering method was provided, along with a user prommpt that tells the agent to categorise the spatial clustering method all the fields as described in the **System** Prompt section. With the following structure:

<Category>,<Answer>

Each result was stored in the form of a csv file under the file structure <method_name>/<article_name>.csv 

This process was repeated until all 5 methods contained all csvs.

# Analysis
With all this in mind, we can now begin the coding part.

## Data Loading
Dataframes were created by accessing each directory and reading each individual csv file. The csvs were transposed and added into a dataframe one row at a time. Once all csvs were retrieved and appended, an additional column specifying the LLM method was added to the dataframe.

For the spatial clustering information sheet- the Google Sheets was directly read, specifying ranges with discrete values. And since currently the sheet is still under review, duplicate entries were removed from the dataset.

In [1]:
import pandas as pd 
import os
from matplotlib import pyplot as plt

def load_sheet(folder):
    '''Given a folder, load all csv files for each clustering methods
    folder: str of folder name
    returns: A dataframe containing all LLM predictions for clustering categories'''
    df = pd.read_csv(f"{folder}/banksy.csv", index_col= 0, skipinitialspace= True).transpose()
    df.index = df.index.str.strip()

    for sheet in os.listdir(folder)[1:]:
        try: 
            if sheet.endswith(".csv"):
                tmp_df = pd.read_csv(f"{folder}/{sheet}", index_col= 0, skipinitialspace= True).transpose()
                tmp_df.index = tmp_df.index.str.strip()
                df = pd.concat([df,tmp_df], axis = 0)
        except Exception as e:
            print(e)
            print(f"Error in {folder}")
            print("Adding in:")
            print(tmp_df)
            print("Current")
            print(df)
            break
    

    if df.columns[-1] == "```":
        df = df.iloc[:,:-1]
    df['Source'] = folder
    return df

def load_google_sheet():
    '''Loads the spatial clustering review information sheet from google Sheets
    returns: A dataframe containing all labels'''
    gsheetid = "1P1-Nw0i_MpLoE8he1H7ZT-acYd4jOgDPrKZBxR-L6dw"
    sheet_name = "Methods" 
    online_sheet = f"https://docs.google.com/spreadsheets/d/{gsheetid}/gviz/tq?tqx=out:csv&sheet={sheet_name}&range=A2:A,I2:Y,AA2:AE,AH2:AW,AX2:BI,BL2:BQ,BS2:BU,BW2:CO,CQ2:CR,CT2:CX"
    info = pd.read_csv(online_sheet, index_col = 0)
    info = info.dropna(subset = ["Parameter testing"])
    info['Source'] = "Truth"

    return info

In [2]:
agent = load_sheet("agent_out")
gpt = load_sheet("gpt_out")
pdfs = load_sheet("pdf_out")
search = load_sheet("search_out")
truth = load_google_sheet()
truth = truth[~truth.index.duplicated(keep='first')]
truth.reset_index(inplace= True)

## Comparisons 
To obtain the comparison between information sheet and agent output, column names were matched so that the dataframes could be joined, and filtered by method.

Correctness of the agent was measured by counting the number of matching entries from the ground truth and agent.

In [4]:
llm_df = pd.concat([agent, gpt, search, pdfs])
llm_df.reset_index(inplace = True)
llm_df.rename(columns = {"index":"Algorithm"}, inplace = True)
llm_df.columns.name = None

llm_df.columns = truth.columns

def filter_and_compare(method):
    '''Joins a method with a specified category and the ground truth dataframe together and 
    performs a comparison, by calculating the number of matching terms.
    
    method (str): Method to filter on 
    return: A dataframe containing the correctness of each algorithm'''
    agent_filter = llm_df[llm_df['Source'] == method]

    truth_filter = truth[truth['Algorithm'].isin(agent_filter['Algorithm'])]
    agent_filter = agent_filter[agent_filter['Algorithm'].isin(truth['Algorithm'])]

    agent_filter = agent_filter.sort_values(by= 'Algorithm').reset_index(drop = True)
    truth_filter = truth_filter.sort_values(by= 'Algorithm').reset_index(drop = True)

    compare = agent_filter.eq(truth_filter)
    compare['Algorithm'] = agent_filter['Algorithm']
    compare['Correct_Terms'] = compare.drop(columns='Algorithm').sum(axis = 1)
    compare['Correct_Percent'] = 100* compare['Correct_Terms'] / compare.shape[1]
    return compare

filter_and_compare("agent_out")

Unnamed: 0,Algorithm,Package,Bioconductor,CRAN,Vignette,Vignette with examples from different technologies,R,Python,C/C++,Centroid-based,...,Realistic,Unrealistic,Simulation included,Scalability assessment,Accuracy assessment,Stress testing,Parameter testing,Source,Correct_Terms,Correct_Percent
0,BASS,True,True,True,True,False,True,True,True,True,...,True,False,True,False,True,False,False,False,52,59.090909
1,DeepST,True,True,True,False,True,True,True,True,True,...,True,True,True,False,False,True,True,False,66,75.0
2,GraphST,True,True,True,False,True,True,True,True,True,...,True,True,True,False,False,True,False,False,64,72.727273
3,SEDR,True,True,True,False,False,True,True,True,True,...,True,True,True,True,False,True,True,False,64,72.727273
4,STAGATE,True,True,True,False,True,True,True,True,True,...,True,True,True,True,False,True,False,False,60,68.181818
5,SpaGCN,True,True,True,False,False,True,True,True,False,...,True,True,True,False,False,True,False,False,63,71.590909
6,SpaceFlow,True,True,True,False,False,True,True,True,True,...,True,True,True,False,False,True,True,False,67,76.136364
7,SpatialPCA,True,True,True,False,True,True,True,True,True,...,True,True,True,False,True,False,True,False,65,73.863636
