# Exploring results data

The idea of this notebook is to explore the results data and try to find the best RAG approach by analizing the scores of metrics.

For metrics, we are using:
- faithfulness
- answer_relevancy
- context_utilization

Each question for each experiment has scores for each metrics. We will try to do some different analysis of the scores to find the best approach:

- Average of the scores for each metric on experiment level and then analyzing the results, after that creating a score which is an sum of the average scores and then analyzing the results.
- Average scores for each metrics on question level and then analyzing the results, after that we drop the questions with the lowest scores and then analyzing the results based on the first approach.

**First steps:**

The first steps that need to be done are:
- importing the libraries needed for EDA
- loading the data
- checking the data

In [74]:
# Importing needed libraries
import pandas as pd

In [75]:
# Reading the results.csv file
dataframe = pd.read_csv("/home/bojan/Work/mixture-of-rags/results/results.csv")

In [76]:
# Checking the dataframe
dataframe.head()

Unnamed: 0,experiment_name,trace_id,question,answer,faithfulness,answer_relevancy,context_utilization
0,mixture-rag-claude-3-haiku-thought,5d7ae2d3-f2b8-4840-b877-69165f991599,How can attention be described in the Transfor...,The response from the second model provides th...,0.727273,0.723033,1.0
1,mixture-rag-claude-3-haiku-thought,aa2067f5-33f7-4d70-b4c9-f1752084c8ae,What is Mixture of Agents?,The response from the third model provides the...,0.555556,0.466129,0.805556
2,mixture-rag-claude-3-haiku-thought,cefa79c4-cba0-4961-bc87-005e2c2b8837,Is Mixtral based on the idea of a mixture of e...,"Based on the provided responses, the best resp...",0.75,0.636265,1.0
3,mixture-rag-claude-3-haiku-thought,8f2ee9a4-72d8-4956-8131-fa0ed9bce4a0,What is sliding window attention?,The response from the first model provides the...,0.571429,0.691174,1.0
4,mixture-rag-claude-3-haiku-thought,584e89e1-cc11-4101-8c96-f10cb725fa15,How many stages are there in the development o...,The response from the second model provides th...,1.0,0.938562,1.0


## Analysis based on the first approach

The steps for the first approach are:
- Create a copy of the data
- Calculate the average score for each metrics per question(row)
- Check if the scores are created correctly
- Create a dataframe with all the metrics + the new score and sort the values by all the metrics

In [77]:
# Creating a copy of the dataframe
dataframe_1 = dataframe.copy()

In [78]:
# Creating a score for each row by calculating the mean of the scores for each row (faithfulness, answer_relevancy, context_utilization)
dataframe_1["score"] = dataframe_1[
    ["faithfulness", "answer_relevancy", "context_utilization"]
].mean(axis=1)

In [79]:
# Checking the new dataframe
dataframe_1.head()

Unnamed: 0,experiment_name,trace_id,question,answer,faithfulness,answer_relevancy,context_utilization,score
0,mixture-rag-claude-3-haiku-thought,5d7ae2d3-f2b8-4840-b877-69165f991599,How can attention be described in the Transfor...,The response from the second model provides th...,0.727273,0.723033,1.0,0.816768
1,mixture-rag-claude-3-haiku-thought,aa2067f5-33f7-4d70-b4c9-f1752084c8ae,What is Mixture of Agents?,The response from the third model provides the...,0.555556,0.466129,0.805556,0.60908
2,mixture-rag-claude-3-haiku-thought,cefa79c4-cba0-4961-bc87-005e2c2b8837,Is Mixtral based on the idea of a mixture of e...,"Based on the provided responses, the best resp...",0.75,0.636265,1.0,0.795422
3,mixture-rag-claude-3-haiku-thought,8f2ee9a4-72d8-4956-8131-fa0ed9bce4a0,What is sliding window attention?,The response from the first model provides the...,0.571429,0.691174,1.0,0.754201
4,mixture-rag-claude-3-haiku-thought,584e89e1-cc11-4101-8c96-f10cb725fa15,How many stages are there in the development o...,The response from the second model provides th...,1.0,0.938562,1.0,0.979521


In [80]:
# Grouping the dataframe by experiment_name and calculating the mean of the scores for each experiment
dataframe_1_mean = (
    dataframe_1.drop(columns=["trace_id", "question", "answer"])
    .groupby("experiment_name")
    .mean()
)

In [81]:
# Displaying the dataframe sorted by faithfulness by descending order
dataframe_1_mean.sort_values(by="faithfulness", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-llama-3.1-8b,0.922222,0.792426,0.781746,0.832132
simple-rag-llama-3.1-405b-instruct,0.905762,0.841026,0.80754,0.851443
simple-rag-llama-3.1-70b-instruct,0.903128,0.839752,0.805556,0.849478
simple-rag-llama-3-8b,0.887117,0.808932,0.809524,0.835191
simple-rag-gemma-7b-it,0.87346,0.83471,0.791667,0.833279
simple-rag-gpt-4o,0.873352,0.852122,0.825397,0.85029
mixture-rag-gemma2-9b-it-thought,0.864067,0.857216,0.799603,0.840295
simple-rag-claude-3-opus,0.836947,0.860946,0.718254,0.805382
simple-rag-mixtral-8x7b-instruct,0.835374,0.781165,0.837302,0.817947
simple-rag-llama-3-70b,0.817743,0.865189,0.78373,0.822221


In [82]:
# Displaying the dataframe sorted by answer_relevancy by descending order
dataframe_1_mean.sort_values(by="answer_relevancy", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-gemma2-9b-it,0.814856,0.888774,0.78373,0.82912
simple-rag-gpt-4o-mini,0.780492,0.885133,0.813492,0.826372
simple-rag-claude-3.5-sonnet,0.798158,0.870635,0.77381,0.814201
simple-rag-llama-3-70b,0.817743,0.865189,0.78373,0.822221
simple-rag-claude-3-opus,0.836947,0.860946,0.718254,0.805382
mixture-rag-gemma2-9b-it-thought,0.864067,0.857216,0.799603,0.840295
simple-rag-gpt-4o,0.873352,0.852122,0.825397,0.85029
simple-rag-mistral-7b-instruct,0.798987,0.847659,0.819444,0.82203
simple-rag-gpt-4-turbo,0.744952,0.847259,0.771825,0.788012
mixture-rag-llama3.1-8b-instruct-modified,0.661541,0.843879,0.811508,0.772309


In [83]:
# Displaying the dataframe sorted by context_utilization by descending order
dataframe_1_mean.sort_values(by="context_utilization", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-mixtral-8x7b-instruct,0.835374,0.781165,0.837302,0.817947
simple-rag-gpt-4o,0.873352,0.852122,0.825397,0.85029
mixture-rag-mixtral-8x7-instruct,0.702459,0.795491,0.823413,0.773787
simple-rag-mistral-7b-instruct,0.798987,0.847659,0.819444,0.82203
simple-rag-gpt-4o-mini,0.780492,0.885133,0.813492,0.826372
mixture-rag-gemma2-9b-it-modified,0.768318,0.819483,0.813492,0.800431
mixture-rag-llama3.1-8b-instruct-modified,0.661541,0.843879,0.811508,0.772309
simple-rag-llama-3-8b,0.887117,0.808932,0.809524,0.835191
simple-rag-llama-3.1-405b-instruct,0.905762,0.841026,0.80754,0.851443
simple-rag-llama-3.1-70b-instruct,0.903128,0.839752,0.805556,0.849478


In [84]:
# Displaying the dataframe sorted by score(mean of all the metric scores on experiment level) by descending order
dataframe_1_mean.sort_values(by="score", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-llama-3.1-405b-instruct,0.905762,0.841026,0.80754,0.851443
simple-rag-gpt-4o,0.873352,0.852122,0.825397,0.85029
simple-rag-llama-3.1-70b-instruct,0.903128,0.839752,0.805556,0.849478
mixture-rag-gemma2-9b-it-thought,0.864067,0.857216,0.799603,0.840295
simple-rag-llama-3-8b,0.887117,0.808932,0.809524,0.835191
simple-rag-gemma-7b-it,0.87346,0.83471,0.791667,0.833279
simple-rag-llama-3.1-8b,0.922222,0.792426,0.781746,0.832132
simple-rag-gemma2-9b-it,0.814856,0.888774,0.78373,0.82912
simple-rag-gpt-4o-mini,0.780492,0.885133,0.813492,0.826372
simple-rag-llama-3-70b,0.817743,0.865189,0.78373,0.822221


## Analysis based on the second approach

The steps for the second approach are:
- Create a copy of the data
- Calculate the average score for each metrics per question(row)
- Check if the scores are created correctly
- Group the scores by question and calculate the average score for each question
- Drop the 4 questions with the lowest scores
- Do the same steps as in the first approach with the new data

In [85]:
# Creating a copy of the dataframe
dataframe_2 = dataframe.copy()

In [86]:
# Creating a score for each row by calculating the mean of the scores for each row (faithfulness, answer_relevancy, context_utilization)
dataframe_2["score"] = dataframe_2[
    ["faithfulness", "answer_relevancy", "context_utilization"]
].mean(axis=1)

In [87]:
# Checking the new dataframe
dataframe_2.head()

Unnamed: 0,experiment_name,trace_id,question,answer,faithfulness,answer_relevancy,context_utilization,score
0,mixture-rag-claude-3-haiku-thought,5d7ae2d3-f2b8-4840-b877-69165f991599,How can attention be described in the Transfor...,The response from the second model provides th...,0.727273,0.723033,1.0,0.816768
1,mixture-rag-claude-3-haiku-thought,aa2067f5-33f7-4d70-b4c9-f1752084c8ae,What is Mixture of Agents?,The response from the third model provides the...,0.555556,0.466129,0.805556,0.60908
2,mixture-rag-claude-3-haiku-thought,cefa79c4-cba0-4961-bc87-005e2c2b8837,Is Mixtral based on the idea of a mixture of e...,"Based on the provided responses, the best resp...",0.75,0.636265,1.0,0.795422
3,mixture-rag-claude-3-haiku-thought,8f2ee9a4-72d8-4956-8131-fa0ed9bce4a0,What is sliding window attention?,The response from the first model provides the...,0.571429,0.691174,1.0,0.754201
4,mixture-rag-claude-3-haiku-thought,584e89e1-cc11-4101-8c96-f10cb725fa15,How many stages are there in the development o...,The response from the second model provides th...,1.0,0.938562,1.0,0.979521


In [88]:
# Creating a new dataframe by grouping the dataframe by question and calculating the mean of the scores for each question
dataframe_2_mean = (
    dataframe_2.drop(columns=["trace_id", "answer", "experiment_name"])
    .groupby("question")
    .mean()
)

In [89]:
# Displaying the dataframe sorted by score by descending order
dataframe_2_mean.sort_values(by="score", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
question,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
How many stages are there in the development of the Llama 3 model?,0.922808,0.894326,1.0,0.939045
Does Claude 3 models have vision capabilities?,0.937603,0.973932,0.866071,0.925869
Can the GPT-4 model accept both text and image inputs?,0.862446,0.917552,0.875,0.884999
On what architecture the Gemma model is based on?,0.602325,0.984635,1.0,0.86232
What is the difference between the Llama 2 and Llama 2-Chat ?,0.81339,0.946063,0.814484,0.857979
Is Mixtral based on the idea of a mixture of experts?,0.877241,0.688604,1.0,0.855282
How many stages of training are in the GPT model?,0.805057,0.739909,1.0,0.848322
What tokenizer is used in the Gemma2 model?,0.886317,0.970879,0.5,0.785732
What is Mixture of Agents?,0.858851,0.587887,0.865079,0.770606
What are the two tasks in BERT?,0.656663,0.936283,0.712302,0.768416


In [90]:
# Creating a copy of the dataframe
dataframe_3 = dataframe.copy()

In [91]:
# From the dataframe, excluding the questions that are not relevant for the analysis
questions_to_exclude = [
    "What is optimizer is used for LLaMA?",
    "On what architecture the GPT-3 model is based on?",
    "What is sliding window attention?",
    "How can attention be described in the Transformer?",
]

dataframe_3_filtered = dataframe_3[~dataframe_3["question"].isin(questions_to_exclude)]

In [92]:
# Creating a dataframe with mean values for the scores for each experiment
dataframe_3_mean = (
    dataframe_3_filtered.drop(columns=["trace_id", "question", "answer"])
    .groupby("experiment_name")
    .mean()
)

In [93]:
# Creating a score for each row by calculating the mean of the scores for each row (faithfulness, answer_relevancy, context_utilization)
dataframe_3_mean["score"] = dataframe_3_mean[
    ["faithfulness", "answer_relevancy", "context_utilization"]
].mean(axis=1)

In [94]:
# Displaying the dataframe sorted by faithfulness by descending order
dataframe_3_mean.sort_values(by="faithfulness", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-llama-3.1-70b-instruct,0.961231,0.844946,0.863889,0.890022
simple-rag-llama-3.1-8b,0.957778,0.822676,0.880556,0.887003
simple-rag-llama-3.1-405b-instruct,0.945641,0.846877,0.897222,0.89658
mixture-rag-gemma2-9b-it-thought,0.924542,0.910476,0.880556,0.905191
simple-rag-gemma-7b-it,0.923677,0.863669,0.875,0.887449
simple-rag-llama-3-8b,0.913214,0.856165,0.875,0.88146
simple-rag-llama-3-70b,0.901136,0.885328,0.863889,0.883451
simple-rag-mixtral-8x7b-instruct,0.896447,0.884369,0.908333,0.896383
simple-rag-gpt-4o,0.895355,0.884128,0.897222,0.892235
mixture-rag-mixtral-8x7-instruct-thought,0.892727,0.794075,0.833333,0.840045


In [95]:
# Displaying the dataframe sorted by answer_relevancy by descending order
dataframe_3_mean.sort_values(by="answer_relevancy", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-gpt-4o-mini,0.851786,0.918347,0.9,0.890044
simple-rag-mistral-7b-instruct,0.878027,0.914597,0.908333,0.900319
mixture-rag-gemma2-9b-it-thought,0.924542,0.910476,0.880556,0.905191
simple-rag-claude-3.5-sonnet,0.82184,0.90533,0.825,0.850723
simple-rag-gemma2-9b-it,0.846212,0.905305,0.863889,0.871802
mixture-rag-llama3.1-8b-instruct-thought,0.82,0.897726,0.838889,0.852205
simple-rag-claude-3-opus,0.867106,0.891054,0.772222,0.843461
simple-rag-llama-3-70b,0.901136,0.885328,0.863889,0.883451
simple-rag-mixtral-8x7b-instruct,0.896447,0.884369,0.908333,0.896383
simple-rag-gpt-4o,0.895355,0.884128,0.897222,0.892235


In [96]:
# Displaying the dataframe sorted by context_utilization by descending order
dataframe_3_mean.sort_values(by="context_utilization", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-llama3.1-8b-instruct,0.735352,0.806652,0.916667,0.819557
mixture-rag-mixtral-8x7-instruct-modified,0.882197,0.859517,0.916667,0.886127
mixture-rag-mixtral-8x7-instruct,0.74575,0.845767,0.913889,0.835136
simple-rag-mixtral-8x7b-instruct,0.896447,0.884369,0.908333,0.896383
simple-rag-mistral-7b-instruct,0.878027,0.914597,0.908333,0.900319
simple-rag-gpt-4o-mini,0.851786,0.918347,0.9,0.890044
mixture-rag-llama3.1-8b-instruct-modified,0.709821,0.871686,0.897222,0.826243
simple-rag-llama-3.1-405b-instruct,0.945641,0.846877,0.897222,0.89658
simple-rag-gpt-4o,0.895355,0.884128,0.897222,0.892235
mixture-rag-gemma2-9b-it-modified,0.825208,0.867729,0.880556,0.857831


In [97]:
# Displaying the dataframe sorted by score(mean of all the metric scores on experiment level) by descending order
dataframe_3_mean.sort_values(by="score", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-gemma2-9b-it-thought,0.924542,0.910476,0.880556,0.905191
simple-rag-mistral-7b-instruct,0.878027,0.914597,0.908333,0.900319
simple-rag-llama-3.1-405b-instruct,0.945641,0.846877,0.897222,0.89658
simple-rag-mixtral-8x7b-instruct,0.896447,0.884369,0.908333,0.896383
simple-rag-gpt-4o,0.895355,0.884128,0.897222,0.892235
simple-rag-gpt-4o-mini,0.851786,0.918347,0.9,0.890044
simple-rag-llama-3.1-70b-instruct,0.961231,0.844946,0.863889,0.890022
simple-rag-gemma-7b-it,0.923677,0.863669,0.875,0.887449
simple-rag-llama-3.1-8b,0.957778,0.822676,0.880556,0.887003
mixture-rag-mixtral-8x7-instruct-modified,0.882197,0.859517,0.916667,0.886127
