# Exploring results data

The idea of this notebook is to explore the results data and try to find the best RAG approach by analizing the scores of metrics.

For metrics, we are using:
- faithfulness
- answer_relevancy
- context_utilization

Each question for each experiment has scores for each metrics. We will try to do some different analysis of the scores to find the best approach:

- Average of the scores for each metric on experiment level and then analyzing the results, after that creating a score which is an sum of the average scores and then analyzing the results.
- Average scores for each metrics on question level and then analyzing the results, after that we drop the questions with the lowest scores and then analyzing the results based on the first approach.

Additionally we would want to analyse both the simple rag approach and the mixture rag approach separately and then together.

**First steps:**

The first steps that need to be done are:
- importing the libraries needed for EDA
- loading the data
- checking the data

In [1]:
# Importing needed libraries
import pandas as pd

In [2]:
# Reading the results.csv file
dataframe = pd.read_csv("/home/bojan/Work/mixture-of-rags/results/results.csv")

In [5]:
# Checking the dataframe
dataframe.head()

Unnamed: 0,experiment_name,trace_id,question,answer,faithfulness,answer_relevancy,context_utilization
0,mixture-rag-claude-3-haiku-modified-corrected,d980a6a9-e283-462e-9085-41d751d9c169,How can attention be described in the Transfor...,Based on the responses provided by the three s...,0.652174,0.752435,0.805556
1,mixture-rag-claude-3-haiku-modified-corrected,9dbcf03e-dadc-47d6-9014-fca44bbc7bf2,What is Mixture of Agents?,Based on the responses provided by the three s...,0.71875,0.474152,1.0
2,mixture-rag-claude-3-haiku-modified-corrected,45d3440f-d2fd-4e6b-bf0f-13dd9ba11f71,Is Mixtral based on the idea of a mixture of e...,Based on the responses provided by the three s...,0.4375,0.554454,1.0
3,mixture-rag-claude-3-haiku-modified-corrected,2069fead-57dc-40e1-b9e5-28b5cfff59a8,What is sliding window attention?,Based on the responses provided by the three s...,0.703704,0.742605,1.0
4,mixture-rag-claude-3-haiku-modified-corrected,9a8b095d-64a6-4fb7-af35-4ecae3207c98,How many stages are there in the development o...,Based on the responses provided by the three s...,0.5625,0.921537,1.0


## Analysis based on the first approach

The steps for the first approach are:
- Create a copy of the data
- Calculate the average score for each metrics per question(row)
- Check if the scores are created correctly
- Create a dataframe with all the metrics + the new score and sort the values by all the metrics

In [6]:
# Creating a copy of the dataframe
dataframe_1 = dataframe.copy()

In [7]:
# Creating a score for each row by calculating the mean of the scores for each row (faithfulness, answer_relevancy, context_utilization)
dataframe_1["score"] = dataframe_1[
    ["faithfulness", "answer_relevancy", "context_utilization"]
].mean(axis=1)

In [8]:
# Checking the new dataframe
dataframe_1.head()

Unnamed: 0,experiment_name,trace_id,question,answer,faithfulness,answer_relevancy,context_utilization,score
0,mixture-rag-claude-3-haiku-modified-corrected,d980a6a9-e283-462e-9085-41d751d9c169,How can attention be described in the Transfor...,Based on the responses provided by the three s...,0.652174,0.752435,0.805556,0.736722
1,mixture-rag-claude-3-haiku-modified-corrected,9dbcf03e-dadc-47d6-9014-fca44bbc7bf2,What is Mixture of Agents?,Based on the responses provided by the three s...,0.71875,0.474152,1.0,0.730967
2,mixture-rag-claude-3-haiku-modified-corrected,45d3440f-d2fd-4e6b-bf0f-13dd9ba11f71,Is Mixtral based on the idea of a mixture of e...,Based on the responses provided by the three s...,0.4375,0.554454,1.0,0.663985
3,mixture-rag-claude-3-haiku-modified-corrected,2069fead-57dc-40e1-b9e5-28b5cfff59a8,What is sliding window attention?,Based on the responses provided by the three s...,0.703704,0.742605,1.0,0.815436
4,mixture-rag-claude-3-haiku-modified-corrected,9a8b095d-64a6-4fb7-af35-4ecae3207c98,How many stages are there in the development o...,Based on the responses provided by the three s...,0.5625,0.921537,1.0,0.828012


In [9]:
# Grouping the dataframe by experiment_name and calculating the mean of the scores for each experiment
dataframe_1_mean = (
    dataframe_1.drop(columns=["trace_id", "question", "answer"])
    .groupby("experiment_name")
    .mean()
)

In [10]:
# Creating a dataframe only for the experiments with simple RAG and mixture RAG
dataframe_1_mean_simple = dataframe_1_mean[
    dataframe_1_mean.index.str.contains("simple")
]
dataframe_1_mean_mixture = dataframe_1_mean[
    dataframe_1_mean.index.str.contains("mixture")
]

**Simple RAG Results**

In [11]:
# Displaying the simple RAGs results sorted by faithfulness by descending order
dataframe_1_mean_simple.sort_values(by="faithfulness", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-llama-3.1-8b-instruct-corrected,0.912961,0.849348,0.788889,0.850399
simple-rag-llama-3.1-405b-instruct-corrected,0.908344,0.812093,0.801852,0.840763
simple-rag-llama-3.1-70b-instructed-corrected,0.875456,0.825337,0.812963,0.837918
simple-rag-llama-3-70b-instruct-corrected,0.853268,0.852571,0.792593,0.832811
simple-rag-claude-3-opus-corrected,0.844025,0.856592,0.781481,0.827366
simple-rag-llama-3-8b-instruct-corrected,0.836943,0.803975,0.755556,0.798825
simple-rag-gpt-4o-corrected,0.8348,0.847993,0.82037,0.834388
simple-rag-mixtral-8x7b-instruct-corrected,0.829334,0.850844,0.831481,0.83722
simple-rag-gpt-4o-mini-corrected,0.822064,0.879556,0.787037,0.829552
simple-rag-mistral-7b-instruct-corrected,0.810449,0.872687,0.814815,0.83265


In [12]:
# Displaying the simple RAGs results sorted by answer_relevancy by descending order
dataframe_1_mean_simple.sort_values(by="answer_relevancy", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-gemma2-9b-it-corrected,0.803332,0.889445,0.8,0.830926
simple-rag-gpt-4o-mini-corrected,0.822064,0.879556,0.787037,0.829552
simple-rag-mistral-7b-instruct-corrected,0.810449,0.872687,0.814815,0.83265
simple-rag-claude-3.5-sonnet-corrected,0.755811,0.862756,0.77037,0.796313
simple-rag-claude-3-opus-corrected,0.844025,0.856592,0.781481,0.827366
simple-rag-llama-3-70b-instruct-corrected,0.853268,0.852571,0.792593,0.832811
simple-rag-gpt-4-turbo-corrected,0.804581,0.852001,0.803704,0.820095
simple-rag-mixtral-8x7b-instruct-corrected,0.829334,0.850844,0.831481,0.83722
simple-rag-llama-3.1-8b-instruct-corrected,0.912961,0.849348,0.788889,0.850399
simple-rag-gpt-4o-corrected,0.8348,0.847993,0.82037,0.834388


In [13]:
# Displaying the simple RAGs results sorted by context_utilization by descending order
dataframe_1_mean_simple.sort_values(by="context_utilization", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-mixtral-8x7b-instruct-corrected,0.829334,0.850844,0.831481,0.83722
simple-rag-gpt-4o-corrected,0.8348,0.847993,0.82037,0.834388
simple-rag-mistral-7b-instruct-corrected,0.810449,0.872687,0.814815,0.83265
simple-rag-gemma-7b-it-corrected,0.799832,0.840303,0.812963,0.817699
simple-rag-llama-3.1-70b-instructed-corrected,0.875456,0.825337,0.812963,0.837918
simple-rag-gpt-4-turbo-corrected,0.804581,0.852001,0.803704,0.820095
simple-rag-llama-3.1-405b-instruct-corrected,0.908344,0.812093,0.801852,0.840763
simple-rag-gemma2-9b-it-corrected,0.803332,0.889445,0.8,0.830926
simple-rag-llama-3-70b-instruct-corrected,0.853268,0.852571,0.792593,0.832811
simple-rag-llama-3.1-8b-instruct-corrected,0.912961,0.849348,0.788889,0.850399


In [14]:
# Displaying the simple RAGs sorted by score(mean of all the metric scores on experiment level) by descending order
dataframe_1_mean_simple.sort_values(by="score", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-llama-3.1-8b-instruct-corrected,0.912961,0.849348,0.788889,0.850399
simple-rag-llama-3.1-405b-instruct-corrected,0.908344,0.812093,0.801852,0.840763
simple-rag-llama-3.1-70b-instructed-corrected,0.875456,0.825337,0.812963,0.837918
simple-rag-mixtral-8x7b-instruct-corrected,0.829334,0.850844,0.831481,0.83722
simple-rag-gpt-4o-corrected,0.8348,0.847993,0.82037,0.834388
simple-rag-llama-3-70b-instruct-corrected,0.853268,0.852571,0.792593,0.832811
simple-rag-mistral-7b-instruct-corrected,0.810449,0.872687,0.814815,0.83265
simple-rag-gemma2-9b-it-corrected,0.803332,0.889445,0.8,0.830926
simple-rag-gpt-4o-mini-corrected,0.822064,0.879556,0.787037,0.829552
simple-rag-claude-3-opus-corrected,0.844025,0.856592,0.781481,0.827366


**Mixture RAG Results**

In [15]:
# Displaying the mixture RAGs results sorted by faithfulness by descending order
dataframe_1_mean_mixture.sort_values(by="faithfulness", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-mixtral-8x7-instruct-modified-corrected,0.813206,0.782839,0.740741,0.778929
mixture-rag-llama3.1-8b-instruct-thought-corrected,0.803832,0.81144,0.775926,0.797066
mixture-rag-mixtral-8x7-instruct-corrected,0.774733,0.751905,0.785185,0.770608
mixture-rag-gemma2-9b-it-modified-corrected,0.744673,0.843607,0.822222,0.803501
mixture-rag-mixtral-8x7-instruct-thought-corrected,0.692236,0.845941,0.777778,0.771985
mixture-rag-gemma2-9b-it-corrected,0.667174,0.841911,0.798148,0.769078
mixture-rag-gemma2-9b-it-thought-corrected,0.661687,0.770956,0.775926,0.73619
mixture-rag-llama3.1-8b-instruct-modified-corrected,0.639369,0.824988,0.8,0.754786
mixture-rag-claude-3-haiku-corrected,0.621775,0.842339,0.822222,0.762112
mixture-rag-llama3.1-8b-instruct-corrected,0.618516,0.78796,0.775926,0.727467


In [16]:
# Displaying the mixture RAGs results sorted by answer_relevancy by descending order
dataframe_1_mean_mixture.sort_values(by="answer_relevancy", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-mixtral-8x7-instruct-thought-corrected,0.692236,0.845941,0.777778,0.771985
mixture-rag-gemma2-9b-it-modified-corrected,0.744673,0.843607,0.822222,0.803501
mixture-rag-claude-3-haiku-corrected,0.621775,0.842339,0.822222,0.762112
mixture-rag-gemma2-9b-it-corrected,0.667174,0.841911,0.798148,0.769078
mixture-rag-llama3.1-8b-instruct-modified-corrected,0.639369,0.824988,0.8,0.754786
mixture-rag-claude-3-haiku-modified-corrected,0.588775,0.820608,0.762963,0.724115
mixture-rag-llama3.1-8b-instruct-thought-corrected,0.803832,0.81144,0.775926,0.797066
mixture-rag-llama3.1-8b-instruct-corrected,0.618516,0.78796,0.775926,0.727467
mixture-rag-mixtral-8x7-instruct-modified-corrected,0.813206,0.782839,0.740741,0.778929
mixture-rag-gemma2-9b-it-thought-corrected,0.661687,0.770956,0.775926,0.73619


In [17]:
# Displaying the mixture RAGs results sorted by context_utilization by descending order
dataframe_1_mean_mixture.sort_values(by="context_utilization", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-claude-3-haiku-corrected,0.621775,0.842339,0.822222,0.762112
mixture-rag-gemma2-9b-it-modified-corrected,0.744673,0.843607,0.822222,0.803501
mixture-rag-llama3.1-8b-instruct-modified-corrected,0.639369,0.824988,0.8,0.754786
mixture-rag-gemma2-9b-it-corrected,0.667174,0.841911,0.798148,0.769078
mixture-rag-claude-3-haiku-thought-corrected,0.57938,0.750864,0.794444,0.708229
mixture-rag-mixtral-8x7-instruct-corrected,0.774733,0.751905,0.785185,0.770608
mixture-rag-mixtral-8x7-instruct-thought-corrected,0.692236,0.845941,0.777778,0.771985
mixture-rag-llama3.1-8b-instruct-corrected,0.618516,0.78796,0.775926,0.727467
mixture-rag-llama3.1-8b-instruct-thought-corrected,0.803832,0.81144,0.775926,0.797066
mixture-rag-gemma2-9b-it-thought-corrected,0.661687,0.770956,0.775926,0.73619


In [18]:
# Displaying the mixture RAGs sorted by score(mean of all the metric scores on experiment level) by descending order
dataframe_1_mean_mixture.sort_values(by="score", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-gemma2-9b-it-modified-corrected,0.744673,0.843607,0.822222,0.803501
mixture-rag-llama3.1-8b-instruct-thought-corrected,0.803832,0.81144,0.775926,0.797066
mixture-rag-mixtral-8x7-instruct-modified-corrected,0.813206,0.782839,0.740741,0.778929
mixture-rag-mixtral-8x7-instruct-thought-corrected,0.692236,0.845941,0.777778,0.771985
mixture-rag-mixtral-8x7-instruct-corrected,0.774733,0.751905,0.785185,0.770608
mixture-rag-gemma2-9b-it-corrected,0.667174,0.841911,0.798148,0.769078
mixture-rag-claude-3-haiku-corrected,0.621775,0.842339,0.822222,0.762112
mixture-rag-llama3.1-8b-instruct-modified-corrected,0.639369,0.824988,0.8,0.754786
mixture-rag-gemma2-9b-it-thought-corrected,0.661687,0.770956,0.775926,0.73619
mixture-rag-llama3.1-8b-instruct-corrected,0.618516,0.78796,0.775926,0.727467


**Combined RAG Results**

In [19]:
# Displaying all the results sorted by faithfulness by descending order
dataframe_1_mean.sort_values(by="faithfulness", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-llama-3.1-8b-instruct-corrected,0.912961,0.849348,0.788889,0.850399
simple-rag-llama-3.1-405b-instruct-corrected,0.908344,0.812093,0.801852,0.840763
simple-rag-llama-3.1-70b-instructed-corrected,0.875456,0.825337,0.812963,0.837918
simple-rag-llama-3-70b-instruct-corrected,0.853268,0.852571,0.792593,0.832811
simple-rag-claude-3-opus-corrected,0.844025,0.856592,0.781481,0.827366
simple-rag-llama-3-8b-instruct-corrected,0.836943,0.803975,0.755556,0.798825
simple-rag-gpt-4o-corrected,0.8348,0.847993,0.82037,0.834388
simple-rag-mixtral-8x7b-instruct-corrected,0.829334,0.850844,0.831481,0.83722
simple-rag-gpt-4o-mini-corrected,0.822064,0.879556,0.787037,0.829552
mixture-rag-mixtral-8x7-instruct-modified-corrected,0.813206,0.782839,0.740741,0.778929


In [20]:
# Displaying all the results sorted by answer_relevancy by descending order
dataframe_1_mean.sort_values(by="answer_relevancy", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-gemma2-9b-it-corrected,0.803332,0.889445,0.8,0.830926
simple-rag-gpt-4o-mini-corrected,0.822064,0.879556,0.787037,0.829552
simple-rag-mistral-7b-instruct-corrected,0.810449,0.872687,0.814815,0.83265
simple-rag-claude-3.5-sonnet-corrected,0.755811,0.862756,0.77037,0.796313
simple-rag-claude-3-opus-corrected,0.844025,0.856592,0.781481,0.827366
simple-rag-llama-3-70b-instruct-corrected,0.853268,0.852571,0.792593,0.832811
simple-rag-gpt-4-turbo-corrected,0.804581,0.852001,0.803704,0.820095
simple-rag-mixtral-8x7b-instruct-corrected,0.829334,0.850844,0.831481,0.83722
simple-rag-llama-3.1-8b-instruct-corrected,0.912961,0.849348,0.788889,0.850399
simple-rag-gpt-4o-corrected,0.8348,0.847993,0.82037,0.834388


In [21]:
# Displaying all the results sorted by context_utilization by descending order
dataframe_1_mean.sort_values(by="context_utilization", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-mixtral-8x7b-instruct-corrected,0.829334,0.850844,0.831481,0.83722
mixture-rag-claude-3-haiku-corrected,0.621775,0.842339,0.822222,0.762112
mixture-rag-gemma2-9b-it-modified-corrected,0.744673,0.843607,0.822222,0.803501
simple-rag-gpt-4o-corrected,0.8348,0.847993,0.82037,0.834388
simple-rag-mistral-7b-instruct-corrected,0.810449,0.872687,0.814815,0.83265
simple-rag-gemma-7b-it-corrected,0.799832,0.840303,0.812963,0.817699
simple-rag-llama-3.1-70b-instructed-corrected,0.875456,0.825337,0.812963,0.837918
simple-rag-gpt-4-turbo-corrected,0.804581,0.852001,0.803704,0.820095
simple-rag-llama-3.1-405b-instruct-corrected,0.908344,0.812093,0.801852,0.840763
mixture-rag-llama3.1-8b-instruct-modified-corrected,0.639369,0.824988,0.8,0.754786


In [22]:
# Displaying all the results sorted by score(mean of all the metric scores on experiment level) by descending order
dataframe_1_mean.sort_values(by="score", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-llama-3.1-8b-instruct-corrected,0.912961,0.849348,0.788889,0.850399
simple-rag-llama-3.1-405b-instruct-corrected,0.908344,0.812093,0.801852,0.840763
simple-rag-llama-3.1-70b-instructed-corrected,0.875456,0.825337,0.812963,0.837918
simple-rag-mixtral-8x7b-instruct-corrected,0.829334,0.850844,0.831481,0.83722
simple-rag-gpt-4o-corrected,0.8348,0.847993,0.82037,0.834388
simple-rag-llama-3-70b-instruct-corrected,0.853268,0.852571,0.792593,0.832811
simple-rag-mistral-7b-instruct-corrected,0.810449,0.872687,0.814815,0.83265
simple-rag-gemma2-9b-it-corrected,0.803332,0.889445,0.8,0.830926
simple-rag-gpt-4o-mini-corrected,0.822064,0.879556,0.787037,0.829552
simple-rag-claude-3-opus-corrected,0.844025,0.856592,0.781481,0.827366


## Analysis based on the second approach

The steps for the second approach are:
- Create a copy of the data
- Calculate the average score for each metrics per question(row)
- Check if the scores are created correctly
- Group the scores by question and calculate the average score for each question
- Drop the 4 questions with the lowest scores
- Do the same steps as in the first approach with the new data

In [23]:
# Creating a copy of the dataframe
dataframe_2 = dataframe.copy()

In [24]:
# Creating a score for each row by calculating the mean of the scores for each row (faithfulness, answer_relevancy, context_utilization)
dataframe_2["score"] = dataframe_2[
    ["faithfulness", "answer_relevancy", "context_utilization"]
].mean(axis=1)

In [25]:
# Checking the new dataframe
dataframe_2.head()

Unnamed: 0,experiment_name,trace_id,question,answer,faithfulness,answer_relevancy,context_utilization,score
0,mixture-rag-claude-3-haiku-modified-corrected,d980a6a9-e283-462e-9085-41d751d9c169,How can attention be described in the Transfor...,Based on the responses provided by the three s...,0.652174,0.752435,0.805556,0.736722
1,mixture-rag-claude-3-haiku-modified-corrected,9dbcf03e-dadc-47d6-9014-fca44bbc7bf2,What is Mixture of Agents?,Based on the responses provided by the three s...,0.71875,0.474152,1.0,0.730967
2,mixture-rag-claude-3-haiku-modified-corrected,45d3440f-d2fd-4e6b-bf0f-13dd9ba11f71,Is Mixtral based on the idea of a mixture of e...,Based on the responses provided by the three s...,0.4375,0.554454,1.0,0.663985
3,mixture-rag-claude-3-haiku-modified-corrected,2069fead-57dc-40e1-b9e5-28b5cfff59a8,What is sliding window attention?,Based on the responses provided by the three s...,0.703704,0.742605,1.0,0.815436
4,mixture-rag-claude-3-haiku-modified-corrected,9a8b095d-64a6-4fb7-af35-4ecae3207c98,How many stages are there in the development o...,Based on the responses provided by the three s...,0.5625,0.921537,1.0,0.828012


In [26]:
# Creating a new dataframe by grouping the dataframe by question and calculating the mean of the scores for each question
dataframe_2_mean = (
    dataframe_2.drop(columns=["trace_id", "answer", "experiment_name"])
    .groupby("question")
    .mean()
)

In [27]:
# Displaying the dataframe sorted by score by descending order
dataframe_2_mean.sort_values(by="score", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
question,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
How many stages are there in the development of the Llama 3 model?,0.901027,0.89534,1.0,0.932122
Does Claude 3 models have vision capabilities?,0.913888,0.986073,0.889881,0.929947
What is the GPT-4o model?,0.838046,0.805813,1.0,0.881287
On what architecture the Gemma model is based on?,0.590781,0.985012,0.994048,0.856614
How many stages of training are in the GPT model?,0.784514,0.770442,1.0,0.851652
What is the difference between the Llama 2 and Llama 2-Chat ?,0.81527,0.902376,0.803571,0.840406
Is Mixtral based on the idea of a mixture of experts?,0.835538,0.680172,1.0,0.83857
Can the GPT-4 model accept both text and image inputs?,0.736189,0.928237,0.805556,0.823327
What tokenizer is used in the Gemma2 model?,0.870547,0.976124,0.5,0.782223
What is Mixture of Agents?,0.873521,0.574426,0.853175,0.767041


In [28]:
# Creating a copy of the dataframe
dataframe_3 = dataframe.copy()

In [29]:
# From the dataframe, excluding the questions that are not relevant for the analysis
questions_to_exclude = [
    "What is optimizer is used for LLaMA?",
    "On what architecture the GPT-3 model is based on?",
    "What is sliding window attention?",
    "How can attention be described in the Transformer?",
    "What are the two tasks in BERT?",
]

dataframe_3_filtered = dataframe_3[~dataframe_3["question"].isin(questions_to_exclude)]

In [30]:
# Creating a dataframe with mean values for the scores for each experiment
dataframe_3_mean = (
    dataframe_3_filtered.drop(columns=["trace_id", "question", "answer"])
    .groupby("experiment_name")
    .mean()
)

In [31]:
# Creating a score for each row by calculating the mean of the scores for each row (faithfulness, answer_relevancy, context_utilization)
dataframe_3_mean["score"] = dataframe_3_mean[
    ["faithfulness", "answer_relevancy", "context_utilization"]
].mean(axis=1)

In [32]:
# Creating a dataframe only for the experiments with simple RAG and mixture RAG
dataframe_3_mean_simple = dataframe_3_mean[
    dataframe_3_mean.index.str.contains("simple")
]
dataframe_3_mean_mixture = dataframe_3_mean[
    dataframe_3_mean.index.str.contains("mixture")
]

**Simple RAG Results**

In [33]:
# Displaying the simple RAG results sorted by faithfulness by descending order
dataframe_3_mean_simple.sort_values(by="faithfulness", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-gpt-4o-corrected,0.950794,0.868135,0.905556,0.908161
simple-rag-llama-3.1-405b-instruct-corrected,0.93585,0.789141,0.888889,0.871293
simple-rag-llama-3.1-8b-instruct-corrected,0.913799,0.889998,0.866667,0.890155
simple-rag-llama-3.1-70b-instructed-corrected,0.913709,0.829169,0.897222,0.880033
simple-rag-mistral-7b-instruct-corrected,0.905,0.903208,0.925,0.911069
simple-rag-gemma-7b-it-corrected,0.902381,0.863156,0.897222,0.887586
simple-rag-gpt-4o-mini-corrected,0.872143,0.902027,0.888889,0.887686
simple-rag-llama-3-70b-instruct-corrected,0.869946,0.860975,0.897222,0.876048
simple-rag-llama-3-8b-instruct-corrected,0.862557,0.836121,0.875,0.857893
simple-rag-mixtral-8x7b-instruct-corrected,0.862047,0.87151,0.883333,0.872297


In [34]:
# Displaying the simple RAG results sorted by answer_relevancy by descending order
dataframe_3_mean_simple.sort_values(by="answer_relevancy", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-mistral-7b-instruct-corrected,0.905,0.903208,0.925,0.911069
simple-rag-gpt-4o-mini-corrected,0.872143,0.902027,0.888889,0.887686
simple-rag-gemma2-9b-it-corrected,0.841273,0.898397,0.883333,0.874335
simple-rag-llama-3.1-8b-instruct-corrected,0.913799,0.889998,0.866667,0.890155
simple-rag-claude-3.5-sonnet-corrected,0.808641,0.887503,0.863889,0.853344
simple-rag-mixtral-8x7b-instruct-corrected,0.862047,0.87151,0.883333,0.872297
simple-rag-claude-3-opus-corrected,0.861019,0.869271,0.847222,0.859171
simple-rag-claude-3-sonnet-corrected,0.857322,0.868577,0.822222,0.849374
simple-rag-gpt-4o-corrected,0.950794,0.868135,0.905556,0.908161
simple-rag-gpt-4-turbo-corrected,0.860575,0.866888,0.888889,0.872117


In [35]:
# Displaying the simple RAG results sorted by context_utilization by descending order
dataframe_3_mean_simple.sort_values(by="context_utilization", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-mistral-7b-instruct-corrected,0.905,0.903208,0.925,0.911069
simple-rag-gpt-4o-corrected,0.950794,0.868135,0.905556,0.908161
simple-rag-llama-3-70b-instruct-corrected,0.869946,0.860975,0.897222,0.876048
simple-rag-gemma-7b-it-corrected,0.902381,0.863156,0.897222,0.887586
simple-rag-llama-3.1-70b-instructed-corrected,0.913709,0.829169,0.897222,0.880033
simple-rag-claude-3-haiku-corrected,0.823547,0.863664,0.894444,0.860552
simple-rag-gpt-4o-mini-corrected,0.872143,0.902027,0.888889,0.887686
simple-rag-gpt-4-turbo-corrected,0.860575,0.866888,0.888889,0.872117
simple-rag-llama-3.1-405b-instruct-corrected,0.93585,0.789141,0.888889,0.871293
simple-rag-mixtral-8x7b-instruct-corrected,0.862047,0.87151,0.883333,0.872297


In [36]:
# Displaying the simple RAG sorted by score(mean of all the metric scores on experiment level) by descending order
dataframe_3_mean_simple.sort_values(by="score", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-mistral-7b-instruct-corrected,0.905,0.903208,0.925,0.911069
simple-rag-gpt-4o-corrected,0.950794,0.868135,0.905556,0.908161
simple-rag-llama-3.1-8b-instruct-corrected,0.913799,0.889998,0.866667,0.890155
simple-rag-gpt-4o-mini-corrected,0.872143,0.902027,0.888889,0.887686
simple-rag-gemma-7b-it-corrected,0.902381,0.863156,0.897222,0.887586
simple-rag-llama-3.1-70b-instructed-corrected,0.913709,0.829169,0.897222,0.880033
simple-rag-llama-3-70b-instruct-corrected,0.869946,0.860975,0.897222,0.876048
simple-rag-gemma2-9b-it-corrected,0.841273,0.898397,0.883333,0.874335
simple-rag-mixtral-8x7b-instruct-corrected,0.862047,0.87151,0.883333,0.872297
simple-rag-gpt-4-turbo-corrected,0.860575,0.866888,0.888889,0.872117


**Mixture RAG Results**

In [37]:
# Displaying the mixture RAG results sorted by faithfulness by descending order
dataframe_3_mean_mixture.sort_values(by="faithfulness", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-mixtral-8x7-instruct-modified-corrected,0.868546,0.762932,0.822222,0.8179
mixture-rag-llama3.1-8b-instruct-thought-corrected,0.866354,0.843956,0.897222,0.869177
mixture-rag-mixtral-8x7-instruct-corrected,0.818727,0.760277,0.880556,0.819853
mixture-rag-gemma2-9b-it-modified-corrected,0.817845,0.875354,0.916667,0.869955
mixture-rag-gemma2-9b-it-thought-corrected,0.80292,0.880448,0.897222,0.860197
mixture-rag-mixtral-8x7-instruct-thought-corrected,0.790323,0.868344,0.9,0.852889
mixture-rag-gemma2-9b-it-corrected,0.716739,0.853304,0.880556,0.816866
mixture-rag-llama3.1-8b-instruct-modified-corrected,0.682824,0.827895,0.908333,0.806351
mixture-rag-llama3.1-8b-instruct-corrected,0.613017,0.807634,0.866667,0.762439
mixture-rag-claude-3-haiku-modified-corrected,0.612593,0.837748,0.825,0.758447


In [38]:
# Displaying the mixture RAG results sorted by answer_relevancy by descending order
dataframe_3_mean_mixture.sort_values(by="answer_relevancy", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-gemma2-9b-it-thought-corrected,0.80292,0.880448,0.897222,0.860197
mixture-rag-gemma2-9b-it-modified-corrected,0.817845,0.875354,0.916667,0.869955
mixture-rag-mixtral-8x7-instruct-thought-corrected,0.790323,0.868344,0.9,0.852889
mixture-rag-claude-3-haiku-corrected,0.609874,0.858626,0.916667,0.795055
mixture-rag-gemma2-9b-it-corrected,0.716739,0.853304,0.880556,0.816866
mixture-rag-llama3.1-8b-instruct-thought-corrected,0.866354,0.843956,0.897222,0.869177
mixture-rag-claude-3-haiku-modified-corrected,0.612593,0.837748,0.825,0.758447
mixture-rag-llama3.1-8b-instruct-modified-corrected,0.682824,0.827895,0.908333,0.806351
mixture-rag-llama3.1-8b-instruct-corrected,0.613017,0.807634,0.866667,0.762439
mixture-rag-claude-3-haiku-thought-corrected,0.605736,0.766984,0.933333,0.768685


In [39]:
# Displaying the mixture RAG results sorted by context_utilization by descending order
dataframe_3_mean_mixture.sort_values(by="context_utilization", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-claude-3-haiku-thought-corrected,0.605736,0.766984,0.933333,0.768685
mixture-rag-claude-3-haiku-corrected,0.609874,0.858626,0.916667,0.795055
mixture-rag-gemma2-9b-it-modified-corrected,0.817845,0.875354,0.916667,0.869955
mixture-rag-llama3.1-8b-instruct-modified-corrected,0.682824,0.827895,0.908333,0.806351
mixture-rag-mixtral-8x7-instruct-thought-corrected,0.790323,0.868344,0.9,0.852889
mixture-rag-llama3.1-8b-instruct-thought-corrected,0.866354,0.843956,0.897222,0.869177
mixture-rag-gemma2-9b-it-thought-corrected,0.80292,0.880448,0.897222,0.860197
mixture-rag-gemma2-9b-it-corrected,0.716739,0.853304,0.880556,0.816866
mixture-rag-mixtral-8x7-instruct-corrected,0.818727,0.760277,0.880556,0.819853
mixture-rag-llama3.1-8b-instruct-corrected,0.613017,0.807634,0.866667,0.762439


In [40]:
# Displaying the mixture RAG sorted by score(mean of all the metric scores on experiment level) by descending order
dataframe_3_mean_mixture.sort_values(by="score", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-gemma2-9b-it-modified-corrected,0.817845,0.875354,0.916667,0.869955
mixture-rag-llama3.1-8b-instruct-thought-corrected,0.866354,0.843956,0.897222,0.869177
mixture-rag-gemma2-9b-it-thought-corrected,0.80292,0.880448,0.897222,0.860197
mixture-rag-mixtral-8x7-instruct-thought-corrected,0.790323,0.868344,0.9,0.852889
mixture-rag-mixtral-8x7-instruct-corrected,0.818727,0.760277,0.880556,0.819853
mixture-rag-mixtral-8x7-instruct-modified-corrected,0.868546,0.762932,0.822222,0.8179
mixture-rag-gemma2-9b-it-corrected,0.716739,0.853304,0.880556,0.816866
mixture-rag-llama3.1-8b-instruct-modified-corrected,0.682824,0.827895,0.908333,0.806351
mixture-rag-claude-3-haiku-corrected,0.609874,0.858626,0.916667,0.795055
mixture-rag-claude-3-haiku-thought-corrected,0.605736,0.766984,0.933333,0.768685


**Combined RAG Results**

In [41]:
# Displaying all the results sorted by faithfulness by descending order
dataframe_3_mean.sort_values(by="faithfulness", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-gpt-4o-corrected,0.950794,0.868135,0.905556,0.908161
simple-rag-llama-3.1-405b-instruct-corrected,0.93585,0.789141,0.888889,0.871293
simple-rag-llama-3.1-8b-instruct-corrected,0.913799,0.889998,0.866667,0.890155
simple-rag-llama-3.1-70b-instructed-corrected,0.913709,0.829169,0.897222,0.880033
simple-rag-mistral-7b-instruct-corrected,0.905,0.903208,0.925,0.911069
simple-rag-gemma-7b-it-corrected,0.902381,0.863156,0.897222,0.887586
simple-rag-gpt-4o-mini-corrected,0.872143,0.902027,0.888889,0.887686
simple-rag-llama-3-70b-instruct-corrected,0.869946,0.860975,0.897222,0.876048
mixture-rag-mixtral-8x7-instruct-modified-corrected,0.868546,0.762932,0.822222,0.8179
mixture-rag-llama3.1-8b-instruct-thought-corrected,0.866354,0.843956,0.897222,0.869177


In [42]:
# Displaying all the results sorted by answer_relevancy by descending order
dataframe_3_mean.sort_values(by="answer_relevancy", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-mistral-7b-instruct-corrected,0.905,0.903208,0.925,0.911069
simple-rag-gpt-4o-mini-corrected,0.872143,0.902027,0.888889,0.887686
simple-rag-gemma2-9b-it-corrected,0.841273,0.898397,0.883333,0.874335
simple-rag-llama-3.1-8b-instruct-corrected,0.913799,0.889998,0.866667,0.890155
simple-rag-claude-3.5-sonnet-corrected,0.808641,0.887503,0.863889,0.853344
mixture-rag-gemma2-9b-it-thought-corrected,0.80292,0.880448,0.897222,0.860197
mixture-rag-gemma2-9b-it-modified-corrected,0.817845,0.875354,0.916667,0.869955
simple-rag-mixtral-8x7b-instruct-corrected,0.862047,0.87151,0.883333,0.872297
simple-rag-claude-3-opus-corrected,0.861019,0.869271,0.847222,0.859171
simple-rag-claude-3-sonnet-corrected,0.857322,0.868577,0.822222,0.849374


In [43]:
# Displaying all the results sorted by context_utilization by descending order
dataframe_3_mean.sort_values(by="context_utilization", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mixture-rag-claude-3-haiku-thought-corrected,0.605736,0.766984,0.933333,0.768685
simple-rag-mistral-7b-instruct-corrected,0.905,0.903208,0.925,0.911069
mixture-rag-claude-3-haiku-corrected,0.609874,0.858626,0.916667,0.795055
mixture-rag-gemma2-9b-it-modified-corrected,0.817845,0.875354,0.916667,0.869955
mixture-rag-llama3.1-8b-instruct-modified-corrected,0.682824,0.827895,0.908333,0.806351
simple-rag-gpt-4o-corrected,0.950794,0.868135,0.905556,0.908161
mixture-rag-mixtral-8x7-instruct-thought-corrected,0.790323,0.868344,0.9,0.852889
mixture-rag-llama3.1-8b-instruct-thought-corrected,0.866354,0.843956,0.897222,0.869177
simple-rag-llama-3-70b-instruct-corrected,0.869946,0.860975,0.897222,0.876048
mixture-rag-gemma2-9b-it-thought-corrected,0.80292,0.880448,0.897222,0.860197


In [44]:
# Displaying all the results sorted by score(mean of all the metric scores on experiment level) by descending order
dataframe_3_mean.sort_values(by="score", ascending=False)

Unnamed: 0_level_0,faithfulness,answer_relevancy,context_utilization,score
experiment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
simple-rag-mistral-7b-instruct-corrected,0.905,0.903208,0.925,0.911069
simple-rag-gpt-4o-corrected,0.950794,0.868135,0.905556,0.908161
simple-rag-llama-3.1-8b-instruct-corrected,0.913799,0.889998,0.866667,0.890155
simple-rag-gpt-4o-mini-corrected,0.872143,0.902027,0.888889,0.887686
simple-rag-gemma-7b-it-corrected,0.902381,0.863156,0.897222,0.887586
simple-rag-llama-3.1-70b-instructed-corrected,0.913709,0.829169,0.897222,0.880033
simple-rag-llama-3-70b-instruct-corrected,0.869946,0.860975,0.897222,0.876048
simple-rag-gemma2-9b-it-corrected,0.841273,0.898397,0.883333,0.874335
simple-rag-mixtral-8x7b-instruct-corrected,0.862047,0.87151,0.883333,0.872297
simple-rag-gpt-4-turbo-corrected,0.860575,0.866888,0.888889,0.872117
