## Evaluation
This notebook contains the code for evaluating both response and retrieval quality for the RAG model. 
It includes the pipeline for evaluating all metrics, although not all metrics were calculated for all tasks

In [136]:
# Path to root
path_to_root = '/work/KlaraKrøyerFomsgaard#1926/NLP_2023_P'

# To API key file
path_to_key = f'{path_to_root}/config/keys.txt'

# To data folder
path_to_data = f'{path_to_root}/data'


In [140]:
# -- Import packages

import pandas as pd
import nest_asyncio

# -- Import custom functions from /src

import sys
sys.path.append(f'{path_to_root}/src')

from utils import extract_text_snippet
from evaluate_models import faithfullness_eval, relevancy_eval, correctness_eval


In [142]:
# -- Import data

df = pd.read_csv(f"{path_to_data}/results/file.csv")

### RAG-specific evaluation
This section includes evaluation metrics that are specific to a RAG setup. It includes
1. Guideline adherence
2. Context retrieval
3. Faithfulness
4. Relevancy 

#### *1. Guideline adherence*

The following section containg code for evaluation of guideline adherence. 
The RAG is evaluated on the following:

1. Does it start the response with the prefix "*Based on the context...*" or "*Based on prior knowledge...*"
2. Does it provide a reference list with working links in the response?
3. Does one of the links in the response correspond to the link in the context metadata?

In [None]:
# -- PREFIX

# --- Create new column for storing evaluation metric
df['prefix_correct'] = ''

# --- Check if the prefix is found in the response - if found, give a 'pass'
for i in range(len(df)):
    true_context = pd.Series(df['LLMRAG_response'][i])
    if true_context.str.contains('Based on', regex=False).any(): #true_context.str.contains('Based on prior knowledge', regex=False).any() or 
        df['prefix_correct'][i] = 'Pass'
    else:
        df['prefix_correct'][i] = 'Fail'

In [143]:
# -- LINKS

# --- Function for checking if URL is active
import requests
import re

def is_url_active(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    try:
        response = requests.get(url, headers=headers, timeout=10, verify=True)
        return response.status_code == 200
    except requests.RequestException:
        return False



# --- Create columns for storing extracted URLs and evaluation metrics
df['url_true_context'] = '' # The URL from the metadata of the 'true' context
df['url_response'] = '' # URLs included in the RAG response
df['url_correct'] = '' # Column for evaluation metric 

# --- Create column for evaluation of working links
df['url_active'] = ''

# --- Mark unwanted symbols in URLs for comparison
bad_chars_url = [']','}','{','[',"'","\\"] 

# --- Check if working link in response, and if it matches with true context link 
for i in range(len(df)):
    #url_true_context = re.findall(r'https?://[^\s\'"}]+',df['metadata'][i])
    url_response = re.findall(r'https?://[^\s\'"}]+',df['LLMRAG_response'][i])

    # If the response do not have a reference
    if url_response == []:
        df['url_correct'][i] = 'No reference'
        df['url_active'][i] = 'No reference'

        #Remove unwanted symbols from string
        #for k in bad_chars_url:
            #df['url_true_context'][i] = url_true_context[0].replace(k,'') 
    
    
    else:
        # Remove unwanted symbols from string
        for k in bad_chars_url:
            #df['url_true_context'][i] = url_true_context[0].replace(k,'')
            
            for j in url_response:
                j = j.replace(k,'')

            df['url_response'][i] = url_response
            break

        # Additional cleaning
        df['url_response'][i] = [url.replace('>', '') for url in df['url_response'][i]]
        
        # Check if the context link can be found in the links provided as references
        #if df['url_true_context'][i] in df['url_response'][i]:
            #df['url_correct'][i] = 'Pass'
        #else:
            #df['url_correct'][i] = 'Fail'
        
        # Check if links are active - provide a list of true/false
        active_links = []
        for url in df['url_response'][i]:
            if is_url_active(url):
                active_links.append('True')
            else:
                active_links.append('False')
                print(url)
        
        df['url_active'][i] = active_links



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['url_correct'][i] = 'No reference'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['url_active'][i] = 'No reference'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['url_response'][i] = url_response
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['url_response'][i] = [url.replace('>', '') for url in

https://www.bbc.com/news/world-middle-east-57885555
https://www.aljazeera.com/news/2021/10/12/moderate-muslims-seek-to-separate-religion-from-extremism
https://www.theguardian.com/world/2021/oct/12/moderate-muslims-seek-to-separate-religion-from-extremism
https://news.mit.edu/2015/algorithms-recognize-objects-f


#### *2. Context retrieval*

The following section containg code for evaluation of guideline adherence.
The RAG is evaluated on whether it retrieves the 'true context', i.e. the context used for query generation.


In [None]:
df['retrieval_correct'] = ''

# Mark unwanted special characters
bad_chars = [']', '[', "'","\\"] 

for i in range(len(df)):
    true_context = pd.Series(df['text'].iloc[i])

    context_labels = []

    for j in df.columns[12:15]:
        context = extract_text_snippet(df[j].iloc[i])

        # initializing bad_chars_list
        for k in bad_chars:
            true_context[0] = true_context[0].replace(k,'')
            context = context.replace(k, '')
        
        context_label = []
        context_bool = true_context.str.contains(context, regex=False)
        context_label.append(context_bool[0])
        context_labels.append(context_label[0])
    
    #print(context_labels)

    if True in context_labels:
        df['retrieval_correct'][i] = 'Pass'
    else:
        df['retrieval_correct'][i] = 'Fail'

#### *3. Faithfulness*
The following used GPT-3.5 Turbo to evaluate the faithfullness of the response to the retrieved nodes. 
The full nodes are matched with the node_columns (which have been cut off in the conversion to df) and appended to dataset

In [139]:
from llama_index import (
    ServiceContext,
    OpenAIEmbedding,
    PromptHelper,
    )
from llama_index.text_splitter import SentenceSplitter

# Function to retrieve nodes from dataset
def get_nodes(dataset):
    # Convert the DataFrame into a list of Document objects that the index can understand
    documents = [Document(text=row['Article Body'],
                        metadata={'title': row['Article Header'],
                                    'source': row['Source'],
                                    'author': row['Author'],
                                    'date': row['Published Date'],
                                    'url': row['Url']}) for index, row in dataset.iterrows()] 

    # --- Sentencesplitter to split into chunks
    text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)

    nodes = text_splitter.get_nodes_from_documents(documents)
    return nodes

In [None]:
# --- Get nodes from dataset
AI_data = pd.read_csv(f"{path_to_data}/articles_full_aug.csv")

nodes = get_nodes(dataset = AI_data)

In [None]:
# --- Convert df to long format with necessary columns for evaluation and create new column for the full node
df_long = pd.melt(df, 
                id_vars=['text','gold_response','query','LLMRAG_response','LLM_response'], # add 'text' column if not G_factoid and 'gold_response' if factoid
                value_vars=['node_0','node_1','node_2']) # Add k nodes

df_long['full_node'] = ''

In [None]:
# --- Append the full node corresponding to the retrieved contexts
for i in range(len(df_long)):
    context = extract_text_snippet(df_long['value'][i])
    
    for j in range(len(nodes)):
        if context in nodes[j].text:
            df_long['full_node'][i] = nodes[j].text

In [None]:
# -- FAITHFULLNESS
# --- OBS: Requires OpenAI API
nest_asyncio.apply()

# --- Create columns for storing correctness
df_long['faithfullness_bin'] = ''
df_long['faithfullness_score'] = ''
df_long['faithfullness_feedback'] = ''

# --- Run evaluation
for i in range(len(df_long)):
    result = faithfullness_eval(response_str= df_long['LLMRAG_response'][i],retrieved_nodes = [df_long['full_node'][i]]) #full_node
    df_long['faithfullness_bin'][i] = result[0]
    df_long['faithfullness_score'][i] = result[1]
    df_long['faithfullness_feedback'][i] = result[2]

#### *4. Relevancy*

In [None]:
# -- RELEVANCY
# --- OBS: Requires OpenAI API
nest_asyncio.apply()

# --- Create columns for storing correctness
df_long['relevancy_bin'] = ''
df_long['relevancy_score'] = ''
df_long['relevancy_feedback'] = ''

# --- Run evaluation
for i in range(len(df_long)):
    result = relevancy_eval(query_str= df_long['query'][i],response_str= df_long['LLMRAG_response'][i],retrieved_nodes = [df_long['full_node'][i]])
    df_long['relevancy_bin'][i] = result[0]
    df_long['relevancy_score'][i] = result[1]
    df_long['relevancy_feedback'][i] = result[2]

### Response quality: Fact retrieval
The following section includes evaluation of correctness for both RAG and ordinary LLM responses

In [None]:
nest_asyncio.apply()

# --- Create columns for storing correctness
df['correctness_RAG_bin'] = ''
df['correctness_RAG_score'] = ''
df['correctness_RAG_feedback'] = ''
df['correctness_LLM_bin'] = ''
df['correctness_LLM_score'] = ''
df['correctness_LLM_feedback'] = ''

# --- Run evaluation
for i in range(len(df)):
    result_RAG = correctness_eval(query_str= df['query'][i],response_str= df['LLMRAG_response'][i],reference_str=df['gold_response'][i])
    df['correctness_RAG_bin'][i] = result_RAG[0]
    df['correctness_RAG_score'][i] = result_RAG[1]
    df['correctness_RAG_feedback'][i] = result_RAG[2]

    result_LLM = correctness_eval(query_str= df['query'][i],response_str= df['LLM_response'][i],reference_str=df['gold_response'][i])
    df['correctness_LLM_bin'][i] = result_LLM[0]
    df['correctness_LLM_score'][i] = result_LLM[1]
    df['correctness_LLM_feedback'][i] = result_LLM[2]

In [None]:
# -- Write df & df_long to .csv

df.to_csv(f"{path_to_data}/results/the_dataset_name_evaluation.csv")
df_long.to_csv(f"{path_to_data}/results/the_dataset_name_evaluation_long.csv")

#### Calculate hit rate (example)

In [None]:
df_acc = pd.read_csv(f"{path_to_data}/results/G_factoid_evaluation.csv")

In [None]:
evaluation_metrics_df = pd.DataFrame()

acc_ref = df_sf['correct_reference'].value_counts('Pass')[0]
acc_node_retrieval = df_sf['retrieval_correct'].value_counts('Pass')[0]
acc_guide_link = 1 - df_sf['correct_reference'].value_counts('Pass')[2]