# ToT Evaluate Mistral Tuned

This notebook evaluates tuned Mistral locally with base (Mistral AI, 2024) and a checkpoint from the ToT-tuning-1 notebook for the generation of answers and (Ollama, 2024a; Ollama, 2024b) for other functions based on test dataset produced from the ToT-data-answer-generator-and-checker notebook and based on the MMLU dataset (Hendrycks et al, 2021a; Hendrycks et al, 2021b; Hendrycks et al, 2023)

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021b. Measuring Massive Multitask Language Understanding. ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-27. Available from: https://arxiv.org/pdf/2009.03300.pdf [Accessed 5 August 2024].
 
Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. and Steinhardt, J., 2023. Aligning AI With Shared Human Values. ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-29. Available from: https://arxiv.org/pdf/2008.02275.pdf [Accessed 5 August 2024].

Mistral AI, 2024. Model Card for Mistral-7B-Instruct-v0.3 [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 [Accessed 19 October 2024].

Ollama, 2024a. Ollama [computer program]. Available from: https://ollama.com [Accessed 1 September 2024].

Ollama, 2024b. mixtral 8x7b-instruct-v0.1-fp16 [Online]. Available from: https://ollama.com/library/mixtral:8x7b-instruct-v0.1-fp16 [Accessed 25 September 2024].

## Configure

The following code in Configure and Load sections in this notebook is in line with: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].

Register at (Hugging Face, n.d.a), perform the access request at (Mistral AI, 2024), follow (Hugging Face, n.d.b) to insert your private secret in the below code instead of "" as indicated in (Hugging Face, n.d.b).

Hugging Face, n.d.a. The AI community building the future [Online]. Available from: https://huggingface.co [Accessed 19 October 2024].

Hugging Face, n.d.b. User access tokens [Online]. Available from: https://huggingface.co/docs/hub/en/security-tokens [Accessed 19 October 2024].

Mistral AI, 2024. Model Card for Mistral-7B-Instruct-v0.3 [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 [Accessed 19 October 2024].

In [1]:
# Code adapted from: Hugging Face, n.d. User access tokens [Online]. 
# Available from: https://huggingface.co/docs/hub/en/security-tokens [Accessed 19 October 2024].
# Code is in line with and adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
private_secret = ""
#

## Load

In [None]:
# Code reused from: Mistral AI, 2024. Model Card for Mistral-7B-Instruct-v0.3 [Online]. s.l.: Hugging Face. 
# Available from: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 [Accessed 19 October 2024].
# Code is in line with and reused from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
from transformers import AutoModelForCausalLM, AutoTokenizer
#

# Code is in line with and reused from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
import os
#

current_path = os.getcwd()

# Code, that is, the parameter, adapted from: Mistral AI, 2024. Model Card for Mistral-7B-Instruct-v0.3 [Online]. s.l.: Hugging Face. 
# Available from: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 [Accessed 19 October 2024].
# Code, that is, the value, reused from: Mistral AI, 2024. Model Card for Mistral-7B-Instruct-v0.3 [Online]. s.l.: Hugging Face. 
# Available from: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 [Accessed 19 October 2024].
# Code is in line with and adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
tuning_llm_name = "mistralai/Mistral-7B-Instruct-v0.3"
#

# Code adapted from: Mistral AI, 2024. Model Card for Mistral-7B-Instruct-v0.3 [Online]. s.l.: Hugging Face. 
# Available from: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 [Accessed 19 October 2024].
# Code is in line with and adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
tuning_llm = AutoModelForCausalLM.from_pretrained(tuning_llm_name,
# Code, that is, the parameter, reused from: Hugging Face, n.d. Auto Classes. AutoModelForCausalLM. from_pretrained (v.4.45.2) [Online].
# Availabel from: https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForCausalLM.from_pretrained [Accessed 19 October 2024].
# Code, that is, the value, is based on: Hugging Face, n.d. Auto Classes. AutoModelForCausalLM. from_pretrained (v.4.45.2) [Online].
# Availabel from: https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForCausalLM.from_pretrained [Accessed 19 October 2024].
cache_dir = current_path,
#
# Code, that is, the parameter, reused from: Hugging Face, n.d. User access tokens [Online]. 
# Available from: https://huggingface.co/docs/hub/en/security-tokens [Accessed 19 October 2024].
# Code, that is, the value, adapted from: Hugging Face, n.d. User access tokens [Online]. 
# Available from: https://huggingface.co/docs/hub/en/security-tokens [Accessed 19 October 2024].
# Code, that is, the parameter, is in line with and reused from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
# Code, that is, the value, is in line with and adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
token = private_secret
#
)
#

# Code adapted from: Mistral AI, 2024. Model Card for Mistral-7B-Instruct-v0.3 [Online]. s.l.: Hugging Face. 
# Available from: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 [Accessed 19 October 2024].
# Code is in line with and adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
tuning_tokenizer_for_llm = AutoTokenizer.from_pretrained(tuning_llm_name,
# Code, that is, the parameter, reused from: Hugging Face, n.d. Auto Classes. AutoTokenizer. from_pretrained (v.4.45.2) [Online].
# Available from: https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained [Accessed 20 October 2024].
# Code, that is, the value, is based on: Hugging Face, n.d. Auto Classes. AutoTokenizer. from_pretrained (v.4.45.2) [Online].
# Available from: https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained [Accessed 20 October 2024].
cache_dir = current_path,
#
# Code is based on: Hugging Face, n.d. Accessing Private/Gated Models (v.3.0.0) [Online].
# Available from: https://huggingface.co/docs/transformers.js/en/guides/private [Accessed 20 October 2024].
# Code, that is, the parameter, reused from: Hugging Face, n.d. User access tokens [Online]. 
# Available from: https://huggingface.co/docs/hub/en/security-tokens [Accessed 19 October 2024].
# Code, that is, the value, adapted from: Hugging Face, n.d. User access tokens [Online]. 
# Available from: https://huggingface.co/docs/hub/en/security-tokens [Accessed 19 October 2024].
# Code, that is, the parameter, is in line with and reused from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
# Code, that is, the value, is in line with and adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
token = private_secret
#
)
#


In [None]:
# Code, that is, the parameter, adapted from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
# Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
# Code, that is, the parameter and value, adapted from: AnsonKw, 2024. mistral training code (v.1) [computer program].
# Availabel from: https://www.kaggle.com/code/ansonkw/mistral-training-code [Accessed 26 October 2024].
# Code, that is, the parameter and value, are in line with and adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
tuning_tokenizer_for_llm.pad_token = tuning_tokenizer_for_llm.unk_token
#

# Code, that is, the parameter, adapted from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
# Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
# Code, that is, the value, reused from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
# Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
# Code is in line with and adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
tuning_tokenizer_for_llm.padding_side = "right"
#

# Code adapted from: AnsonKw, 2024. mistral training code (v.1) [computer program].
# Availabel from: https://www.kaggle.com/code/ansonkw/mistral-training-code [Accessed 26 October 2024].
# Code, that is, the value, reused from: AnsonKw, 2024. mistral training code (v.1) [computer program].
# Availabel from: https://www.kaggle.com/code/ansonkw/mistral-training-code [Accessed 26 October 2024].
first_extracted_index = "input_ids"
#

# Code adapted from: AnsonKw, 2024. mistral training code (v.1) [computer program].
# Availabel from: https://www.kaggle.com/code/ansonkw/mistral-training-code [Accessed 26 October 2024].
# Code, that is, the value, reused from: AnsonKw, 2024. mistral training code (v.1) [computer program].
# Availabel from: https://www.kaggle.com/code/ansonkw/mistral-training-code [Accessed 26 October 2024].
second_extracted_index = 1
#

# Code adapted from: AnsonKw, 2024. mistral training code (v.1) [computer program].
# Availabel from: https://www.kaggle.com/code/ansonkw/mistral-training-code [Accessed 26 October 2024].
extracted_value = tuning_tokenizer_for_llm(tuning_tokenizer_for_llm.unk_token)[first_extracted_index][second_extracted_index]
tuning_llm.config.pad_token_id = extracted_value
#

In [None]:
# Code reused from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
from peft import PeftModel
#

# Code adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
# Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 16 November 2024].
combined_llm = PeftModel.from_pretrained(tuning_llm, "ToT-tuning-1/checkpoint-643")
#

In [1]:
import pandas as pd

test_dataset_path_csv = "test-dataset-for-tuning/test_dataset_0_0_671.csv"

# Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
test_dataset = pd.read_csv(test_dataset_path_csv)
#

## Create the evaluation columns

In [2]:
# Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 17 August 2024].
test_dataset.loc[:, "string_answer_evaluation"] = None
#

In [3]:
# Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 17 August 2024].
test_dataset.loc[:, "llm_answer_check"] = None
#

In [4]:
# Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 17 August 2024].
test_dataset.loc[:, "llm_answer_evaluation"] = None
#

In [5]:
# Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 17 August 2024].
test_dataset.loc[:, "answer_evaluation"] = None
#

## ToT evaluation prompt

In [9]:
def prompt_tot_evaluation(enhanced_question, generated_tot_answer, letter_answer, letter_answer_text):
    # Prompt (lines 1, 2, 5, 8, 9, 10 in the tot_evaluation_prompt) is based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    #
    # Prompt (line 1 in the tot_evaluation_prompt) reused and slightly adapted from: mrspiggot, 2023. langchain_tree.py [computer program].
    # Available from: https://github.com/mrspiggot/forestOfThoughts/blob/master/langchain_tree.py [Accessed 5 September 2024]. 
    # (mrspiggot, 2023, line 23)
    #
    # Please note that enhanced_question variable in the tot_evaluation_prompt would include transformed data from: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    #
    # Please note that generated_tot_answer variable in the tot_evaluation_prompt would include an answer based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    #
    # Please note that generated_tot_answer variable in the tot_evaluation_prompt would include an answer from the tuned Mistral 7B Instruct version v0.3 at the period of running this notebook and which would be used locally using:
    # Mistral AI, 2024. Model Card for Mistral-7B-Instruct-v0.3 [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 [Accessed 19 October 2024].
    # checkpoint from the ToT-tuning-1 notebook
    # enhanced_prompt_to_generate_answer in generate_and_check_answers function in Generate and Evaluate Answers section of this notebook
    # enhanced_question values for the prompt from test_dataset in the notebook which is based on the MMLU dataset:
    # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021b. Measuring Massive Multitask Language Understanding. 
    # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-27. Available from: https://arxiv.org/pdf/2009.03300.pdf [Accessed 5 August 2024].
    # Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. and Steinhardt, J., 2023. Aligning AI With Shared Human Values. 
    # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-29. Available from: https://arxiv.org/pdf/2008.02275.pdf [Accessed 5 August 2024].
    #
    # Please note that letter_answer variable in the tot_evaluation_prompt would include transformed data from: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    #
    # Please note that letter_answer_text variable in the tot_evaluation_prompt would include data from: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    #
    tot_evaluation_prompt = f'''
    The question is: {enhanced_question}
    The following text delimited by ####### is the answer to the question.

    #######
    {generated_tot_answer}
    #######

    Evaluate the text delimited by ####### which is the answer to the question and conclude if the answer provided at the bottom of the text delimited by ####### is {letter_answer} and/or {letter_answer_text}.
    If the answer provided at the bottom of the text delimited by ####### is {letter_answer} and/or {letter_answer_text}, output correct.
    Otherwise, if the answer provided at the bottom of the text delimited by ####### is not {letter_answer} and/or {letter_answer_text}, output incorrect.
    You must output only one word: correct or incorrect.
    '''
    print(tot_evaluation_prompt)
    return tot_evaluation_prompt

## Generate and Evaluate Answers

Create folders to save generated and evaluated answers

In [10]:
import os

current_path = os.getcwd()

def create_data_for_tuning_folder_address(current_path, data_for_tuning_folder_name):
    return current_path + "/" + data_for_tuning_folder_name

def create_folder_for_data_for_tuning(data_for_tuning_folder_name, data_for_tuning_folder_address):
    # Code, that is, the check for folder, adapted from: Python Software Foundation, 2024. os.path - Common pathname manipulations. os.path.exists (v.3.12.5) [Online]. 
    # Available from: https://docs.python.org/3/library/os.path.html#os.path.exists [Accessed 25 August 2024].
    data_for_tuning_folder_is_present = os.path.exists(data_for_tuning_folder_address)
    #
    
    if not data_for_tuning_folder_is_present:
        os.mkdir(data_for_tuning_folder_address)
        print(f'Created {data_for_tuning_folder_name} folder')
    else:
        print(f'{data_for_tuning_folder_name} folder is present')

Check if answers were already generated and load the appropriate dataset

In [11]:
import pandas as pd

def check_and_load_generated_answers_to_continue(initial_dataset, data_for_tuning_folder_name, data_for_tuning_folder_address):
    '''
    This function checks if there are any files with the generated answers in a folder and if so, loads intermediate files to continue generating answers or loads a completed file, 
    otherwise uses the intial dataset to start generating answers from the beginning
    '''
    # Code, that is, the finding of files for tuning, adapted from: Python Software Foundation, 2024. os - Miscellaneous operating system interfaces. os.listdir (v.3.12.5) [Online]. 
    # Available from: https://docs.python.org/3/library/os.html#os.listdir [Accessed 31 August 2024].
    files_for_tuning = os.listdir(data_for_tuning_folder_address)
    #

    if len(files_for_tuning):
        # Code, that is, the function and the ordering of files for tuning, adapted from: Python Software Foundation, 2024. Built-in Types. sort (v.3.12.5) [Online]. 
        # Available from: https://docs.python.org/3/library/stdtypes.html#list.sort [Accessed 31 August 2024].
        def check_version_of_file(file):
            return int(file.split('_')[2])
        
        files_for_tuning.sort(key=check_version_of_file, reverse=True)
        #
        print(f"The following files were found in the {data_for_tuning_folder_name} folder. {files_for_tuning}")
        recent_file_for_tuning = files_for_tuning[0]
        print(f"The most recent file in the {data_for_tuning_folder_name} folder is {recent_file_for_tuning}")
        recent_file_for_tuning_attributes = recent_file_for_tuning.split('_')

        recent_file_for_tuning_address = data_for_tuning_folder_address + '/' + recent_file_for_tuning
        # Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
        # Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
        dataset_for_generated_answers = pd.read_csv(recent_file_for_tuning_address)
        #
        dataset_for_generated_answers_version = int(recent_file_for_tuning_attributes[2])
        generated_answers_count = int(recent_file_for_tuning_attributes[3])
        total_dataset_rows = int(recent_file_for_tuning_attributes[4].split('.')[0])
        
        if generated_answers_count < total_dataset_rows:
            print(f"The {data_for_tuning_folder_name} folder has intermediate files with generated answers. Continue the generation of answers. The dataset version is {dataset_for_generated_answers_version}. The count of generated answers is {generated_answers_count}. The total count of rows is {total_dataset_rows}.")
        elif generated_answers_count == total_dataset_rows:
            print(f"The {data_for_tuning_folder_name} folder has already generated answers. The dataset version is {dataset_for_generated_answers_version}. The count of generated answers is {generated_answers_count}. The total count of rows is {total_dataset_rows}.")
        return dataset_for_generated_answers, dataset_for_generated_answers_version, generated_answers_count, total_dataset_rows
    else:
        dataset_for_generated_answers = initial_dataset
        dataset_for_generated_answers_version = 0
        generated_answers_count = 0
        total_dataset_rows = len(dataset_for_generated_answers)
        print(f"The {data_for_tuning_folder_name} folder does not have files with generated answers. Start the generation of answers from the beginning. The dataset version is {dataset_for_generated_answers_version}. The count of generated answers is {generated_answers_count}. The total count of rows is {total_dataset_rows}.")
        return dataset_for_generated_answers, dataset_for_generated_answers_version, generated_answers_count, total_dataset_rows

Generate and evaluate answers

In [None]:
# Code, that is, the import, reused from: LangChain, 2023. OllamaLLM [Online]. 
# Available from: https://python.langchain.com/v0.2/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html [Accessed 1 September 2024].
from langchain_ollama import OllamaLLM
#
# Code, that is, the import, reused from: NumPy Developers, 2024. Constants. numpy.nan (v.2.1) [Online].
# Available from: https://numpy.org/doc/stable/reference/constants.html#numpy.nan [Accessed 15 September 2024].
import numpy as np
#

# Code reused from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
# Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
from transformers import pipeline
#

def generate_and_check_answers(data_for_tuning_folder_name, data_for_tuning_folder_address, dataset_for_generated_answers, dataset_for_generated_answers_version, generated_answers_count, total_dataset_rows, tot):
    '''
    This function generates answers in the simple or ToT style to be used for evaluation, checks the answers for quality, saves the answers periodically with the intermediate versions
    '''
    if generated_answers_count < total_dataset_rows:
        generated_answers_count_updated = generated_answers_count
        dataset_for_generated_answers_version_updated = dataset_for_generated_answers_version
        save_dataset_count = 15
        # Code, that is, the loop, adapted from: pandas, 2024. pandas.DataFrame.itertuples (v.2.2) [Online]. 
        # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples [Accessed 1 September 2024].
        for dataset_record in dataset_for_generated_answers.itertuples():
        #
            if (dataset_record.generated_answer is None or
            # Code, that is, the use of numpy that follows is, reused from: NumPy Developers, 2024. Constants. numpy.nan (v.2.1) [Online].
            # Available from: https://numpy.org/doc/stable/reference/constants.html#numpy.nan [Accessed 15 September 2024]
            dataset_record.generated_answer is np.nan or
            #
            # Code adapted from: pandas, 2024. pandas.isna (v.2.2) [Online]. 
            # Available from: https://pandas.pydata.org/docs/reference/api/pandas.isna.html#pandas-isna [Accessed 15 October 2024].
            pd.isna(dataset_record.generated_answer) == True):
            #
                retries_for_quality = 0
                max_retries_for_quality = 3
                while retries_for_quality < max_retries_for_quality:
                    # Code, that is, the value, reused and slightly adapted from: mrspiggot, 2023. langchain_tree.py [computer program].
                    # Available from: https://github.com/mrspiggot/forestOfThoughts/blob/master/langchain_tree.py [Accessed 5 September 2024]. 
                    # (mrspiggot, 2023, line 23)
                    element_before_ask = "The question is: "
                    #
                    # Code, that is, the value, is adapted and based on: mrspiggot, 2023. langchain_tree.py [computer program].
                    # Available from: https://github.com/mrspiggot/forestOfThoughts/blob/master/langchain_tree.py [Accessed 5 September 2024]. 
                    # (mrspiggot, 2023, lines 25)
                    # Code, that is, the value, is based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
                    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
                    element_after_ask = "What is the answer and the answer letter?"
                    #
                    # Code, that is, the parameter, adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
                    # Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 2 November 2024].
                    # Code, that is, the value, adapted from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
                    # Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 2 November 2024].
                    # Code, that is, the value, is based on: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
                    # Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 2 November 2024].
                    # Code, that is, the value, is based on transformations in ToT-data-ETL notebook to data from and according to: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
                    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
                    # Code, that is, the first, second, fifth, seventh elements of value, reused from: NVIDIA, 2024. Fine-tuning Mistral 7B using QLoRA (mistral-finetune.ipynb) [computer program].
                    # Available from: https://github.com/NVIDIA/workbench-example-mistral-finetune/blob/main/code/mistral-finetune.ipynb [Accessed 2 November 2024].
                    # Code adapted from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    # Code, that is, first, second and seventh strings, reused from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    enhanced_prompt_to_generate_answer = "<s>" + "[INST] " + element_before_ask + dataset_record.enhanced_question + "\n" + element_after_ask + " [/INST] "
                    #
                    # Code, that is, the parameter, adapted from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    # Code, that is, the value, reused from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    # Code, that is, the value, reused from: Hugging Face, n.d. Pipelines. The pipeline abstraction. transformers.pipeline (v.4.45.2) [Online].
                    # Available from: https://huggingface.co/docs/transformers/v4.45.2/main_classes/pipelines#transformers.pipeline [Accessed 20 October 2024].
                    generated_answer_type = "text-generation"
                    #
                    # Code adapted from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    answer_generator = pipeline(model = combined_llm, 
                    tokenizer = tuning_tokenizer_for_llm, 
                    task = generated_answer_type,
                    # Code, that is, the parameter, reused from: Hugging Face, n.d. Pipelines. The pipeline abstraction. transformers.pipeline (v.4.45.2) [Online].
                    # Available from: https://huggingface.co/docs/transformers/v4.45.2/main_classes/pipelines#transformers.pipeline [Accessed 20 October 2024].
                    # Code, that is, the value, reused from: Hugging Face, n.d. Pipelines. The pipeline abstraction. transformers.pipeline (v.4.45.2) [Online].
                    # Available from: https://huggingface.co/docs/transformers/v4.45.2/main_classes/pipelines#transformers.pipeline [Accessed 20 October 2024].
                    device = "mps",
                    #
                    # Code is based on: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    # Code is based on: Hugging Face, n.d. Pipelines. The pipeline abstraction. transformers.pipeline (v.4.45.2) [Online].
                    # Available from: https://huggingface.co/docs/transformers/v4.45.2/main_classes/pipelines#transformers.pipeline [Accessed 20 October 2024].
                    # Code, that is, the parameter, reused from: Hugging Face, n.d. Pipelines. Natural Language Processing. TextGenerationPipeline. class transformers.TextGenerationPipeline (v.4.45.2) [Online].
                    # Available from: https://huggingface.co/docs/transformers/v4.45.2/main_classes/pipelines#transformers.TextGenerationPipeline [Accessed 20 October 2024].
                    # Code, that is, the value, adapted from: Hugging Face, n.d. Pipelines. Natural Language Processing. TextGenerationPipeline. class transformers.TextGenerationPipeline (v.4.45.2) [Online].
                    # Available from: https://huggingface.co/docs/transformers/v4.45.2/main_classes/pipelines#transformers.TextGenerationPipeline [Accessed 20 October 2024].
                    max_new_tokens = 5000
                    #
                    )
                    #
                    # Code adapted from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    # Code, that is, the value, reused from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    answer_first_index = 0
                    #
                    # Code adapted from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    # Code, that is, the value, reused from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    answer_second_index = "generated_text"
                    #
                    # Code adapted from: Awan, A.A., 2023. Mistral 7B Instruct 4bit QLoRA Fine-tuning (v.2) [computer program].
                    # Available from: https://www.kaggle.com/code/kingabzpro/mistral-7b-instruct-4bit-qlora-fine-tuning [Accessed 20 October 2024].
                    generated_answer_with_enhanced_prompt = answer_generator(enhanced_prompt_to_generate_answer)[answer_first_index][answer_second_index]
                    print(generated_answer_with_enhanced_prompt)
                    #
                    enhanced_prompt_to_generate_answer_part = enhanced_prompt_to_generate_answer[(len(enhanced_prompt_to_generate_answer)-50):len(enhanced_prompt_to_generate_answer)]
                    print(enhanced_prompt_to_generate_answer_part)
                    generated_answer = generated_answer_with_enhanced_prompt.split(enhanced_prompt_to_generate_answer_part)[1]
                    print(generated_answer)
                    # This simple check for quality checks that the generated answer has at least different second level thoughts based on the
                    # Zhang, Z., Ye, Z., Shen, Y. and Gan, C., 2023. Autonomous Tree-Search Ability of Large Language Models. Ithaca: Cornell University Library, arXiv.org. arXiv [Online]. Available from: https://arxiv.org/pdf/2310.10686.pdf [Accessed 25 August 2024].
                    # Page 13, C.1
                    if (("1.1" in generated_answer and 
                    "1.2" in generated_answer and 
                    "2.1" in generated_answer and 
                    "2.2" in generated_answer and 
                    "3.1" in generated_answer and 
                    "3.2" in generated_answer) and 
                    #
                    # Code is based on several outputs from Mixtral 8x7b Instruct version v0.1, fp16 (pers. comm.) on 03/10/2024 
                    # from tot_prompt_to_generate_answer prompt in the prompt_tot function in the Prompt section in ToT-data-answer-generator-and-checker notebook, with and without hint_1 and hint_2 
                    # values for the prompt which are indicated in the code in Generate and check answers subsection in the ToT-data-answer-generator-and-checker notebook 
                    # and enhanced_question values from train_dataset in the ToT-data-answer-generator-and-checker notebook
                    # for the prompt (that is, the outputs that could be produced with the code in the ToT-data-answer-generator-and-checker notebook) which is based on the MMLU dataset: 
                    # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
                    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
                    # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021b. Measuring Massive Multitask Language Understanding. 
                    # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-27. Available from: https://arxiv.org/pdf/2009.03300.pdf [Accessed 5 August 2024].
                    # Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. and Steinhardt, J., 2023. Aligning AI With Shared Human Values. 
                    # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-29. Available from: https://arxiv.org/pdf/2008.02275.pdf [Accessed 5 August 2024].
                    # Please note that Mixtral 8x7b Instruct version v0.1, fp16 has been used locally using: 
                    # Ollama, 2024a. Ollama [computer program]. Available from: https://ollama.com [Accessed 1 September 2024].
                    # Ollama, 2024b. mixtral 8x7b-instruct-v0.1-fp16 [Online]. 
                    # Available from: https://ollama.com/library/mixtral:8x7b-instruct-v0.1-fp16 [Accessed 25 September 2024].
                    # Code is based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
                    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
                    (f"Answer: A" in generated_answer or
                        f"Answer: B" in generated_answer or 
                        f"Answer: C" in generated_answer or  
                        f"Answer: D" in generated_answer or
                        f"answer: A" in generated_answer or
                        f"answer: B" in generated_answer or
                        f"answer: C" in generated_answer or
                        f"answer: D" in generated_answer or
                        f"Answer is A" in generated_answer or
                        f"Answer is B" in generated_answer or
                        f"Answer is C" in generated_answer or
                        f"Answer is D" in generated_answer or
                        f"answer is A" in generated_answer or
                        f"answer is B" in generated_answer or
                        f"answer is C" in generated_answer or
                        f"answer is D" in generated_answer or
                        f"Answer is: A" in generated_answer or
                        f"Answer is: B" in generated_answer or
                        f"Answer is: C" in generated_answer or
                        f"Answer is: D" in generated_answer or
                        f"answer is: A" in generated_answer or
                        f"answer is: B" in generated_answer or
                        f"answer is: C" in generated_answer or
                        f"answer is: D" in generated_answer)):
                    #
                        retries_for_quality = max_retries_for_quality
                    retries_for_quality += 1
                # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
                # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
                dataset_for_generated_answers.loc[dataset_for_generated_answers['id'] == dataset_record.id, "generated_answer"] = generated_answer
                #
                generated_answers_count_updated += 1
                print(f"The generated answer for {dataset_record.id} was generated with quality check and {retries_for_quality} retries")
                if (generated_answers_count_updated % save_dataset_count == 0) or (generated_answers_count_updated == total_dataset_rows):
                    dataset_for_generated_answers_version_updated += 1
                    dataset_type = data_for_tuning_folder_name.split('-')[0]
                    dataset_for_generated_answers_address = data_for_tuning_folder_address + '/' + f'{dataset_type}_dataset_{dataset_for_generated_answers_version_updated}_{generated_answers_count_updated}_{total_dataset_rows}.csv'
                    # Code, that is, the saving of the dataset, adapted from: pandas, 2024. pandas.DataFrame.to_csv (v.2.2) [Online]. 
                    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv [Accessed 15 August 2024].
                    dataset_for_generated_answers.to_csv(dataset_for_generated_answers_address, index=False)
                    #
                    print(f"The dataset for generated answers was saved for the version {dataset_for_generated_answers_version_updated}. The count of generated answers is {generated_answers_count_updated}. The count of total rows is {total_dataset_rows}")
            else:
                print(f"The generated answer for id {dataset_record.id} was already generated")
    elif generated_answers_count == total_dataset_rows:
        print("The generated answers were already generated for the dataset")

In [13]:
# Code, that is, the import, reused from: NumPy Developers, 2024. Constants. numpy.nan (v.2.1) [Online].
# Available from: https://numpy.org/doc/stable/reference/constants.html#numpy.nan [Accessed 15 September 2024].
import numpy as np
#

# Code, that is, the import, reused from: Kuchling, A.M., 2024. Regular Expression HOWTO (v.3.13.0) [Online]. s.l.: Python Software Foundation.
# Available from: https://docs.python.org/3/howto/regex.html [Accessed 15 October 2024].
import re
#

def string_answer_evaluate(data_for_tuning_folder_address, dataset_for_string_answer_evaluation_name, dataset_for_string_answer_evaluation, extract_first_answer):
    '''
    This function evaluates generated answers for the correct answer based on the string pattern of answers where 1 is correct answer and 0 is incorrect answer
    '''
    # Code, that is, the loop, adapted from: pandas, 2024. pandas.DataFrame.itertuples (v.2.2) [Online]. 
    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples [Accessed 1 September 2024].
    for dataset_record in dataset_for_string_answer_evaluation.itertuples():
    #
        if (dataset_record.string_answer_evaluation is None or
        # Code, that is, the use of numpy that follows is, reused from: NumPy Developers, 2024. Constants. numpy.nan (v.2.1) [Online].
        # Available from: https://numpy.org/doc/stable/reference/constants.html#numpy.nan [Accessed 15 September 2024]
        dataset_record.string_answer_evaluation is np.nan or
        #
        # Code adapted from: pandas, 2024. pandas.isna (v.2.2) [Online]. 
        # Available from: https://pandas.pydata.org/docs/reference/api/pandas.isna.html#pandas-isna [Accessed 15 October 2024].
        pd.isna(dataset_record.string_answer_evaluation) == True):
        #
            if extract_first_answer:
                print(f"Extract first answer is used for id {dataset_record.id}")
                # Code is adapted from and based on: Kuchling, A.M., 2024. Regular Expression HOWTO (v.3.13.0) [Online]. s.l.: Python Software Foundation.
                # Available from: https://docs.python.org/3/howto/regex.html [Accessed 15 October 2024].
                # Code is based on several outputs from Mixtral 8x7b Instruct version v0.1, fp16 (pers. comm.) on 03/10/2024 
                # from tot_prompt_to_generate_answer prompt in the prompt_tot function in the Prompt section in ToT-data-answer-generator-and-checker notebook, with and without hint_1 and hint_2 
                # values for the prompt which are indicated in the code in Generate and check answers subsection in the ToT-data-answer-generator-and-checker notebook 
                # and enhanced_question values from train_dataset in the ToT-data-answer-generator-and-checker notebook
                # for the prompt (that is, the outputs that could be produced with the code in the ToT-data-answer-generator-and-checker notebook) which is based on the MMLU dataset: 
                # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
                # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
                # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021b. Measuring Massive Multitask Language Understanding. 
                # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-27. Available from: https://arxiv.org/pdf/2009.03300.pdf [Accessed 5 August 2024].
                # Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. and Steinhardt, J., 2023. Aligning AI With Shared Human Values. 
                # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-29. Available from: https://arxiv.org/pdf/2008.02275.pdf [Accessed 5 August 2024].
                # Please note that Mixtral 8x7b Instruct version v0.1, fp16 has been used locally using: 
                # Ollama, 2024a. Ollama [computer program]. Available from: https://ollama.com [Accessed 1 September 2024].
                # Ollama, 2024b. mixtral 8x7b-instruct-v0.1-fp16 [Online]. 
                # Available from: https://ollama.com/library/mixtral:8x7b-instruct-v0.1-fp16 [Accessed 25 September 2024].
                # Code is based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
                # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
                string_answer_to_extract = re.compile("(Answer: |answer: |Answer is |answer is |Answer is: |answer is: )[A-D]")
                #
                # Code is adapted from: Kuchling, A.M., 2024. Regular Expression HOWTO (v.3.13.0) [Online]. s.l.: Python Software Foundation.
                # Available from: https://docs.python.org/3/howto/regex.html [Accessed 15 October 2024].
                string_answer_check = string_answer_to_extract.search(dataset_record.generated_answer)
                #
                # Code is adapted from and based on: Kuchling, A.M., 2024. Regular Expression HOWTO (v.3.13.0) [Online]. s.l.: Python Software Foundation.
                # Available from: https://docs.python.org/3/howto/regex.html [Accessed 15 October 2024].
                if string_answer_check is not None:
                #
                    print(f"String answer check for id {dataset_record.id} is {string_answer_check}")
                    # Code is adapted from: Kuchling, A.M., 2024. Regular Expression HOWTO (v.3.13.0) [Online]. s.l.: Python Software Foundation.
                    # Available from: https://docs.python.org/3/howto/regex.html [Accessed 15 October 2024].
                    string_answer_indices = string_answer_check.span()
                    #
                    # Code is based on: Kuchling, A.M., 2024. Regular Expression HOWTO (v.3.13.0) [Online]. s.l.: Python Software Foundation.
                    # Available from: https://docs.python.org/3/howto/regex.html [Accessed 15 October 2024].
                    generated_answer = dataset_record.generated_answer[0:string_answer_indices[1]]
                    #
                else:
                    generated_answer = dataset_record.generated_answer
            else:
                generated_answer = dataset_record.generated_answer

            print(f"The generated answer to be checked for id {dataset_record.id} is {generated_answer}")

            # Code is based on several outputs from Mixtral 8x7b Instruct version v0.1, fp16 (pers. comm.) on 03/10/2024 
            # from tot_prompt_to_generate_answer prompt in the prompt_tot function in the Prompt section in ToT-data-answer-generator-and-checker notebook, with and without hint_1 and hint_2 
            # values for the prompt which are indicated in the code in Generate and check answers subsection in the ToT-data-answer-generator-and-checker notebook 
            # and enhanced_question values from train_dataset in the ToT-data-answer-generator-and-checker notebook
            # for the prompt (that is, the outputs that could be produced with the code in the ToT-data-answer-generator-and-checker notebook) which is based on the MMLU dataset: 
            # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
            # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
            # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021b. Measuring Massive Multitask Language Understanding. 
            # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-27. Available from: https://arxiv.org/pdf/2009.03300.pdf [Accessed 5 August 2024].
            # Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. and Steinhardt, J., 2023. Aligning AI With Shared Human Values. 
            # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-29. Available from: https://arxiv.org/pdf/2008.02275.pdf [Accessed 5 August 2024].
            # Please note that Mixtral 8x7b Instruct version v0.1, fp16 has been used locally using: 
            # Ollama, 2024a. Ollama [computer program]. Available from: https://ollama.com [Accessed 1 September 2024].
            # Ollama, 2024b. mixtral 8x7b-instruct-v0.1-fp16 [Online]. 
            # Available from: https://ollama.com/library/mixtral:8x7b-instruct-v0.1-fp16 [Accessed 25 September 2024].
            # Code is based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
            # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
            if (f"Answer: {dataset_record.letter_answer}" in generated_answer or
                f"answer: {dataset_record.letter_answer}" in generated_answer or
                f"Answer is {dataset_record.letter_answer}" in generated_answer or
                f"answer is {dataset_record.letter_answer}" in generated_answer or
                f"Answer is: {dataset_record.letter_answer}" in generated_answer or
                f"answer is: {dataset_record.letter_answer}" in generated_answer):
            #
                # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
                # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
                dataset_for_string_answer_evaluation.loc[dataset_for_string_answer_evaluation['id'] == dataset_record.id, "string_answer_evaluation"] = 1
                #
                # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
                # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
                dataset_for_string_answer_evaluation.loc[dataset_for_string_answer_evaluation['id'] == dataset_record.id, "answer_evaluation"] = 1
                #
            # Code is based on several outputs from Mixtral 8x7b Instruct version v0.1, fp16 (pers. comm.) on 03/10/2024 
            # from tot_prompt_to_generate_answer prompt in the prompt_tot function in the Prompt section in ToT-data-answer-generator-and-checker notebook, with and without hint_1 and hint_2 
            # values for the prompt which are indicated in the code in Generate and check answers subsection in the ToT-data-answer-generator-and-checker notebook 
            # and enhanced_question values from train_dataset in the ToT-data-answer-generator-and-checker notebook
            # for the prompt (that is, the outputs that could be produced with the code in the ToT-data-answer-generator-and-checker notebook) which is based on the MMLU dataset: 
            # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
            # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
            # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021b. Measuring Massive Multitask Language Understanding. 
            # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-27. Available from: https://arxiv.org/pdf/2009.03300.pdf [Accessed 5 August 2024].
            # Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. and Steinhardt, J., 2023. Aligning AI With Shared Human Values. 
            # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-29. Available from: https://arxiv.org/pdf/2008.02275.pdf [Accessed 5 August 2024].
            # Please note that Mixtral 8x7b Instruct version v0.1, fp16 has been used locally using: 
            # Ollama, 2024a. Ollama [computer program]. Available from: https://ollama.com [Accessed 1 September 2024].
            # Ollama, 2024b. mixtral 8x7b-instruct-v0.1-fp16 [Online]. 
            # Available from: https://ollama.com/library/mixtral:8x7b-instruct-v0.1-fp16 [Accessed 25 September 2024].
            # Code is based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
            # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
            elif (f"Answer: A" in generated_answer or
                f"Answer: B" in generated_answer or 
                f"Answer: C" in generated_answer or  
                f"Answer: D" in generated_answer or
                f"answer: A" in generated_answer or
                f"answer: B" in generated_answer or
                f"answer: C" in generated_answer or
                f"answer: D" in generated_answer or
                f"Answer is A" in generated_answer or
                f"Answer is B" in generated_answer or
                f"Answer is C" in generated_answer or
                f"Answer is D" in generated_answer or
                f"answer is A" in generated_answer or
                f"answer is B" in generated_answer or
                f"answer is C" in generated_answer or
                f"answer is D" in generated_answer or
                f"Answer is: A" in generated_answer or
                f"Answer is: B" in generated_answer or
                f"Answer is: C" in generated_answer or
                f"Answer is: D" in generated_answer or
                f"answer is: A" in generated_answer or
                f"answer is: B" in generated_answer or
                f"answer is: C" in generated_answer or
                f"answer is: D" in generated_answer):
            #
                # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
                # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
                dataset_for_string_answer_evaluation.loc[dataset_for_string_answer_evaluation['id'] == dataset_record.id, "string_answer_evaluation"] = 0
                #
                # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
                # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
                dataset_for_string_answer_evaluation.loc[dataset_for_string_answer_evaluation['id'] == dataset_record.id, "answer_evaluation"] = 0
                #
            else:
                # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
                # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
                dataset_for_string_answer_evaluation.loc[dataset_for_string_answer_evaluation['id'] == dataset_record.id, "string_answer_evaluation"] = None
                #
        else:
            print(f"The string answer evaluation for id {dataset_record.id} was already generated")
    if extract_first_answer:
        dataset_for_string_answer_evaluation_address = data_for_tuning_folder_address + '/' + f'{dataset_for_string_answer_evaluation_name}_string_answer_evaluated_extracted.csv'
    else:
        dataset_for_string_answer_evaluation_address = data_for_tuning_folder_address + '/' + f'{dataset_for_string_answer_evaluation_name}_string_answer_evaluated.csv'
    # Code, that is, the saving of the dataset, adapted from: pandas, 2024. pandas.DataFrame.to_csv (v.2.2) [Online]. 
    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv [Accessed 15 August 2024].
    dataset_for_string_answer_evaluation.to_csv(dataset_for_string_answer_evaluation_address, index=False)
    #
    if extract_first_answer:
        print(f"The dataset for string answer evaluation was saved to {dataset_for_string_answer_evaluation_name}_string_answer_evaluated_extracted.csv")
    else:
        print(f"The dataset for string answer evaluation was saved to {dataset_for_string_answer_evaluation_name}_string_answer_evaluated.csv")

In [14]:
def llm_answer_evaluate(data_for_tuning_folder_address, dataset_for_llm_answer_evaluation_name, dataset_for_llm_answer_evaluation, tot):
    '''
    This fuction evaluates generated answers for the correct answer based on the Large Language Model which could not be evaluated with pattern of string answers 
    where 1 is correct answer and 0 is incorrect answer
    '''
    # Code, that is, the answer generator, adapted from: LangChain, 2023. OllamaLLM [Online]. 
    # Available from: https://python.langchain.com/v0.2/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html [Accessed 1 September 2024].
    answer_generator_llm = OllamaLLM(
        # Code, that is, the parameter, reused from: LangChain, 2023. OllamaLLM [Online]. 
        # Available from: https://python.langchain.com/v0.2/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html [Accessed 1 September 2024].
        # Code, that is, the value, reused from: Ollama, 2024. mixtral 8x7b-instruct-v0.1-fp16 [Online]. 
        # Available from: https://ollama.com/library/mixtral:8x7b-instruct-v0.1-fp16 [Accessed 25 September 2024].
        model="mixtral:8x7b-instruct-v0.1-fp16"
        #
        )
    #
    # Code, that is, the loop, adapted from: pandas, 2024. pandas.DataFrame.itertuples (v.2.2) [Online]. 
    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples [Accessed 1 September 2024].
    for dataset_record in dataset_for_llm_answer_evaluation.itertuples():
    #
        if (dataset_record.string_answer_evaluation is None or
        # Code, that is, the use of numpy that follows is, reused from: NumPy Developers, 2024. Constants. numpy.nan (v.2.1) [Online].
        # Available from: https://numpy.org/doc/stable/reference/constants.html#numpy.nan [Accessed 15 September 2024]
        dataset_record.string_answer_evaluation is np.nan or
        #
        # Code adapted from: pandas, 2024. pandas.isna (v.2.2) [Online]. 
        # Available from: https://pandas.pydata.org/docs/reference/api/pandas.isna.html#pandas-isna [Accessed 15 October 2024].
        pd.isna(dataset_record.string_answer_evaluation) == True):
        #
            # Code is based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
            # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
            if dataset_record.letter_answer == "A":
                letter_answer_text = dataset_record.first_choice
            elif dataset_record.letter_answer == "B":
                letter_answer_text = dataset_record.second_choice
            elif dataset_record.letter_answer == "C":
                letter_answer_text = dataset_record.third_choice
            elif dataset_record.letter_answer == "D":
                letter_answer_text = dataset_record.fourth_choice
            #

            prompt_to_evaluate_answer = prompt_tot_evaluation(dataset_record.enhanced_question, dataset_record.generated_answer, dataset_record.letter_answer, letter_answer_text)

            # Code, that is, the answer generation, adapted from: LangChain, 2023. OllamaLLM [Online]. 
            # Available from: https://python.langchain.com/v0.2/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html [Accessed 1 September 2024].
            llm_checked_answer = answer_generator_llm.invoke(prompt_to_evaluate_answer)
            #
            print(llm_checked_answer)
            # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
            # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
            dataset_for_llm_answer_evaluation.loc[dataset_for_llm_answer_evaluation['id'] == dataset_record.id, "llm_answer_check"] = llm_checked_answer
            #
            if "correct" in llm_checked_answer.lower() and "incorrect" not in llm_checked_answer.lower() and "not correct" not in llm_checked_answer.lower():
                llm_evaluated_answer = 1
            else:
                llm_evaluated_answer = 0
            # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
            # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
            dataset_for_llm_answer_evaluation.loc[dataset_for_llm_answer_evaluation['id'] == dataset_record.id, "llm_answer_evaluation"] = llm_evaluated_answer
            #
            # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
            # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
            dataset_for_llm_answer_evaluation.loc[dataset_for_llm_answer_evaluation['id'] == dataset_record.id, "answer_evaluation"] = llm_evaluated_answer
            #
        else:
            print(f"The string answer evaluation for id {dataset_record.id} was generated")
    dataset_for_llm_answer_evaluation_address = data_for_tuning_folder_address + '/' + f'{dataset_for_llm_answer_evaluation_name}_answer_evaluated.csv'
    # Code, that is, the saving of the dataset, adapted from: pandas, 2024. pandas.DataFrame.to_csv (v.2.2) [Online]. 
    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv [Accessed 15 August 2024].
    dataset_for_llm_answer_evaluation.to_csv(dataset_for_llm_answer_evaluation_address, index=False)
    #
    print(f"The dataset for llm answer evaluation was saved to {dataset_for_llm_answer_evaluation_name}_answer_evaluated.csv")

In [15]:
def evaluate_accuracy(dataset_path_for_answer_evaluation, dataset_to_evaluate, column_to_evaluate):
    '''
    This function evaluates the accuracy of results in a particular column in a dataset
    '''
    # Code adapted from: pandas, 2024. pandas.DataFrame.sum (v.2.2) [Online]. 
    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html#pandas-dataframe-sum [Accessed 13 October 2024].
    dataset_to_evaluate_column_ones = dataset_to_evaluate[f'{column_to_evaluate}'].sum()
    #
    # Code adapted from: pandas, 2024. pandas.DataFrame.count (v.2.2) [Online]. 
    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html#pandas-dataframe-count [Accessed 13 October 2024].
    dataset_to_evaluate_column_numbers = dataset_to_evaluate[f'{column_to_evaluate}'].count()
    #
    # The equation is according to: Dutt, S., Chandramouli, S. and Das, A.K., 2019. Machine Learning [Online]. 1st ed. Uttar Pradesh, India: Pearson India. 
    # Available from: https://learning.oreilly.com/library/view/machine-learning-1st/9789353067373/ [Accessed 13 October 2024].
    # Page 76
    dataset_column_accuracy = dataset_to_evaluate_column_ones / dataset_to_evaluate_column_numbers
    #
    print(f'The dataset to evaluate column ones of {column_to_evaluate} in {dataset_path_for_answer_evaluation} is {dataset_to_evaluate_column_ones}')
    print(f'The dataset to evaluate column numbers of {column_to_evaluate} in {dataset_path_for_answer_evaluation} is {dataset_to_evaluate_column_numbers}')
    print(f'The dataset column accuracy of {column_to_evaluate} in {dataset_path_for_answer_evaluation} is {dataset_column_accuracy}')
    return dataset_column_accuracy

## Mistral ToT tuned

Extract first answer in string answer evaluate

In [17]:
tot = True
extract_first_answer = True
test_data_for_evaluation_folder_name = "test-dataset-for-evaluation-mistral-tot-tuned"
test_data_for_evaluation_folder_address = create_data_for_tuning_folder_address(current_path, test_data_for_evaluation_folder_name)

In [None]:
create_folder_for_data_for_tuning(test_data_for_evaluation_folder_name, test_data_for_evaluation_folder_address)
test_dataset_for_generated_answers, test_dataset_for_generated_answers_version, test_generated_answers_count, test_total_dataset_rows = check_and_load_generated_answers_to_continue(test_dataset, test_data_for_evaluation_folder_name, test_data_for_evaluation_folder_address)
generate_and_check_answers(test_data_for_evaluation_folder_name, test_data_for_evaluation_folder_address, test_dataset_for_generated_answers, test_dataset_for_generated_answers_version, test_generated_answers_count, test_total_dataset_rows, tot)

In [None]:
test_dataset_for_string_answer_evaluation_name = "test_dataset_45_671_671"
test_dataset_path_for_string_answer_evaluation_csv = f"{test_data_for_evaluation_folder_name}/{test_dataset_for_string_answer_evaluation_name}.csv"
print(f'Test dataset path for string answer evaluation csv is {test_dataset_path_for_string_answer_evaluation_csv}')

# Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
test_dataset_for_string_answer_evaluation = pd.read_csv(test_dataset_path_for_string_answer_evaluation_csv)
#

string_answer_evaluate(test_data_for_evaluation_folder_address, test_dataset_for_string_answer_evaluation_name, test_dataset_for_string_answer_evaluation, extract_first_answer)

In [None]:
test_dataset_for_llm_answer_evaluation_name = "test_dataset_45_671_671_string_answer_evaluated_extracted"
test_dataset_path_for_llm_answer_evaluation_csv = f"{test_data_for_evaluation_folder_name}/{test_dataset_for_llm_answer_evaluation_name}.csv"
print(f'Test dataset path for llm answer evaluation csv is {test_dataset_path_for_llm_answer_evaluation_csv}')

# Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
test_dataset_for_llm_answer_evaluation = pd.read_csv(test_dataset_path_for_llm_answer_evaluation_csv)
#

llm_answer_evaluate(test_data_for_evaluation_folder_address, test_dataset_for_llm_answer_evaluation_name, test_dataset_for_llm_answer_evaluation, tot)

In [None]:
test_dataset_for_answer_evaluation_name = "test_dataset_45_671_671_string_answer_evaluated_extracted_answer_evaluated"
test_dataset_path_for_answer_evaluation_csv = f"{test_data_for_evaluation_folder_name}/{test_dataset_for_answer_evaluation_name}.csv"
print(f'Test dataset path for answer evaluation csv is {test_dataset_path_for_answer_evaluation_csv}')

# Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
test_dataset_for_answer_evaluation = pd.read_csv(test_dataset_path_for_answer_evaluation_csv)
#

evaluate_accuracy(test_dataset_path_for_answer_evaluation_csv, test_dataset_for_answer_evaluation, "string_answer_evaluation")
evaluate_accuracy(test_dataset_path_for_answer_evaluation_csv, test_dataset_for_answer_evaluation, "llm_answer_evaluation")
evaluate_accuracy(test_dataset_path_for_answer_evaluation_csv, test_dataset_for_answer_evaluation, "answer_evaluation")

Do not extract first answer in string answer evaluate

In [21]:
extract_first_answer = False

In [None]:
test_dataset_for_string_answer_evaluation_name = "test_dataset_45_671_671"
test_dataset_path_for_string_answer_evaluation_csv = f"{test_data_for_evaluation_folder_name}/{test_dataset_for_string_answer_evaluation_name}.csv"
print(f'Test dataset path for string answer evaluation csv is {test_dataset_path_for_string_answer_evaluation_csv}')

# Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
test_dataset_for_string_answer_evaluation = pd.read_csv(test_dataset_path_for_string_answer_evaluation_csv)
#

string_answer_evaluate(test_data_for_evaluation_folder_address, test_dataset_for_string_answer_evaluation_name, test_dataset_for_string_answer_evaluation, extract_first_answer)

In [None]:
test_dataset_for_llm_answer_evaluation_name = "test_dataset_45_671_671_string_answer_evaluated"
test_dataset_path_for_llm_answer_evaluation_csv = f"{test_data_for_evaluation_folder_name}/{test_dataset_for_llm_answer_evaluation_name}.csv"
print(f'Test dataset path for llm answer evaluation csv is {test_dataset_path_for_llm_answer_evaluation_csv}')

# Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
test_dataset_for_llm_answer_evaluation = pd.read_csv(test_dataset_path_for_llm_answer_evaluation_csv)
#

llm_answer_evaluate(test_data_for_evaluation_folder_address, test_dataset_for_llm_answer_evaluation_name, test_dataset_for_llm_answer_evaluation, tot)

In [None]:
test_dataset_for_answer_evaluation_name = "test_dataset_45_671_671_string_answer_evaluated_answer_evaluated"
test_dataset_path_for_answer_evaluation_csv = f"{test_data_for_evaluation_folder_name}/{test_dataset_for_answer_evaluation_name}.csv"
print(f'Test dataset path for answer evaluation csv is {test_dataset_path_for_answer_evaluation_csv}')

# Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
test_dataset_for_answer_evaluation = pd.read_csv(test_dataset_path_for_answer_evaluation_csv)
#

evaluate_accuracy(test_dataset_path_for_answer_evaluation_csv, test_dataset_for_answer_evaluation, "string_answer_evaluation")
evaluate_accuracy(test_dataset_path_for_answer_evaluation_csv, test_dataset_for_answer_evaluation, "llm_answer_evaluation")
evaluate_accuracy(test_dataset_path_for_answer_evaluation_csv, test_dataset_for_answer_evaluation, "answer_evaluation")