# ToT Data Answer Generator And Checker

This notebook samples the datasets produced from the notebook ToT-data-ETL based on the MMLU dataset (Hendrycks et al, 2021a; Hendrycks et al, 2021b; Hendrycks et al, 2023), generates and checks answers in the ToT style based on the enhanced questions from the sampled datasets and saves the datasets with the ToT answer to files

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021b. Measuring Massive Multitask Language Understanding. ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-27. Available from: https://arxiv.org/pdf/2009.03300.pdf [Accessed 5 August 2024].
 
Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. and Steinhardt, J., 2023. Aligning AI With Shared Human Values. ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-29. Available from: https://arxiv.org/pdf/2008.02275.pdf [Accessed 5 August 2024].

## Load

In [1]:
import pandas as pd

full_train_dataset_path_csv = "full_train_dataset.csv"
full_validation_dataset_path_csv = "full_validation_dataset.csv"
full_test_dataset_path_csv = "full_test_dataset.csv"

# Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
full_train_dataset = pd.read_csv(full_train_dataset_path_csv)
#

# Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
full_validation_dataset = pd.read_csv(full_validation_dataset_path_csv)
#

# Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
full_test_dataset = pd.read_csv(full_test_dataset_path_csv)
#

## Create the generated answer column

In [2]:
# Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 17 August 2024].
full_train_dataset.loc[:, "generated_answer"] = None
#

In [3]:
# Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 17 August 2024].
full_validation_dataset.loc[:, "generated_answer"] = None
#

In [4]:
# Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 17 August 2024].
full_test_dataset.loc[:, "generated_answer"] = None
#

## Sample

The relative proportions of the datasets are according to and approximately in line with (Baheti, 2021, para.58)

Baheti, P., 2021. Train Test Validation Split: How to & Best Practices [2024] [Online]. s.l.: V7Labs. Available from: https://www.v7labs.com/blog/train-validation-test-set [Accessed 15 September 2024].

The volume of the samples is approximately in the order of (Awan, 2024, para.59; Zhang et al, 2023, p.12)

Awan, A.A, 2024. Fine-Tuning Llama 3.1 for Text Classification [Online]. s.l.: DataCamp. Available from: https://www.datacamp.com/tutorial/fine-tuning-llama-3-1 [Accessed 21 September 2024].

Zhang, Z., Ye, Z., Shen, Y. and Gan, C., 2023. Autonomous Tree-Search Ability of Large Language Models. Ithaca: Cornell University Library, arXiv.org. arXiv [Online]. Available from: https://arxiv.org/pdf/2310.10686.pdf [Accessed 25 August 2024].

In [5]:
# Code, that is, the function and the sampling of the dataset, adapted from: pandas, 2024. pandas.DataFrame.iloc (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html [Accessed 17 August 2024].
def sample_rows(record, step):
    return record.index % step == 0

train_dataset = full_train_dataset.iloc[lambda record: sample_rows(record, 25)]
#
print("The length of the train dataset is", len(train_dataset))

The length of the train dataset is 3245


In [6]:
# Code adapted from: pandas, 2024. pandas.DataFrame.iloc (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html [Accessed 17 August 2024].
validation_dataset = full_validation_dataset.iloc[lambda record: sample_rows(record, 3)]
#
print("The length of the validation dataset is", len(validation_dataset))

The length of the validation dataset is 557


In [7]:
# Code adapted from: pandas, 2024. pandas.DataFrame.iloc (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html [Accessed 17 August 2024].
test_dataset = full_test_dataset.iloc[lambda record: sample_rows(record, 19)]
#
print("The length of the test dataset is", len(test_dataset))

The length of the test dataset is 671


## Prompt

In [8]:
def prompt_tot(enhanced_question, hint_1="", hint_2=""):
    # Prompt (lines 1 - 11 in the tot_prompt_to_generate_answer) reused and slightly adapted from: mrspiggot, 2023. langchain_tree.py [computer program].
    # Available from: https://github.com/mrspiggot/forestOfThoughts/blob/master/langchain_tree.py [Accessed 5 September 2024]. 
    # (mrspiggot, 2023, lines 11 - 22)
    #
    # Prompt (lines 16 - 99 in the tot_prompt_to_generate_answer, that is, where delimited by #####) is based on: Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y. and Narasimhan, K., 2023. 
    # Tree of Thoughts: Deliberate Problem Solving with Large Language Models. 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 10-16 December 2023, New Orleans. 
    # Ithaca: Cornell University Library, arXiv.org, pp.1-14. Available from: https://arxiv.org/pdf/2305.10601.pdf [Accessed 17 August 2024].
    #
    # Prompt (lines 16 - 99 in the tot_prompt_to_generate_answer, that is, where delimited by #####) is based on the approach of how tree is structured: Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y. and Narasimhan, K., 2023. 
    # Tree of Thoughts: Deliberate Problem Solving with Large Language Models. 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 10-16 December 2023, New Orleans. 
    # Ithaca: Cornell University Library, arXiv.org, pp.1-14. Available from: https://arxiv.org/pdf/2305.10601.pdf [Accessed 17 August 2024]. 
    # (for example, Yao et al, 2023, page 2, figure d)
    #
    # Prompt (lines 13 - 105, line 109 in the tot_prompt_to_generate_answer) is adapted and based on the ToT pattern according to: Zhang, Z., Ye, Z., Shen, Y. and Gan, C., 2023. 
    # Autonomous Tree-Search Ability of Large Language Models. Ithaca: Cornell University Library, arXiv.org. arXiv [Online]. 
    # Available from: https://arxiv.org/pdf/2310.10686.pdf [Accessed 25 August 2024]. 
    # (Zhang et al, 2023, page 13, C.1)
    #
    # Prompt (lines 16 - 106 in the tot_prompt_to_generate_answer) is adapted and based on: mrspiggot, 2023. langchain_tree.py [computer program].
    # Available from: https://github.com/mrspiggot/forestOfThoughts/blob/master/langchain_tree.py [Accessed 5 September 2024]. 
    # (mrspiggot, 2023, lines 11 - 26)
    #
    # Prompt (line 102 in the tot_prompt_to_generate_answer, that is, where indicated about the letters and line 11, line 97, line 99, line 106, line 108, line 109 in the tot_prompt_to_generate_answer) is based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    #
    # Prompt (line 108 in the tot_prompt_to_generate_answer) reused and slightly adapted from: mrspiggot, 2023. langchain_tree.py [computer program].
    # Available from: https://github.com/mrspiggot/forestOfThoughts/blob/master/langchain_tree.py [Accessed 5 September 2024]. 
    # (mrspiggot, 2023, line 23)
    #
    # Prompt (line 109 in the tot_prompt_to_generate_answer) is adapted and based on: mrspiggot, 2023. langchain_tree.py [computer program].
    # Available from: https://github.com/mrspiggot/forestOfThoughts/blob/master/langchain_tree.py [Accessed 5 September 2024]. 
    # (mrspiggot, 2023, lines 11 - 26)
    #
    # Prompt (line 109, that is, before asking about the answer) is based on: Kojima, T., Gu, S.S., Reid, M., Matsuo, Y. and Iwasawa, Y., 2023. 
    # Large Language Models are Zero-Shot Reasoners. 36th Conference on Neural Information Processing Systems (NeurIPS 2022), 28 November 2022 – 9 December 2022, New Orleans. 
    # Ithaca: Cornell University Library, arXiv.org, pp.1-42. Available from: https://arxiv.org/pdf/2205.11916 [Accessed 15 September 2024]. 
    # (Kojima et al, 2023, page 2, figure d)
    #
    # Please note that enhanced_question variable in the tot_prompt_to_generate_answer would include transformed data from: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    #
    # Please note that hint_1 and hint_2 variables in the tot_prompt_to_generate_answer could include transformed data from: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    tot_prompt_to_generate_answer = f'''
    Imagine three different experts are answering this question in the Tree of Thoughts style.
    They will brainstorm and debate the answer step by step reasoning carefully and taking all facts into consideration
    All experts will write down 1 step of their thinking, then share it with the group.
    They will each critique their response, and then all the responses of others
    They will check their answer based on the appropriate rules.
    Then all experts will go on to the next step and write down this step of their thinking.
    They will keep going through steps until they reach their conclusion taking into account the thoughts of the other experts.
    If at any time they realise that there is a flaw in their logic they will backtrack to where that flaw occurred. 
    If any expert realises they're wrong at any point then they acknowledge this and start another train of thought.
    Each expert will assign a likelihood of their current assertion being correct.
    Continue until the experts agree on the single most likely answer {hint_1}

    Use the following template delimited by #####.

    #####
    Step 1

    Expert 1
    concise thought 1:
    concise critique of thought 1:
    probability of thought 1 being correct in percentage:

    Expert 2
    concise thought 2:
    concise critique of thought 2:
    probability of thought 2 being correct in percentage:

    Expert 3
    concise thought 3:
    concise critique of thought 3:
    probability of thought 3 being correct in percentage:

    Step 2

    Expert 1
    concise thought 1.1:
    concise critique of thought 1.1:
    probability of thought 1.1 being correct in percentage:

    Expert 1
    concise thought 1.2:
    concise critique of thought 1.2:
    probability of thought 1.2 being correct in percentage:

    Expert 2
    concise thought 2.1:
    concise critique of thought 2.1:
    probability of thought 2.1 being correct in percentage:

    Expert 2
    concise thought 2.2:
    concise critique of thought 2.2:
    probability of thought 2.2 being correct in percentage:

    Expert 3
    concise thought 3.1:
    concise critique of thought 3.1:
    probability of thought 3.1 being correct in percentage:

    Expert 3
    concise thought 3.2:
    concise critique of thought 3.2:
    probability of thought 3.2 being correct in percentage:

    Step 3

    Expert 1
    concise thought 1.1.1:
    concise critique of thought 1.1.1:
    probability of thought 1.1.1 being correct in percentage:

    Expert 1
    concise thought 1.2.1:
    concise critique of thought 1.2.1:
    probability of thought 1.2.1 being correct in percentage:

    Expert 2
    concise thought 2.1.1:
    concise critique of thought 2.1.1:
    probability of thought 2.1.1 being correct in percentage:

    Expert 2
    concise thought 2.2.1:
    concise critique of thought 2.2.1:
    probability of thought 2.2.1 being correct in percentage:

    Expert 3
    concise thought 3.1.1:
    concise critique of thought 3.1.1:
    probability of thought 3.1.1 being correct in percentage:

    Expert 3
    concise thought 3.2.1:
    concise critique of thought 3.2.1:
    probability of thought 3.2.1 being correct in percentage:

    Conclusion: write here the answer which all of the experts agree is the correct answer based on the final thoughts of the experts

    Answer: write here the answer letter which all of the experts agree is the correct answer based on the final thoughts of the experts
    #####

    Very important. The thoughts of experts must be diverse and different, that is, the experts must not repeat what they previously said and the thoughts of the experts should include letters (A, B, C or D) to which the thoughts refer or relate. 
    Very important. In Step 1, each expert must provide 1 thought (3 thoughts in total, that is, concise thought 1, concise thought 2, concise thought 3). 
    Very important. In Step 2, each expert must provide 2 thoughts (6 thoughts in total, that is, concise thought 1.1, concise thought 1.2, concise thought 2.1, concise thought 2.2, concise thought 3.1, concise thought 3.2). 
    Very important. In Step 3, each expert must provide at least 2 thoughts (at least 6 thoughts in total), etc.
    Very important. The answer letter which all of the experts agree is the correct answer based on the final thoughts of the experts should be in the format of Answer: answer letter which all of the experts agree is the correct answer based on the final thoughts of the experts, for example, Answer: A, Answer: B, Answer: C or Answer: D.

    The question is: {enhanced_question}
    Think in the Tree of Thoughts style. What is the answer and the answer letter? It is very important that you provide the correct answer letter in the format of Answer: A, Answer: B, Answer: C or Answer: D and it is very important that you provide it only after all the thoughts in step 3 {hint_2}
    '''
    #
    print(tot_prompt_to_generate_answer)
    return tot_prompt_to_generate_answer


## Generate And Check Answers

Create folders to save generated and checked answers

In [9]:
import os

current_path = os.getcwd()

def create_data_for_tuning_folder_address(current_path, data_for_tuning_folder_name):
    return current_path + "/" + data_for_tuning_folder_name

def create_folder_for_data_for_tuning(data_for_tuning_folder_name, data_for_tuning_folder_address):
    # Code, that is, the check for folder, adapted from: Python Software Foundation, 2024. os.path - Common pathname manipulations. os.path.exists (v.3.12.5) [Online]. 
    # Available from: https://docs.python.org/3/library/os.path.html#os.path.exists [Accessed 25 August 2024].
    data_for_tuning_folder_is_present = os.path.exists(data_for_tuning_folder_address)
    #
    
    if not data_for_tuning_folder_is_present:
        os.mkdir(data_for_tuning_folder_address)
        print(f'Created {data_for_tuning_folder_name} folder')
    else:
        print(f'{data_for_tuning_folder_name} folder is present')

Check if answers were already generated and load the appropriate dataset

In [10]:
import pandas as pd

def check_and_load_generated_answers_to_continue(initial_dataset, data_for_tuning_folder_name, data_for_tuning_folder_address):
    '''
    This function checks if there are any files with the generated answers in a folder and if so, loads intermediate files to continue generating answers or loads a completed file, 
    otherwise uses the intial dataset to start generating answers from the beginning
    '''
    # Code, that is, the finding of files for tuning, adapted from: Python Software Foundation, 2024. os - Miscellaneous operating system interfaces. os.listdir (v.3.12.5) [Online]. 
    # Available from: https://docs.python.org/3/library/os.html#os.listdir [Accessed 31 August 2024].
    files_for_tuning = os.listdir(data_for_tuning_folder_address)
    #

    if len(files_for_tuning):
        # Code, that is, the function and the ordering of files for tuning, adapted from: Python Software Foundation, 2024. Built-in Types. sort (v.3.12.5) [Online]. 
        # Available from: https://docs.python.org/3/library/stdtypes.html#list.sort [Accessed 31 August 2024].
        def check_version_of_file(file):
            return int(file.split('_')[2])
        
        files_for_tuning.sort(key=check_version_of_file, reverse=True)
        #
        print(f"The following files were found in the {data_for_tuning_folder_name} folder. {files_for_tuning}")
        recent_file_for_tuning = files_for_tuning[0]
        print(f"The most recent file in the {data_for_tuning_folder_name} folder is {recent_file_for_tuning}")
        recent_file_for_tuning_attributes = recent_file_for_tuning.split('_')

        recent_file_for_tuning_address = data_for_tuning_folder_address + '/' + recent_file_for_tuning
        # Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
        # Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
        dataset_for_generated_answers = pd.read_csv(recent_file_for_tuning_address)
        #
        dataset_for_generated_answers_version = int(recent_file_for_tuning_attributes[2])
        generated_answers_count = int(recent_file_for_tuning_attributes[3])
        total_dataset_rows = int(recent_file_for_tuning_attributes[4].split('.')[0])
        
        if generated_answers_count < total_dataset_rows:
            print(f"The {data_for_tuning_folder_name} folder has intermediate files with generated answers. Continue the generation of answers. The dataset version is {dataset_for_generated_answers_version}. The count of generated answers is {generated_answers_count}. The total count of rows is {total_dataset_rows}.")
        elif generated_answers_count == total_dataset_rows:
            print(f"The {data_for_tuning_folder_name} folder has already generated answers. The dataset version is {dataset_for_generated_answers_version}. The count of generated answers is {generated_answers_count}. The total count of rows is {total_dataset_rows}.")
        return dataset_for_generated_answers, dataset_for_generated_answers_version, generated_answers_count, total_dataset_rows
    else:
        dataset_for_generated_answers = initial_dataset
        dataset_for_generated_answers_version = 0
        generated_answers_count = 0
        total_dataset_rows = len(dataset_for_generated_answers)
        print(f"The {data_for_tuning_folder_name} folder does not have files with generated answers. Start the generation of answers from the beginning. The dataset version is {dataset_for_generated_answers_version}. The count of generated answers is {generated_answers_count}. The total count of rows is {total_dataset_rows}.")
        return dataset_for_generated_answers, dataset_for_generated_answers_version, generated_answers_count, total_dataset_rows

Generate and check answers

In [11]:
# Code, that is, the import, reused from: LangChain, 2023. OllamaLLM [Online]. 
# Available from: https://python.langchain.com/v0.2/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html [Accessed 1 September 2024].
from langchain_ollama import OllamaLLM
#
# Code, that is, the import, reused from: NumPy Developers, 2024. Constants. numpy.nan (v.2.1) [Online].
# Available from: https://numpy.org/doc/stable/reference/constants.html#numpy.nan [Accessed 15 September 2024].
import numpy as np
#

def generate_and_check_answers(data_for_tuning_folder_name, data_for_tuning_folder_address, dataset_for_generated_answers, dataset_for_generated_answers_version, generated_answers_count, total_dataset_rows, quality_check=True):
    '''
    This function generates answers in the ToT style to be used for fine-tuning, checks the answers for quality, saves the answers periodically with the intermediate versions
    '''
    if generated_answers_count < total_dataset_rows:
        # Code, that is, the answer generator, adapted from: LangChain, 2023. OllamaLLM [Online]. 
        # Available from: https://python.langchain.com/v0.2/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html [Accessed 1 September 2024].
        answer_generator_llm = OllamaLLM(
            # Code, that is, the parameter, reused from: LangChain, 2023. OllamaLLM [Online]. 
            # Available from: https://python.langchain.com/v0.2/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html [Accessed 1 September 2024].
            # Code, that is, the value, reused from: Ollama, 2024. mixtral 8x7b-instruct-v0.1-fp16 [Online]. 
            # Available from: https://ollama.com/library/mixtral:8x7b-instruct-v0.1-fp16 [Accessed 25 September 2024].
            model="mixtral:8x7b-instruct-v0.1-fp16"
            #
            )
        #
        generated_answers_count_updated = generated_answers_count
        dataset_for_generated_answers_version_updated = dataset_for_generated_answers_version
        save_dataset_count = 15
        # Code, that is, the loop, adapted from: pandas, 2024. pandas.DataFrame.itertuples (v.2.2) [Online]. 
        # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples [Accessed 1 September 2024].
        for dataset_record in dataset_for_generated_answers.itertuples():
        #
            if (dataset_record.generated_answer is None or
            # Code, that is, the use of numpy that follows is, reused from: NumPy Developers, 2024. Constants. numpy.nan (v.2.1) [Online].
            # Available from: https://numpy.org/doc/stable/reference/constants.html#numpy.nan [Accessed 15 September 2024]
            dataset_record.generated_answer is np.nan):
            #
                if quality_check:
                    retries_for_quality = 0
                    max_retries_for_quality = 3
                    while retries_for_quality < max_retries_for_quality:
                        if retries_for_quality < (max_retries_for_quality - 2):
                            prompt_to_generate_answer = prompt_tot(dataset_record.enhanced_question)
                        else:
                            # Code is based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
                            # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
                            hint_1 = f"which is {dataset_record.letter_answer}"
                            hint_2 = f"(remember, the final correct answer must be {dataset_record.letter_answer})"
                            #
                            prompt_to_generate_answer = prompt_tot(dataset_record.enhanced_question, hint_1, hint_2)
                        # Code, that is, the answer generation, adapted from: LangChain, 2023. OllamaLLM [Online]. 
                        # Available from: https://python.langchain.com/v0.2/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html [Accessed 1 September 2024].
                        generated_answer = answer_generator_llm.invoke(prompt_to_generate_answer)
                        #
                        print(generated_answer)
                        # This simple check for quality checks that the generated answer has at least different second level thoughts based on the
                        # Zhang, Z., Ye, Z., Shen, Y. and Gan, C., 2023. Autonomous Tree-Search Ability of Large Language Models. Ithaca: Cornell University Library, arXiv.org. arXiv [Online]. Available from: https://arxiv.org/pdf/2310.10686.pdf [Accessed 25 August 2024].
                        # Page 13, C.1
                        if (("1.1" in generated_answer and 
                        "1.2" in generated_answer and 
                        "2.1" in generated_answer and 
                        "2.2" in generated_answer and 
                        "3.1" in generated_answer and 
                        "3.2" in generated_answer) and 
                        #
                        # Code is based on several outputs from Mixtral 8x7b Instruct version v0.1, fp16 (pers. comm.) on 03/10/2024 
                        # from tot_prompt_to_generate_answer prompt in the prompt_tot function in the Prompt section in this notebook, with and without hint_1 and hint_2 
                        # values for the prompt which are indicated above in the code and enhanced_question values from train_dataset 
                        # for the prompt (that is, the outputs that could be produced with the code in the notebook) which is based on the MMLU dataset: 
                        # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
                        # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
                        # Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021b. Measuring Massive Multitask Language Understanding. 
                        # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-27. Available from: https://arxiv.org/pdf/2009.03300.pdf [Accessed 5 August 2024].
                        # Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. and Steinhardt, J., 2023. Aligning AI With Shared Human Values. 
                        # ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-29. Available from: https://arxiv.org/pdf/2008.02275.pdf [Accessed 5 August 2024].
                        # Please note that Mixtral 8x7b Instruct version v0.1, fp16 has been used locally using: 
                        # Ollama, 2024a. Ollama [computer program]. Available from: https://ollama.com [Accessed 1 September 2024].
                        # Ollama, 2024b. mixtral 8x7b-instruct-v0.1-fp16 [Online]. 
                        # Available from: https://ollama.com/library/mixtral:8x7b-instruct-v0.1-fp16 [Accessed 25 September 2024].
                        # Code is based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
                        # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
                        (f"Answer: {dataset_record.letter_answer}" in generated_answer or   
                         f"answer: {dataset_record.letter_answer}" in generated_answer or
                         f"Answer is {dataset_record.letter_answer}" in generated_answer or
                         f"answer is {dataset_record.letter_answer}" in generated_answer or
                         f"Answer is: {dataset_record.letter_answer}" in generated_answer or
                         f"answer is: {dataset_record.letter_answer}" in generated_answer)):
                        #
                            retries_for_quality = max_retries_for_quality
                        retries_for_quality += 1
                    # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
                    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
                    dataset_for_generated_answers.loc[dataset_for_generated_answers['id'] == dataset_record.id, "generated_answer"] = generated_answer
                    #
                    generated_answers_count_updated += 1
                    print(f"The generated answer for {dataset_record.id} was generated with quality check and {retries_for_quality} retries")
                    if (generated_answers_count_updated % save_dataset_count == 0) or (generated_answers_count_updated == total_dataset_rows):
                        dataset_for_generated_answers_version_updated += 1
                        dataset_type = data_for_tuning_folder_name.split('-')[0]
                        dataset_for_generated_answers_address = data_for_tuning_folder_address + '/' + f'{dataset_type}_dataset_{dataset_for_generated_answers_version_updated}_{generated_answers_count_updated}_{total_dataset_rows}.csv'
                        # Code, that is, the saving of the dataset, adapted from: pandas, 2024. pandas.DataFrame.to_csv (v.2.2) [Online]. 
                        # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv [Accessed 15 August 2024].
                        dataset_for_generated_answers.to_csv(dataset_for_generated_answers_address, index=False)
                        #
                        print(f"The dataset for generated answers was saved for the version {dataset_for_generated_answers_version_updated}. The count of generated answers is {generated_answers_count_updated}. The count of total rows is {total_dataset_rows}")
                else:
                    prompt_to_generate_answer = prompt_tot(dataset_record.enhanced_question)
                    # Code, that is, the answer generation, adapted from: LangChain, 2023. OllamaLLM [Online]. 
                    # Available from: https://python.langchain.com/v0.2/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html [Accessed 1 September 2024].
                    generated_answer = answer_generator_llm.invoke(prompt_to_generate_answer)
                    #
                    print(generated_answer)
                    # Code adapted from: pandas, 2024. pandas.DataFrame.loc (v.2.2) [Online]. 
                    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc [Accessed 5 September 2024].
                    dataset_for_generated_answers.loc[dataset_for_generated_answers['id'] == dataset_record.id, "generated_answer"] = generated_answer
                    #
                    generated_answers_count_updated += 1
                    print(f"The generated answer for {dataset_record.id} was generated without quality check")
                    if (generated_answers_count_updated % save_dataset_count == 0) or (generated_answers_count_updated == total_dataset_rows):
                        dataset_for_generated_answers_version_updated += 1
                        dataset_type = data_for_tuning_folder_name.split('-')[0]
                        dataset_for_generated_answers_address = data_for_tuning_folder_address + '/' + f'{dataset_type}_dataset_{dataset_for_generated_answers_version_updated}_{generated_answers_count_updated}_{total_dataset_rows}.csv'
                        # Code, that is, the saving of the dataset, adapted from: pandas, 2024. pandas.DataFrame.to_csv (v.2.2) [Online]. 
                        # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv [Accessed 15 August 2024].
                        dataset_for_generated_answers.to_csv(dataset_for_generated_answers_address, index=False)
                        #
                        print(f"The dataset for generated answers was saved for the version {dataset_for_generated_answers_version_updated}. The count of generated answers is {generated_answers_count_updated}. The count of total rows is {total_dataset_rows}")
            else:
                print(f"The generated answer for id {dataset_record.id} was already generated")
    elif generated_answers_count == total_dataset_rows:
        print("The generated answers were already generated for the dataset")

Train dataset for tuning

In [12]:
train_data_for_tuning_folder_name = "train-dataset-for-tuning"
train_data_for_tuning_folder_address = create_data_for_tuning_folder_address(current_path, train_data_for_tuning_folder_name)
create_folder_for_data_for_tuning(train_data_for_tuning_folder_name, train_data_for_tuning_folder_address)
train_dataset_for_generated_answers, train_dataset_for_generated_answers_version, train_generated_answers_count, train_total_dataset_rows = check_and_load_generated_answers_to_continue(train_dataset, train_data_for_tuning_folder_name, train_data_for_tuning_folder_address)
generate_and_check_answers(train_data_for_tuning_folder_name, train_data_for_tuning_folder_address, train_dataset_for_generated_answers, train_dataset_for_generated_answers_version, train_generated_answers_count, train_total_dataset_rows, True)

train-dataset-for-tuning folder is present
The following files were found in the train-dataset-for-tuning folder. ['train_dataset_215_3225_3245.csv', 'train_dataset_214_3210_3245.csv', 'train_dataset_213_3195_3245.csv', 'train_dataset_212_3180_3245.csv', 'train_dataset_211_3165_3245.csv', 'train_dataset_210_3150_3245.csv', 'train_dataset_209_3135_3245.csv', 'train_dataset_208_3120_3245.csv', 'train_dataset_207_3105_3245.csv', 'train_dataset_206_3090_3245.csv', 'train_dataset_205_3075_3245.csv', 'train_dataset_204_3060_3245.csv', 'train_dataset_203_3045_3245.csv', 'train_dataset_202_3030_3245.csv', 'train_dataset_201_3015_3245.csv', 'train_dataset_200_3000_3245.csv', 'train_dataset_199_2985_3245.csv', 'train_dataset_198_2970_3245.csv', 'train_dataset_197_2955_3245.csv', 'train_dataset_196_2940_3245.csv', 'train_dataset_195_2925_3245.csv', 'train_dataset_194_2910_3245.csv', 'train_dataset_193_2895_3245.csv', 'train_dataset_192_2880_3245.csv', 'train_dataset_191_2865_3245.csv', 'train_dat

Validation dataset for tuning

In [12]:
validation_data_for_tuning_folder_name = "validation-dataset-for-tuning"
validation_data_for_tuning_folder_address = create_data_for_tuning_folder_address(current_path, validation_data_for_tuning_folder_name)
create_folder_for_data_for_tuning(validation_data_for_tuning_folder_name, validation_data_for_tuning_folder_address)
validation_dataset_for_generated_answers, validation_dataset_for_generated_answers_version, validation_generated_answers_count, validation_total_dataset_rows = check_and_load_generated_answers_to_continue(validation_dataset, validation_data_for_tuning_folder_name, validation_data_for_tuning_folder_address)
generate_and_check_answers(validation_data_for_tuning_folder_name, validation_data_for_tuning_folder_address, validation_dataset_for_generated_answers, validation_dataset_for_generated_answers_version, validation_generated_answers_count, validation_total_dataset_rows, True)

validation-dataset-for-tuning folder is present
The following files were found in the validation-dataset-for-tuning folder. ['validation_dataset_17_255_557.csv', 'validation_dataset_16_240_557.csv', 'validation_dataset_15_225_557.csv', 'validation_dataset_14_210_557.csv', 'validation_dataset_13_195_557.csv', 'validation_dataset_12_180_557.csv', 'validation_dataset_11_165_557.csv', 'validation_dataset_10_150_557.csv', 'validation_dataset_9_135_557.csv', 'validation_dataset_8_120_557.csv', 'validation_dataset_7_105_557.csv', 'validation_dataset_6_90_557.csv', 'validation_dataset_5_75_557.csv', 'validation_dataset_4_60_557.csv', 'validation_dataset_3_45_557.csv', 'validation_dataset_2_30_557.csv', 'validation_dataset_1_15_557.csv']
The most recent file in the validation-dataset-for-tuning folder is validation_dataset_17_255_557.csv
The validation-dataset-for-tuning folder has intermediate files with generated answers. Continue the generation of answers. The dataset version is 17. The coun

Test dataset for tuning

In [12]:
test_data_for_tuning_folder_name = "test-dataset-for-tuning"
test_data_for_tuning_folder_address = create_data_for_tuning_folder_address(current_path, test_data_for_tuning_folder_name)
create_folder_for_data_for_tuning(test_data_for_tuning_folder_name, test_data_for_tuning_folder_address)

# test dataset with generated answers is not required
# test_dataset_for_generated_answers, test_dataset_for_generated_answers_version, test_generated_answers_count, test_total_dataset_rows = check_and_load_generated_answers_to_continue(test_dataset, test_data_for_tuning_folder_name, test_data_for_tuning_folder_address)
# generate_and_check_answers(test_data_for_tuning_folder_name, test_data_for_tuning_folder_address, test_dataset_for_generated_answers, test_dataset_for_generated_answers_version, test_generated_answers_count, test_total_dataset_rows, True)

test_dataset_type = 'test'
test_dataset_for_generated_answers_version_updated = 0
test_generated_answers_count_updated = 0
test_total_dataset_rows = len(test_dataset)
test_dataset_for_generated_answers_address = test_data_for_tuning_folder_address + '/' + f'{test_dataset_type}_dataset_{test_dataset_for_generated_answers_version_updated}_{test_generated_answers_count_updated}_{test_total_dataset_rows}.csv'

# Code, that is, the saving of the dataset, adapted from: pandas, 2024. pandas.DataFrame.to_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv [Accessed 15 August 2024].
test_dataset.to_csv(test_dataset_for_generated_answers_address, index=False)
#

Created test-dataset-for-tuning folder
