# <h1 align="center"><font color="gree">Improving Instruction-Data Via Reflection-Tuning Using GPT-4</font></h1>

<font color="pink">Senior Data Scientist.: Dr. Eddy Giusepe Chirinos Isidro</font>

Este Notebook usa a `API GPT-4` da `OpenAI` para implementar o processo de refinamento do conjunto de dados do artigo [Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning](https://arxiv.org/pdf/2310.11716).

Este estudo foi baseado no tutorial do [Dr. Sebastian Raschka]().

![](https://camo.githubusercontent.com/8577304d0568ac398ea62537100abd0f0d5e65890af92fa05440ef541bf49192/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f626f6e75732f7265666c656374696f6e2d74756e696e672f7265666c656374696f6e2d74756e696e672e77656270)

* No artigo original, os pesquisadores refinaram os conjuntos de dados `instruction-finetuning` [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) e [WizardLM](https://huggingface.co/datasets/WizardLMTeam/WizardLM_evol_instruct_70k); neste Notebook, refinamos o [conjunto de dados de instruções usado no capítulo 7](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/instruction-data.json) (no entanto, como ele tem o mesmo formato que o `Alpaca`, o mesmo código funciona com o conjunto de dados `Alpaca` também)

* O formato esperado do conjunto de dados é o seguinte:

```
{
        "instruction": "Edit the following sentence for grammar.",
        "input": "He go to the park every day.",
        "output": "He goes to the park every day."
    },
    {
        "instruction": "Convert 45 kilometers to meters.",
        "input": "",
        "output": "45 kilometers is 45000 meters."
    },
```

In [2]:
from importlib.metadata import version

pkgs = [
    "openai",  # OpenAI API
    "tqdm",    # Progress bar
]

for p in pkgs:
    print(f"{p} version: {version(p)}")

openai version: 1.44.1
tqdm version: 4.66.5


In [3]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file


from openai import OpenAI
client = OpenAI()

Primeiro, vamos testar a API com um exemplo simples para garantir que ela funcione conforme o esperado:

In [4]:
def run_chatgpt(prompt, client, model="gpt-4o-mini", system_prompt=None):
    # Define the system message if a system_prompt is provided:
    messages = []
    
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    
    # Add the user prompt to the messages:
    messages.append({"role": "user", "content": prompt})

    # Call the API:
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.0,
        seed=123,
    )
    
    # Return the model's response:
    return response.choices[0].message.content


prompt = f"Responda com 'olá mundo' se você recebeu esta mensagem."
run_chatgpt(prompt, client)


'Olá mundo!'

# <font color="red">Carregar entradas JSON</font>

* Em seguida, vamos carregar e processar o conjunto de dados de instruções

* Aqui, assumimos que salvamos o conjunto de dados de teste e as respostas do modelo como um arquivo `JSON` que podemos carregar da seguinte forma:

In [6]:
from pathlib import Path
import json


json_file = Path(".") / "01_main-chapter-code" / "instruction-data.json"

with open(json_file, "r") as file:
    json_data = json.load(file)

print("Number of entries:", len(json_data))

Number of entries: 1100


Vamos imprimir uma das entradas do conjunto de dados para ver sua estrutura:

In [7]:
from pprint import pp as pprint

pprint(json_data[0])

{'instruction': 'Evaluate the following phrase by transforming it into the '
                'spelling given.',
 'input': 'freind --> friend',
 'output': 'The spelling of the given phrase "freind" is incorrect, the '
           'correct spelling is "friend".'}


# <font color="red">Melhore as instruções</font>

* Os autores do `Reflection-Tuning` compartilharam duas abordagens: `(1)` melhorar as instruções e `(2)` melhorar as respostas

* Vamos começar melhorando as instruções em um determinado conjunto de dados

* Abaixo está uma pequena função de utilidade do [repositório Reflection-Tuning](https://github.com/tianyi-lab/Reflection_Tuning/blob/main/reflection_code/reflect_response.py) para formatar as entradas para o modelo `GPT-4` para este refinamento do conjunto de dados

In [8]:
def instr_prompt_no_input(ins, outp):

    sys_prompt = "You are a helpful, precise but picky assistant for checking the quality of a given instruction."

    prompt_template = "[Instruction]\n{ins}\n\n[The Start of Answer]\n{outp}\n\n[The End of Answer]\n\n[System]\n{criteria}\n\n"
    
    criteria = "We would like you to answer several questions related to the quality of a given instruction. \n" + \
                "1. Why this instruction is not good? First analyse the instruction based on Complexity of the Topic, Level of Detail Required, Knowledge Required, Ambiguity of the Instruction and Logical Reasoning or Problem-Solving Involved. \n" + \
                "Then analyse why this answer is not good for the given instruction? Analyse based on the Helpfulness, Relevance, Accuracy and Level of Details. \n" + \
                "Finally analyse why this bad instruction lead to a bad answer. " +\
                "2. Based on the reason you provided, generate a new and complete instruction which is complex and difficult to answer directly. " + \
                "Make sure the new instruction is relevent but independent to the original instruction, which can be answered without knowing the original instruction, put the new instruction in the format of [New Instruction] your instruction [End]" +\
                "3. Answer the newly generated instruction as detailed as possible, in the format of [New Answer] your answer [End] \n"
    
    prompt = prompt_template.format(ins=ins, outp=outp, criteria=criteria)
    
    return sys_prompt, prompt


Para ver como funciona, considere a entrada do conjunto de dados, `json_data[2]`

In [10]:
pprint(json_data[2])

{'instruction': 'Convert 45 kilometers to meters.',
 'input': '',
 'output': '45 kilometers is 45000 meters.'}


Podemos refinar a instrução da seguinte forma, usando a função `instr_prompt_no_input` definida acima:

In [11]:
entry = json_data[2]

system_prompt, prompt = instr_prompt_no_input(ins=entry["instruction"], outp=entry["output"])


In [12]:
system_prompt

'You are a helpful, precise but picky assistant for checking the quality of a given instruction.'

In [14]:
pprint(prompt)

('[Instruction]\n'
 'Convert 45 kilometers to meters.\n'
 '\n'
 '[The Start of Answer]\n'
 '45 kilometers is 45000 meters.\n'
 '\n'
 '[The End of Answer]\n'
 '\n'
 '[System]\n'
 'We would like you to answer several questions related to the quality of a '
 'given instruction. \n'
 '1. Why this instruction is not good? First analyse the instruction based on '
 'Complexity of the Topic, Level of Detail Required, Knowledge Required, '
 'Ambiguity of the Instruction and Logical Reasoning or Problem-Solving '
 'Involved. \n'
 'Then analyse why this answer is not good for the given instruction? Analyse '
 'based on the Helpfulness, Relevance, Accuracy and Level of Details. \n'
 'Finally analyse why this bad instruction lead to a bad answer. 2. Based on '
 'the reason you provided, generate a new and complete instruction which is '
 'complex and difficult to answer directly. Make sure the new instruction is '
 'relevent but independent to the original instruction, which can be answered '
 'wit

In [18]:
output = run_chatgpt(prompt=prompt, client=client, system_prompt=system_prompt)

print(output)

1. **Analysis of the Instruction:**

   - **Complexity of the Topic:** The topic of converting kilometers to meters is relatively simple and straightforward, as it involves basic unit conversion.
   - **Level of Detail Required:** The instruction lacks detail regarding the context or purpose of the conversion. It does not specify if the conversion is for a specific application or if additional information is needed.
   - **Knowledge Required:** Basic knowledge of metric units and their conversions is required, which is common knowledge.
   - **Ambiguity of the Instruction:** The instruction is clear in its request; however, it does not provide any context that could enhance understanding or relevance.
   - **Logical Reasoning or Problem-Solving Involved:** There is minimal logical reasoning involved, as the conversion is a direct calculation.

   **Analysis of the Answer:**

   - **Helpfulness:** The answer is helpful in that it provides the correct conversion, but it lacks any explana

* A resposta é muito detalhada, o que é útil para fins de análise; também, ajuda o modelo `GPT-4` a fazer melhorias por meio da abordagem de solicitação de cadeia de pensamento (`Chain-of-Thought Prompting`)

* No entanto, para construir o conjunto de dados aprimorado, estamos realmente interessados ​​apenas em novas instruções e saídas, não nas análises

* Podemos usar o seguinte código utilitário do [repositório Reflection-Tuning](https://github.com/tianyi-lab/Reflection_Tuning/blob/main/reflection_code/reflect_response.py) para extrair as instruções e saídas aprimoradas do modelo

In [19]:
import re

def extract_ins(text, no_input=True):
    if '[New Instruction]' in text:
        pattern = r'(\[New Instruction\])(.*?)(\[End\]|\[New Answer\]|New Answer:)'
    else:
        pattern = r'(New Instruction:)(.*?)(\[End\]|\[New Answer\]|New Answer:)'
    segments = re.findall(pattern, text, re.DOTALL)
    if len(segments) == 0:
        seg_ins = ''
    else:
        seg_ins = segments[0][1].strip()
    if seg_ins.endswith("\n\n3."):
        seg_ins = seg_ins[:-4]
    return seg_ins


def extract_oup(text, no_input=True):
    if '[New Answer]' in text:
        pattern = r'(\[New Answer\])(.*?)(\[End\]|$)'
    else:
        pattern = r'(New Answer:)(.*?)(\[End\]|$)'
        # pattern = r'(\[New Answer\]|New Answer:)(.*?)(\[End\]|$)'
    segments = re.findall(pattern, text, re.DOTALL)
    if len(segments) == 0:
        seg_oup = ''
    else:
        seg_oup = segments[0][1].strip()
    return seg_oup


def extract_instruction(text):
    if text == '':
        return []
    seg_ins = extract_ins(text, no_input=True)
    seg_oup = extract_oup(text, no_input=True)
    return [seg_ins, seg_oup]


Vamos usar essas funções utilitárias (`utility`) para extrair a instrução e a resposta aprimoradas da longa saída `GPT-4` gerada anteriormente:

In [20]:
new_instr, new_outp = extract_instruction(output)

In [28]:
import textwrap

print(textwrap.fill(new_instr, width=100))

Explain the significance of converting kilometers to meters in the context of scientific research,
and provide an example of a scenario where this conversion is critical. Include any relevant
formulas or calculations that may be necessary for understanding the conversion process and its
implications in real-world applications.


In [29]:
print(textwrap.fill(new_outp, width=100))

Converting kilometers to meters is significant in scientific research because precise measurements
are crucial for data accuracy and consistency. In many scientific fields, such as physics,
environmental science, and engineering, measurements are often required in standard units, which are
typically in meters.      For example, consider a scenario in environmental science where
researchers are studying the impact of a pollutant that spreads over a distance of 5 kilometers from
a source. To calculate the area affected by the pollutant, researchers need to convert the distance
into meters for consistency with other measurements, such as area (which is often measured in square
meters).     The conversion process is straightforward:     - 1 kilometer = 1,000 meters.    -
Therefore, to convert 5 kilometers to meters, you multiply by 1,000:      \[      5 \text{ km}
\times 1,000 \text{ m/km} = 5,000 \text{ m}      \]     In this case, understanding the conversion
is critical because if the r

<font color="orange">Observe que o `refinamento de instruções` (instruction-refinement) é atualmente implementado apenas para entradas de conjuntos de dados que não têm um campo de "entrada" (`input`)</font>

# <font color="red">Melhore as respostas</font>

* De forma semelhante, também podemos aplicar o processo de `refinamento do Reflection-Tuning` especificamente às respostas do conjunto de dados (ou seja, campos de `"saída"`).

* Abaixo estão duas pequenas funções de utilidade do [repositório Reflection-Tuning](https://github.com/tianyi-lab/Reflection_Tuning/blob/main/reflection_code/reflect_response.py) para formatar as entradas para o modelo GPT-4 para refinamento do conjunto de dados.

In [30]:
def res_gen_prompt_no_input(ins, outp):

    sys_prompt = "You are a helpful, precise but picky assistant for checking the quality of the answer to a given instruction."
    prompt_template = "[Instruction]\n{ins}\n\n[The Start of Answer]\n{outp}\n\n[The End of Answer]\n\n[System]\n{criteria}\n\n"
    criteria = "We would like you to answer several questions related to the quality of the answer to the given instruction. \n" + \
                "1. Why this answer is not good for the given instruction? Analyse based on the Helpfulness, Relevance, Accuracy and Level of Details. \n" + \
                "2. Based on the reason you provided, generate a better answer, new and complete, as detailed as possible, in the format of [Better Answer] your answer [End] \n" 
    prompt = prompt_template.format(
        ins=ins, outp=outp, criteria=criteria
    )
    return sys_prompt, prompt


def res_gen_prompt_input(ins, inp, outp):

    sys_prompt = "You are a helpful and precise assistant for checking the quality of the answer to a given instruction and its input."
    prompt_template = "[Instruction]\n{ins}\n\n[The Start of Input]\n{inp}\n\n[The End of Input]\n\n[The Start of Answer]\n{outp}\n\n[The End of Answer]\n\n[System]\n{criteria}\n\n"
    criteria = "We would like you to answer several questions related to the quality of the answer to the given instruction and corresponding input. \n" + \
                "1. Why this answer is not good for the given instruction and corresponding input? Analyse based on the Helpfulness, Relevance, Accuracy and Level of Details. \n" + \
                "2. Based on the reason you provided, generate a better answer, new and complete, as detailed as possible, in the format of [Better Answer] your answer [End] \n" 
    prompt = prompt_template.format(
        ins=ins, inp=inp, outp=outp, criteria=criteria
    )
    return sys_prompt, prompt


Novamente, vamos aplicá-lo a uma das entradas do conjunto de dados para ver como funciona, gerando a resposta melhorada:

In [31]:
entry = json_data[2]

system_prompt, prompt = res_gen_prompt_no_input(ins=entry["instruction"], outp=entry["output"])

output = run_chatgpt(prompt=prompt, client=client, system_prompt=system_prompt)

print(output)

1. The answer provided is not good for the given instruction for several reasons:

- **Helpfulness**: While the answer does provide the correct conversion, it lacks any explanation or context. A more helpful answer would include a brief explanation of the conversion process or the relationship between kilometers and meters.

- **Relevance**: The answer is relevant in that it addresses the instruction to convert kilometers to meters, but it could be more engaging by providing additional information, such as the conversion factor.

- **Accuracy**: The answer is accurate in terms of the numerical conversion (45 kilometers equals 45000 meters). However, it does not clarify how this conversion was reached, which is important for understanding.

- **Level of Details**: The answer is very brief and lacks detail. It does not explain the conversion factor (1 kilometer = 1000 meters) or provide any context for why someone might need to know this conversion.

2. [Better Answer] To convert kilomet

Como podemos ver acima, a resposta inclui uma análise da resposta original; podemos extrair a nova resposta usando a seguinte função utilitária do [repositório Reflection-Tuning](https://github.com/tianyi-lab/Reflection_Tuning/blob/main/reflection_code/reflect_response.py)

In [32]:
def extract_response(text):
    if text.count('[Better Answer]') >= 2:
        pattern = r'\[(Better Answer)\](.*?)(\[End\]|\[Better Answer\]|$)'
        segments = re.findall(pattern, text, re.DOTALL)
    else:
        # pattern = r'\[(Better Answer)\](.*?)\[End\]'
        pattern = r'\[(Better Answer)\](.*?)(\[End\]|End|$)'
        segments = re.findall(pattern, text, re.DOTALL)
    return [segment[1].strip() for segment in segments]


In [33]:
response = extract_response(output)[0]

print(response)

To convert kilometers to meters, you can use the conversion factor that 1 kilometer is equal to 1000 meters. Therefore, to convert 45 kilometers to meters, you multiply 45 by 1000. 

So, 45 kilometers is equal to 45 x 1000 = 45000 meters. 

This means that 45 kilometers is the same distance as 45000 meters. Understanding this conversion is useful in various contexts, such as travel, sports, and scientific measurements.


# <font color="red">Melhorando o conjunto de dados</font>

* Agora, vamos aplicar as técnicas de `instruction-reflection` e `response-reflection` ao conjunto de dados real

* `Observação`: aplicamos apenas a um pequeno subconjunto de dados aqui para fins de demonstração; para aplicá-lo a todo o conjunto de dados, altere

```data_to_process = json_data[:3]```

para

```data_to_process = json_data```


# <font color="red">Instruções de reflexão</font>

O código a seguir aplica a metodologia de ajuste de reflexão para refinamento do conjunto de dados às instruções no conjunto de dados original

In [34]:
data_to_process = json_data[:3]

In [35]:
from tqdm import tqdm


def reflect_instructions(json_data, client):
    new_json_data = [] 
    
    for entry in tqdm(json_data):
        
        if not entry["input"]:
            system_prompt, prompt = instr_prompt_no_input(ins=entry["instruction"], outp=entry["output"])
            output = run_chatgpt(prompt=prompt, client=client, system_prompt=system_prompt)
            new_instr, new_outp = extract_instruction(output)
            new_entry = {"instruction": new_instr, "input": "", "output": new_outp}
            new_json_data.append(new_entry)
        else:
            new_json_data.append(entry)

    return new_json_data


In [36]:
data_to_process = json_data[:3]

new_json_data = reflect_instructions(data_to_process, client)

100%|██████████| 3/3 [00:04<00:00,  1.48s/it]


In [37]:
for i in new_json_data[:3]:
    pprint(i)
    print("\n\n")

{'instruction': 'Evaluate the following phrase by transforming it into the '
                'spelling given.',
 'input': 'freind --> friend',
 'output': 'The spelling of the given phrase "freind" is incorrect, the '
           'correct spelling is "friend".'}



{'instruction': 'Edit the following sentence for grammar.',
 'input': 'He go to the park every day.',
 'output': 'He goes to the park every day.'}



{'instruction': 'Explain the significance of unit conversions in scientific '
                'research, and provide a detailed example of how converting '
                'kilometers to meters can impact data interpretation in a '
                'hypothetical study on animal migration patterns. Include the '
                'conversion process and discuss potential errors that could '
                'arise from incorrect conversions.',
 'input': '',
 'output': 'Unit conversions are crucial in scientific research as they ensure '
           'that data is accurately interpreted 

Vamos salvar o novo conjunto de dados:

In [38]:
with open("instruction-reflected.json", "w") as file:
    json.dump(new_json_data, file, indent=4)

# <font color="red">Refletir respostas</font>

Vamos agora fazer o mesmo para a resposta-reflexão:

In [39]:
data_to_process = json_data[:3]

In [40]:
def reflect_responses(json_data, client):
    new_json_data = [] 
    
    for entry in tqdm(json_data):
        
        if not entry["input"]:
            system_prompt, prompt = res_gen_prompt_no_input(ins=entry["instruction"], outp=entry["output"])
            output = run_chatgpt(prompt=prompt, client=client, system_prompt=system_prompt)
            new_response = extract_response(output)

            if not len(new_response):
                new_response = entry["output"]
                      
            new_entry = {"instruction": entry["instruction"], "input": "", "output": new_response[0]}
            new_json_data.append(new_entry)

        else:
            system_prompt, prompt = res_gen_prompt_input(ins=entry["instruction"], inp=entry["input"], outp=entry["output"])
            output = run_chatgpt(prompt=prompt, client=client, system_prompt=system_prompt)
            new_response = extract_response(output)

            if not len(new_response):
                new_response = entry["output"]

            new_entry = {"instruction": entry["instruction"], "input": entry["input"], "output": new_response[0]}
            new_json_data.append(new_entry)

    return new_json_data

In [41]:
new_json_data = reflect_responses(data_to_process, client)

100%|██████████| 3/3 [00:07<00:00,  2.35s/it]


In [42]:
for i in new_json_data[:3]:
    pprint(i)
    print("\n\n")

{'instruction': 'Evaluate the following phrase by transforming it into the '
                'spelling given.',
 'input': 'freind --> friend',
 'output': 'The input phrase "freind" contains a spelling error. The correct '
           'transformation of this phrase is as follows: "freind" should be '
           'transformed to "friend." Therefore, the correct spelling is '
           '"friend."'}



{'instruction': 'Edit the following sentence for grammar.',
 'input': 'He go to the park every day.',
 'output': 'The original sentence "He go to the park every day" contains a '
           'grammatical error in the verb form. The correct form should be "He '
           'goes to the park every day." This is because the subject "He" is '
           'third person singular, and in English, the verb "to go" changes to '
           '"goes" when used with third person singular subjects. Therefore, '
           'the corrected sentence is grammatically accurate and maintains the '
           'origina

Vamos salvar o novo conjunto de dados:

In [43]:
with open("response-reflected.json", "w") as file:
    json.dump(new_json_data, file, indent=4)