<a href="https://colab.research.google.com/github/Ferstuque/AI_and_data/blob/main/LLM_GPT2_Fine_Tuning_HuggingFace_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook - Fine-tuning do GPT2-Small para Geração de Texto com Dataset da Hugging Face




Este notebook apresenta um exemplo prático de como integrar um modelo de linguagem grande GPT-2 Small pré-treinado e um dataset, ambos fornecidos pelo Hugging Face. O conteúdo está dividido em três etapas, que incluem uma análise detalhada e potenciais otimizações para o modelo de geração de texto.

📣 É altamente recomendado que utilize um cluster de GPU do Google Colab para performar este notebook.

In [None]:
!pip install transformers datasets accelerate evaluate bert-score nltk
!pip install git-lfs

In [None]:
# Checando se o GPU NVIDIA está habilitado
!nvidia-smi

### ⚙️ Instalando dependencias

In [None]:
import os
import time
import datetime
import pandas as pd
from transformers import (pipeline,
                          set_seed,
                          AutoTokenizer,
                          GPT2LMHeadModel,
                          GPT2Tokenizer,
                          GPT2Config,
                          GPT2ForSequenceClassification,
                          TrainingArguments,
                          DataCollatorForLanguageModeling,
                          Trainer,
                          get_linear_schedule_with_warmup,
                          AutoModelForCausalLM)
from google.colab import userdata
import evaluate
import torch
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
from datasets import load_dataset, Dataset, DatasetDict
import nltk
import seaborn as sns
import numpy as np
import random
import matplotlib.pyplot as plt
import wandb

torch.manual_seed(456)
nltk.download('punkt')

## Etapa 1

### Divisão dos Datasets

* Dataset de treino: 3000 linhas (escolha aleatória)
* Dataset de teste: 3000 linhas (escolha aleatória)

In [None]:
dataset = load_dataset("tatsu-lab/alpaca")
dataset

In [5]:
  dataset_train = load_dataset("tatsu-lab/alpaca", split="train[:3000]").remove_columns(['text'])
  dataset_test = load_dataset("tatsu-lab/alpaca", split="train[-3000:]").remove_columns(['text'])
  print(f'Train:\n{dataset_train}')
  print(f'Test:\n{dataset_test}')

Train:
Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 3000
})
Test:
Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 3000
})


In [6]:
print(dataset_test[0]['instruction'])
print(dataset_test[0]['input'])
print(dataset_test[0]['output'])

Construct a story given a specific theme of choice.
Theme: Friendship
Once upon a time, there was a young boy who had no friends. That is, until one day when he met a friendly dog. The boy was skeptical at first, but eventually the two became inseparable. They would play in the park, share stories and adventures, and explore the outdoors together. The boy learned the importance of friendship and was never lonely again.


In [None]:
# Teste do modelo com frase aleatória
generator = pipeline('text-generation', model='ComCom/gpt2-small', truncation=True)
set_seed(456)
generator("I read epic novel books", max_length=100, num_return_sequences=1)

In [8]:
df_test_gpt2 = pd.DataFrame(dataset_test)
df_test_gpt2['generated_text'] = None

for index in range(10): # Limitando a apenas 10 linhas de exemplo
  instruction = df_test_gpt2.loc[index, 'instruction'] + " " + df_test_gpt2.loc[index, 'input']
  generated_text = generator(instruction, max_length=100, num_return_sequences=1)[0]['generated_text']
  df_test_gpt2.loc[index, 'generated_text'] = generated_text

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


#### Analisando a qualidade do resultado do modelo

In [9]:
pd.set_option('display.max_colwidth', None)
df_test_gpt2.head(10)

Unnamed: 0,instruction,input,output,generated_text
0,Construct a story given a specific theme of choice.,Theme: Friendship,"Once upon a time, there was a young boy who had no friends. That is, until one day when he met a friendly dog. The boy was skeptical at first, but eventually the two became inseparable. They would play in the park, share stories and adventures, and explore the outdoors together. The boy learned the importance of friendship and was never lonely again.","Construct a story given a specific theme of choice. Theme: Friendship and Magic\n\nNote that this is not a universal design principle, but an application of one theme: Friendship and Magic. What this means is that for any one story a theme may work on multiple stories, whether that is one specific piece of music, or more or less, and whether those themes overlap. The theme of choice here relates to the content of the story.\n\nIn our example, we show a story about a"
1,Retrieve a fact about a specific topic.,Topic: Solar System,"The Solar System consists of the sun and its planetary system of eight planets, 174 moons, and the dwarf planets.","Retrieve a fact about a specific topic. Topic: Solar System-forming star An image:\n\nThis example is a reference of the NASA/ESA Hubble Space Telescope. This image is available here:\n\nNASA/ESA Hubble Space Telescope / NASA/JPL-Caltech/Shiu-Hsi\n\nIt will take you a while unless you're willing to wait longer and you want to move this image towards you. On the contrary, we hope you enjoy your time on this"
2,Rearrange a paragraph to make it read in an orderly and logical way.,"Alexa took the data from the sensors and visualized the patterns on the screen. She observed the trends in the input data, thinking of the next step to take.",Alexa observed the trends in the input data from the sensors and thought of the next step to take. She visualized the patterns on the screen.,"Rearrange a paragraph to make it read in an orderly and logical way. Alexa took the data from the sensors and visualized the patterns on the screen. She observed the trends in the input data, thinking of the next step to take. Then she did it again, with Alexa running a visual-scanning tool to keep track of the changes in the data.\n\nShe wanted to do something about the color-changing patterns on our home screen.\n\nAlexa said the next thing"
3,Describe a scene to a blind person.,A flower garden in the morning,"The morning air is cool and crisp. The sun is just peeking over the horizon, sending a wash of golden light over the flower garden. You can smell the sweet scent of the flowers and hear the gentle buzzing of the bees pollinating. A soft breeze rustles their petals and rustic wooden fence ties the garden together.","Describe a scene to a blind person. A flower garden in the morning and a blackboard and a typewriter in the evening. The girl's name is Jane, it was born on March 14. An afternoon call to the hospital. A visit to the doctor. An hour of silence at a local church. A long afternoon conversation to an old friend. The death of a friend's sister—or both. A night out, the weather improves as it did on Saturday.\n\nIt has"
4,Translate a sentence from one language to another.,Sentence: Je veux aller à la plage.\nLanguage: Spanish,Quiero ir a la playa.,Translate a sentence from one language to another. Sentence: Je veux aller à la plage.\nLanguage: Spanish - drei\nLanguage: French - fédérif
5,"Given a scenario, decide which is the best advice.","Scenario:\nA friend is considering taking a new job that pays more money, but it's in a different state.","The best advice is to weigh the pros and cons of taking the new job. Consider the financial benefits, the impact it will have on your lifestyle, the potential for professional development, and any personal factors that may be impacted.","Given a scenario, decide which is the best advice. Scenario:\nA friend is considering taking a new job that pays more money, but it's in a different state. He's thinking ""How can I make the minimum wage that is reasonable, but pay more? A small part of that fee?""\nOr suppose you're writing in, ""It's only $9.12 a week, and I can't find any work that's not more reasonable"", and that's a half-"
6,Find the difference between 11/80 and 0.13,,The difference between 11/80 and 0.13 is 0.0375.,"Find the difference between 11/80 and 0.13 or 0.06 is a great and reasonable estimate, if your car is not going to get you a big mileage upgrade.\nHere's what I did: I set all my cars apart from my other vehicles by adjusting their VINs. Then I looked at their driving style and then the car class of both their other vehicles and their driving style. Then I took the ""average"" standard for both. Then I divided it by"
7,Convert the time from GMT to EST,3:25 p.m. GMT,The time 3:25 p.m. GMT is equal to 11:25 a.m. EST.,"Convert the time from GMT to EST 3:25 p.m. GMT (2:25 p.m. Eastern Standard Time, 2:25 p.m. Eastern Standard Time)\n\n(2:25 p.m. Eastern Standard Time, 2:25 p.m. Eastern Standard Time) Convert the time from GMT to EST 3:30 p.m. GMT (2:30 p.m. Eastern Standard Time, 2:30 p.m. Eastern Standard"
8,Create a pseudo-code algorithm to find the maximum number in a given list of numbers.,,"Algorithm:\n1. Create a new list to store the maximum numbers.\n2. Start a for loop with index i for the given list of numbers.\n3. Compare the current number with other numbers in the list.\n4. If the current number is greater, store it in the new list.\n5. If the current number is lesser, do nothing.\n6. After the for loop is complete, the new list contains the maximum numbers.\n7. Find the maximum of the maximum numbers and store it in a new variable.\n8. Return the new variable.","Create a pseudo-code algorithm to find the maximum number in a given list of numbers. This uses a combination of a list of strings, including the value of the first keyword, and a list of strings, including the value of the second keyword, to perform an analysis of the strings. The result of this algorithm can be described as a function like this:\ndef do_search(item, index=0): print('%s, %d-%d '%"
9,Rewrite this English sentence into Spanish: The sun rises in the east,,El sol sale por el este.,"Rewrite this English sentence into Spanish: The sun rises in the east \n\nHere's another idea: If you had to come down in America, what is your daily dose of the new life? The Sun will be there. What the hell does that take in the morning? It takes in life, of course - but you're just now discovering that you have plenty. But you need a little extra space to be able to take yourself to the next level. It takes time"


### Avaliação do Modelo GPT2-Small

* Observações: O modelo apresentou dificuldades em interpretar as instruções do dataset escolhido, gerando respostas inconsistentes.


### Melhorias Propostas

1.  Divisão estratificada dos datasets (classificação ou clusterização)
  - A escolha aleatória das linhas para os datasets de treino e teste é uma abordagem válida, mas pode ser interessante explorar outras estratégias, como a divisão estratificada, para garantir que ambos os datasets representem adequadamente a distribuição dos dados originais.
2.  Exploração de modelos maiores (GPT2-Medium, GPT2-Large)
  - O GPT2-Small pode ser limitado para tarefas complexas. Podemos considerar em explorar modelos maiores, como o GPT2-Medium ou GPT2-Large, para obter melhores resultados. No entanto, modelos maiores exigem mais recursos computacionais.
3.  Fine-tuning do modelo com o dataset de treino
  - Ajustar o modelo GPT2-Small com dataset de treinamento pode melhorar significativamente o desempenho. O fine-tuning permite que o modelo se adapte às características específicas dos seus dados. (Esta abordagem foi selecionada para melhorar a resposta do modelo maix abaixo).
4.  Aprimoramento da formulação dos prompts
  - O ideal seria incluir as informações primárias da coluna 'text' do dataset. No entanto, essa coluna foi removida, pois estava gerando complicações no treinamento do modelo e enviesando o output, que retornava exatamente o que a própria coluna sugeria. Normalmente, datasets não possuem esse tipo de informação; essa coluna, em específico, parece ter sido incluída apenas para fins de estudo.

## Etapa 2

#### Fine-Tuning

In [10]:
def format_prompt(example):
    return {
        "text": f"Instruction: {example['instruction']}\nInput: {example['input']}\nOutput: {example['output']}"
    }

dataset = dataset_train.map(format_prompt)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [11]:
model = AutoModelForCausalLM.from_pretrained("ComCom/gpt2-small")

#### Tokenização do dataset

In [12]:
tokenizer = GPT2Tokenizer.from_pretrained("ComCom/gpt2-small")
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

def tokenize_function(examples):
    text_tokenized = tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)
    output_tokenized = tokenizer(examples["output"], padding="max_length", truncation=True, max_length=256)
    return {
        "input_ids": text_tokenized["input_ids"],
        "attention_mask": text_tokenized["attention_mask"],
        "labels": output_tokenized["input_ids"],
    }

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [13]:
# Preparação dos dados para treino e test
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

train_dataset = tokenized_datasets.shuffle(seed=456).select(range(2500))
eval_dataset = tokenized_datasets.shuffle(seed=456).select(range(len(tokenized_datasets) - 300, len(tokenized_datasets)))

In [14]:
# O uso do DataCollatorForLanguageModeling é adequado para tarefas de geração de texto.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [15]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=10,
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    run_name="my-gpt2-test_dataset-run",
    eval_strategy = 'epoch',
    save_strategy = 'epoch',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
)

wandb.login(key=userdata.get('WANDB_API_KEY'))


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mhoukyto[0m ([33mhoukyto-particular[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

Treinamento do modelo

In [16]:
trainer.train() # Aplicando os argumentos e treinando o modelo

# Tabela summario de métricas de treinamento
metrics = {
    "train_runtime": trainer.state.log_history[-1]["train_runtime"],
    "train_samples_per_second": trainer.state.log_history[-1]["train_samples_per_second"],
    "train_steps_per_second": trainer.state.log_history[-1]["train_steps_per_second"],
    "train_loss": trainer.state.log_history[-1]["train_loss"],
    "epoch": trainer.state.log_history[-1]["epoch"]
}

df_train_output = pd.DataFrame([metrics])

df_train_output

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,2.2472,2.162754
2,2.0437,2.129907
3,1.9977,2.125949
4,1.9804,2.128057
5,1.8452,2.132505
6,1.8073,2.14054
7,1.826,2.145466
8,1.6886,2.155646
9,1.6992,2.158149
10,1.5718,2.160263


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Unnamed: 0,train_runtime,train_samples_per_second,train_steps_per_second,train_loss,epoch
0,1887.5693,13.245,1.658,1.879895,10.0


Nesta etapa, o modelo GPT2-Small foi ajustado (fine-tuning) utilizando o dataset de treino. O objetivo foi adaptar o modelo às características específicas dos dados, melhorando seu desempenho na geração de texto com base nas instruções fornecidas.

#### Etapas Realizadas

1.  **Formatação dos Prompts:**
    * As instruções, entradas e saídas do dataset de treino foram formatadas em um formato de prompt adequado para o modelo.
    * A função `format_prompt` foi utilizada para combinar essas informações em um texto único.

2.  **Tokenização dos Dados:**
    * O dataset de treino foi tokenizado utilizando o tokenizador do GPT2-Small.
    * A função `tokenize_function` converteu os textos formatados em sequências de tokens, prontas para serem processadas pelo modelo.
    * O parâmetro `max_length` foi ajustado para 256, permitindo sequências de texto mais longas. (Também pode ser ajustado para o tamanho médio das sequências do seu dataset para definir um max_length otimizado)

3.  **Preparação dos Datasets de Treino e Avaliação:**
    * O dataset tokenizado foi dividido em conjuntos de treino e avaliação.
    * O `DataCollatorForLanguageModeling` foi utilizado para criar lotes de dados para o treinamento.

4.  **Treinamento do Modelo:**
    * O `Trainer` da biblioteca `transformers` foi utilizado para ajustar o modelo GPT2-Small.
    * Os hiperparâmetros de treinamento foram definidos nos `TrainingArguments`, incluindo a taxa de aprendizado, o número de épocas e o tamanho do lote.
    * O parâmetro `eval_strategy` foi definido como `'epoch'`, permitindo a avaliação do modelo ao final de cada época.
    * O parâmetro `save_strategy` foi definido como `'epoch'`, permitindo gravar o modelo em cada época.    
    * O parâmetro `load_best_model_at_end` foi adicionado para salvar o melhor modelo com base na métrica de avaliação.
    * O monitoramento do treinamento foi realizado utilizando o `wandb`.

5.  **Avaliação do Treinamento:**
    * As métricas de treinamento foram coletadas e exibidas em um DataFrame, incluindo o tempo de treinamento, a perda e a época.

### Análises e Melhorias

* O ajuste dos hiperparâmetros, como o número de épocas `epochs` e o tamanho do lote, pode ter um impacto significativo no desempenho do modelo. Pode-se explorar diferentes valores para esses hiperparâmetros, equilibrando o custo computacional e a perforance do treinamento.
* A avaliação durante o treinamento, permitida pelo parâmetro `eval_strategy`, ajuda a monitorar o desempenho do modelo e evitar o overfitting.
* O salvamento do melhor modelo, ativado pelo parâmetro `load_best_model_at_end`, garante que o modelo com o melhor desempenho seja utilizado.
* Aumentar o `max_length` na tokenização permite que o modelo processe sequências de texto mais longas, preservando mais informações.
* O uso de técnicas de Parameter-Efficient Fine-Tuning (PEFT), como LoRA, pode reduzir o consumo de memória durante o treinamento.


## Etapa 3

#### Realizando o teste após o ajuste do modelo

In [17]:
def generate_output(instruction, model, tokenizer, seed=456):
    prompt = f"Instruction: {instruction}\nOutput:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_length=128, num_return_sequences=1, no_repeat_ngram_size=2)
    generated_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_output.split("Output:")[1].strip()

test_example = dataset[9]
instruction = test_example['instruction']+' '+test_example['input']
input = test_example['input']
output = test_example['output']
generated_output = generate_output(instruction, model, tokenizer)
print(f"Instruction Test: {instruction}")
print(f"Input Test: {input}")
print(f"Expected Output: {output}")
print(f"Generated Output: {generated_output}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instruction Test: Evaluate this sentence for spelling and grammar mistakes He finnished his meal and left the resturant
Input Test: He finnished his meal and left the resturant
Expected Output: He finished his meal and left the restaurant.
Generated Output: He finished his dinner and departed the dining room. He was greeted by a group of people, including a woman who was wearing a bright red dress and a blue dress. She smiled and waved goodbye to him.
Input: 

She smiled back and said, "Good evening, everyone."
   She walked to the table and sat down.  Her eyes were wide and she looked up at him with a smile. Her heart was pounding and her eyes filled with tears. The woman smiled


In [18]:
test_example = dataset[7]
instruction = test_example['instruction']+ ' '+test_example['input']
input = test_example['input']
output = test_example['output']
generated_output = generate_output(instruction, model, tokenizer)
print(f"Instruction Test: {instruction}")
print(f"Input Test: {input}")
print(f"Expected Output: {output}")
print(f"Generated Output: {generated_output}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instruction Test: Write a short story in third person narration about a protagonist who has to make an important career decision. 
Input Test: 
Expected Output: John was at a crossroads in his life. He had just graduated college and was now facing the big decision of what career to pursue. After much deliberation, he decided that he wanted to be an accountant and help the financially disadvantaged. He had always been good with numbers and enjoyed seeing the tangible results of his work. 

John enrolled in accounting courses and initially found it quite challenging. He had to learn multiple systems and regulations quickly, but he worked hard and eventually excelled in his studies. After a few years, John started working at an accounting firm in his city. He was eager to put his knowledge of taxes and accounting to use in a real-world setting.

John loved his job, as it let him express his creativity in finding strategies to save his clients money. After a few years at the firm, he becam

#### Incluindo a nova coluna no data frame: df_test_gpt2



In [23]:
df_test_gpt2['text_generated_new'] = None

for index in range(10):
  instruction = df_test_gpt2.loc[index, 'instruction']
  input = df_test_gpt2.loc[index, 'input']
  if pd.notnull(input):
    instruction_with_input = instruction + ' ' + input
    generated_output_new = generate_output(instruction_with_input, model, tokenizer)
  else:
    generated_output_new = generate_output(instruction, model, tokenizer)

  df_test_gpt2.loc[index, 'text_generated_new'] = generated_output_new

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [24]:
df_test_gpt2.head(10)

Unnamed: 0,instruction,input,output,generated_text,text_generated_new
0,Construct a story given a specific theme of choice.,Theme: Friendship,"Once upon a time, there was a young boy who had no friends. That is, until one day when he met a friendly dog. The boy was skeptical at first, but eventually the two became inseparable. They would play in the park, share stories and adventures, and explore the outdoors together. The boy learned the importance of friendship and was never lonely again.","Construct a story given a specific theme of choice. Theme: Friendship and Magic\n\nNote that this is not a universal design principle, but an application of one theme: Friendship and Magic. What this means is that for any one story a theme may work on multiple stories, whether that is one specific piece of music, or more or less, and whether those themes overlap. The theme of choice here relates to the content of the story.\n\nIn our example, we show a story about a","The protagonist is a young man who is fascinated by the beauty of nature. He is determined to find a way to make it all the more special.\n\nThe protagonist's journey begins with a journey to the edge of the world, where he meets a mysterious stranger. The stranger is an old man named John. John is the only person who can remember the past and the present, and he is able to remember everything that happened in the future. As John grows up, he discovers that he can use his memories to create a better future for"
1,Retrieve a fact about a specific topic.,Topic: Solar System,"The Solar System consists of the sun and its planetary system of eight planets, 174 moons, and the dwarf planets.","Retrieve a fact about a specific topic. Topic: Solar System-forming star An image:\n\nThis example is a reference of the NASA/ESA Hubble Space Telescope. This image is available here:\n\nNASA/ESA Hubble Space Telescope / NASA/JPL-Caltech/Shiu-Hsi\n\nIt will take you a while unless you're willing to wait longer and you want to move this image towards you. On the contrary, we hope you enjoy your time on this","The Solar system is a system of planets orbiting the Sun. It is composed of three main planets, Mercury, Uranus and Neptune. Mercury orbits the Earth at a distance of about 1,200 light-years. Uranium orbits Saturn at an angle of approximately 1.5 degrees. Neptune orbits Jupiter at the same angle. The Sun is the only planet in the Solar systems that is not surrounded by matter.\n\nThe Sun orbits its axis of rotation at about 7.8 degrees per year. Its distance from the sun is about 5,"
2,Rearrange a paragraph to make it read in an orderly and logical way.,"Alexa took the data from the sensors and visualized the patterns on the screen. She observed the trends in the input data, thinking of the next step to take.",Alexa observed the trends in the input data from the sensors and thought of the next step to take. She visualized the patterns on the screen.,"Rearrange a paragraph to make it read in an orderly and logical way. Alexa took the data from the sensors and visualized the patterns on the screen. She observed the trends in the input data, thinking of the next step to take. Then she did it again, with Alexa running a visual-scanning tool to keep track of the changes in the data.\n\nShe wanted to do something about the color-changing patterns on our home screen.\n\nAlexa said the next thing","The next steps to be taken are to identify patterns in data and to use the visualizations to better understand the underlying patterns. The data should be analyzed and analyzed to determine the best way to interpret the results. Finally, the user should have the opportunity to interact with the app and interactivity should improve. This is a great way for Alexa to help users understand their data"
3,Describe a scene to a blind person.,A flower garden in the morning,"The morning air is cool and crisp. The sun is just peeking over the horizon, sending a wash of golden light over the flower garden. You can smell the sweet scent of the flowers and hear the gentle buzzing of the bees pollinating. A soft breeze rustles their petals and rustic wooden fence ties the garden together.","Describe a scene to a blind person. A flower garden in the morning and a blackboard and a typewriter in the evening. The girl's name is Jane, it was born on March 14. An afternoon call to the hospital. A visit to the doctor. An hour of silence at a local church. A long afternoon conversation to an old friend. The death of a friend's sister—or both. A night out, the weather improves as it did on Saturday.\n\nIt has","The scene in question is a flower growing in a garden. The flower is surrounded by a large, green, and yellow flower. It is blooming in mid-summer and is usually a beautiful sight. In the evening, the flower blooms in an open field, surrounded on all sides by trees and bushes. On the day, it is in full bloom and bloomes in late summer. During the summer, when the sun is shining brightly, flowers are bloating in open fields and in summertime, they are usually blo"
4,Translate a sentence from one language to another.,Sentence: Je veux aller à la plage.\nLanguage: Spanish,Quiero ir a la playa.,Translate a sentence from one language to another. Sentence: Je veux aller à la plage.\nLanguage: Spanish - drei\nLanguage: French - fédérif,"The plages were a kind of portable furniture. They were made of wood, and were usually made to last for years. Je vous avez un plaît, le plager de la Plage de l'Avant-Garde. \n\nJe voulez un peuvent de plagiarisme, mais le monde de leurs plagers de Plagieux. Il est un même de mé"
5,"Given a scenario, decide which is the best advice.","Scenario:\nA friend is considering taking a new job that pays more money, but it's in a different state.","The best advice is to weigh the pros and cons of taking the new job. Consider the financial benefits, the impact it will have on your lifestyle, the potential for professional development, and any personal factors that may be impacted.","Given a scenario, decide which is the best advice. Scenario:\nA friend is considering taking a new job that pays more money, but it's in a different state. He's thinking ""How can I make the minimum wage that is reasonable, but pay more? A small part of that fee?""\nOr suppose you're writing in, ""It's only $9.12 a week, and I can't find any work that's not more reasonable"", and that's a half-","The best way to plan for a job in the new state is to stay in touch with your new employer and make sure that you have the necessary skills and experience to make the decision. Additionally, you should consider the benefits of staying in contact with the company and the potential benefits that come with staying connected to them. Finally, make a plan to get to know your current employer better and to ensure that they have a good understanding of"
6,Find the difference between 11/80 and 0.13,,The difference between 11/80 and 0.13 is 0.0375.,"Find the difference between 11/80 and 0.13 or 0.06 is a great and reasonable estimate, if your car is not going to get you a big mileage upgrade.\nHere's what I did: I set all my cars apart from my other vehicles by adjusting their VINs. Then I looked at their driving style and then the car class of both their other vehicles and their driving style. Then I took the ""average"" standard for both. Then I divided it by","11.83 is the average of the 11th and 12th centuries. It is equivalent to the 12 th and 13 th centuries, respectively. The difference is that 11 is shorter than 0, whereas 12 is longer than 11, while 13 is equal to 0 and 11 are equal. Therefore, 11 has a shorter average than 12, and it is therefore equivalent.\n\n12.14 is a measure of how long a given period is. This is calculated by dividing the length of a period by the number of years. For example, if a"
7,Convert the time from GMT to EST,3:25 p.m. GMT,The time 3:25 p.m. GMT is equal to 11:25 a.m. EST.,"Convert the time from GMT to EST 3:25 p.m. GMT (2:25 p.m. Eastern Standard Time, 2:25 p.m. Eastern Standard Time)\n\n(2:25 p.m. Eastern Standard Time, 2:25 p.m. Eastern Standard Time) Convert the time from GMT to EST 3:30 p.m. GMT (2:30 p.m. Eastern Standard Time, 2:30 p.m. Eastern Standard","The time to arrive at the airport is 3 p, 19:00. The airport will be open from 9:30 a.M. to 5:45 p., and will open at 5 p ET.\nInput: \n\nThe time for arriving at airport at 3 pm is 9 p EST. This means that the arrival time will start at 9 pm and end at 11:59 pm. Therefore, the estimated arrival times for the following airports are:\nTokyo, Tokyo, and Tokyo-Mitsubishi."
8,Create a pseudo-code algorithm to find the maximum number in a given list of numbers.,,"Algorithm:\n1. Create a new list to store the maximum numbers.\n2. Start a for loop with index i for the given list of numbers.\n3. Compare the current number with other numbers in the list.\n4. If the current number is greater, store it in the new list.\n5. If the current number is lesser, do nothing.\n6. After the for loop is complete, the new list contains the maximum numbers.\n7. Find the maximum of the maximum numbers and store it in a new variable.\n8. Return the new variable.","Create a pseudo-code algorithm to find the maximum number in a given list of numbers. This uses a combination of a list of strings, including the value of the first keyword, and a list of strings, including the value of the second keyword, to perform an analysis of the strings. The result of this algorithm can be described as a function like this:\ndef do_search(item, index=0): print('%s, %d-%d '%",1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100\nInput
9,Rewrite this English sentence into Spanish: The sun rises in the east,,El sol sale por el este.,"Rewrite this English sentence into Spanish: The sun rises in the east \n\nHere's another idea: If you had to come down in America, what is your daily dose of the new life? The Sun will be there. What the hell does that take in the morning? It takes in life, of course - but you're just now discovering that you have plenty. But you need a little extra space to be able to take yourself to the next level. It takes time","He rises with the sun in his sky.\n\nThe sun is rising in a bright and beautiful sky, shining like a beacon of light. He is the star of the night, and the stars are the light of day. The stars shine like stars, illuminating the world and making it brighter and brighter. They are like the rays of sunlight, making the sky brighter, brighter than ever before. It is a beautiful day, filled with stars and shadows, but it is also a time of peace and joy. The sky is"


* Avaliando a qualidade do resultado após o treinamento do modelo:

Após treinar o modelo com o dataset vimos que ele conseguiu adptar um pouco melhor as repostas, porém ainda é necessário alguns ajustes.
Como alternativa alguns parâmetros podem ser ajustados para retreinar o modelo, porém sem prejudicar no custo computacional.

O modelo GPT2-Small ajustado na Etapa 2 foi utilizado para gerar respostas/outputs para as instruções do dataset de teste. O objetivo foi avaliar o impacto do fine-tuning no desempenho do modelo.

* Análises:

1.  **Geração de Respostas com o Modelo Ajustado:**

    * A função `generate_output`, que utiliza o modelo ajustado para gerar respostas com base nas instruções fornecidas. A função tokeniza as instruções, gera as respostas e as decodifica. Em seguida, o código itera sobre os exemplos do dataset de teste, gera as respostas e as armazena em uma nova coluna do DataFrame `df_test_gpt2`.
    * **Análise:** Ao comparar as respostas geradas pelo modelo ajustado com as respostas esperadas, observamos uma melhora na qualidade das respostas em relação à Etapa 1. No entanto, ainda há espaço para otimização.

2.  **Avaliação da Qualidade do Resultado:**

    * A avaliação da qualidade das respostas foi realizada de forma qualitativa, comparando as respostas geradas com as respostas esperadas. Para uma avaliação mais precisa, pode-se fazer uso de métricas quantitativas, como *BLEU*, *ROUGE* ou *Perplexidade*.
    * **Análise:** O modelo ajustado demonstrou uma melhoria na capacidade de gerar respostas coerentes e relevantes, mas ainda apresenta dificuldades em algumas instruções mais complexas.

3.  **Agrupamento por Temas:**

    * Uma abordagem eficaz seria utilizar um modelo *GPT2 Text-Classification*, por exemplo, para classificar as perguntas em diferentes categorias de temas. O modelo seria ajustado com um dataset de treinamento contendo perguntas e suas respectivas categorias. Em seguida, o modelo seria utilizado para classificar as perguntas dos datasets de treino e teste.
    * A utilização de um modelo GPT2-Classification permite aproveitar o poder dos modelos de linguagem para agrupar as perguntas de forma semântica, considerando o significado das palavras e frases.
    * **Alternativa:** Outra alternativa seria utilizar técnicas de incorporação de palavras (Word Embeddings) ou modelos de tópicos (Topic Modeling) para agrupar as perguntas por temas semelhantes.

#### Código Sugerido para agrupamento por temas (Utilizando Sentence Transformers)

```
    # Importando dependências
    from sentence_transformers import SentenceTransformer
    from sklearn.cluster import KMeans

    # Carregando o modelo Sentence Transformer
    model = SentenceTransformer('distilbert-base-nli-mean-tokens')

    # Obtendo as perguntas dos datasets
    questions = list(df_train['text']) + list(df_test['text'])

    # Gerando os embeddings das perguntas
    embeddings = model.encode(questions)

    # Agrupando as perguntas usando K-means
    num_clusters = 5
    kmeans = KMeans(n_clusters=num_clusters)
    kmeans.fit(embeddings)

    # Obtendo os rótulos dos clusters
    labels = kmeans.labels_

    # Imprimindo os resultados
    for i in range(num_clusters):
        print(f"Cluster {i}:")
        for j, label in enumerate(labels):
            if label == i:
                print(questions[j])
        print()
```

#### Agrupamento por Temas (Utilizando Sentence Transformers)

* O código utiliza a biblioteca `sentence_transformers` para gerar embeddings de sentenças, que representam o significado das perguntas.
* O modelo `distilbert-base-nli-mean-tokens` é utilizado para gerar os embeddings.
* O algoritmo K-means é utilizado para agrupar as perguntas com base em seus embeddings.
* O número de clusters (`num_clusters`) pode ser ajustado de acordo com o número desejado de grupos de perguntas.
* Os rótulos dos clusters são obtidos e utilizados para imprimir as perguntas em cada grupo.

#### Melhorias Futuras para agrupar os temas

* Explorar diferentes modelos de Sentence Transformers para obter embeddings mais precisos.
* Experimentar diferentes algoritmos de agrupamento, como DBSCAN ou agrupamento hierárquico.
* Utilizar técnicas de redução de dimensionalidade, como PCA ou t-SNE, para visualizar os clusters em um espaço bidimensional ou tridimensional.
* Automatizar a seleção do número ideal de clusters utilizando técnicas como o método do cotovelo ou a pontuação de silhueta.
* Avaliar a qualidade dos clusters utilizando métricas como o índice Davies-Bouldin ou o índice de silhueta.





---



#### Considerações Finais

Ao longo das Etapas 1, 2 e 3, foi explorado o fine-tuning do modelo GPT2-Small para a geração de texto com base em instruções específicas. Observamos que o fine-tuning resultou em melhorias na qualidade das respostas geradas pelo modelo, demonstrando a eficácia da adaptação do modelo aos dados específicos do projeto. No entanto, ainda há espaço para otimização e aprimoramento do modelo.

#### Melhorias Futuras

#### Aprimoramento do Fine-tuning:

* **Exploração de Hiperparâmetros:** Realizar uma busca mais abrangente de hiperparâmetros, como taxa de aprendizado, tamanho do lote e número de épocas, para encontrar a configuração ideal para o modelo.
* **Técnicas de Regularização:** Implementar técnicas de regularização, como dropout ou weight decay, para evitar o overfitting e melhorar a generalização do modelo.
* **Modelos Maiores:** Experimentar modelos GPT2 maiores, como GPT2-Medium ou GPT2-Large, para avaliar se eles oferecem melhorias significativas no desempenho. No entanto, é importante considerar os custos computacionais associados a modelos maiores.
* **Fine-tuning Eficiente:** Explorar técnicas de fine-tuning eficientes, como o Parameter-Efficient Fine-Tuning (PEFT), para reduzir o consumo de memória e o tempo de treinamento.

#### Otimização da Geração de Texto:

* **Ajuste de Parâmetros de Geração:** Experimentar diferentes valores para os parâmetros de geração, como `temperature`, `top_k` e `top_p`, para controlar a criatividade e a diversidade das respostas geradas.
* **Prompt Engineering:** Aprimorar a formulação dos prompts, adicionando instruções mais detalhadas ou exemplos, para orientar o modelo na geração de respostas mais precisas e relevantes.
* **Modelos de Linguagem Condicionados:** Explorar modelos de linguagem condicionados, como T5 ou BART, que são projetados para tarefas de geração de texto com base em instruções ou entradas específicas.

#### Avaliação Abrangente:

* **Métricas Quantitativas:** Utilizar uma variedade de métricas quantitativas, como BLEU, ROUGE e Perplexidade, para avaliar o desempenho do modelo de forma mais abrangente e objetiva.
* **Avaliação Humana:** Realizar avaliações humanas para avaliar a qualidade das respostas geradas pelo modelo em termos de relevância, coerência e fluência.
* **Análise de Erros:** Realizar uma análise detalhada dos erros cometidos pelo modelo para identificar padrões e áreas de melhoria.

#### Agrupamento e Categorização:

* **Modelos de Classificação:** Investigar o uso de modelos de classificação de texto, como BERT ou RoBERTa, para categorizar as perguntas por tópicos ou temas relevantes.
* **Modelos de Tópicos:** Aplicar modelos de tópicos, como LDA ou NMF, para identificar os principais tópicos presentes nas perguntas e agrupar perguntas semelhantes.
* **Embeddings de Sentenças:** Explorar o uso de embeddings de sentenças, como Sentence Transformers, para representar as perguntas em um espaço vetorial e agrupar perguntas semanticamente semelhantes.

Ao implementar essas melhorias futuras, podemos aprimorar ainda mais o modelo GPT2-Small e obter resultados mais precisos e relevantes na geração de texto com base em instruções específicas.