#**Fine Tuning**
## O que é Fine Tuning:
_Fine Tuning_ (afinação) é uma técnica que consiste em realizar um ajuste nos parametros de um modelo já treinado com a sua base de dados </br> </br>

Imagine que você recebeu um modelo treinado com os dados de uma
biblioteca, esse modelo é capaz de escrever qualquer tipo de livro e material escrito.</br>

Porem você possui o desejo de escrever poemas e poesias da forma que Manuel Bandeira escreve, assim um texto generico não seria suficiente para você. </br>

Então você pega o modelo e modifica seus parametros para que ele escreva textos como Bandeira apenas inserido a base de dados do escritor. _Voilà_ você realizou um fine tuning! </br>

Nos projeto abaixo eu apresento como é realizado um fine tuning do modelo GPT-2 (Generative Pre-trained Transformer Version 2) para a geração de músicas _"taylor swifitianas"_ </br>

> Fonte dos dados: [Taylor Swift Song Lyrics (All Albums)](https://www.kaggle.com/datasets/thespacefreak/taylor-swift-song-lyrics-all-albums)

## 0 - Preparando Ambiente

In [1]:
# Verificando se o ambiente de execução estã no GPU (Graphics Processing Unit)
!nvidia-smi

Wed Jun 14 14:38:59 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# Recebendo acesso do google drive para carregar o modelo:

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [3]:
#realizando o dowload do modulo "Transformers"

!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m118.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90

## O que é transformers?

_dica: não são os robôs carros alienigenas_

Transformers são um tipo de modelo que utiliza redes neurais para compreender o contexto de textos e realizarem predições sobre isso. Os transformers aplicam uma serie de evoluções matematicas chamadas de atenção ou autoatenção para compreender como os elementos, até os mais sutis, se comportam entre si. </br>

Transformers foi apresentado pela primeira vez em 2017 em um [artigo google](https://arxiv.org/abs/1706.03762). Essa técnica é, até o presente momento, um dos modelos mais novos e potentes já criado.

In [4]:
#importando bibliotecas

from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
from collections import defaultdict
import torch
import os
import pandas as pd


#Coleta de dados

In [5]:
taylor_df = pd.read_csv("/content/drive/MyDrive/taylor_db/01-taylor_swift.csv")
taylor_df.describe()

Unnamed: 0,track_n,line
count,609.0,609.0
mean,7.99179,22.095238
std,4.394404,13.619505
min,1.0,1.0
25%,4.0,11.0
50%,8.0,21.0
75%,12.0,32.0
max,15.0,55.0


In [6]:
taylor_df.head(10)

Unnamed: 0,album_name,track_title,track_n,lyric,line
0,Taylor Swift,Tim McGraw,1,He said the way my blue eyes shined,1
1,Taylor Swift,Tim McGraw,1,Put those Georgia stars to shame that night,2
2,Taylor Swift,Tim McGraw,1,"I said, ""That's a lie""",3
3,Taylor Swift,Tim McGraw,1,Just a boy in a Chevy truck,4
4,Taylor Swift,Tim McGraw,1,That had a tendency of gettin' stuck,5
5,Taylor Swift,Tim McGraw,1,On back roads at night,6
6,Taylor Swift,Tim McGraw,1,And I was right there beside him all summer long,7
7,Taylor Swift,Tim McGraw,1,And then the time we woke up to find that summ...,8
8,Taylor Swift,Tim McGraw,1,But when you think Tim McGraw,9
9,Taylor Swift,Tim McGraw,1,I hope you think my favorite song,10


In [7]:
#criação de um objeto para a preparação dos dados para nosso fine tuning

class music():
  lyrics = defaultdict(str)

  def add_lyrics(self, df):

    for i in df.iterrows():
      #           (album,    track_name)    lyrics
      self.lyrics[(i[1][0], i[1][1])] += (i[1][3] + '\n ')

  def __len__(self):
    return len(self.lyrics.keys())



In [8]:
musics = music()
musics.add_lyrics(taylor_df)

In [9]:
fearless_df = pd.read_csv("/content/drive/MyDrive/taylor_db/02-fearless_taylors_version.csv")
speak_df = pd.read_csv("/content/drive/MyDrive/taylor_db/03-speak_now_deluxe_package.csv")
red_df = pd.read_csv("/content/drive/MyDrive/taylor_db/04-red_deluxe_edition.csv")
taylor_1989_df = pd.read_csv("/content/drive/MyDrive/taylor_db/05-1989_deluxe.csv")
reputation_df = pd.read_csv("/content/drive/MyDrive/taylor_db/06-reputation.csv")
lover_df = pd.read_csv("/content/drive/MyDrive/taylor_db/07-lover.csv")
folklore_df = pd.read_csv("/content/drive/MyDrive/taylor_db/08-folklore_deluxe_version.csv")
evermore_df = pd.read_csv("/content/drive/MyDrive/taylor_db/09-evermore_deluxe_version.csv")

In [10]:
musics.add_lyrics(fearless_df)
musics.add_lyrics(speak_df)
musics.add_lyrics(red_df)
musics.add_lyrics(taylor_1989_df)
musics.add_lyrics(reputation_df)
musics.add_lyrics(lover_df)
musics.add_lyrics(folklore_df)
musics.add_lyrics(evermore_df)

# Modelo pré-treinado


In [11]:
#Criação de objeto para o treino a partir das letras da Taylor já organizadas

class lyricTrain():
  def __init__(self, control_code, lyrics_dict, truncate = False, model = "gpt2", albums = [],
               max_length = 1024):
    self.tokenizer = GPT2Tokenizer.from_pretrained(model)
    self.lyrics = []

    for i in lyrics_dict.items():
      if albums != []:
        if i[0][0] in albums:
          self.lyrics.append(torch.tensor(
                  self.tokenizer.encode(f"<|{control_code}|>{i[1][:max_length]}<|endoftext|>")
              ))
      else:
        self.lyrics.append(torch.tensor(
                  self.tokenizer.encode(f"<|{control_code}|>{i[1][:max_length]}<|endoftext|>")
        ))

    if truncate:
      self.lyrics = self.lyrics[:20000]

    self.lyrics_len = len(self.lyrics)

  def __len__(self):
    return self.lyrics_len

  def __getitem__(self, item):
    return self.lyrics[item]

df_train = lyricTrain("startoftext", musics.lyrics)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [12]:
#Recebendo o Tokenizer e o modelo
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

#Acumulação de lotes de tensores (o do GPT-2 é muito grande)
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor is None:
        return new_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [13]:
#funcao de treino

def train(
    dataset, model, tokenizer,
    batch_size=16, epochs=5, lr=2e-5,
    max_seq_len=400, warmup_steps=200,
    gpt2_type="gpt2", output_dir=".", output_prefix="wreckgar",
    test_mode=False,save_model_on_epoch=False,
):
    acc_steps = 100
    device=torch.device("cuda")
    model = model.cuda()
    model.train()

    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1
    )

    train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
    loss=0
    accumulating_batch_count = 0
    input_tensor = None

    for epoch in range(epochs):

        print(f"Training epoch {epoch}")
        print(loss)
        for idx, entry in tqdm(enumerate(train_dataloader)):
            (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768)

            if carry_on and idx != len(train_dataloader) - 1:
                continue

            input_tensor = input_tensor.to(device)
            outputs = model(input_tensor, labels=input_tensor)
            loss = outputs[0]
            loss.backward()

            if (accumulating_batch_count % batch_size) == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                model.zero_grad()

            accumulating_batch_count += 1
            input_tensor = None
        if save_model_on_epoch:
            torch.save(
                model.state_dict(),
                os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"),
            )
    return model

In [14]:
train(df_train, model = model, tokenizer = tokenizer, epochs = 10)



Training epoch 0
0


163it [00:10, 15.25it/s]


Training epoch 1
tensor(3.4511, device='cuda:0', grad_fn=<NllLossBackward0>)


163it [00:09, 17.12it/s]


Training epoch 2
tensor(3.0876, device='cuda:0', grad_fn=<NllLossBackward0>)


163it [00:09, 17.84it/s]


Training epoch 3
tensor(3.6804, device='cuda:0', grad_fn=<NllLossBackward0>)


163it [00:09, 17.59it/s]


Training epoch 4
tensor(3.7323, device='cuda:0', grad_fn=<NllLossBackward0>)


163it [00:09, 17.44it/s]


Training epoch 5
tensor(3.1329, device='cuda:0', grad_fn=<NllLossBackward0>)


163it [00:09, 17.06it/s]


Training epoch 6
tensor(3.2057, device='cuda:0', grad_fn=<NllLossBackward0>)


163it [00:09, 16.81it/s]


Training epoch 7
tensor(3.5514, device='cuda:0', grad_fn=<NllLossBackward0>)


163it [00:09, 16.72it/s]


Training epoch 8
tensor(4.1639, device='cuda:0', grad_fn=<NllLossBackward0>)


163it [00:09, 16.80it/s]


Training epoch 9
tensor(2.7566, device='cuda:0', grad_fn=<NllLossBackward0>)


163it [00:10, 16.20it/s]


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

## GERAÇÃO


In [15]:
# Trecho da musica 'Born This Way' da Lady Gaga
prompt = '''It doesn't matter if you love him or capital H-I-M
Just put your paws up
'Cause you were born this way, baby
'''

In [16]:
generated = tokenizer(prompt, return_tensors = "pt").input_ids.cuda()

# Gerando o texto
'''
do_sample = estilo de geração (por amostragem)
top_k = quantidade de palavras escolhidas para a previsão da proxima
max_lenth = tamanho maximo da saida gerada em caracteres
temperature = é o controle de aleatoriedade da geração >1 mais aleatorios, <1 menos aletorios
num_return_sequences = quantidade de gerações
pad_token_id = habilita o eos(end of sequence) para preenchimentos

'''
output = model.generate(generated, do_sample = True, top_k = 50,
                         max_length = 100, temperature = 1,
                         num_return_sequences = 1,
                         pad_token_id=tokenizer.eos_token_id)

In [17]:
print(tokenizer.decode(output[0], skip_special_tokens=True))

It doesn't matter if you love him or capital H-I-M
Just put your paws up
'Cause you were born this way, baby
This is what you, this is what you all love
I never knew you, never had to
Love me, never ever said I'd make you, you got me
Don't you ever say you could die for something
Don't you think I would happen, or you'd break your heart
This is what is, and


## REFERENCIAS

> [Sobre o Finetuning](https://platform.openai.com/docs/guides/fine-tuning) </br>
> [Sobre transformers](https://blog.nvidia.com.br/2022/04/19/o-que-e-um-modelo-transformer/)
