## Geração de Notícias utilizando GPT-2

Neste notebook iremos realizar treinamento (finetunning) do modelo GPT-2 treinado em PT-BR para gerar títulos de notícias de economia.

In [1]:
!pip install scikit-learn -q
!pip install torch==1.11.0
!pip install transformers[torch] -q

!pip install accelerate -q

!pip install -U jupyter
!pip install -U ipywidgets

[33mDEPRECATION: torch-tensorrt 1.1.0a0 has a non-standard dependency specifier torch>=1.10.0+cu113<1.11.0. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of torch-tensorrt or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[33mDEPRECATION: torch-tensorrt 1.1.0a0 has a non-standard dependency specifier torch>=1.10.0+cu113<1.11.0. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of torch-tensorrt or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: torch-tensorrt 1.1.0a0 has a non-standard dependency specifier torch>=1.10.0+cu113<1.11

In [2]:
!nvidia-smi

Fri Nov 24 03:04:28 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA Quadro R...  On   | 00000000:3F:00.0 Off |                  Off |
| 33%   37C    P8    22W / 260W |      3MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Preparação do dataset utilizando ``TextDataset``

A suite do huggingface contém funções específicas para utilização de dataset textual.

O conteúdo é carregado do Google Drive, e divido em conjunto de treinamento e teste sendo que 85% do conjunto é destinado a treinamento e 15% para teste

In [3]:
import os

In [4]:
text_file = open("./dataset_full_preprocessed_labeled.txt", "r")
#text_file = open("/content/drive/MyDrive/gpt2-noticias/dataset_pos.txt", "r")
#text_file = open("/content/drive/MyDrive/gpt2-noticias/dataset_neg.txt", "r")

lines = text_file.readlines()
print(lines[0:10])
print(len(lines))
text_file.close()

['Na disputa loira, Devassa levou a melhor\n', 'Espumante quer vender como cerveja\n', 'Cinco novas marcas de cerveja na praça\n', 'Mea culpa, contradições & mais Devassa\n', 'Nizan Guanaes deixa o comando da Africa\n', 'Cade vive um clima de guerra civil\n', 'AmBev amplia fábrica no Amazonas\n', 'Heineken se insinua na guerra das cervejas\n', 'Brasil ganha cerveja de mosteiro\n', 'Correção: governo estuda correção da tabela do IRPF\n']
141388


In [5]:
import re
import json
from sklearn.model_selection import train_test_split

#with open('recipes.json') as f:
#    data = json.load(f)

def build_text_files(data_json, dest_path):
    f = open(dest_path, 'w')
    data = ''
    for texts in data_json:
        summary = str(texts).strip()
        summary = re.sub(r"\s", " ", summary)
        data += summary + "  "
    f.write(data)

train, test = train_test_split(lines,test_size=0.15)

train_path = 'train_dataset.txt'
test_path = 'test_dataset.txt'

build_text_files(train,train_path)
build_text_files(test,test_path)

print("Train dataset length: "+str(len(train)))
print("Test dataset length: "+ str(len(test)))

Train dataset length: 120179
Test dataset length: 21209


### Carregamento de tokenizador
Para o projeto, utilizaremos o tokenizador do modelo `pierreguillou/gpt2-small-portuguese` disponível no [huggingface](https://huggingface.co/pierreguillou/gpt2-small-portuguese).

In [6]:
from transformers import TextDataset,DataCollatorForLanguageModeling
from transformers import AutoTokenizer


def load_dataset(train_path,test_path,tokenizer):
    #train_dataset = load_dataset("text", data_dir=train_path)
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)

    #test_dataset = load_dataset("text", data_dir=test_path)
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator



In [7]:
start_from_epoch = 6
start_from_epoch = False

project_name = 'gpt2-noticias'

model_name = "pierreguillou/gpt2-small-portuguese"
#model_name = 'bigscience/bloom-560m'
#model_name = 'maritaca-ai/sabia-7b'

tokenizer_name = model_name

#model_path = '/content/drive/MyDrive/gpt2-noticias/bigscience/bloom-560m/2023_11_21_1425'
#model_path = '/content/drive/MyDrive/gpt2-noticias/pierreguillou/gpt2-small-portuguese/2023_10_17_0001/epoch_'+str(start_from_epoch)
#model_path = '/content/drive/MyDrive/gpt2-noticias/pierreguillou/gpt2-small-portuguese/'
#model_path = 'H:\My Drive/gpt2-noticias/pierreguillou/gpt2-small-portuguese/2023_10_06_1511/epoch_'+str(start_from_epoch)
model_path = ''

(model_name, model_path)

('pierreguillou/gpt2-small-portuguese', '')

In [8]:
if(os.path.isdir('models') == False ):
  os.makedirs('models')

In [9]:
from huggingface_hub import login
login(token = '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')

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [10]:
print(tokenizer_name)

pierreguillou/gpt2-small-portuguese


In [11]:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)



Criação de pastas para resultados.

In [12]:
from datetime import date
from datetime import datetime

date_str = datetime.now().strftime("%Y_%m_%d_%H%M")

model_drive_dir = model_path

if(os.path.isdir(model_drive_dir) == False):
  model_drive_dir = os.path.join('models', model_name)

if(start_from_epoch == False):
    model_drive_dir = os.path.join(model_drive_dir, date_str)
model_drive_dir

'models/pierreguillou/gpt2-small-portuguese/2023_11_24_0304'

### Carregamento do modelo
A seguir, carregamos o modelo, definimos os métodos para treinamento, geração e gravação dos títulos gerados.

In [13]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead,AutoModelForMaskedLM,AutoModelForCausalLM

def load_model(model_name,model_drive_dir, data_collator, train_dataset, test_dataset):
  #model = AutoModelForMaskedLM.from_pretrained(model_name)
  model = AutoModelForCausalLM.from_pretrained(model_name)


  training_args = TrainingArguments(
      output_dir=model_drive_dir, #The output directory
      overwrite_output_dir=True, #overwrite the content of the output directory
      num_train_epochs=180, # number of training epochs
      per_device_train_batch_size=128, # batch size for training
      per_device_eval_batch_size=256,  # batch size for evaluation
      eval_steps = 500, # Number of update steps between two evaluations.
      save_steps=500, # after # steps model is saved
      warmup_steps=200,# number of warmup steps for learning rate scheduler
      prediction_loss_only=False,
      learning_rate=2.5e-5,
      evaluation_strategy="steps",
      load_best_model_at_end=True,
      auto_find_batch_size=True
  )

  return Trainer(
      model=model,
      args=training_args,
      data_collator=data_collator,
      train_dataset=train_dataset,
      eval_dataset=test_dataset,
  )

`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.


In [14]:
def save_noticias_geradas(model_drive_dir, noticias):
  file = open(os.path.join(model_drive_dir, 'noticias_geradas_'+date_str+'.txt'),'w+')

  for n in noticias:
    #print(n)
    file.write(n)
    file.write("\n")

  file.close()

def save_metrics(model_drive_dir, metricas):
  file = open(os.path.join(model_drive_dir, 'metricas_'+date_str+'.txt'),'w+')

  file.write(str(metricas))

  file.close()

In [15]:
from transformers import pipeline

def generate_noticias(model_drive_dir, model_name, qtde_noticias = 10, autosave = False):
  print("Gerando "+str(qtde_noticias)+" noticias com o modelo: "+model_drive_dir+', tokenizer='+model_name)
  base_str =  'A  '
  noticias = []

  gerador_noticias = pipeline('text-generation', model=model_drive_dir, tokenizer=model_name, max_new_tokens=100)

  while len(noticias) < qtde_noticias:
    noticias_geradas = gerador_noticias(base_str)[0]['generated_text'].split('  ')
    if(len(noticias_geradas) == 1):
      noticias_geradas = gerador_noticias(base_str)[0]['generated_text'].split('.')
    if(len(noticias)==1):
      base_str = '  '
      continue

    for n in noticias_geradas:
      if(len(n.strip()) < 3):
        continue;
      noticias.append(n.strip());
      print(n.strip());

    base_str = noticias.pop();

    if autosave == True:
      save_noticias_geradas(model_drive_dir, noticias)

    print("Gerei ", len(noticias))

  return noticias

## Treinamento do Modelo

O modelo pode ser treinado do zero ou continuar o treinamento anterior.

In [16]:
#start_from_epoch=6
if(start_from_epoch != False):
    saved_model_drive_dir = os.path.join(model_drive_dir, 'epoch_'+str(start_from_epoch))
    model_name = saved_model_drive_dir
else:
    saved_model_drive_dir = os.path.join(model_drive_dir, 'epoch_0')

#if(os.path.isdir(model_drive_dir) == False):
#    saved_model_drive_dir = model_name

(model_drive_dir, saved_model_drive_dir)


('models/pierreguillou/gpt2-small-portuguese/2023_11_24_0304',
 'models/pierreguillou/gpt2-small-portuguese/2023_11_24_0304/epoch_0')

In [17]:
trainer = load_model(model_name,saved_model_drive_dir, data_collator, train_dataset, test_dataset)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [18]:
#for i in range(start_from_epoch+1,start_from_epoch+2):
i=2

iteration_model_path = os.path.join(model_drive_dir, "epoch_" + str(i))

trainer.train()

print("Saving model to: "+iteration_model_path)
trainer.save_model (iteration_model_path)

metrics = str(trainer.evaluate())
print(metrics)
save_metrics(iteration_model_path, metrics)

#noticias = generate_noticias(iteration_model_path, model_name, 50)
#save_noticias_geradas(iteration_model_path, noticias)

start_from_epoch=start_from_epoch + 1

Step,Training Loss,Validation Loss
500,4.3003,3.70686
1000,3.6469,3.532894
1500,3.4577,3.447563
2000,3.3316,3.399931
2500,3.2349,3.367739
3000,3.1539,3.345399
3500,3.0831,3.332832
4000,3.0221,3.324341
4500,2.9646,3.316356
5000,2.9146,3.317461


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Saving model to: models/pierreguillou/gpt2-small-portuguese/2023_11_24_0304/epoch_2


{'eval_loss': 3.3155109882354736, 'eval_runtime': 10.0229, 'eval_samples_per_second': 242.244, 'eval_steps_per_second': 0.998, 'epoch': 180.0}
