Código para deixar Widget do Jupyter Notebook condizente com o tema.

In [1]:
%%html
<style>
.cell-output-ipywidget-background {
    background-color: transparent !important;
}
:root {
    --jp-widgets-color: var(--vscode-editor-foreground);
    --jp-widgets-font-size: var(--vscode-editor-font-size);
}
</style>

Aqui é para testar se o dispositivo está disponível para usar a GPU.
E import as bibliotecas necessárias para rodar o código.
Fizemos assim para também podermos rodar o código localmente, e no Google Colab.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

DATA_PATH = '/content/drive/MyDrive/data.csv'

Mounted at /content/drive


In [None]:
!pip install -U pytorch-lightning

Collecting pytorch-lightning
  Downloading pytorch_lightning-2.5.0.post0-py3-none-any.whl.metadata (21 kB)
Collecting torchmetrics>=0.7.0 (from pytorch-lightning)
  Downloading torchmetrics-1.6.1-py3-none-any.whl.metadata (21 kB)
Collecting lightning-utilities>=0.10.0 (from pytorch-lightning)
  Downloading lightning_utilities-0.12.0-py3-none-any.whl.metadata (5.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.1.0->pytorch-lightning)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.1.0->pytorch-lightning)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.1.0->pytorch-lightning)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.1.0->pytorch-lightning)
  Dow

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments

In [None]:
import torch

In [None]:
from torch.utils.data import Dataset, DataLoader

In [None]:
import pytorch_lightning as pl

In [None]:
import random

In [99]:
class CPProblemDataset(Dataset):
    def __init__(self, tokenizer, data_path, max_length=512):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.data = self.load_data(data_path)

    def load_data(self, path):
        data = []
        with open(path, "r") as f:
          for line in f:
            problem_statement, editorial = line.strip().split(",")

            data.append((problem_statement, editorial))

        return data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        problem, editorial = self.data[idx]

        # Add random truncation to different positions
        if random.random() > 0.5:
            problem = problem[-692:]  # Take from the end sometimes

        # Format input with task prefix
        input_text = f"Problem: {problem}"

        output_text = f"Editorial: {editorial}"

        input_encoding = self.tokenizer(
            input_text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        target_encoding = self.tokenizer(
            output_text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        return {
            'input_ids': input_encoding['input_ids'].flatten(),
            'attention_mask': input_encoding['attention_mask'].flatten(),
            'labels': target_encoding['input_ids'].flatten()
        }


In [72]:
# Configuration
MODEL_NAME = 't5-small'
BATCH_SIZE = 8
MAX_LENGTH = 512  # Max length for input and output

In [100]:
# Data Module
class CPDataModule(pl.LightningDataModule):
    def __init__(self, batch_size=BATCH_SIZE):
        super().__init__()
        self.batch_size = batch_size
        self.tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

    def setup(self, stage=None):
        self.dataset = CPProblemDataset(self.tokenizer, DATA_PATH, MAX_LENGTH)

    def train_dataloader(self):
        return DataLoader(
            self.dataset,
            batch_size=self.batch_size,
            shuffle=True,
            num_workers=4,
            persistent_workers=True
        )


In [101]:
data_module = CPDataModule()

In [102]:
def calculate_repetitions(generated_ids, ngram_size=3):
    """
    Calculates repetition penalty based on repeated n-grams
    Args:
        generated_ids: tensor of shape [batch_size, seq_len]
        ngram_size: size of n-grams to check
    Returns:
        repetition_penalty: scalar tensor
    """
    batch_size, seq_len = generated_ids.shape
    penalty = 0.0

    for seq in generated_ids:
        ngrams = set()
        repeats = 0
        for i in range(seq_len - ngram_size + 1):
            ngram = tuple(seq[i:i+ngram_size].tolist())
            if ngram in ngrams:
                repeats += 1
            else:
                ngrams.add(ngram)

        # Normalize by sequence length
        penalty += repeats / seq_len

    return penalty / batch_size  # Average across batch

In [103]:
# Lightning Module
class CPModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
        self.tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

    def forward(self, input_ids, attention_mask, labels=None):
        return self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

    def training_step(self, batch, batch_idx):
        outputs = self(
            input_ids=batch['input_ids'],
            attention_mask=batch['attention_mask'],
            labels=batch['labels']
        )

        # Inicial:
        #loss = output.loss
        #self.log('train_loss', loss)
        #return loss

        # Com sequência em gramas:
        # Calculate input-output token overlap penalty
        input_tokens = batch['input_ids']
        output_tokens = torch.argmax(outputs.logits, dim=-1)

        # Penalize matching n-grams (adjust n=3 as needed)
        copy_penalty = 0
        for seq_in, seq_out in zip(input_tokens, output_tokens):
            in_ngrams = set(tuple(seq_in[i:i+3]) for i in range(len(seq_in)-2))
            out_ngrams = set(tuple(seq_out[i:i+3]) for i in range(len(seq_out)-2))
            copy_penalty += len(in_ngrams & out_ngrams) / len(out_ngrams)

        total_loss = outputs.loss + 0.5 * copy_penalty  # Weight is tunable
        self.log('train_loss', total_loss)
        return total_loss

        # Com logits:
        #logits = outputs.logits
        #preds = torch.argmax(logits, dim=-1)

        # Calcularr repetição
        #rep_penalty = calculate_repetitions(preds) * 0.1  # Weight hyperparameter

        #total_loss = outputs.loss + rep_penalty
        #self.log('train_loss', total_loss)


    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=3e-5)

In [104]:
model = CPModel()

In [105]:
trainer = pl.Trainer(
    max_epochs=10,
    gradient_clip_val=0.5,
    check_val_every_n_epoch=2,
    val_check_interval=0.25,
    log_every_n_steps=10,
    precision='32-true'
)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [106]:
trainer.fit(model, data_module)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                       | Params | Mode
------------------------------------------------------------
0 | model | T5ForConditionalGeneration | 60.5 M | eval
------------------------------------------------------------
60.5 M    Trainable params
0         Non-trainable params
60.5 M    Total params
242.026   Total estimated model params size (MB)
0         Modules in train mode
277       Modules in eval mode


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


In [107]:
model.model.save_pretrained("t5-small-cp-solver-4")
model.tokenizer.save_pretrained("t5-small-cp-solver-4")

('t5-small-cp-solver-4/tokenizer_config.json',
 't5-small-cp-solver-4/special_tokens_map.json',
 't5-small-cp-solver-4/spiece.model',
 't5-small-cp-solver-4/added_tokens.json')

In [None]:
device_name = torch.cuda.get_device_name(0)

torch_device_name = "cpu" if "AMD Radeon RX 580 2048SP" in device_name else "cuda"

print(f"Arquitetura de dispositivo: '{device_name}' e nome do dispositivo: '{torch_device_name}'" )

In [158]:
class CPSolver:
    def __init__(self, model_path="t5-small-cp-solver-4", torch_device_name=torch_device_name):
        self.model = T5ForConditionalGeneration.from_pretrained(model_path)
        self.tokenizer = T5Tokenizer.from_pretrained(model_path)
        self.device = torch.device(torch_device_name)
        self.model.to(self.device)

    def solve(self, problem_statement):
        # Preprocess long input
        input_text = f"Generate a step-by-step competitive programming editorial for this problem: {problem_statement[:9000]}"

        inputs = self.tokenizer(
            input_text,
            max_length=512,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        ).to(self.device)

        outputs = self.model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_length=256,
            num_beams=4,  # Increased beam width
            early_stopping=True,
            length_penalty=2,  # Encourage shorter sequences
            no_repeat_ngram_size=3,  # Prevent 3-gram repeats
            temperature=0.5,  # Add some randomness
            top_k=50,               # Consider top 50 tokens
            top_p=0.95,             # Nucleus sampling
            repetition_penalty=2.0, # Explicit penalty
            num_return_sequences=4  # Generate multiple candidates
        )

        candidates = [self.tokenizer.decode(seq, skip_specials=True) for seq in outputs]
        return max(candidates, key=lambda x: len(x.split()))

In [159]:
solver = CPSolver()

In [166]:
problems = [
  "You are given two positive integers and In one move you can increase by replace with Your task is to find the minimum number of moves you need to do in order to make divisible by It is possible that you have to make moves as is already divisible by You have to answer independent test cases",
  "You have a matrix  filled with N integers. You want your matrix to become beautiful. The matrix is beautiful if the following two conditions are satisfied:  in each row, the first element is smaller than the second element;  in each column, the first element is smaller than the second element.   You can perform the following operation on the matrix any number of times: rotate it clockwise by  degrees, so the top left element shifts to the top right cell, the top right element shifts to the bottom right cell, and so on:  Determine if it is possible to make the matrix beautiful by applying zero or more operations.",
  "Polycarp has positive integers and He can perform the following operation Choose a integer and multiply of the integers or by Can Polycarp make it so that after performing the operation the sequence of three numbers forms an arithmetic progression Note that you the order of and Formally a sequence is called an arithmetic progression AP if there exists a number called common difference such that for all from to In this problem For example the following sequences are AP and The following sequences are not AP and You need to answer independent test cases ",
  "There are N pigeons numbered from 1 to N, and there are N nests numbered from 1 to N Initially, pigeon i is in nest i for 1 less than N You are given Q queries, which you must process in order. There are two types of queries, each given in one of the following formats: Move P pigeon to nest H, Output the number of nests that contain more than one pigeon.",
  "Adilbek was assigned to a special project For Adilbek it means that he has days to run a special program and provide its results But there is a problem the program needs to run for days to calculate the results Fortunately Adilbek can optimize the program If he spends is a non negative integer days optimizing the program he will make the program run in days is the ceiling function The program cannot be run and optimized simultaneously so the total number of days he will spend is equal to Will Adilbek be able to provide the generated results in no more than days "
]

In [169]:
problem = problems[1]
solution = solver.solve(problem)

In [153]:
import re

In [113]:
def format_str(s):
    return re.sub(r'(?=[A-Z])', '\n', s)

In [170]:
print("STATEMENT:")
print(format_str(problem))
print()
print("EDITORIAL GERADO:")
print(format_str(solution))

STATEMENT:

You have a matrix  filled with 
N integers. 
You want your matrix to become beautiful. 
The matrix is beautiful if the following two conditions are satisfied:  in each row, the first element is smaller than the second element;  in each column, the first element is smaller than the second element.   
You can perform the following operation on the matrix any number of times: rotate it clockwise by  degrees, so the top left element shifts to the top right cell, the top right element shifts to the bottom right cell, and so on:  
Determine if it is possible to make the matrix beautiful by applying zero or more operations.

EDITORIAL GERADO:
<pad> <extra_id_0> and the second element to the left of the column. 
You can do the following operation on the matrix by rotating it clockwise by degrees, so the top left element shifts to the bottom left cell and so on. 
So you can make the matrix beautiful by applying zero or more operations. 
Then if you want to make the first element bea

Rodamos o modelo inicial que decidimos usar, que é o t5-small, que foi construído
com o intuito de ser um modelo de tradução de texto, mas que pode ser utilizado para
outras tarefas de NLP.

Aqui é para carregar o tokenizador do modelo que vamos usar, que é o t5-small.

Carregar o dataset no arquivo "data.csv":
https://www.kaggle.com/datasets/dinuiongeorge/codeforces-competitive-programming-dataset

Essa função de pre-processamento vai ser usada para tokenizar os inputs e os targets,
ela vai adicionar o token "Problem:" antes de cada declaração de problema,
e vai usar o tokenizer em modo target para os labels.