# AI Juris

In [1]:
import pandas as pd
import numpy as np
import nltk
import shutil
import evaluate
from datasets import load_dataset
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Removes folders if they already exist. This avoids errors when running jupyter from the second time onwards
try:
    shutil.rmtree('train_logs')
    shutil.rmtree('train_results')
    shutil.rmtree('saved_model')
except:
    print('The folders do not exist or have already been removed!')

## Load Data

In [None]:
# File name
filename = 'data/dataset.csv'

# Load data
dataset = load_dataset('csv', data_files=filename)

# Splitting into training and testint with 80/20 ratio
dataset = dataset['train'].train_test_split(test_size = 0.2)

# Show dataset format
dataset

# Tokenizer and LLM Open-Source

https://huggingface.co/google/flan-t5-base

In [None]:
# Load Tokenizer
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-base')

# Showing the tokenizer
tokenizer

In [None]:
# Load pretrained LLM
model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-base')

# Show the model
model

In [None]:
# Data collator to concatenate the tokenizer and the model
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Show the Data Collator
data_collator

## Data Preprocessing

In [7]:
# Every input will receive the prefix: "answer the question"
prefix = "answer the question: "

In [8]:
# Preprocessing function
def data_preprocess(data):
    # Concatenate the prefix to each question in the list of questions given in data["question"]
    inputs = [prefix + doc for doc in data['question']]

    # Uses the tokenizer to convert the processed questions into tokens with a maximum lenght of 128, truncating any that are longer
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)

    # Tokenize the responses given in data['answer] with a maximum lenght of 512, truncating any that are longer
    labels = tokenizer(text_target = data['answer'], max_length=512, truncation=True)

    # Add the tokens of response as labels in the input dictionary of the model
    model_inputs['labels'] = labels['input_ids']

    return model_inputs

In [None]:
# Applies the preprocessing function to the dataset, generating the tokenized dataset 
dataset_tokenized = dataset.map(data_preprocess, batched=True)

In [None]:
# Show the dataset tokenized
dataset_tokenized

In [None]:
dataset_tokenized['train']['question'][0]

In [None]:
dataset_tokenized['train']['answer'][0]

In [None]:
dataset_tokenized['train']['input_ids'][0]

## Defining the Evaluate Metric

In [None]:
# The "punkt" package is specifically for the task of tokenization, which involves splitting a text
# into a list of sentences
nltk.download("punkt", quiet = True)
nltk.download('punkt_tab')

The **ROUGE** metric (*Recall-Oriented Understudy for Gisting Evaluation*) is widely used to automatically evaluate the quality of machine-generated text summaries by comparing them to human-written reference summaries. It measures the overlap between n-grams, words, or sequences of words in the generated text and the reference text.

### Common ROUGE Variants

1. **ROUGE-N**:
   - Measures the overlap of n-grams between the generated and reference texts.
   - Example: ROUGE-1 (for unigrams), ROUGE-2 (for bigrams), etc.
   - Formula:
     
     $\text{ROUGE-N} = \frac{\sum_{S \in \text{Reference}} \sum_{\text{n-gram} \in S} \text{Count\_overlap}(\text{n-gram})}{\sum_{S \in \text{Reference}} \sum_{\text{n-gram} \in S} \text{Count}(\text{n-gram})}$
     

2. **ROUGE-L**:
   - Based on the *Longest Common Subsequence* (LCS), measuring the longest sequence of words that appears in both the generated and reference texts.
   - Useful because it accounts for word order without requiring the words to be contiguous.

3. **ROUGE-W**:
   - A variation of ROUGE-L that assigns weights to continuous subsequences, giving more importance to longer segments.

4. **ROUGE-S** (or ROUGE-Skip):
   - Measures co-occurrences of word pairs that appear in the same order but may be separated by other words.

5. **ROUGE-SU**:
   - Combines ROUGE-S with unigrams, adding more context to the evaluation.

### Key Components

- **Recall**:
  - Measures how much of the reference text is captured in the generated text.
  - Useful for summarization tasks, as it prioritizes capturing essential information.

- **Precision**:
  - Measures how much of the generated text is present in the reference.
  - Less commonly emphasized in ROUGE but still relevant.

- **F1-Score**:
  - Combines *Recall* and *Precision* to provide a balanced metric.

### Applications

- Evaluating text summarization models.
- Comparing generated texts in tasks like machine translation, captioning, or automated responses.

### Example

Consider the reference summary:  
**"The cat is on the roof"**  
Machine-generated summary:  
**"The cat sleeps on the roof"**

- **ROUGE-1 (Unigrams)**:  
  - Unigrams in the reference: {The, cat, is, on, roof}  
  - Unigrams in the generated text: {The, cat, sleeps, on, roof}  
  - Overlap: {The, cat, on, roof}  
  - Recall = 4/5 = 0.8 (80%)  
  - Precision = 4/5 = 0.8 (80%)  
  - F1-Score = 0.8 (80%)

- **ROUGE-2 (Bigrams)**:  
  - Bigrams in the reference: {The cat, cat is, is on, on the roof}  
  - Bigrams in the generated text: {The cat, cat sleeps, sleeps on, on the roof}  
  - Overlap: {The cat, on the roof}  
  - Recall = 2/4 = 0.5 (50%)  
  - Precision = 2/4 = 0.5 (50%)  
  - F1-Score = 0.5 (50%)

ROUGE is useful for automatically evaluating the quality of texts but does not capture semantic or creative nuances. Therefore, it is recommended as a complement to human evaluation.

In [15]:
# Defining the metric
metric = evaluate.load('rouge')


In [16]:
# Metric calculate function
def calculate_metric(eval_preds):

    # Unpack the predictions and labels from the eval_preds argument
    predictions, labels = eval_preds

    # Replace all non--100 values ​​in labels with the padding token ID
    labels = np.where(labels != -100,
                      labels,
                      tokenizer.pad_token_id)
    
    # Decode predictions to text, ignoring special tokens
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Decode labels to text, ignoring special tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Add a new line after each sentence to the decoded predictions, preparing them for ROUGE evaluation
    decoded_predictions = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_predictions]
    
    # Add a new line after each label to the decoded predictions, preparing them for ROUGE evaluation
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]


    # Calculate the ROUGE metric between predictions and decoded labels, using a stemmer
    result = metric.compute(predictions = decoded_predictions,
                            references = decoded_labels,
                            use_stemmer = True)
    
    # Returns the result of ROUGE metric
    return result


The **`Seq2SeqTrainingArguments`** class from the `transformers` package is used to configure training parameters when fine-tuning sequence-to-sequence (Seq2Seq) models, such as those used in tasks like machine translation, text summarization, or conditional text generation.

This class extends the base `TrainingArguments` class and adds specific arguments tailored for Seq2Seq model training, such as those based on **T5**, **BART**, and others.

---

### **Key Arguments**
Here are the most relevant arguments:

#### **General Arguments** (inherited from `TrainingArguments`):
1. **`output_dir`**:
   - Directory where models and checkpoints will be saved.
   - Example: `output_dir="./results"`

2. **`evaluation_strategy`**:
   - Strategy for evaluation during training. Options:
     - `no`: No evaluation.
     - `steps`: Evaluation at specific step intervals.
     - `epoch`: Evaluation at the end of each epoch.

3. **`per_device_train_batch_size`**:
   - Batch size used for training on each device (CPU/GPU).

4. **`learning_rate`**:
   - Initial learning rate.

5. **`num_train_epochs`**:
   - Total number of training epochs.

6. **`save_steps`**:
   - Number of steps between each checkpoint save.

---

#### **Seq2Seq-Specific Arguments**:
1. **`predict_with_generate`**:
   - Type: `bool`
   - Indicates whether to use the `generate()` method to make predictions during evaluation.
   - Very useful for tasks like translation or summarization, where the output is a generated sequence.

2. **`generation_max_length`**:
   - Type: `int`
   - Maximum length of sequences generated during evaluation or prediction.

3. **`generation_num_beams`**:
   - Type: `int`
   - Sets the number of beams used in beam search for sequence generation.
   - Example: Setting it to `4` can improve the quality of generated text.

4. **`label_smoothing_factor`**:
   - Type: `float`
   - Label smoothing factor used to prevent overfitting.
   - Example: A value like `0.1` smooths the target probabilities.

5. **`forced_bos_token_id` and `forced_eos_token_id`**:
   - IDs of special tokens to enforce as the beginning (`BOS`) or end (`EOS`) of the generated sequence.
   - Useful in scenarios where greater control over the model's output is required.

6. **`length_penalty`**:
   - Type: `float`
   - Penalizes or rewards longer sequences during generation.
   - Values less than 1 favor shorter sequences; values greater than 1 favor longer ones.

---

### **Usage Example**
Here’s an example of how to set up arguments for Seq2Seq training:

```python
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=5e-5,
    num_train_epochs=3,
    predict_with_generate=True,
    generation_max_length=128,
    generation_num_beams=4,
    save_steps=500,
    eval_steps=500,
    logging_dir="./logs",
    logging_steps=100
)
```

---

### **When to Use**
The **`Seq2SeqTrainingArguments`** class is essential when training Seq2Seq models using the `Trainer` provided by the `transformers` package. It offers a standardized interface to configure both general training aspects and specifics of text generation.

In [21]:
# Define the train arguments
training_args = Seq2SeqTrainingArguments(output_dir = "train_results",
                                        evaluation_strategy = "epoch",
                                        learning_rate = 3e-4,
                                        logging_dir = "logs_treino",
                                        logging_steps = 1,
                                        per_device_train_batch_size = 4,
                                        per_device_eval_batch_size = 2,
                                        weight_decay = 0.01,
                                        save_total_limit = 1,
                                        num_train_epochs = 1,
                                        predict_with_generate = True,
                                        push_to_hub = False)

The **`Seq2SeqTrainer`** class from the `transformers` package is a specialized version of the `Trainer` class designed specifically for training sequence-to-sequence (Seq2Seq) models. It is tailored for tasks like machine translation, text summarization, and conditional text generation, where both the input and output are sequences.

This class simplifies the training and evaluation of Seq2Seq models by integrating features specific to generation tasks, such as beam search, sequence length constraints, and evaluation with the `generate` method.

---

### **Key Features of `Seq2SeqTrainer`**
1. **Integration with Generation**:
   - Uses the model's `generate()` method for prediction during evaluation or inference, allowing evaluation metrics to be computed on generated sequences.

2. **Support for Seq2Seq-Specific Metrics**:
   - Metrics like ROUGE, BLEU, and others that rely on generated text can be directly integrated into the evaluation pipeline.

3. **Handles Forced Tokens**:
   - Supports forcing specific tokens (e.g., BOS/EOS tokens) at the start or end of generated sequences using arguments like `forced_bos_token_id` and `forced_eos_token_id`.

4. **Extended Arguments**:
   - Works seamlessly with `Seq2SeqTrainingArguments`, which includes additional parameters like `generation_max_length`, `generation_num_beams`, and `label_smoothing_factor`.

5. **Label Smoothing**:
   - Implements label smoothing during training to make the model more robust and prevent overfitting.

---

### **Key Methods**
#### 1. **`compute_loss`**:
   - Computes the loss during training, optionally applying label smoothing if configured.

#### 2. **`prediction_step`**:
   - Overrides the base `Trainer`'s method to support predictions using `generate()` for Seq2Seq tasks.

#### 3. **`evaluate`**:
   - Evaluates the model using generated sequences instead of raw logits.
   - Automatically applies `generation_max_length` and `generation_num_beams` for evaluation.

#### 4. **`generate`**:
   - Handles sequence generation using the model's `generate()` method, with support for various generation strategies like greedy search, beam search, or sampling.


---

### **Key Configuration Parameters**
`Seq2SeqTrainer` inherits all parameters from `Trainer` and adds Seq2Seq-specific ones:

1. **`tokenizer`**:
   - Used to tokenize the input and decode generated sequences.

2. **`predict_with_generate`**:
   - Enables the use of the model's `generate()` method during evaluation.

3. **`generation_max_length`**:
   - Maximum length for generated sequences during evaluation or prediction.

4. **`generation_num_beams`**:
   - Number of beams for beam search in sequence generation.

5. **`label_smoothing_factor`**:
   - Factor for label smoothing during training.

---

### **When to Use**
- Use `Seq2SeqTrainer` when training models for tasks where both the input and output are sequences, and you need additional support for generation and sequence-based metrics.
- It is particularly suited for models like **T5**, **BART**, and **MBart**.



In [22]:
# Defining the trainer
trainer = Seq2SeqTrainer(model = model,
                        args = training_args,
                        train_dataset = dataset_tokenized["train"],
                        eval_dataset = dataset_tokenized["test"],
                        tokenizer = tokenizer,
                        data_collator = data_collator,
                        compute_metrics = calculate_metric)

## Training the model

In [23]:
%%time
trainer.train()

  0%|          | 0/749 [00:00<?, ?it/s]

{'loss': 2.2324, 'grad_norm': 2.2708263397216797, 'learning_rate': 0.0002995994659546061, 'epoch': 0.0}
{'loss': 1.9946, 'grad_norm': 1.7159390449523926, 'learning_rate': 0.00029919893190921226, 'epoch': 0.0}
{'loss': 2.5341, 'grad_norm': 2.3515491485595703, 'learning_rate': 0.0002987983978638184, 'epoch': 0.0}
{'loss': 2.0063, 'grad_norm': 1.8315774202346802, 'learning_rate': 0.00029839786381842456, 'epoch': 0.01}
{'loss': 1.8856, 'grad_norm': 1.7520458698272705, 'learning_rate': 0.0002979973297730307, 'epoch': 0.01}
{'loss': 1.8455, 'grad_norm': 1.8115766048431396, 'learning_rate': 0.00029759679572763685, 'epoch': 0.01}
{'loss': 2.3088, 'grad_norm': 1.5589631795883179, 'learning_rate': 0.00029719626168224294, 'epoch': 0.01}
{'loss': 1.7683, 'grad_norm': 2.0201447010040283, 'learning_rate': 0.0002967957276368491, 'epoch': 0.01}
{'loss': 1.8047, 'grad_norm': 1.7122702598571777, 'learning_rate': 0.00029639519359145523, 'epoch': 0.01}
{'loss': 2.1273, 'grad_norm': 1.5213422775268555, 'le

  0%|          | 0/375 [00:00<?, ?it/s]

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


{'eval_loss': 2.4804370403289795, 'eval_rouge1': 0.12885556092373202, 'eval_rouge2': 0.03069107851373484, 'eval_rougeL': 0.10321133376978474, 'eval_rougeLsum': 0.11766051924186802, 'eval_runtime': 150.7628, 'eval_samples_per_second': 4.968, 'eval_steps_per_second': 2.487, 'epoch': 1.0}
{'train_runtime': 4834.9168, 'train_samples_per_second': 0.619, 'train_steps_per_second': 0.155, 'train_loss': 2.11831885743364, 'epoch': 1.0}
CPU times: total: 7min 23s
Wall time: 1h 20min 35s


TrainOutput(global_step=749, training_loss=2.11831885743364, metrics={'train_runtime': 4834.9168, 'train_samples_per_second': 0.619, 'train_steps_per_second': 0.155, 'total_flos': 489958597260288.0, 'train_loss': 2.11831885743364, 'epoch': 1.0})

In [24]:
# Saving the model
trainer.save_model('model/saved_model')

**ROUGE-1** measures the overlap of unigrams (individual words).

**ROUGE-2** measures the overlap of bigrams (pairs of consecutive words).

**ROUGE-L** measures the overlap of the longest common subsequence between the generated summary and the reference summary. This takes into account word order, but allows for gaps. Higher values ​​indicate better performance. ROUGE-L is calculated based on the similarity between the sequences, taking into account precision, recall, and the harmonic mean between them.

Higher ROUGE values ​​indicate a greater similarity between the generated summary and the reference summary, which is generally interpreted as an indication of better summary quality. However, it is important to remember that no single metric can fully capture the quality of a summary, and it is useful to complement the assessment with qualitative analysis or other metrics.

## Deploy e Uso do Modelo