# Natural Language Processing - Group assignment

**Product:** Assist customers in identifying parts of interest in legal contracts. Reviewing legal contracts can be tedious as these are usually very long discouraging most people in taking the time to check the points that are relevant to them. The goal is to give a tool that would help any costumer identifying the parts of a contract that they are interested in or they want to check. They would simply need to ask their question and the model would point them to the sentence(s) that provide an answer to this question. <br>
**NLP Task:** We need to build an extractive question answering model which consists in extracting parts of a document answering the question asked. To this end, we will use a pre-trained language modelling model with its corresponding tokenizer and fine-tune it on a legal contracts dataset.<br>
<br>
Pre-trained model: https://huggingface.co/saibo/legal-roberta-base <br>
Dataset for fine-tuning: https://huggingface.co/datasets/cuad/tree/refs%2Fconvert%2Fparquet/default <br>
QA Tutorial: https://huggingface.co/docs/transformers/tasks/question_answering <br>
Extractive QA Tutorial: https://huggingface.co/course/chapter7/7?fw=pt#postprocessing

## 0. Install dependencies

In [2]:
#! pip3 install torch torchvision
#! pip install transformers datasets evaluate
#! pip install datasets --quiet

from datasets import concatenate_datasets, load_dataset, Dataset
from transformers import AutoTokenizer, DefaultDataCollator, AutoModelForQuestionAnswering, TrainingArguments, Trainer, TFAutoModelForQuestionAnswering
import numpy as np
import pandas as pd
import tensorflow as tf
import pandas
from tqdm.auto import tqdm
import evaluate
import collections

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m17.1 MB

In [None]:
from tensorflow.config.experimental import list_physical_devices, set_memory_growth
from tensorflow import device

gpus = list_physical_devices('GPU')
for gpu in gpus: 
    set_memory_growth(gpu, True)

## 1. Load dataset & tokenizer
The dataset for fine-tuning is divided into 4 parquet files - 3 for the training and one for the testing. <br>
**Goals:** 
- Load the datasets and check their format
- Set one of the train parquet file as the validation dataset
- Remove from all the samples with 0 answer and reduce the size of the train datasets in order to be able to fine-tune the model within a reasonable time (task was too large when taking the full dataset)
- Put them all together under the DatasetDict "dd" 
- Check that as expected the training & validation sets only have 1 answer while the testing set has 1 to several answers

### 1.1 Load datasets

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
#Load datasets
import datasets

dataset_train1 = load_dataset("parquet", data_files={'train': "/content/drive/MyDrive/Colab Notebooks/cuad-train-00000-of-00003.parquet"})
dataset_train2 = load_dataset("parquet", data_files={'train': "/content/drive/MyDrive/Colab Notebooks/cuad-train-00001-of-00003.parquet"})
dataset_val = load_dataset("parquet", data_files={'train': "/content/drive/MyDrive/Colab Notebooks/cuad-train-00002-of-00003.parquet"})
dataset_test = load_dataset("parquet", data_files={'train': "/content/drive/MyDrive/Colab Notebooks/cuad-test.parquet"})

# Remove examples with no answers from the training set
dataset_train1['train'] = dataset_train1['train'].filter(lambda example: len(example['answers']['text']) > 0)
dataset_train2['train'] = dataset_train2['train'].filter(lambda example: len(example['answers']['text']) > 0)
dataset_val['train'] = dataset_val['train'].filter(lambda example: len(example['answers']['text']) > 0)
dataset_test['train'] = dataset_test['train'].filter(lambda example: len(example['answers']['text']) > 0)

# Select a subset of the dataset
dataset_train1 = dataset_train1["train"].select(range((1800)))
dataset_val = dataset_val["train"].select(range((600)))
dataset_test = dataset_test["train"].select(range((600)))
dd = dataset_train1
dd = datasets.DatasetDict({"train":dd, "val":dataset_val, "test":dataset_test})

# Full dataset
#dd = concatenate_datasets([dataset_train1['train'], dataset_train2['train']])
#dd = datasets.DatasetDict({"train":dd, "val":dataset_val['train'], "test":dataset_test['train']})



Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-bc512afd54342576/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-bc512afd54342576/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-efa6f17067b29d37/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-efa6f17067b29d37/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-0ef724502975fe0e/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-0ef724502975fe0e/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-5b16a32428edb56f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-5b16a32428edb56f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Filter:   0%|          | 0/8000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/7000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/7450 [00:00<?, ? examples/s]

Filter:   0%|          | 0/4182 [00:00<?, ? examples/s]

In [None]:
#During training, there is only one possible answer.
dd["train"].filter(lambda x: len(x["answers"]["text"]) > 1)

Loading cached processed dataset at /Users/manon/.cache/huggingface/datasets/parquet/default-d60745dcab136b43/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-1b26a51939286176.arrow


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

In [None]:
#During testing, there are several possible answers for each sample.
dd["test"].filter(lambda x: len(x["answers"]["text"]) > 1)

Loading cached processed dataset at /Users/manon/.cache/huggingface/datasets/parquet/default-9684758fc536b01e/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-d5b4d917076ff7a6.arrow


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 213
})

In [None]:
print(dd["test"][1]["answers"])

{'text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'answer_start': [143, 49]}


### 1.2 Import tokenizer

**Tokenizer class:** The ``AutoTokenizer`` class will grab the proper tokenizer class in the library based on the model checkpoint name, and can be used directly with any checkpoint. With the pre-trained model we are using, the tokenizer we will use is the ``RobertaTokenizerFast`` which is a tokenizer designed for the RoBERTa model that is optimized for question answering tasks. It uses byte-level Byte-Pair Encoding (BPE) for subword tokenization. <br>
**!Need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained!** <br>
<br>
**Fast tokenizer:** These are optimized for speed and can handle large amounts of text efficiently. It can quickly break down the input text into tokens, which can then be fed into the model for fine-tuning. This can significantly speed up the fine-tuning process and make it more practical to train models on large datasets. They also have other advantages such as support for a wide range of languages, subword tokenization for handling out-of-vocabulary words, and the ability to handle special tokens such as masks and segment IDs.

In [7]:
#The tokenizer automatically chosen is the RobertaTokenizerFast
tokenizer = AutoTokenizer.from_pretrained("saibo/legal-roberta-base")
print(type(tokenizer))

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

<class 'transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast'>


In [None]:
#Checking that the tokenizer is a fast tokenizer
tokenizer.is_fast

True

In [None]:
#This tokenizer does not apply any normalization (general cleanup - e.g. removing needless whitespace, lowercasing, and/or removing accents)
#print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

In [None]:
#Example of how the tokenizer works. The "Ġ" denotes spaces. This notation is proper to the tokenizer we are using.
print(tokenizer.tokenize('I love salad'))

['I', 'Ġlove', 'Ġsalad']


In [None]:
#Splits on whitespace and punctuation, but keeps the spaces and replace them with a Ġ symbol, enabling it to recover the original spaces if we decode the tokens
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('Ġhow', (6, 10)),
 ('Ġare', (10, 14)),
 ('Ġ', (14, 15)),
 ('Ġyou', (15, 19)),
 ('?', (19, 20))]

Checking that by passing the question and the context to the tokenizer, it inserts the special tokens to get a feature with the format: ``<s> question </s></s> context </s>``

In [None]:
context = dd["train"][0]["context"]
question = dd["train"][0]["question"]

inputs = tokenizer(question, context)
tokenizer.decode(inputs["input_ids"])

Token indices sequence length is longer than the specified maximum sequence length for this model (23284 > 512). Running this sequence through the model will result in indexing errors


'<s>Highlight the parts (if any) of this contract related to "Document Name" that should be reviewed by a lawyer. Details: The name of the contract</s></s>EXHIBIT 10.6\n\n                              DISTRIBUTOR AGREEMENT\n\n         THIS  DISTRIBUTOR  AGREEMENT (the  "Agreement")  is made by and between Electric City Corp.,  a Delaware  corporation  ("Company")  and Electric City of Illinois LLC ("Distributor") this 7th day of September, 1999.\n\n                                    RECITALS\n\n         A. The  Company\'s  Business.  The Company is  presently  engaged in the business  of selling an energy  efficiency  device,  which is  referred to as an "Energy  Saver"  which may be improved  or  otherwise  changed  from its present composition (the "Products").  The Company may engage in the business of selling other  products  or  other  devices  other  than  the  Products,  which  will be considered  Products if Distributor  exercises its options pursuant to Section 7 hereof.\n\n 

## 2. Pre-process the train and validation dataset
**Goal:** Prepare both the train and validation datasets to have the correct format for training the model. All the tokenization and pre-processing steps for each dataset will be wrapped in a function.

<u>**1. TOKENIZE & SETTINGS**</u> <br>
<br>
**Goal:** Convert the raw text to numbers (encoding) as models can only process numbers. The tokenizer used should give the most meaningful representation — the one that makes the most sense to the model and the smallest one. <br>
**Steps:** Encoding is done in a two-step process: <br>
1. The tokenization
2. The conversion to input IDs (can use the decoding method to go from the ``input_ids`` back to the text.)

Most transformers models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. For this reason, we can set up parameters when creating the inputs from the tokenizer that will truncate the context into smaller sequences. This will create different inputs each with the question in its entirety and a sequence of the context (together called feature). This will create 3 different kind of inputs: 
- Answer is not included in the context: in this case we will set the start and end position of the answer token to ``label = (0,0)`` (so we predict the ``</s>`` token). 
- Answer has been truncated: in this case we will set the start and end position of the answer token to ``label = (0,0)`` (so we predict the ``</s>`` token).
- Answer fully in the context: in this case we will set the start and end position of the answer token to ``label = (start_position, end_position)``.

To limit the possibility of having the answer truncated on 2 different inputs, we can set the option to make the sequences overlap by a certain amount of tokens (stride parameter). <br>
<br>
**1.a. <u>Tokenizer settings**</u><br>
**Max length:**  allows to truncate the sequences of each context by specifying its maximum token length. <br>
**Truncation:** "only_second" makes sure we only truncate the context (second position) and not the question when the text is > max_length. <br>
**Stride:** number of overlapping tokens between two successive chunks. <br>
**Return_overflowing_tokens:** maps each feature to the sample it originated from since one sample can give several features when truncating option is on. This returns the ``overflow_to_sample_mapping`` key in the inputs. <br>
**Return_offsets_mapping:** instructs the tokenizer to return a list of offset tuples alongside the list of tokens. Each offset tuple corresponds to a token and contains the start and end character offsets of that token in the original input text. This returns the ``offset_mapping`` key in the inputs. <br>
**Padding**: tensors need to be of rectangular shape. So if you encoded several sentences, you will have a list of lists of numbers and each inside list needs to be of same length. Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values. <br>
<br>
**1.b. <u>Inputs keys**</u><br>
**Input ids:** encoded text ids as input to the model - list of token ids for each feature ``[ [20, 1045, 30], [20, 45, 576, 4802] ]`` <br>
**Attention mask:** list with the exact same shape as the input IDs list, filled with 0s and 1s: 1s indicate the corresponding tokens should be taken into consideration, and 0s indicate the corresponding tokens should not be taken into consideration (e.g. padding ids) ``[[1,1,1,1],[1,1,1,1],[1,1,0,0]]`` <br>
**Offset mapping:** example of a dataset with 2 features (lists) each containing 6 tokens (tuples) with 3 (0,0) tuples to identify the special tokens. Each tuple is the span of text corresponding to each token. ``[[(0,0), (0,3), (4,9), (0,0), (10,15), (0,0)], [(0,0), (16,19), (0,0), (20,25), (26,30),(0,0)]]`` <br>
**Overflow to sample mapping:** example dataset containing 4 samples with 19 features ``[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3]``<br>

<br>

<u>**2. GENERATE LABELS FOR QUESTION'S ANSWER**</u> <br>
<br>
**Goal:** get the start and end token positions of the tokens corresponding to the answer inside the context by adding the fields ``["start_positions"]`` and ``["end_positions"]`` to the inputs <br>
**Steps:** For each feature,
1. Find the start and end character of the answer. Mark them as 0 if there is no answer.
2. Find the start and end token position of the context in the input IDs by using the inputs.sequence_ids() function.
3. Get the start and end token position of the answer<br>
    3.a. If the answer is not fully inside the context, label it (0, 0)<br>
    3.b. Otherwise it's the start and end token positions <br>

The offset_mapping allows to map the character to the token and the overflow_to_sample_mapping allows to map each feature to its sample and thus the answer.

In [None]:
def preprocess_train_function(examples):
    questions = [q.strip() for q in examples["question"]] #removing unecessary spaces around the question

    #Getting the input with format dict_keys(['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=256,
        truncation="only_second",
        stride=64,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping") #extract the offset_mapping inputs key
    sample_map = inputs.pop("overflow_to_sample_mapping") #extract the overflow_to_sample_mapping inputs key
    answers = examples["answers"] #extract the answers
    start_positions = []
    end_positions = []
    
    for i, offset in enumerate(offset_mapping): #For each feature i (offset is the list of each token positions) -> [(0, [(0,3), (4,7), (8,11)]), (1,[(0,3), (4,7), (8,11)]), ...] -> 
        sample_idx = sample_map[i] #position of the sample to which the feature belongs to -> 0
        answer = answers[sample_idx] #answer corresponding to the feature -> [{text_answer0: "", answer_start0: ::}]
        # 1. Find the start and end character of the answer
        if len(answer["answer_start"]) > 0: #if there is at least one answer
          start_char = answer["answer_start"][0] #get the answer start character position
          end_char = answer["answer_start"][0] + len(answer["text"][0]) #get the answer end character position
        else: #if there is no answer
            start_char = 0 #set the answer start character as 0
            end_char = 0 #set the answer end character as 0
        sequence_ids = inputs.sequence_ids(i) # maps each token of the feature to whether it is a special character (None), the question (0) or the answer (1) -> [None, 0,0,0, None, None, 1,1,1,1,1, None]

        # 2. Find the start and end token position of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx #context start token position
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1 #context end token position

        # 3. Get the start and end token position of the answer
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            # 3.a. If the answer is not fully inside the context, label it (0, 0)
            start_positions.append(0)
            end_positions.append(0)
        else:
            # 3.b. Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char: #from the start character of the answer, find the answer start token position
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char: #from the end character of the answer, find the answer end token position
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

From 1800 samples, the tokenization of the train set produces 247843 features.

In [None]:
#We’re using batched=True in our call to map so the function is applied to multiple elements of our dataset at once, and not on each element separately. This allows for faster preprocessing.
tokenized_dd_train = dd['train'].map(preprocess_train_function, batched=True, batch_size=4, remove_columns=dd["train"].column_names)
len(dd['train']), len(tokenized_dd_train)

Loading cached processed dataset at /Users/manon/.cache/huggingface/datasets/parquet/default-d60745dcab136b43/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-30d48357cb1f96bd.arrow


(1800, 247843)

From 600 samples, the tokenization of the train set produces 85517 features.

In [None]:
#Tokenizing the validation set
tokenized_dd_val = dd['val'].map(preprocess_train_function, batched=True, batch_size=4, remove_columns=dd["val"].column_names)
len(dd['val']), len(tokenized_dd_val)

  0%|          | 0/150 [00:00<?, ?ba/s]

(600, 85517)

In [None]:
tokenized_dd_train

Dataset({
    features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 247843
})

## 3. Fine-tune the model and upload it to Hugging Faces
**Goal:** Re-train the RobertaLegal model on our CUAD dataset and upload the new model to Hugging Faces.

### 3.1 Train

In [None]:
#Import the pre-trained RobertaLegal model
model = AutoModelForQuestionAnswering.from_pretrained("saibo/legal-roberta-base")

Some weights of the model checkpoint at saibo/legal-roberta-base were not used when initializing RobertaForQuestionAnswering: ['lm_head.decoder.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at saibo/legal-roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN thi

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_qa_model", #output directory where the model predictions and checkpoints will be written
    save_strategy="epoch", #save the model at the end of each epoch
    learning_rate=2e-5, #initial learning rate for AdamW optimizer
    per_device_train_batch_size=4, #batch size per GPU for training.
    per_device_eval_batch_size=4, #batch size per GPU for evaluation
    num_train_epochs=5,
    weight_decay=0.01, #weight decay to apply to all layers except all bias and LayerNorm weights in AdamW optimizer
    fp16=True, #enable mixed-precision training to speed up the training on a recent GPU
    logging_steps=1000, #Number of update steps between two logs
    save_steps=1000, #Number of updates steps before two checkpoint saves
)

trainer = Trainer(
    model=model, #pre-trained model
    args=training_args, #parameters
    train_dataset=tokenized_dd_train, #pre-processed train dataset
    eval_dataset=tokenized_dd_val, #pre-processed validation dataset
    tokenizer=tokenizer, #RobertaFast tokenizer
)

with device("gpu:0"):
    trainer.train()

Step,Training Loss
1000,0.2447
2000,0.1598
3000,0.1511
4000,0.1556
5000,0.1323
6000,0.1487
7000,0.1042
8000,0.1275
9000,0.1213
10000,0.1335


### 3.2 Upload to HuggingFace

In [None]:
from huggingface_hub import notebook_login
notebook_login()
model.push_to_hub("nlp_roberta_legal")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Configuration saved in /tmp/tmpzoqdgwjq/config.json
Model weights saved in /tmp/tmpzoqdgwjq/pytorch_model.bin
Uploading the following files to arturo7531/nlp_roberta_legal: config.json,pytorch_model.bin


Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/496M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/arturo7531/nlp_roberta_legal/commit/615809681652d2f3dbbdeb6cdbbd6b955668f8e9', commit_message='Upload RobertaForQuestionAnswering', commit_description='', oid='615809681652d2f3dbbdeb6cdbbd6b955668f8e9', pr_url=None, pr_revision=None, pr_num=None)

## 4. Evaluate

### 4.1 Post-processing

To `evaluate the model`, we will need to post-process the model predictions into spans of text in the original examples; once we have done that, the **Squad** metric from the Datasets library will evaluate the performance of the model.

The model will output logits for the start and end positions of the answer in the input IDs. The **post-processing step** will be as follows:
1. We will mask the start and end logits corresponding to tokens outside of the context.
2. We will attribute a score to the (start_token, end_token) pair corresponding to the highest n_best logits (with n_best=20).These scores will be logit scores, and will be obtained by taking the sum of the start and end logits.
4. Then we will check the performance of our fine-tuned model.

Now, we need to **find the predicted answer** for each example in our small_eval_set. One example may have been split into several features in eval_set, so the first step is to map each example in small_eval_set to the corresponding features in eval_set.

We’ll look at the logit scores for the n_best start logits and end logits, excluding positions that give:

1. An answer that wouldn’t be inside the context
2. An answer with negative length
3. An answer that is too long (we limit the possibilities at max_answer_length=30)

Once we have all the scored possible answers for one example, we just **pick the one with the best logit score**.

In [23]:
def preprocess_val_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=256,
        truncation="only_second",
        stride=64,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping") #extract the overflow_to_sample_mapping inputs key
    example_ids = []

    for i in range(len(inputs["input_ids"])): #For each feature
        sample_idx = sample_map[i] #position of the sample to which the feature belongs to -> 0
        example_ids.append(examples["id"][sample_idx]) #map the features to the sample id they correspond [list]

        sequence_ids = inputs.sequence_ids(i) # maps each token of the feature to whether it is a special character (None), the question (0) or the context (1) -> [None, 0,0,0, None, None, 1,1,1,1,1, None]
        offset = inputs["offset_mapping"][i] #list of token tuples with its start and end position
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset) #set the offsets corresponding to the question and special character to None, set to o for the ones corresponding to the answer
        ]

    inputs["example_id"] = example_ids #add a key to the input with the mapping of the features to the corresponding sample id
    return inputs

In [24]:
small_eval_set = dataset_test.select(range(1))

eval_set = small_eval_set.map(
    preprocess_val_function,
    batched=True,
    remove_columns=small_eval_set.column_names,
)



In [25]:
Final_Model = AutoModelForQuestionAnswering.from_pretrained("arturo7531/nlp_roberta_legal")
model = Trainer(model=Final_Model)
predictions, _, _ = model.predict(eval_set)
start_logits, end_logits = predictions

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--arturo7531--nlp_roberta_legal/snapshots/615809681652d2f3dbbdeb6cdbbd6b955668f8e9/config.json
Model config RobertaConfig {
  "_name_or_path": "arturo7531/nlp_roberta_legal",
  "architectures": [
    "RobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "total_flos": 2733585826172737536,
  "transformers_version": "4.26.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file

In [26]:
import collections

example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature["example_id"]].append(idx)

import numpy as np

# By taking the n_best highest scores, the code is effectively limiting the number of candidate answer spans that need to be considered to improve the efficiency of the model.
n_best = 20
max_answer_length = 30
predicted_answers = []

for example in small_eval_set:
    example_id = example["id"]
    context = example["context"]
    answers = []

    for feature_index in example_to_features[example_id]:
        #Extracting start logits, end logits, and token offsets for the current feature index.
        start_logit = start_logits[feature_index]
        end_logit = end_logits[feature_index]
        offsets = eval_set["offset_mapping"][feature_index]

        #Sorting the start logits and end logits in descending order and extract the indices of the n_best highest-scoring logits.
        start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
        end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
        
        #The loops check whether the start and end indices correspond to valid token offsets in the context text and whether the resulting answer span is within the maximum length limit. 
        #If both conditions are met, a dictionary is appended to the answers list, containing the text of the answer span and the sum of the start and end logits as the logit_score.
        for start_index in start_indexes:
            for end_index in end_indexes:
                # Skip answers that are not fully in the context
                if offsets[start_index] is None or offsets[end_index] is None:
                    continue
                # Skip answers with a length that is either < 0 or > max_answer_length.
                if (
                    end_index < start_index
                    or end_index - start_index + 1 > max_answer_length
                ):
                    continue

                answers.append(
                    {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                )

    best_answer = max(answers, key=lambda x: x["logit_score"])
    predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})
    

### 4.2 Evaluation using the `SQUAD` metric

For evaluation purposes, we will use the **Squad** metric from **evaluate** library that was designed specifically for Q&A models evaluation. It takes two inputs :

1. *predictions*, a list of question-answer dictionaries with the following key-values:
   - id: the id of the question-answer pair as given in the references.
   - prediction_text: a list of possible texts for the answer, as a list of strings depending on a threshold on the confidence probability of each prediction.

2. *references*: a list of question-answer dictionaries with the following key-values:
   - id: the id of the question-answer pair (the same as above).
   - answers: a dictionary in the CUAD dataset format with the following keys:
     - text: a list of possible texts for the answer, as a list of strings.
     - answer_start: a list of start positions for the answer, as a list of ints.

The SQUAD metric computes **two scores**: Exact Match and F1 score.
1. *exact_match*: the range is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched.
2. *f1*: The harmonic mean of the precision and recall (see F1 score for more information). Its range is between 0 and 100 – its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 100, which means perfect precision and recall.



In [27]:
# Loading the metric
import evaluate
metric = evaluate.load("squad")


This metric expects the predicted answers in the format of a list of dictionaries with one key for the ID of the example and one key for the predicted text and the theoretical answers in the format of a list of dictionaries with one key for the ID of the example and one key for the possible answers.

In [28]:
theoretical_answers = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in small_eval_set
]

In [29]:
#Checking results
print(predicted_answers[0])
print(theoretical_answers[0])

{'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Document Name', 'prediction_text': 'SUPPLY CONTRACT Contract'}
{'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Document Name', 'answers': {'text': ['SUPPLY CONTRACT'], 'answer_start': [14]}}


In [30]:
metric.compute(predictions=predicted_answers, references=theoretical_answers)

{'exact_match': 0.0, 'f1': 80.0}

### 4.3 Creating the `compute_metric` function
We will now create the **compute_metrics** function that will combine all the above steps and that we will use with **Trainer**.



In [31]:
from tqdm.auto import tqdm


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

#### Evaluating performance of the **fine-tuned** model

In [32]:
small_eval_set = dataset_test

eval_set = small_eval_set.map(
    preprocess_val_function,
    batched=True,
    remove_columns=small_eval_set.column_names,
)

Final_Model = AutoModelForQuestionAnswering.from_pretrained("arturo7531/nlp_roberta_legal")
model = Trainer(model=Final_Model)
predictions, _, _ = model.predict(eval_set)
start_logits, end_logits = predictions

compute_metrics(start_logits, end_logits, eval_set, small_eval_set)

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

KeyboardInterrupt: ignored

## 5. Inference

**Goal:** Build a pipeline to output an answer based on a question and a document.

### 5.1. Wrap the context and question provided in a dataset

In [35]:
def transformer(context, question):
  new_review = {"context" : [context] , "question": [question], "id": [1]}
  df = pandas.DataFrame(data = new_review)
  small_eval_set = Dataset.from_pandas(df)
  return small_eval_set

### 5.2. Pre-process the inputs

**Goal:** Tokenize the dataset with the same tokenizer and setting as for the train dataset. Pre-process the dataset so that it has the right format to be inputted for the model training. <br>
- Tokenize
- Add the field ["example_id"] to the inputs - "example_id" is a list containing the sample id corresponding to each feature
- Modify the content of offset_mapping - "offset_mapping" is a list marking the tokens belonging to the context with o <br>

**Steps:** For each feature,<br>
1. Use the overflow_to_sample_mapping to map the features to the sample id they correspond to and put this mapping into a list
2. Use the sequence_ids method to modify the offset_maping in the following way:<br>
    2.a. Offsets corresponding to the question and special character are set to None <br>
    2.b. Offsets corresponding to the context are set to o <br>


In [36]:
def preprocess_val_function(examples):
  questions = [q.strip() for q in examples["question"]]
  inputs = tokenizer(
      questions,
      examples["context"],
      max_length=256,
      truncation="only_second",
      stride=64,
      return_overflowing_tokens=True,
      return_offsets_mapping=True,
      padding="max_length",
  )

  sample_map = inputs.pop("overflow_to_sample_mapping") #extract the overflow_to_sample_mapping inputs key
  example_ids = []

  for i in range(len(inputs["input_ids"])): #For each feature
      sample_idx = sample_map[i] #position of the sample to which the feature belongs to -> 0
      example_ids.append(examples["id"][sample_idx]) #map the features to the sample id they correspond [list]

      sequence_ids = inputs.sequence_ids(i) # maps each token of the feature to whether it is a special character (None), the question (0) or the context (1) -> [None, 0,0,0, None, None, 1,1,1,1,1, None]
      offset = inputs["offset_mapping"][i] #list of token tuples with its start and end position
      inputs["offset_mapping"][i] = [
          o if sequence_ids[k] == 1 else None for k, o in enumerate(offset) #set the offsets corresponding to the question and special character to None, set to o for the ones corresponding to the answer
      ]

  inputs["example_id"] = example_ids #add a key to the input with the mapping of the features to the corresponding sample id
  return inputs

### 5.3. Create the compute_metrics function to get the predicted answers

In [37]:
n_best = 20
max_answer_length = 50

In [38]:
def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append({"id": example_id, "predicted_text": best_answer["text"]})
        else:
          predicted_answers.append({"id": example_id, "prediction_text": ""})

    return predicted_answers 


### 5.4. Test the pipeline

In [39]:
question = 'Highlight the parts (if any) of this contract related to "Governing Law" that should be reviewed by a lawyer. Details: Which state/country\'s law governs the interpretation of the contract?'

context = 'Exhibit 10.16 SUPPLY CONTRACT Contract No: Date: The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd. ADD: Tel No. : Fax No. : The seller: ADD: The Contract is concluded and signed by the Buyer and Seller on , in Hong Kong. 1. General provisions 1.1 This is a framework agreement, the terms and conditions are applied to all purchase orders which signed by this agreement (hereinafter referred to as the "order"). 1.2 If the provisions of the agreement are inconsistent with the order, the order shall prevail. Not stated in order content will be subject to the provisions of agreement. Any modification, supplementary, give up should been written records, only to be valid by buyers and sellers authorized representative signature and confirmation, otherwise will be deemed invalid. 2. The agreement and order 2.1 During the validity term of this agreement, The buyer entrust SHENZHEN YICHANGTAI IMPORT AND EXPORT TRADE CO., LTD or SHENZHEN LEHEYUAN TRADING CO, LTD (hereinafter referred to as the "entrusted party" or "YICHANGTAI" or "LEHEYUAN"), to purchase the products specified in this agreement from the seller in the form of orders. 2.2 The seller shall be confirmed within three working days after receipt of order. If the seller finds order is not acceptable or need to modify, should note entrusted party in two working days after receipt of the order, If the seller did not confirm orders in time or notice not accept orders or modifications, the seller is deemed to have been accepted the order. The orders become effective once the seller accepts, any party shall not unilaterally cancel the order before the two sides agreed . 2.3 If the seller puts forward amendments or not accept orders, the seller shall be in the form of a written notice to entrusted party, entrusted party accept the modified by written consent, the modified orders to be taken effect. 2.4 Seller\'s note, only the buyer entrust the entrusted party issued orders, the product delivery and payment has the force of law.\n\n1\n\nSource: LOHA CO. LTD., F-1, 12/9/2019\n\n\n\n\n\n3. GOODS AND COUNTRY OF ORIGIN: 4. Specific order: The products quantity, unit price, specifications, delivery time and transportation, specific content shall be subject to the purchase order issued by entrusted party which is commissioned the buyer. 5. PACKING: To be packed in new strong wooden case(s) /carton(s), suitable for long distance transportation and for the change of climate, well protected against rough handling, moisture, rain, corrosion, shocks, rust, and freezing. The seller shall be liable for any damage and loss of the commodity, expenses incurred on account of improper packing, and any damage attributable to inadequate or improper protective measures taken by the seller in regard to the packing. One full set of technical All wooden material of shipping package must be treated as the requirements of Entry-Exit Inspection and Quarantine Bureau of China, by the agent whom is certified by the government where the goods is exported. And the goods must be marked with the IPPC stamps, which are certified by the government agent of Botanical-Inspection and Quarantine Bureau. 6. SHIPPING MARK: The Sellers shall mark on each package with fadeless paint the package number, gross weight, net weight, measurements and the wordings: "KEEP AWAY FROM MOISTURE","HANDLE WITH CARE" "THIS SIDE UP" etc. and the shipping mark on each package with fadeless paint. 7. DATE OF SHIPMENT: According to specific order by YICHANGTAI or LEHEYUAN. 8. PORT OF SHIPMENT:\n\n2\n\nSource: LOHA CO. LTD., F-1, 12/9/2019\n\n\n\n\n\n9. PORT OF DESTINATION: SHENZHEN, GUANGDONG, CHINA 10. INSURANCE: To be covered by the Seller for 110% invoice value against All Risks and War Risk. 11. PAYMENT: Under Letter of Credit or T/T: Under the Letter of Credit: The Buyer shall open an irrevocable letter of credit with the bank within 30 days after signing the contract, in favor of the Seller, for 100% value of the total contract value. The letter of credit should state that partial shipments are allowed. The Buyer\'s agent agrees to pay for the goods in accordance with the actual amount of the goods shipped. 80% of the system value being shipped will be paid against the documents stipulated in Clause 12.1. The remaining 20% of the system value being shipped will be paid against the documents stipulated in Clause 12.2. The Letter of Credit shall be valid until 90 days after the latest shipment is effected. Under the T/T The trustee of the buyer remitted the goods to the seller by telegraphic transfer in batches as agreed upon after signing each order. 12. DOCUMENTS: 12.1 (1) Invoice in 5 originals indicating contract number and Shipping Mark (in case of more than one shipping mark, the invoice shall be issued separately). (2) One certificate of origin of the goods. (3) Four original copies of the packing list. (4) Certificate of Quality and Quantity in 1 original issued by the agriculture products base. (5) One copy of insurance coverage (6) Copy of cable/letter to the transportation department of Buyer advising of particulars as to shipment immediately after shipment is made.\n\n3\n\nSource: LOHA CO. LTD., F-1, 12/9/2019\n\n\n\n\n\n12.2 (1) Invoice in 3 originals indicating contract number and L/C number. (2) Final acceptance certificate signed by the Buyer and the Seller. 13. SHIPMENT: CIP The seller shall contract on usual terms at his own expenses for the carriage of the goods to the agreed point at the named place of destination and bear all risks and expenses until the goods have been delivered to the port of destination. The Sellers shall ship the goods within the shipment time from the port of shipment to the port of destination. Transshipment is allowed. Partial Shipment is allowed. In case the goods are to be dispatched by parcel post/sea-freight, the Sellers shall, 3 days before the time of delivery, inform the Buyers by cable/letter of the estimated date of delivery, Contract No., commodity, invoiced value, etc. The sellers shall, immediately after dispatch of the goods, advise the Buyers by cable/letter of the Contract No., commodity, invoiced value and date of dispatch for the Buyers. 14. SHIPPING ADVICE: The seller shall within 72 hours after the shipment of the goods, advise the shipping department of buyer by fax or E-mail of Contract No., goods name, quantity, value, number of packages, gross weight, measurements and the estimated arrival time of the goods at the destination. 15. GUARANTEE OF QUALITY: The Sellers guarantee that the commodity hereof is complies in all respects with the quality and specification stipulated in this Contract. 16. CLAIMS: Within 7 days after the arrival of the goods at destination, should the quality, specification, or quantity be found not in conformity with the stipulations of the Contract except those claims for which the insurance company or the owners of the vessel are liable, the Buyers, on the strength of the Inspection Certificate issued by the China Commodity Inspection Bureau, have the right to claim for replacement with new goods, or for compensation, and all the expenses (such as inspection charges, freight for returning the goods and for sending the replacement, insurance premium, storage and loading and unloading charges etc.) shall be borne by the Sellers. The Certificate so issued shall be accepted as the base of a claim. The Sellers, in accordance with the Buyers\' claim, shall be responsible for the immediate elimination of the defect(s), complete or partial replacement of the commodity or shall devaluate the commodity according to the state of defect(s). Where necessary, the Buyers shall be at liberty to eliminate the defect(s) themselves at the Sellers\' expenses. If the Sellers fail to answer the Buyers within one weeks after receipt of the aforesaid claim, the claim shall be reckoned as having been accepted by the Sellers.\n\n4\n\nSource: LOHA CO. LTD., F-1, 12/9/2019\n\n\n\n\n\n17. FORCE MAJEURE: The Sellers shall not be held responsible for the delay in shipment or non-delivery, of the goods due to Force Majeure, which might occur during the process of manufacturing or in the course of loading or transit. The Sellers shall advise the Buyers immediately of the occurrence mentioned above and within fourteen days thereafter, the Sellers shall send by airmail to the Buyers a certificate of the accident issued by the competent government authorities, Chamber of Commerce or registered notary public of the place where the accident occurs as evidence thereof. Under such circumstances the Sellers, however, are still under the obligation to take all necessary measures to hasten the delivery of the goods. In case the accident lasts for more than 10 weeks, the Buyers shall have the right to cancel the Contract. 18. LATE DELIVERY AND PENALTY: Should the Sellers fail to make delivery on time as stipulated in the Contract, with exception of Force Majeure causes specified in Clause 17 of this Contract, the Buyers shall agree to postpone the delivery on condition that the Sellers agree to pay a penalty which shall be deducted by the paying bank from the payment. The penalty, however, shall not exceed 5% of the total value of the goods involved in the late delivery. The rate of penalty is charged at 0.5% for every seven days, odd days less than seven days should be counted as seven days. In case the Sellers fail to make delivery ten weeks later than the time of shipment stipulated in the Contract, the Buyers have the right to cancel the contract and the Sellers, in spite of the cancellation, shall still pay the aforesaid penalty to the Buyers without delay, the seller should refund the money received and pay the 30% of the total goods price of the penalty 19. ARBITRATION: All disputes in connection with this Contract or the execution thereof shall be settled friendly through negotiations. In case no settlement can be reached, the case may then be submitted for arbitration to the Foreign Economic and Trade Arbitration Committee of the China Beijing Council for the Promotion of International Trade in accordance with its Provisional Rules of Procedures by the said Arbitration Committee. The Arbitration shall take place in Beijing and the decision of the Arbitration Committee shall be final and binding upon both parties; neither party shall seek recourse to a law court nor other authorities to appeal for revision of the decision. Arbitration fee shall be borne by the losing party. 20. This final price is the confidential information. Dissemination, distribution or duplication of this price is strictly prohibited.\n\n5\n\nSource: LOHA CO. LTD., F-1, 12/9/2019\n\n\n\n\n\n21. Law application It will be governed by the law of the People\'s Republic of China ,otherwise it is governed by United Nations Convention on Contract for the International Sale of Goods. 22. <<Incoterms 2000>> The terms in the contract are based on (INCOTERMS 2000) of the International Chamber of Commerce. 23. The Contract is valid for 5 years, beginning from and ended on . This Contract is made out in three originals in both Chinese and English, each language being legally of the equal effect. Conflicts between these two languages arising there from, if any, shall be subject to Chinese version. One copy for the Sellers, two copies for the Buyers. The Contract becomes effective after signed by both parties. THE BUYER: THE SELLER: SIGNATURE: SIGNATURE: 6\n\nSource: LOHA CO. LTD., F-1, 12/9/2019'

In [40]:
small_eval_set = transformer(context, question)

eval_set = small_eval_set.map(
    preprocess_val_function,
    batched=True,
    remove_columns=small_eval_set.column_names,
)

Final_Model = AutoModelForQuestionAnswering.from_pretrained("arturo7531/nlp_roberta_legal")
model = Trainer(model=Final_Model)
predictions, _, _ = model.predict(eval_set)
start_logits, end_logits = predictions

compute_metrics(start_logits, end_logits, eval_set, small_eval_set)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--arturo7531--nlp_roberta_legal/snapshots/615809681652d2f3dbbdeb6cdbbd6b955668f8e9/config.json
Model config RobertaConfig {
  "_name_or_path": "arturo7531/nlp_roberta_legal",
  "architectures": [
    "RobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "total_flos": 2733585826172737536,
  "transformers_version": "4.26.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file

  0%|          | 0/1 [00:00<?, ?it/s]

[{'id': 1,
  'predicted_text': "It will be governed by the law of the People's Republic of China ,otherwise it is governed by United Nations Convention on Contract for the International Sale of Goods."}]