<a href="https://colab.research.google.com/github/NLPNice/final-project/blob/main/train_testBART_fp16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!nvidia-smi

Thu Mar 10 09:02:45 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install -q transformers datasets torchinfo rouge_score git+https://github.com/google-research/bleurt.git

[K     |████████████████████████████████| 3.8 MB 9.6 MB/s 
[K     |████████████████████████████████| 312 kB 47.3 MB/s 
[K     |████████████████████████████████| 352 kB 53.5 MB/s 
[K     |████████████████████████████████| 1.2 MB 41.7 MB/s 
[K     |████████████████████████████████| 895 kB 26.6 MB/s 
[K     |████████████████████████████████| 596 kB 49.4 MB/s 
[K     |████████████████████████████████| 6.5 MB 17.1 MB/s 
[K     |████████████████████████████████| 67 kB 4.8 MB/s 
[K     |████████████████████████████████| 212 kB 69.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 50.8 MB/s 
[K     |████████████████████████████████| 134 kB 62.8 MB/s 
[K     |████████████████████████████████| 127 kB 52.9 MB/s 
[K     |████████████████████████████████| 271 kB 72.5 MB/s 
[K     |████████████████████████████████| 144 kB 67.1 MB/s 
[K     |████████████████████████████████| 94 kB 4.0 MB/s 
[K     |████████████████████████████████| 462 kB 61.0 MB/s 
[?25h  Building wheel for BL

In [None]:
from google.colab import drive

drive.flush_and_unmount()

Drive not mounted, so nothing to flush and unmount.


In [None]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [None]:
RANDOM_SEED = 42

# Dataset tokenization

We will tokenize the whole data by making use of the `datasets` library, which works seamlessly with the huggingface library.

In [None]:
from datasets import load_dataset, load_from_disk
import os

GDRIVE_DATASET_PATH = "/gdrive/MyDrive/university/tokenized_dataset"
SUMMARIES_CLAIMS_CSV_PATH = "/gdrive/MyDrive/university/summaries_claims.csv"

## Memory concerns

Unfortunately finetuning such a huge model requires a lot of memory in GPU. 
Google colab limit is $\approx$ 12 GB which is not enough for the whole dataset.

In [None]:
import sys
import pandas as pd
import csv

csv.field_size_limit(sys.maxsize)
# load into memory for analysis
df = pd.read_csv(SUMMARIES_CLAIMS_CSV_PATH, engine="python")
# some descriptions are NaNs so let's drop them
df = df.dropna()


Let's check how many samples we would be able to use by using a lower amount of data.

In [None]:
SIZES = [512, 1024, 2048, 4096]

summary_tokens = df["summaries"].apply(lambda x: len(x.split(" ")))
claims_tokens = df["claims"].apply(lambda x: len(x.split(" ")))

for size in SIZES:
  ok_summaries = summary_tokens <= size
  ok_claims = claims_tokens <= size  
  print(f"{ok_summaries.sum()} summaries have <= {size} tokens ({ok_summaries.sum() / len(summary_tokens) * 100:2.2f}%)")
  print(f"{ok_claims.sum()} claims have <= {size} tokens ({ok_claims.sum() / len(ok_claims) * 100:2.2f}%)")

11707 summaries have <= 512 tokens (44.97%)
25371 claims have <= 512 tokens (97.45%)
18100 summaries have <= 1024 tokens (69.52%)
25970 claims have <= 1024 tokens (99.75%)
23065 summaries have <= 2048 tokens (88.59%)
26026 claims have <= 2048 tokens (99.97%)
25311 summaries have <= 4096 tokens (97.22%)
26033 claims have <= 4096 tokens (99.99%)


We can safely see that by using 512 as maximum length we obtain most of the claims ($97.45\%$).

For the description, however, we can't really go that low or we would lose most of the sample. While using 4096 tokens, which is the maximum length handled by bigbird, would allow us to use all the data in the dataset we don't have adequate resources for such a job.

We will therefore make use of only those summaries whose length is $\leq 2048$, which accounts for $\approx 88\%$ of all the available data.

In [None]:
#@title Tokenize data
from transformers import BartTokenizer

MIN_SUMMARY_LEN =  128#@param {type: "number"}
MIN_CLAIM_LEN =  64#@param {type: "number"}
SUMMARY_LEN = 512 #@param {type: "number"}
CLAIM_LEN = 256 #@param {type: "number"}
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
#tokenizer.pad_token = 0


# let's check if we can load the dataset from disk first.
# this will save us the burden of loading the tokenizer
# and tokenizing all the data we need
if os.path.exists(GDRIVE_DATASET_PATH):
  dataset = load_from_disk(GDRIVE_DATASET_PATH)
  print("Dataset loaded")
else:
  from datasets import Dataset
  reduced_df = df[(claims_tokens <= CLAIM_LEN) & (summary_tokens <= SUMMARY_LEN) & (claims_tokens >= MIN_CLAIM_LEN) & (summary_tokens >= MIN_SUMMARY_LEN)]
  dataset = Dataset.from_pandas(reduced_df)

  # first let's rename data in the way the model expect
  dataset = dataset.rename_column("summaries", "input_ids") \
    .rename_column("claims", "decoder_input_ids") \
    .remove_columns("patentnumber") \
    .remove_columns("__index_level_0__")

  # even though we carefully preprocessed data some descriptions are still empty.
  # we will filter them out
  dataset = dataset.filter(lambda r: r["input_ids"] is not None)

  def encoder_tokenize_function(row):
    """
    Tokenize the summary into input_ids and attention_mask
    """
    return tokenizer(row["input_ids"], max_length=SUMMARY_LEN, padding="max_length", truncation=True)

  # tokenize the summaries
  dataset = dataset.map(encoder_tokenize_function, batched=True)

  def decoder_tokenize_function(row):
    """
    Tokenize claim into the expected output from the decoder 
    (decoder_input_ids and decoder_attention_mask)
    """
    tokenized = tokenizer(row["decoder_input_ids"], max_length=CLAIM_LEN, padding="max_length", truncation=True)

    return {
        "decoder_input_ids": tokenized["input_ids"],
        "decoder_attention_mask": tokenized["attention_mask"]
    }

  # tokenize the claim
  dataset = dataset.map(decoder_tokenize_function, batched=True)

  def compute_labels(row):
    """
    Compute labels based on decoder_input_ids where padding token is represented as -100
    """
    labels = row["decoder_input_ids"]
    labels = [-100 if t == 0 else t for t in labels]
    return {"labels" : labels}
  
  dataset = dataset.map(compute_labels, batched=True)

  # export the dataset to disk for future loading
  dataset.save_to_disk(GDRIVE_DATASET_PATH)
  print("Dataset computed and saved")

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Dataset loaded


We saw that, with the suggested configuration, we can manage to achieve $\approx 0.55 \ \text{it/s}$.

In [None]:
from datetime import timedelta

def format(seconds):
    h, rem = divmod(seconds, 3600)
    m, s = divmod(rem, 60)
    return round(h), round(m), round(s)

time_per_epoch = len(dataset) * 0.55
total_time = time_per_epoch * 3
h, m, s = format(total_time)

print(f"Approximately we would need {h}:{m}:{s} to train the whole dataset for 3 epochs")

Approximately we would need 2:42:53 to train the whole dataset for 3 epochs


# Dataset splits

We will divide the overall dataset into the usual splits: training and testing respectively $90\%$ and $10\%$ of the overall data.

We will further extract $10\%$ from training data and use it as validation during training.

In [None]:
train_test = dataset.train_test_split(test_size=0.15, seed=RANDOM_SEED)
test_valid = train_test["train"].train_test_split(test_size=0.1, seed=RANDOM_SEED)
train_test["train"] = test_valid["train"]
train_test["valid"] = test_valid["test"]
dataset = train_test

# delete from memory unused values
del train_test
del test_valid

Loading cached split indices for dataset at /gdrive/MyDrive/university/tokenized_dataset/cache-e0fd9cde5288d88e.arrow and /gdrive/MyDrive/university/tokenized_dataset/cache-4c007ea6cf3b28eb.arrow
Loading cached split indices for dataset at /gdrive/MyDrive/university/tokenized_dataset/cache-c66f33cd0cf3a2c7.arrow and /gdrive/MyDrive/university/tokenized_dataset/cache-85006177df905995.arrow


In [None]:
print(f"Train: {len(dataset['train'])} samples")
print(f"Test: {len(dataset['test'])} samples")
print(f"Valid: {len(dataset['valid'])} samples")

Train: 4530 samples
Test: 889 samples
Valid: 504 samples


# Neural model



In [None]:
from transformers import BartForConditionalGeneration
from torchinfo import summary

batch_size = 4

model = BartForConditionalGeneration.from_pretrained(
    "/gdrive/MyDrive/university/BARTModelFineTune/checkpoint-500",
    #block_size=16,
    #num_random_blocks=3,
    #attention_type="block_sparse",
    use_cache=False) # required for fp16
model.gradient_checkpointing_enable()
summary(model, dtypes=["torch.IntTensor"])

Layer (type:depth-idx)                                  Param #
BartForConditionalGeneration                            --
├─BartModel: 1-1                                        --
│    └─Embedding: 2-1                                   51,470,336
│    └─BartEncoder: 2-2                                 --
│    │    └─Embedding: 3-1                              (recursive)
│    │    └─BartLearnedPositionalEmbedding: 3-2         1,050,624
│    │    └─ModuleList: 3-3                             151,154,688
│    │    └─LayerNorm: 3-4                              2,048
│    └─BartDecoder: 2-3                                 --
│    │    └─Embedding: 3-5                              (recursive)
│    │    └─BartLearnedPositionalEmbedding: 3-6         1,050,624
│    │    └─ModuleList: 3-7                             201,560,064
│    │    └─LayerNorm: 3-8                              2,048
├─Linear: 1-2                                           51,470,336
Total params: 457,760,768
Trainable pa

In [None]:
from datasets import load_metric
import nltk

ROUGE = load_metric('rouge')
BLEURT = load_metric('bleurt', 'bleurt-large-512')

def postprocess_text(preds, labels):
  preds = [pred.strip() for pred in preds]
  labels = [label.strip() for label in labels]

  # rougeLSum expects newline after each sentence
  preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
  labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

  return preds, labels

def compute_metrics(eval_preds):
  preds, labels = eval_preds
  if isinstance(preds, tuple):
      preds = preds[0]

  decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
  # Replace -100 in the labels to actual padding
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  # Some simple post-processing
  decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

  rouge_score = ROUGE.compute(predictions=preds, references=labels)
  rouge_score = { k: v.mid.fmeasure for k, v in rouge_score.items() }

  bleurt_score = BLEURT.compute(predictions=preds, references=labels)["scores"]
  bleurt_score = {"bleurt": bleurt_score}

  return {**rouge_score, **bleurt_score}

Downloading:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

INFO:tensorflow:Reading checkpoint /root/.cache/huggingface/metrics/bleurt/bleurt-large-512/downloads/extracted/299e33e80b83c78cc60e485384c7804f6ec1fb36c2013c5078257c17a82719ca/bleurt-large-512.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
INFO:tensorflow:... max_seq_length:512
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Loading model.
INFO:tensorflow:BLEURT initialized.


In [None]:
FINETUNE_MODEL_PATH = "/gdrive/MyDrive/university/BARTModelFineTune/"

In [None]:
import gc
gc.collect()

13736

In [None]:
from transformers import  Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from torch.utils.checkpoint import checkpoint 

# setup training arguments, mainly batch size of 4
# accumulating gradient over 2 consecutive steps and training for 3 epochs.
# Saves the model every 900 steps and evaluates it every 300 steps.
# Only keeps the most recent 2 models.
# Training is made by means of gradient checkpointing and mixed-precision to 
# make the training process faster and lighter.
training_args =  Seq2SeqTrainingArguments(
    output_dir=FINETUNE_MODEL_PATH,
    overwrite_output_dir=True, # used to keep training
    gradient_accumulation_steps=2, # lower memory usage: perform backprop every 2 steps
    num_train_epochs=1,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    logging_first_step=True,
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=1, # save at most two checkpoints, delete the older ones
    fp16=True, # faster and lighter on memory but possibly less precise on convergence
    predict_with_generate=True,
    gradient_checkpointing=True)

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
trainer = Seq2SeqTrainer(
    model=model, 
    args=training_args,
    eval_dataset=dataset["valid"],
    train_dataset=dataset["train"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)



In [None]:
trainer.train()

###GENERATION

In [None]:
inputs = []
ref = []
sum = []
pred = []
NUM_SAMPLES =  10#@param {type: "number"}

dataset["test"] = dataset["test"].shuffle()

for idx in range(NUM_SAMPLES):

  SUMMARY = tokenizer.decode(dataset["test"]["input_ids"][idx], skip_special_tokens=True, truncation=True)
  sum.append(SUMMARY)
  ref.append(tokenizer.decode(dataset["test"]["decoder_input_ids"][idx], skip_special_tokens=True, truncation=True))
  inputs.append(tokenizer([SUMMARY], return_tensors="pt"))

for idx in range(NUM_SAMPLES):

  output_ids = model.generate(inputs[idx]["input_ids"],
      max_length=128,
      do_sample=True, 
      top_k=20, 
      top_p=0.90, 
      #num_beams=5, 
      #no_repeat_ngram_size=2, 
      #num_return_sequences=2, 
      early_stopping=True)
  
  print("DESCRIPTION N.", idx+1 ,": \n")
  print(sum[idx], end="\n\n")
  print("OUTPUT N.", idx+1 , ": \n")
  print(tokenizer.batch_decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False),end = "\n\n")
  print("CLAIM N.", idx+1 ,": \n")
  print(ref[idx], end="\n\n")
  
  pred.append(tokenizer.batch_decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False))

DESCRIPTION N. 1 : 

in a case where a type of the laser beam machining apparatus disclosed in jp 2012-256062 a is used, because the irradiation head is not displaced, it is difficult to measure laser scanning velocity. the invention provides a measuring method that enables measurement of laser scanning velocity in a laser beam machining apparatus structured so that laser scanning is performed by an operation of a mirror. an aspect of the invention is a measuring method for measuring laser scanning velocity for a laser beam machining apparatus, according to claim 1. the laser beam machining apparatus includes a mirror, and is configured to process a work by irradiating pulsed laser. the laser is irradiated by operating the mirror. the measuring method includes measuring processing sound of the work while it is processed by the laser using the laser beam machining apparatus. further, the measuring method includes calculating the laser scanning velocity by analyzing a frequency of said m

In [None]:
pred = [item for sublist in pred for item in sublist]

predictions = [sum, pred]
references= [sum, ref]


In [None]:
BLEURT.compute(predictions=predictions, references=references)

In [None]:
ROUGE.compute(predictions=predictions, references=references)

In [None]:
# import pandas as pd
import pandas as pd

df = pd.DataFrame(list(zip(sum, ref, pred)),
			columns =['Desc', 'Claim', 'Out'])
df.to_csv("/gdrive/MyDrive/university/BARTModelFineTune/out.csv", mode='a', header=False)

In [None]:
df