# Efficiently train Large Language Models with LoRA and Hugging Face

In this blog, we are going to show you how to apply [Low-Rank Adaptation of Large Language Models (LoRA)](https://arxiv.org/abs/2106.09685) to fine-tune FLAN-T5 XXL (11 billion parameters) on a single GPU. We are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft).

You will learn how to:

1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune T5 with LoRA and bnb int-8
4. Evaluate & run Inference with LoRA FLAN-T5
5. Cost performance comparison

### Quick intro: PEFT or Parameter Efficient Fine-tunin

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)

*Note: This tutorial was created and run on a g5.2xlarge AWS EC2 Instance, including 1 NVIDIA A10G.*

## 1. Setup Development Environment

In our example, we use the [PyTorch Deep Learning AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-pytorch.html) with already set up CUDA drivers and PyTorch installed. We still have to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.

In [2]:
# install Hugging Face Libraries
!pip install "peft==0.2.0"
!pip install "transformers==4.27.1" "datasets==2.9.0" "accelerate==0.17.1" "evaluate==0.4.0" "bitsandbytes==0.37.1" loralib --upgrade --quiet

# install additional dependencies needed for training
!pip install rouge-score tensorboard py7zr

Collecting peft==0.2.0
  Downloading peft-0.2.0-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting transformers (from peft==0.2.0)
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m76.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate (from peft==0.2.0)
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers->peft==0.2.0)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers->peft==0.2.0)
  Downloading tokenizers

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')
!cp '/content/gdrive/My Drive/Data/FINDSum/text/FINDSum-ROO/roo_input_2000/'* .

Mounted at /content/gdrive


In [4]:
from datasets import load_dataset
# dataset = load_dataset("csv", data_files="my_file.csv")

#dataset_fin = load_dataset("csv", data_files=["val_roo_segment_0_input_2_1000.csv", "test_roo_segment_0_input_2_1000.csv", "val_roo_segment_1_input_2_1000.csv", "test_roo_segment_1_input_2_1000.csv"])
#dataset_fin = load_dataset("csv", data_files={"train": ["val_roo_segment_0_input_2_1000.csv", "test_roo_segment_0_input_2_1000.csv", "val_roo_segment_1_input_2_1000"], "test": ["test_roo_segment_1_input_2_1000.csv"]})
#dataset_fin = load_dataset("csv", data_files=["train_roo_segment_0_input_2_1000.csv", "train_roo_segment_1_input_2_1000.csv"])
ds_fin = load_dataset("csv", data_files={"train": ["train_roo_segment_0_input_2_1000.csv", "train_roo_segment_1_input_2_1000.csv"], "test": ["test_roo_segment_0_input_2_1000.csv","test_roo_segment_1_input_2_1000.csv"]})





Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-e31ff96ade9443e4/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)
  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


Generating test split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)
  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-e31ff96ade9443e4/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
# ds_fin = dataset_fin["train"].train_test_split(test_size=0.2, shuffle=True)
# ds_fin

ds_fin["train"][0]

{'document': "amounts allocated to land , buildings , equipment and fixtures are based on cost segregation studies performed by independent third parties or on the company 's analysis of story_separator_special_tag the following discussion should be read in conjunction with the consolidated financial statements of lcfh and accompanying notes included in this annual report on form 10-k. in addition to historical information , the following discussion contains forward-looking statements that reflect our plans , estimates and beliefs . our actual results could differ materially from those discussed in the forward-looking statements . factors that could cause or contribute to these differences include , but are not limited to , those discussed in our `` risk factors . '' overview we are a leading commercial real estate finance company with a proprietary loan origination platform and an established national footprint . as a non-bank operating company , we believe that we are well-positioned

In [6]:

print(f"Train dataset size: {len(ds_fin['train'])}")
print(f"Test dataset size: {len(ds_fin['test'])}")


Train dataset size: 33640
Test dataset size: 4204


To train our model, we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means, check out **[chapter 6](https://huggingface.co/course/chapter6/1?fw=tf)** of the Hugging Face Course.

In [7]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="t5-small"

sample_record = ds_fin["train"][0]
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# to be removed

from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_id)

# let us save the model
model.save_pretrained('./models/base')
#text_prompt="summarize: This is a very long text and sometimes this is also a long set of words. Hence such long texts are usually written with lots of words"
sample="summarize: " + sample_record["document"]
input_tokens = tokenizer(sample, padding='max_length', max_length=512, truncation=True, return_tensors='np')
result_sample = model.generate(**input_tokens, max_length=200, top_k=3, temperature=0.5)
# tokenized = tokenizer([document], return_tensors='np')
# out = model.generate(**tokenized, max_length=128
# Greedy Search
print(tokenizer.decode(result_sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))

#summarization
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text,  max_length=512,truncation=True,  return_tensors="np")

outputs = model.generate(**input_ids, max_length=200, top_k=3, temperature=0.5)
print(tokenizer.decode(outputs[0]))

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


<pad> lcfh's financial statements reflect our plans, estimates and beliefs. the forward-looking statements could differ materially from those discussed in the forward-looking statements. we are a leading commercial real estate finance company with a proprietary loan origination platform and an established national footprint.</s>
<pad> Wie alt sind Sie?</s>


In [8]:
ds_fin

DatasetDict({
    train: Dataset({
        features: ['document', 'summary'],
        num_rows: 33640
    })
    test: Dataset({
        features: ['document', 'summary'],
        num_rows: 4204
    })
})

Before we can start training, we need to preprocess our data. Abstractive Summarization is a text-generation task. Our model will take a text as input and generate a summary as output. We want to understand how long our input and output will take to batch our data efficiently.

In [9]:
# Vish
from datasets import concatenate_datasets
import numpy as np
# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([ds_fin["train"], ds_fin["test"]]).map(lambda x: tokenizer(x["document"], truncation=True), batched=True, remove_columns=["document", "summary"])
input_lenghts = [len(x) for x in tokenized_inputs["input_ids"]]
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([ds_fin["train"], ds_fin["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["document", "summary"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

  0%|          | 0/38 [00:00<?, ?ba/s]

Max source length: 512


  0%|          | 0/38 [00:00<?, ?ba/s]

Max target length: 512


We preprocess our dataset before training and save it to disk. You could run this step on your local machine or a CPU and upload it to the [Hugging Face Hub](https://huggingface.co/docs/hub/datasets-overview).

In [10]:
#vish
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["document"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = ds_fin.map(preprocess_function, batched=True, remove_columns=["document", "summary"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

  0%|          | 0/34 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


Saving the dataset (0/1 shards):   0%|          | 0/33640 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/4204 [00:00<?, ? examples/s]

## 3. Fine-Tune T5 with LoRA and bnb int-8

In addition to the LoRA technique, we will use [bitsanbytes LLM.int8()](https://huggingface.co/blog/hf-bitsandbytes-integration) to quantize out frozen LLM to int8. This allows us to reduce the needed memory for FLAN-T5 XXL ~4x.  

The first step of our training is to load the model. We are going to use [philschmid/flan-t5-xxl-sharded-fp16](https://huggingface.co/philschmid/flan-t5-xxl-sharded-fp16), which is a sharded version of [google/flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl). The sharding will help us to not run off of memory when loading the model.

In [11]:
from transformers import AutoModelForSeq2SeqLM
import torch
# huggingface hub model id
#model_id = "philschmid/flan-t5-xxl-sharded-fp16"

#Vish
#model_id = "t5-small"

# model_id = "google/flan-t5-xl"

# with torch.autocast("cuda"):
  # load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto")


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)



Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

t5-small


Now, we can prepare our model for the LoRA int-8 training using `peft`.

In [12]:


from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType

# Define LoRA Config
lora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=["q", "v"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)
# prepare int-8 model for training - Commenting Vish
model = prepare_model_for_int8_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# trainable params: 18874368 || all params: 11154206720 || trainable%: 0.16921300163961817

trainable params: 589824 || all params: 61096448 || trainable%: 0.9653981848502878


As you can see, here we are only training 0.16% of the parameters of the model! This huge memory gain will enable us to fine-tune the model without memory issues.

Next is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the 🤗 Transformers library.

In [13]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

The last step is to define the hyperparameters (`TrainingArguments`) we want to use for our training.

In [14]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir="lora-flan-t5-small"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
		auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=2,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

Let's now train our model and run the cells below. Note that for T5, some layers are kept in `float32` for stability purposes.

In [15]:
#  with torch.autocast("cuda"):
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,3.9915
1000,3.7291
1500,3.6294
2000,3.581
2500,3.5345
3000,3.4828
3500,3.4815
4000,3.4926
4500,3.447
5000,3.4396


TrainOutput(global_step=8410, training_loss=3.5121704682158517, metrics={'train_runtime': 6639.3284, 'train_samples_per_second': 10.134, 'train_steps_per_second': 1.267, 'total_flos': 9227703681024000.0, 'train_loss': 3.5121704682158517, 'epoch': 2.0})

The training took ~10:36:00 and cost `~13.22$` for 10h of training. For comparison a [full fine-tuning on FLAN-T5-XXL](https://www.philschmid.de/fine-tune-flan-t5-deepspeed#3-results--experiments) with the same duration (10h) requires 8x A100 40GBs and costs ~322$.

We can save our model to use it for inference and evaluate it. We will save it to disk for now, but you could also upload it to the [Hugging Face Hub](https://huggingface.co/docs/hub/main) using the `model.push_to_hub` method.

In [16]:
# Save our LoRA model & tokenizer results
peft_model_id="results"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)
# if you want to save the base model to call
trainer.model.base_model.save_pretrained(peft_model_id)



Our LoRA checkpoint is only 84MB small and includes all of the learnt knowleddge for samsum.

## 4. Evaluate & run Inference with LoRA FLAN-T5

After the training is done we want to evaluate and test it. The most commonly used metric to evaluate summarization task is [rogue_score](https://en.wikipedia.org/wiki/ROUGE_(metric)) short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries.

We are going to use `evaluate` library to evaluate the `rogue` score. We can run inference using `PEFT` and `transformers`. For our FLAN-T5 XXL model, we need at least 18GB of GPU memory.

In [29]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load peft config for pre-trained checkpoint etc.
peft_model_id = "results"
config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,  load_in_8bit=True,  device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

print("Peft model loaded")



Peft model loaded


In [18]:
#Vish : Let us look at an unseen data from Train dataset
unseen_dataset=  load_dataset("csv", data_files=["test_roo_segment_0_input_2_1000.csv", "test_roo_segment_0_input_2_1000.csv"])



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-b10949de81716265/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)
  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-b10949de81716265/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [19]:
unseen_dataset["train"][10]

{'document': "the one month libor rate increased from an average of ( 0.7164 % ) during december 2016 to an average of ( 1.4925 % ) during december 2017 and an average of ( 2.4582 % ) during december 2018. fixed mortgage interest rates for multi-family properties of similar class and location as omaha 's portfolio also increased during 2017 from an approximate range of 4.10 % to 4.15 % in early 2017 to 4.35 % to 4.50 % near the end of 2017and further to an approximate range of 5.05 % to 5.10 % in december 2018. mortgage interest rates may continue to increase in 2019 if the u.s. federal reserve continues a policy to increase the federal funds rate . although increases in fixed mortgage rates do not impact the operating cash flow of the omaha properties directly , increases in fixed and floating rates on commercial mortgage debt can have a negative impact on capitalization rates and the sales prices sentinel omaha may achieve in the future . since omaha has in the past two years paid of

In [33]:
import random
sample_data= unseen_dataset["train"][random.randrange(len(unseen_dataset))]

In [34]:
sample_data["document"]

'the company primarily targets acquisitions of re-performing loans ( “ rpls ” ) , which are residential mortgage loans on which at least five of the seven most recent payments have been made , or the most recent payment has been made and accepted pursuant to an agreement , or the full dollar amount , to cover at least five payments has been paid in the last seven months . the company also acquires and originates small balance commercial loans ( `` sbc loans `` ) . the sbc loans that the company opportunistically targets , through acquisitions , or originations , generally have a principal balance of up to $ 5.0 million and are secured by multi-family residential and story_separator_special_tag overview great ajax corp. is a maryland corporation that is organized and operated in a manner intended to allow us to qualify as a reit . we primarily target acquisitions of rpls , which are residential mortgage loans on which at least five of the seven most recent payments have been made , or t

In [35]:
input_ids = tokenizer(sample_data["document"], return_tensors="pt", truncation=True).input_ids.cuda()
#with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens= 100, do_sample=True, top_p=0.8)
print(f"input sentence: {sample_data['document']}\n{'---'* 20}")

print(f"summary:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]}")

input sentence: the company primarily targets acquisitions of re-performing loans ( “ rpls ” ) , which are residential mortgage loans on which at least five of the seven most recent payments have been made , or the most recent payment has been made and accepted pursuant to an agreement , or the full dollar amount , to cover at least five payments has been paid in the last seven months . the company also acquires and originates small balance commercial loans ( `` sbc loans `` ) . the sbc loans that the company opportunistically targets , through acquisitions , or originations , generally have a principal balance of up to $ 5.0 million and are secured by multi-family residential and story_separator_special_tag overview great ajax corp. is a maryland corporation that is organized and operated in a manner intended to allow us to qualify as a reit . we primarily target acquisitions of rpls , which are residential mortgage loans on which at least five of the seven most recent payments have b

In [38]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_base="t5-small"

#sample_record = ds_fin["train"][0]
sample_record = sample_data # use the same record as earlier
# Load tokenizer
tokenizer_base = AutoTokenizer.from_pretrained(model_base)
# to be removed

from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model_base = TFAutoModelForSeq2SeqLM.from_pretrained("./models/base")

# let us save the model
#model.save_pretrained('/models/base')
#text_prompt="summarize: This is a very long text and sometimes this is also a long set of words. Hence such long texts are usually written with lots of words"
sample="summarize: " + sample_record["document"]
input_tokens = tokenizer_base(sample, padding='max_length', max_length=512, truncation=True, return_tensors='np')
result_sample_base = model_base.generate(**input_tokens, max_length=200, top_k=3, temperature=0.5)
# tokenized = tokenizer([document], return_tensors='np')
# out = model.generate(**tokenized, max_length=128
# Greedy Search
print(tokenizer.decode(result_sample_base[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))

#summarization
input_text_trans = "translate English to German: How old are you?"
input_ids_base = tokenizer_base(input_text_trans,  max_length=512,truncation=True,  return_tensors="np")

outputs = model_base.generate(**input_ids_base, max_length=200, top_k=3, temperature=0.5)
print(tokenizer_base.decode(outputs[0]))

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at ./models/base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


<pad> the company primarily targets acquisitions of re-performing loans ( rpls ), which are residential mortgage loans on which at least five of the seven most recent payments have been made. the company also acquires and originates small balance commercial loans ( sbc loans ). the sbc loans that the company opportunistically targets through acquisitions generally have a principal balance of up to $ 5.0 million.</s>
<pad> Wie alt sind Sie?</s>


Let’s load the dataset again with a random sample to try the summarization.

Nice! our model works! Now, lets take a closer look and evaluate it against the `test` set of processed dataset from `samsum`. Therefore we need to use and create some utilities to generate the summaries and group them together. The most commonly used metrics to evaluate summarization task is [rogue_score](https://en.wikipedia.org/wiki/ROUGE_(metric)) short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries.

In [None]:
import evaluate
import numpy as np
from datasets import load_from_disk
from tqdm import tqdm

# Metric
metric = evaluate.load("rouge")

def evaluate_peft_model(sample,max_target_length=50):
    # generate summary
    outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)
    prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    # decode eval sample
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    return prediction, labels

# load test dataset from distk
test_dataset = load_from_disk("data/eval/").with_format("torch")

# run predictions
# this can take ~45 minutes
predictions, references = [] , []
for sample in tqdm(test_dataset):
    p,l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)

# compute metric
rogue = metric.compute(predictions=predictions, references=references, use_stemmer=True)

# print results
print(f"Rogue1: {rogue['rouge1']* 100:2f}%")
print(f"rouge2: {rogue['rouge2']* 100:2f}%")
print(f"rougeL: {rogue['rougeL']* 100:2f}%")
print(f"rougeLsum: {rogue['rougeLsum']* 100:2f}%")

# Rogue1: 50.386161%
# rouge2: 24.842412%
# rougeL: 41.370130%
# rougeLsum: 41.394230%

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

 11%|█         | 464/4204 [22:45<2:55:05,  2.81s/it]

Our PEFT fine-tuned FLAN-T5-XXL achieved a rogue1 score of `50.38%` on the test dataset. For comparison a [full fine-tuning of flan-t5-base achieved a rouge1 score of 47.23](https://www.philschmid.de/fine-tune-flan-t5). That is a `3%` improvements.

It is incredible to see that our LoRA checkpoint is only 84MB small and model achieves better performance than a smaller fully fine-tuned model.

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

In [None]:
!pip install SentencePiece