# Transfer learning

# Summary by BART
BART = BERT + GPT        
In this notebook applying transfer learning paradigm to train BART on legislative text.


In [1]:
import pandas as pd
from tqdm.auto import tqdm

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(palette = 'summer')

In [2]:
!pip install datasets transformers==4.28.0



In [3]:
!pip install --upgrade transformers

Collecting transformers
  Using cached transformers-4.33.1-py3-none-any.whl (7.6 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.28.0
    Uninstalling transformers-4.28.0:
      Successfully uninstalled transformers-4.28.0
Successfully installed transformers-4.33.1


In [4]:
!pip install evaluate



In [9]:
!pip install accelerate -U

Collecting accelerate
  Using cached accelerate-0.22.0-py3-none-any.whl (251 kB)
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.20.1
    Uninstalling accelerate-0.20.1:
      Successfully uninstalled accelerate-0.20.1
Successfully installed accelerate-0.22.0


In [10]:
import transformers
from datasets import load_dataset
import evaluate

In [11]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [12]:
transformers.__version__

'4.33.1'

In [13]:
data = load_dataset('billsum', split = 'ca_test')

In [14]:
data

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})

In [15]:
data[0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThe Legislature finds and declares all of the following:\n(a) (1) Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. These organizations help preserve the memories and incidents of the great hostilities fought by our nation, and preserve and strengthen comradeship among members.\n(2) These veterans’ organizations also own and manage various properties including lodges, posts, and fraternal halls. These properties act as a safe haven where veterans of all ages and their families can gather together to find camaraderie and fellowship, share stories, and seek support from people who understand their unique experiences. This aids in the healing process for these returning veterans, and ensures their health and happiness.\n(b) As a result of congressional chartering of these veterans’ organizations, the United States Inte

In [16]:
tokenizer = transformers.AutoTokenizer.from_pretrained("ainize/bart-base-cnn")

In [17]:
def prepr_f(text):
  inputs = tokenizer(text['text'], max_length = 1024, truncation = True)
  labels = tokenizer(text_target=text['summary'], max_length = 128, truncation = True)
  inputs['labels'] = labels['input_ids']
  return inputs

In [18]:
data = data.train_test_split(test_size = 0.1)
token_data = data.map(prepr_f, batched = True)

Map:   0%|          | 0/1113 [00:00<?, ? examples/s]

Map:   0%|          | 0/124 [00:00<?, ? examples/s]

In [19]:
token_data

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1113
    })
    test: Dataset({
        features: ['text', 'summary', 'title', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 124
    })
})

In [20]:
model = transformers.AutoModelForSeq2SeqLM.from_pretrained("ainize/bart-base-cnn")

In [21]:
data_collator = transformers.DataCollatorForSeq2Seq(tokenizer = tokenizer, model = model)

In [22]:
!pip install transformers[torch]



In [23]:
training_args = transformers.Seq2SeqTrainingArguments(
    output_dir = './res',
    evaluation_strategy = 'epoch',
    learning_rate = 2e-5,
    per_device_train_batch_size = 4,
    per_device_eval_batch_size = 4,
    weight_decay = 0.01,
    save_total_limit = 3,
    num_train_epochs = 3,
)

In [24]:
trainer = transformers.Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = token_data['train'],
    eval_dataset = token_data['test'],
    tokenizer = tokenizer,
    data_collator = data_collator,
)

In [25]:
trainer.train()

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,1.896121
2,2.166200,1.808822
3,2.166200,1.796048


TrainOutput(global_step=837, training_loss=2.042926163941159, metrics={'train_runtime': 593.0769, 'train_samples_per_second': 5.63, 'train_steps_per_second': 1.411, 'total_flos': 2035910034063360.0, 'train_loss': 2.042926163941159, 'epoch': 3.0})

In [26]:
ex = data['test']['text'][0]
ex

'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 739.1 of the Public Utilities Code is amended to read:\n739.1.\n(a) The commission shall continue a program of assistance to low-income electric and gas customers with annual household incomes that are no greater than 200 percent of the federal poverty guideline levels, the cost of which shall not be borne solely by any single class of customer. For one-person households, program eligibility shall be based on two-person household guideline levels. The program shall be referred to as the California Alternate Rates for Energy or CARE program. The commission shall ensure that the level of discount for low-income electric and gas customers correctly reflects the level of need.\n(b) The commission shall establish rates for CARE program participants, subject to both of the following:\n(1) That the commission ensure that low-income ratepayers are not jeopardized or overburdened by monthly energy expenditures,

In [27]:
in_ids = tokenizer.encode(
    ex,
    return_tensors = 'pt',
    max_length = 1024,
    truncation = True
).to(device)

In [28]:
type(in_ids)

torch.Tensor

In [29]:
in_ids.shape

torch.Size([1, 1024])

In [30]:
summ_ids = model.generate(
    input_ids = in_ids,
    bos_token_id = model.config.bos_token_id,
    eos_token_id = model.config.eos_token_id,
    max_length = 256,
    min_length = 32,
    num_beams = 4,
)

In [31]:
summ_ids.shape

torch.Size([1, 126])

In [32]:
decoder_txt = tokenizer.decode(summ_ids[0], skip_special_tokens = True)

In [33]:
len(ex), len(decoder_txt)

(11290, 685)

In [34]:
decoder_txt

'Existing law requires the Public Utilities Commission to continue a program of assistance to low-income electric and gas customers with annual household incomes that are no greater than 200% of the federal poverty guideline levels, the cost of which shall not be borne solely by any single class of customer.\nThis bill would require the commission to establish rates for CARE program participants, subject to specified requirements, including that the average effective CARE discount not be less than 30% or more than 35% of revenues that would have been produced for the same billed usage by non-CARE customers. The bill would authorize recovery of all administrative costs associated'

Here is the result!

In [35]:
summs = []

for txt in tqdm(data['test']['text']):
  in_ids = tokenizer.encode(
      txt,
      return_tensors = 'pt',
      max_length = 1024,
      truncation = True
  ).to(device)
  summ_ids = model.generate(
      input_ids = in_ids,
      bos_token_id = model.config.bos_token_id,
      eos_token_id = model.config.eos_token_id,
      max_length = 256,
      min_length = 32,
      num_beams = 4,
  )
  decoder_txt = tokenizer.decode(summ_ids[0], skip_special_tokens = True)
  summs.append(decoder_txt)

  0%|          | 0/124 [00:00<?, ?it/s]

# ROUGE

Recall-Oriented Understandy for Gisting Evaluation

In [36]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24932 sha256=664d5338ddfb879f1352847e90cb637eb7c980757141ac1b3e9485f1c6148474
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [37]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [38]:
%%time

res = rouge.compute(
    predictions = summs,
    references = data['test']['summary']
)

CPU times: user 3.68 s, sys: 15 ms, total: 3.69 s
Wall time: 3.77 s


In [39]:
res

{'rouge1': 0.35775330410764733,
 'rouge2': 0.19916644099616543,
 'rougeL': 0.2435982467584277,
 'rougeLsum': 0.30986575955212003}