# Finetuning BART for abstractive text summarisation with fastai2

A great thing about working in NLP at the moment is being able to park a hard problem for a few weeks and discovering the community making massive amounts of progress on your behalf. I used to be overwhelmed by the challenge of just training a summarisation model to generate plausible looking text without burning through tonnes of cash on GPUs. Then [BertExtAbs](../finetuning-bertsumextabs) came along and solved that problem. Unfortunately, it still gernerated incoherent sentences sometimes and had a habit of confusing entities in an article. You certainly couldn't trust it to convey the facts of an article reliably.

Enter BART (Bidirectional and Auto-Regressive Transformers). Here we have a model that generates staggeringly good summaries and has a wonderful implementation from Sam Shleifer at HuggingFace. It's still a work in progress, but after digging around in the Transformers pull requests and with help from [Morgan McGuire's FastHugs notebook](https://github.com/morganmcg1/fasthugs) I have put together this notebook for fine-tuning BART and generating summaries. Feedback welcome!

I should mention that this a big model requiring big inputs. For fine-tuning I've been able to get a batch size of 4 and a maximum sequence length of 512 on an AWS P3.2xlarge (~£4 an hour).

We begin with a bunch of imports and an args object for storing variables we will need. We'll be finetuning the model on the Curation Corpus of abstractive text summaries. We load it into a dataframe using Pandas. For more information about how to access this dataset for your own purposes please see our [article introducing the dataset](https://medium.com/curation-corporation/teaching-an-ai-to-abstract-a-new-dataset-for-abstractive-auto-summarisation-5227f546caa8).

In [1]:
%reload_ext autoreload
%autoreload 2

Install requirements

In [2]:
# import sys  
# !{sys.executable} -m pip install -r ../requirements.txt



In [55]:
import logging
import os
import sys
from tqdm.notebook import tqdm

from fastprogress import progress_bar
from fastai2.basics import *
from fastai2.data import *
from fastai2.text.all import *
from fastai2.callback.all import *
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import PreTrainedTokenizer, BartTokenizer, BartForConditionalGeneration, BartConfig 
import torch
from torch import nn

sys.path.append('..')
logging.getLogger().setLevel(100)

Hopefully we will be able to increase our batch size and/or maximum sequence lengths when some pull requests to reduce the model's memory footprint get merged into the Transformers repository

In [9]:
class Namespace:
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)
        
args = Namespace(
    batch_size=4,
    max_seq_len=512,
    data_path="../data/private_dataset.file",
    device=torch.device("cuda:0" if torch.cuda.is_available() else "cpu"), # ('cpu'),
    stories_folder='../data/my_own_stories',
    subset=None,
    test_pct=0.1
)

Create dataset after scraping

In [28]:
def showmult(*args):
    return [i for i in args]

In [5]:
paths = glob.glob('../data/*/', recursive=True)
paths = [Path(p) for p in paths]
paths

[Path('../data/curation_corpus'),
 Path('../data/SemanticScholarAbstractSectionSummaryDataSet'),
 Path('../data/ArxivStructuredAbstractSectionalSummaries'),
 Path('../data/wikihow')]

In [56]:
paths = glob.glob('../raw_data/*/', recursive=True)
paths = [Path(p) for p in paths]

# read in datasets. use more of a dict approach here
cols2keep = {
    'SemanticScholarAbstractSectionSummaryDataSet': ['paperSection', 'Summary'],
    'ArxivStructuredAbstractSectionalSummaries': ['paperSection', 'Summary'],
    'wikihow': ['text', 'overview'],
    'curation_corpus': ['article_content', 'summary'],  # renames 'article_content' column as 'text'
}

output_dir = Path('../data')


# appends everything to the same dataframe, so far that's managable
data = []
for p in paths:

    name = os.path.split(p)[-1]

    print(f"processing {name}")

    # read files
    files = p.iterdir()
        
    # wrangle curation corpus
    if 'curation_corpus' in name:
        parent_dir = list(files)[0].parent
        summaries = pd.read_csv(parent_dir / 'curation-corpus-base.csv')
        text = pd.read_csv(parent_dir / 'curation-corpus-base-with-articles.csv')
        text = text[text.html != 'Exception']
        df = pd.merge(text, summaries, on='url')[['article_content', 'summary']]

    # read wikihow data
    elif 'wikihow' in name:
        df = pd.read_csv(list(files)[0])

    # read everything else
    else:
        dfs = [
            pd.read_parquet(f) for f in files
            if 'parquet' in str(f)
        ]
        df = pd.concat(dfs)

    df = df[cols2keep[name]]
    df.columns = ['text', 'summary']
    df['data_src'] = name
    data.append(df)

    print(f"{name} done")

data = pd.concat(data, ignore_index=True)
data.head()

processing curation_corpus
curation_corpus done
processing SemanticScholarAbstractSectionSummaryDataSet
SemanticScholarAbstractSectionSummaryDataSet done
processing ArxivStructuredAbstractSectionalSummaries
ArxivStructuredAbstractSectionalSummaries done
processing wikihow
wikihow done


Unnamed: 0,text,summary,data_src
0,"Credit: CC0 Public Domain\n \n\nUniversity of British Columbia researchers have found a cheap, sustainable way to build a solar cell using bacteria that convert light to energy.\n \nTheir cell generated a current stronger than any previously recorded from such a device, and worked as efficiently in dim light as in bright light.\nThis innovation could be a step toward wider adoption of solar power in places like British Columbia and parts of northern Europe where overcast skies are common. With further development, these solar cells—c...","Researchers in Canada have developed an innovative solar cell which uses bacteria to convert light into energy. The team at the University of British Columbia developed a way of genetically engineering E.coli to produce large amounts of lycopene, a natural dye which bacteria use for photosynthesis. By coating the bacteria with a semiconducting substance and incorporating it into a battery cell, they were able to achieve a current density of 0.686 milliamps per square centimetre, which they say is the highest yet achieved by a biogenic solar cell.",curation_corpus
1,"Genesis Motor America is launching two new online tools as part of its newly redesigned website.In conjunction with the debut of the Genesis G70, AOR Innocean took gaming technologies and\nInstagram Stories and created two new automotive ""configurators.""The first, a web-based Automotive Real-Time 3D Configurator, can only be found on the updated website, Genesis.com. Here, users can build a virtual vehicle while exploring inside and outside the car, activating animations that emulate real vehicle actions including\nheadlights, sunroof, and dashboard. They can even pop the trunk open. Users...","Luxury car maker Genesis Motor America launched a redesigned website that included two interactive tools that enable users to build a virtual vehicle complete with animations that mimic functions such as opening the sunroof. One of the tools allows Instagram users to build their car on the app using Instagram Stories. Innocean lead the project handling the creative and concept design. MediaMonks was one of four other companies that participated on the project, working on the WebGL Game Engine.\n",curation_corpus
2,"When Jamie Hodari started looking for funding for his New York-based co-working startup, Industrious, in 2012, he didn’t even bother talking to venture capital firms. \n“We were just so confident that VCs didn’t fund real estate that it wasn’t worth trying,” he said. \nInstead, Hodari and his co-founder, Justin Stewart, went to anyone they knew — parents, siblings, aunts, uncles, friends — who had money and pitched them on investing in individual locations. While they managed to raise $8 million, it was tedious. They couldn’t sign a lease or get going on a project until they had every dime...","Real-estate tech has become hot property for VC funds. Struggling to find seed funding when it launched in 2012, coworkspace operator Industrious was forced to tap friends and family for $8m. By early 2018, two rounds later, it had raised a total of $142m and had a list of 64 funds keen to invest in the sector. Industrious is one of hundreds benefiting from the explosion of interest in real-estate venture investment. From 2012 to 2017, investment in US real-estate tech leapt from $44.7m to $5.7bn, according to Pitchbook, as VC funds exclusively devoted to the sector emerged and family real...",curation_corpus
3,"British Insurers have today highlighted the potential dangers of ‘autonomous ambiguity’, as vehicles with different levels of autonomy, or driverless technology, increasingly become a feature of UK roads.With important and wide-reaching changes being defined by international regulators on what Assisted and Automated systems can and can't do, the Automated Driving Insurer Group (ADIG), led by the Association of British Insurers (ABI) in collaboration with Thatcham Research, has released a white paper setting out the latest position of UK insurers.The ‘Regulating Automated Driving’ paper is ...","While they believe vehicle automation will significantly reduce accidents, British insurers, including esure, have voiced concerns over ‘autonomous ambiguity’ as vehicles with different levels of autonomy take to the roads. The insurers have called for international regulators to make clear distinctions between assisted and automated systems, and set out criteria for marketing such vehicles. It is suggested that ""intermediate automated systems"", which offer significant self-driving capability but require drivers to reclaim control of the vehicle in certain circumstances, could leave driver...",curation_corpus
4,"Sir Martin Sorrell, the former boss of advertising giant WPP, is poised to wrap up a second takeover at his new venture this week.Sorrell, 73, who dramatically quit WPP following a board investigation into his conduct, is now running S4 Capital, a vehicle he set up to acquire marketing and advertising businesses.S4 is likely to confirm a takeover of US programmatic ad firm MightyHive in the coming days, The Mail on Sunday understands. The firm has been valued at up to $200 million (£157 million). Pastures new: Sir Martin Sorrell dramatically quit WPP following a board investigation into hi...","S4 Capital is close to acquiring MightyHive, with a deal expected within the next few days, according to The Mail on Sunday. The deal would be S4 Capital’s second acquisition since it was founded earlier this year; it beat WPP in a bidding war for MediaMonks in June. US programmatic ad firm MightyHive has been valued at up to $200m.\n",curation_corpus


In [62]:
output_dir = Path('../data')

nchunks = 40
datas = np.array_split(data, nchunks)

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

for ix, d in enumerate(tqdm(datas)):
    d.to_parquet(output_dir / f'data{str(ix).zfill(2)}.parquet.gzip', compression='gzip')

HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))




In [35]:
# ds = pd.read_feather(args.data_path).iloc[:args.subset]
ds = data
ds = ds[ds['summary'] != '']
train_ds, test_ds = train_test_split(ds, test_size=args.test_pct, random_state=42)
valid_ds, test_ds = train_test_split(test_ds, test_size=0.5, random_state=42)

To pass our data to the model in our fastai2 learner object we need a dataloader. To create a dataloader we need a Datasets object, batch size, and device type. To create a Datasets object, we have to pass a few things:
- Our raw data which in our case is a Pandas dataframe
- A list of transforms. Or to be more precise a list containing the list of transforms to perform on our inputs and a list of transforms to perform on our desired outputs. I've defined a transform below that encodes the text using the BART tokenizer. Mostly it will be the encodes class method that gets called by fastai2. However the decodes method can also be useful if you want to reverse the process.
- We will also split our data into training and validation datasets here, using fastai2's RandomSplitter class.

In [36]:
tokenizer = BartTokenizer.from_pretrained('bart-large-cnn', add_prefix_space=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




I'm still exploring whether it is necessary to pass any of the masks and other ids manually or if it is handled for us. Any advice here would be much appreciated!

In [37]:
class DataTransform(Transform):
    def __init__(self, tokenizer:PreTrainedTokenizer, column:string):
        self.tokenizer = tokenizer
        self.column = column
        
    def encodes(self, inp):  
        tokenized = self.tokenizer.batch_encode_plus(
            [list(inp[self.column])],
            max_length=args.max_seq_len, 
            pad_to_max_length=True, 
            return_tensors='pt'
        )
        return TensorText(tokenized['input_ids']).squeeze()
        
    def decodes(self, encoded):
        decoded = [
            self.tokenizer.decode(
                o, 
                skip_special_tokens=True, 
                clean_up_tokenization_spaces=False
            ) for o in encoded
        ]
        return decoded

In [38]:
x_tfms = [DataTransform(tokenizer, column='text')]
y_tfms = [DataTransform(tokenizer, column='summary')]
dss = Datasets(
    train_ds, 
    tfms=[x_tfms, y_tfms], 
    splits=RandomSplitter(valid_pct=0.1)(range(train_ds.shape[0]))
)

In [39]:
dls = dss.dataloaders(bs=args.batch_size, device=args.device.type)

This function lets us choose between loading the model architecture with Facebook's pretrained weights, the model architecture with our own weights stored locally, or the model architecture with no pretraining at all.

In [40]:
def load_hf_model(config, pretrained=False, path=None): 
    if pretrained:    
        if path:
            model = BartForConditionalGeneration.from_pretrained(
                "bart-large-cnn", 
                state_dict=torch.load(path, map_location=torch.device(args.device)), 
                config=config
            )
        else: 
            model = BartForConditionalGeneration.from_pretrained("bart-large-cnn", config=config)
    else:
        model = BartForConditionalGeneration()

    return model.to(args.device)

The model will return a lot of different things, but we only want the weights to calculate the loss when training, so we will wrap the model in this class to control what gets passed to the loss function.

In [41]:
class FastaiWrapper(Module):
    def __init__(self):
        self.config = BartConfig(vocab_size=50264, output_past=True)
        self.bart = load_hf_model(config=self.config, pretrained=True)
        
    def forward(self, x):
        output = self.bart(x)[0]
        return output

You can think of seq2seq tasks as a series of attempts to categorise which word should come next. Cross entropy loss is a pretty good loss function for this use case. We want to normalise it by how many non padding words are in each sequence.

In [42]:
class SummarisationLoss(Module):
    def __init__(self):
        self.criterion = torch.nn.CrossEntropyLoss()
        
    def forward(self, output, target):
        x = F.log_softmax(output, dim=-1)
        norm = (target != 1).data.sum()
        return self.criterion(x.contiguous().view(-1, x.size(-1)), target.contiguous().view(-1)) / norm

### Training

When fine-tuning the model we start by just training the top layer(s). You can experiment by unfreezing layers further down in the decoder, and then (if you're feeling bold) then encoder. fastai2 provides an easy way to split the model up into groups with frozen or unfrozen parameters.

In [12]:
def bart_splitter(model):
    return [
        params(model.bart.model.encoder), 
        params(model.bart.model.decoder.embed_tokens),
        params(model.bart.model.decoder.embed_positions),
        params(model.bart.model.decoder.layers),
        params(model.bart.model.decoder.layernorm_embedding),
    ]

I've been experimenting with half precision training. In theory this will save a lot of memory. However, I find my loss quickly becomes a bunch of nans. This may be an issue with HuggingFace's implementation or it may be an issue with my code. I'll update if I work out how to get fp16() working. Do let me know if you have any ideas!

In [13]:
learn = Learner(
    dls, 
    FastaiWrapper(), 
    loss_func=SummarisationLoss(), 
    opt_func=ranger,
    splitter=bart_splitter
)#.to_fp16()

In [14]:
learn.freeze_to(-1)

I've been finding that the learning rate finder suggests values that are too high. Your mileage may vary though.

In [15]:
# learn.lr_find()

In [17]:
learn.fit_flat_cos(
    1,
    lr=1e-4
)

If you do carry on unfreezing layers, you may find that you need to reduce your batch size to fit everything in memory. Also you should probably lower your learning rate.

In [19]:
learn.freeze_to(-2)
learn.dls.train.bs = args.batch_size//2
learn.dls.valid.bs = args.batch_size//2

In [None]:
learn.lr_find()

In [19]:
learn.fit_flat_cos(
    2,
    lr=1e-5
)

epoch,train_loss,valid_loss,time
0,0.001866,0.00179,01:40
1,0.001834,0.001778,01:41


Now that everything is done we can export the model

In [18]:
learn.export('../models/fintuned_bart.pkl')

### Inference

In [19]:
learn = load_learner('../models/fintuned_bart.pkl')

The following code for generating the summaries comes from [Sam Shleifer's example in the Transformers repository](https://github.com/huggingface/transformers/blob/master/examples/summarization/bart/evaluate_cnn.py). 

In [18]:
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]

def generate_summaries(lns, out_file, batch_size=4):
    dec = []
    for batch in progress_bar(list(chunks(lns, batch_size))):
        dct = tokenizer.batch_encode_plus(
            batch, 
            max_length=1024, 
            return_tensors="pt", 
            pad_to_max_length=True
        )
        
        summaries = learn.model.bart.to(args.device).generate(
            input_ids=dct["input_ids"].to(args.device),
            num_beams=4,
            length_penalty=2.0,
            max_length=142,
            min_length=56,
            no_repeat_ngram_size=3,
        )
        
        dec.extend([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summaries])
        
    return dec

In [19]:
lns = [" " + x.rstrip() for x in list(test_ds['text'])[:8]]
bart_sums = generate_summaries(lns, f'{args.stories_folder}/output.txt', batch_size=args.batch_size)

In [20]:
for s in bart_sums[:8]:
    print(s)
    print("***************")

OPPO's first 5G smartphone has received 5G CE certification, paving the way for the company to commercially launch the device in Europe. The phone maker claims that it is the first multi-frequency, multi-mode and multi-EN-DC (which means dual connectivity for LTE and 5G) smartphone to be certified by CTC Advanced GmbH. OPPO inked a global patent license agreement with Ericsson, which covers the patent portfolios of both companies in 2G, 3G and 4G, as well as cooperation on device testing, customer engagements and a demonstration at MWC19.
***************
Streaming music from Amazon Music, TuneIn, iHeartRadio, and Pandora, with support for Spotify and SiriusXM available shortly. Using the Amazon Alexa App simply create groups of Echo devices and then simply ask Alexa to play on those devices. Amazon is excited to be working with leading brands on this offering, including Sonos, Bose, Sound United, and Samsung.
***************
LanzaTech uses anaerobic bacteria (originally found in rabbit