In [None]:
# default_exp data.summarization

In [None]:
#hide
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# data.summarization

> This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for summarization tasks using architectures like BART and T5.

In [None]:
#export
import ast
from functools import reduce

import torch
from transformers import *
from fastai2.text.all import *

from blurr.utils import *
from blurr.data.core import *

In [None]:
#hide
import pdb

from nbdev.showdoc import *
from fastcore.test import *

In [None]:
#cuda
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')

Using GPU #1: GeForce GTX 1080 Ti


## Summarization tokenization, batch transform, and DataBlock methods

Summarization tasks attempt to generate a human-understandable and sensible representation of a larger body of text (e.g., capture the meaning of a larger document in 1-3 sentences).

In [None]:
path = Path('./')
cnndm_df = pd.read_csv(path/'cnndm_sample.csv'); len(cnndm_df)

1000

In [None]:
cnndm_df.head(2)

Unnamed: 0,article,highlights,ds_type
0,"(CNN) -- Globalization washes like a flood over the world's cultures and economies. Floods can be destructive; however, they can also bring blessings, as the annual floods of the Nile did for ancient Egypt. The world's great universities can be crucial instruments in shaping, in a positive way, humankind's reaction to globalization and the development of humankind itself. Traditionally, universities have been defined and limited by location, creating an academic community and drawing students and scholars to that place. Eventually, some universities began to encourage students to study el...","John Sexton: Traditionally, universities have been defined and limited by location .\nGlobal campuses form a network of thought, innovation, he writes .\nFaculty can teach, Sexton says, students can team up in many cities at once .\nSexton: Research, scholarship can be shared and cultural ties made in ""century of knowledge""",train
1,"(CNN) -- Armenian President Robert Kocharian declared a state of emergency Saturday night after a day of clashes between police and protesters, a spokeswoman for the Armenian Foreign Ministry said. Opposition supporters wave an Armenian flag during a protest rally in Yerevan, Armenia, on Saturday. The protesters claim last month's presidential election was rigged. The state of emergency will ""hopefully bring some order"" to the capital, Yerevan, said Salpi Ghazarian, assistant to the Armenian foreign minister, who spoke to CNN early Sunday. The state of emergency could last until March 20, ...","NEW: Protest moves after crackdown at Freedom Square .\nOrder sought after protests over last month's election turn violent .\nDemonstrators say the election was fraudulent .\nState of emergency could last until March 20, official says .",train


In [None]:
pretrained_model_name = "facebook/bart-large-cnn"

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                               model_cls=BartForConditionalGeneration)

hf_arch, type(hf_tokenizer), type(hf_config), type(hf_model)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1343.0, style=ProgressStyle(description…




Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['final_logits_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


('bart',
 transformers.tokenization_bart.BartTokenizer,
 transformers.configuration_bart.BartConfig,
 transformers.modeling_bart.BartForConditionalGeneration)

In [None]:
#export
class HF_SummarizationInput(list): pass

We create a subclass of `HF_BatchTransform` for summarization tasks to add `decoder_input_ids` and `labels` to our inputs during training, which will in turn allow the huggingface model to calculate the loss for us.  See [here](https://huggingface.co/transformers/model_doc/bart.html#transformers.BartModel.forward) for more information on these additional inputs are used in summarization and conversational training tasks.  

Note also that `labels` is simply target_ids shifted to the right by one since the task to is to predict the next token based on the current (and all previous) `decoder_input_ids`.

And lastly, we also update our targets to just be the `input_ids` of our target sequence so that fastai's `Learner.show_results` works (again, almost all the fastai bits require returning a single tensor to work).

In [None]:
#export
class HF_SummarizationBatchTransform(HF_BatchTransform):
    def __init__(self, hf_arch, hf_tokenizer, **kwargs):
        super().__init__(hf_arch, hf_tokenizer, HF_SummarizationInput, **kwargs)
        
    def encodes(self, samples):  
        samples = super().encodes(samples)
        if (len(samples[0]) == 1): return samples
        
        updated_samples = []
        for s in samples:
            s[0]['decoder_input_ids'] = s[1]['input_ids'][:-1].clone()
            s[0]['labels'] = s[1]['input_ids'][1:].clone()
            s[0]['labels'][s[0]['labels'] == self.hf_tokenizer.pad_token_id] = -100
            
            targ_ids = s[1]['input_ids']
            
            updated_samples.append((s[0], targ_ids))
        
        return updated_samples
    
    def decodes(self, encoded_samples):
        if (isinstance(encoded_samples, dict)): return self.hf_input_return_type([encoded_samples['input_ids']])
        return [encoded_samples]

We had to override the `decodes` method above because, while both our inputs and targets are technically the same things, we update the later to consist of *only* the target input_ids so that methods like `Learner.show_results` work.  Nevertheless, because fastai remembers what they are, `HF_TokenizerTransform.decodes` will be called for both and it works on a `list` of input_ids.

In [None]:
hf_batch_tfm = HF_SummarizationBatchTransform(hf_arch, hf_tokenizer)

blocks = ( 
    HF_TextBlock(hf_arch, hf_tokenizer), 
    HF_TextBlock(hf_arch, hf_tokenizer, hf_batch_tfm=hf_batch_tfm, max_length=150, hf_input_idxs=[0,1])
)

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('article'), 
                   get_y=ColReader('highlights'), 
                   splitter=RandomSplitter())

In [None]:
# dblock.summary(cnndm_df)

In [None]:
dls = dblock.dataloaders(cnndm_df, bs=4)

In [None]:
b = dls.one_batch()

In [None]:
len(b), b[0]['input_ids'].shape, b[1].shape

(2, torch.Size([4, 512]), torch.Size([4, 77]))

In [None]:
#export
@typedispatch
def show_batch(x:HF_SummarizationInput, y, samples, dataloaders=None, ctxs=None, max_n=6, **kwargs):  
    res = L([ (s[0], s[1]) for s in samples ])          
    display_df(pd.DataFrame(res, columns=['text', 'target'])[:max_n])
    return ctxs

In [None]:
dls.show_batch(dataloaders=dls, max_n=2)

Unnamed: 0,text,target
0,"(CNN) -- Sylvia Robinson, a singer-songwriter who went on to become a pioneer in the hip-hop music business, introducing the seminal ""Rapper's Delight,"" died Thursday in New Jersey of congestive heart failure. She was 76. Best known as an artist for 1973's sultry ""Pillow Talk,"" Robinson was a ""trendsetter"" in music, publicist Lynn K. Hobson told CNN. ""She was known as the founder of hip-hop,"" Hobson said. ""She was vibrant, with an over-the-top personality."" Robinson's singing, producing and songwriting career dated back to the 1950s, when she recorded as ""Little Sylvia"" and later as one half of the duo ""Mickey & Sylvia."" The team's hit ""Love Is Strange,"" which hit the pop charts in early 1957 and reached No. 1 on the rhythm-and-blues chart, found new life three decades later in the 1987 movie ""Dirty Dancing."" She also produced ""Love On a Two-Way Street"" for the Moments in 1970. Born Sylvia Vanterpool, Robinson and her late husband, Joe, founded Sugar Hill Records in 1979 and released the early hip hop hit, ""Rapper's Delight,"" performed by the Sugar Hill Gang. Her eldest son, Joey, was a member of the group she formed. The song, which adapted the musical track of Chic's ""Good Times,"" began with the familiar lines, ""I said a hip hop, a hippie, a hippie to the hip hip hop, you don't stop to rock it."" The label also signed Grandmaster Flash and the Furious Five, which had success in the 1980s, including the hit ""The Message."" Kanye West and Alicia Keys are among the artists who sampled songs associated with Robinson, Hobson said. The funeral is scheduled for October 11 at Community Baptist Church in Englewood, New Jersey. ""RIP to my grandmother,"" MTV personality Darnell Robinson, the entrepreneur's grandson, wrote on his Twitter account Thursday. ""We lost Mommy Sylvia this morning but she will never be forgotten!"" CNN's Phil Gast contributed to this report.","Singer-songwriter and music entrepreneur dies at 76.\nShe was most known for single ""Pillow Talk""\nSylvia Robinson helped start Sugar Hill Records."
1,"(CNN) -- Rory McIlroy has been reflecting on his meltdown at the Masters, after the 21-year-old squandered a four-stroke lead going into the final round at Augusta, eventually carding an eight-over-par 80 to finish 10 strokes behind winner Charl Schwartzel. Speaking to the official Masters website, the Northern Irishman said: ""I'm very disappointed. ""I was leading this golf tournament with nine holes to go, and I just unraveled. Hit a bad tee shot on 10, and then never really recovered. It's going to be hard to take for a few days, but I'll get over it."" McIlroy is not alone in crumbling on the back nine whilst leading the Masters and joins a famous list that includes great names such as Ben Hogan, Arnold Palmer, Scott Hoch and Greg Norman. Can McIlroy conquer mental minefield? And McIlory admitted that when he missed the fairway on the 13th hole, he knew his Masters dream had died. ""I'd knew that unless I birdied my way in, I realized I didn't have a chance,"" McIlroy said. ""I realized that was it then. ""But this is my first experience at it and hopefully the next time I'm in this position I'll be able to handle it better. ""I didn't handle it particularly well obviously, but it was a character-building day, put it that way. I'll come out stronger for it."" And, speaking on his Twitter feed, McIlroy quoted Muhammad Ali, saying: ""It's repetition of affirmations that leads to belief -- and once that belief becomes a deep conviction, things begin to happen."" McIlory also received some words of encouragement from Schwartzel who said: ""Golf is a really funny game -- one moment you're on top of it, the next moment it bites you. ""Rory is going to feel hurt. It's not easy what he went through. He is such a phenomenal player and he will win one. ""The first just happens to be the toughest, and that may be the biggest lesson of all,"" added the South African.",Rory McIlory has spoken about his final round heartbreak at the Masters.\nThe 21-year-old wasted a four-stroke lead to finish 10 shots behind Charl Schwartzel.\nThe Northern Irishman also quoted boxer Muhammad Ali on his Twitter page.


## Cleanup

In [None]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 00_utils.ipynb.
Converted 01_data-core.ipynb.
Converted 01a_data-language-modeling.ipynb.
Converted 01c_data-question-answering.ipynb.
Converted 01d_data-token-classification.ipynb.
Converted 01e_data-summarization.ipynb.
Converted 02_modeling-core.ipynb.
Converted 02a_modeling-language-modeling.ipynb.
Converted 02c_modeling-question-answering.ipynb.
Converted 02d_modeling-token-classification.ipynb.
Converted 02e_modeling-summarization.ipynb.
Converted index.ipynb.
