In [5]:
# !pip install transformers
# !pip install keras_nlp
# !pip install datasets
# !pip install huggingface-hub
# !pip install nltk
# !pip install rouge-score
# !pip install pytorch_lightning

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch.nn import functional as F
from pprint import pprint
from torch import nn
import pytorch_lightning as pl
# https://www.pytorchlightning.ai/

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
from sklearn.model_selection import train_test_split

from datasets import load_dataset

## **ABOUT DATASETS - CNN/DAILYMAIL**
The CNN/DailyMail (Hermann et al., 2015) dataset contains 93k articles from the CNN, and 220k articles the Daily Mail newspapers. Both publishers supplement their articles with bullet point summaries. Non-anonymized variant in See et al. (2017). The dataset is available for download from [here](https://huggingface.co/datasets/cnn_dailymail).

The dataset has the following fields:

* **id**: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from
* **article**: a string containing the body of the news article
* **highlights**: a string containing the highlight of the article as written by the article author

In [3]:
df = load_dataset('cnn_dailymail', '3.0.0', split='train[:8%]')

Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/3.0.0 to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de...


Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de. Subsequent calls will reuse this data.


In [4]:
print(df)

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 22969
})


## **BART Pre-Training: Noising Methodologies**

The corruption schemes used in the paper are summarized below.

***Token Masking*** — A random subset of the input is replaced with [MASK] tokens, like in BERT.

***Token Deletion*** — Random tokens are deleted from the input. The model must decide which positions are missing (as the tokens are simply deleted and not replaced with anything else).

***Text Infilling*** — A number of text spans (length can vary) are each replaced with a single [MASK] token.

***Sentence Permutation*** — The input is split based on periods (.), and the sentences are shuffled.

***Document Rotation*** — A token is chosen at random, and the sequence is rotated so that it starts with the chosen token.*italicised text*

|Corruption Scheme    |Original Text|Corrupted Text|Explanation|
|---------------------|-------------|--------------|-----------|
|Token Masking        |ABC.DE.      |A_C._E.       |Both B and D are masked with a single mask token for each.|
|Token Deletion       |ABC.DE.      |A.C.E.        |Both B and D are deleted (and not replaced).|
|Text Infilling       |ABC.DE.      |A_.D_E.       |The span BC is replaced with a single mask token. A 0 length span is inserted between D and E.|
|Sentence Permutation |ABC.DE.      |DE.ABC.       |Split into sentences at periods (.) and shuffled.|
|Document Rotation    |ABC.DE.      |C.DE.AB       |The sequence is rotated around C.|


## **BART Fine Tuning: Methodologies**

**What is finetuning?**

Fine-tuning, in general, means making small adjustments to a process to achieve the desired output or performance. Fine-tuning deep learning involves using weights of a previous deep learning algorithm for programming another similar deep learning process. Weights are used to connect each neuron in one layer to every neuron in the next layer in the neural network. The fine-tuning process significantly decreases the time required for programming and processing a new deep learning algorithm as it already contains vital information from a pre-existing deep learning algorithm.

**Why finetuning?**

***Pros:***

* Greatly **reduced training time**. By using pre-trained weights, the model's first few layers are already very effective. You just need to train the final layers of your model.

* **Improved performance**. Models you usually used are pre-trained on large scale datasets (most commonly the ImageNet dataset). Because CNNs performance improves with more training data, the lower-level filters of pre-trained models are probably superior to filters trained on smaller datasets.

* **Counter over-fitting on small datasets**. CNNs need a lot of data to generalize properly, even when data augmentation techniques are applied. When trained on small datasets, their lower and mid-level filters tend to adapt specifically to the training set, leading the model to overfit. In contrast ImageNet is a very large (millions of images) and very diverse (1000 classes) dataset and filters of CNNs trained on it can extract very generic features. Using a pre-trained network is the only way I'm aware of with which you can train a CNN effectively on

***Cons:***

* No guarantee that the initialization point of the weights is a good starting point; they could be stuck in a **local minimum**. On the other hand by training a model from scratch could lead to a better solution, which might be unobtainable by starting from the initialization point of the pre-trained model. This is relevant if both runs (initial training and fine-tuning) are done on the same dataset.

* **Restricted architecture**. The most important downside of using pre-trained models are that we are restricted to use exactly the same architecture, which might not be desirable. The good thing is that pre-trained weights are available for almost all state-of-the-art models.



**BART-Fine tuning Down Stream Tasks**

The representations produced by BART can be used in
several ways for downstream applications.


* **Sequence Classification Tasks**
For sequence classification tasks, the same input is fed
into the encoder and decoder, and the final hidden state
of the final decoder token is fed into new multi-class
linear classifier. This approach is related to the CLS
token in BERT; however we add the additional token
to the end so that representation for the token in the
decoder can attend to decoder states from the complete
input.


* **Token Classification Tasks**
For token classification tasks, such as answer endpoint
classification for SQuAD, we feed the complete document into the encoder and decoder, and use the top
hidden state of the decoder as a representation for each
word. This representation is used to classify the token.


* **Sequence Generation Tasks**
Because BART has an autoregressive decoder, it can be
directly fine tuned for sequence generation tasks such
as abstractive question answering and summarization.
In both of these tasks, information is copied from the input but manipulated, which is closely related to the
denoising pre-training objective. Here, the encoder input is the input sequence, and the decoder generates
outputs autoregressively

* **Machine Translation**
We also explore using BART to improve machine translation decoders for translating into English. Previous
work Edunov et al. (2019) has shown that models can
be improved by incorporating pre-trained encoders, but
gains from using pre-trained language models in decoders have been limited. We show that it is possible
to use the entire BART model (both encoder and decoder) as a single pretrained decoder for machine translation, by adding a new set of encoder parameters that
are learned from bitext