# Assignment 3: Summarization Tests

**Description:** This assignment covers summarization outputs. You will compare three different types of solutions, all using an encoder decoder architecture. You should also be able to develop an intuition for:


* How well summarization systems work
* The effects of using different pre-training and fine-tuning checkpoints on outcomes
* The effects of hyperparameters on outcomes



This notebook on your GCP instance as the generation of summaries does not require a GPU to work in a timely fashion. This notebook should be run on a Google Colab but it does not require a GPU. By default, when you open the notebook in Colab it will not configure a GPU.  Summarization commands can take up to five minutes to run depending on the hyperparameters you use.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2022-summer-main/blob/master/assignment/a3/Summarization_test.ipynb)

The overall assignment structure is as follows:

1. T5 for summarization

2. Pegasus for summarization

3. BART for summarization




**INSTRUCTIONS:**: 

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.

* **### YOUR CODE HERE** indicates that you are supposed to write code.



In [None]:
!pip install -q sentencepiece

In [None]:
!pip install -q transformers

In [None]:
!pip install -q datasets

[K     |████████████████████████████████| 362 kB 15.7 MB/s 
[K     |████████████████████████████████| 140 kB 44.6 MB/s 
[K     |████████████████████████████████| 212 kB 41.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 37.8 MB/s 
[K     |████████████████████████████████| 127 kB 17.3 MB/s 
[K     |████████████████████████████████| 94 kB 3.1 MB/s 
[K     |████████████████████████████████| 144 kB 43.6 MB/s 
[K     |████████████████████████████████| 271 kB 42.5 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

Let's leverage the pre-trained and fine tuned models on HuggingFace to demonstrate some capabilities.  They include models/checkpoints that were fine tuned on a particular dataset.  We can leverge the datasets library to look at some of their outputs.

In [None]:
#let's make longer output readable without scrolling
from pprint import pprint

We'll use this same toy article as the input to all of our summarization attempts.  That way we have the ability to compare.

In [None]:
ARTICLE_TO_SUMMARIZE = (
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds "
    "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
    "The record breaking drought has made the current conditions even worse than in previous years. It exponentially"
    "increases the probability of large scale wildfires."
)


### 1. T5 for summarization

T5 is an encoder decoder architecture that has been trained on multiple tasks, so not purely summarization.  You can read more about it [here](https://huggingface.co/docs/transformers/model_doc/t5).

In [None]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

model = TFT5ForConditionalGeneration.from_pretrained("google/t5-v1_1-base")
tokenizer = T5Tokenizer.from_pretrained("google/t5-v1_1-base")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at google/t5-v1_1-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [None]:
model.summary()

Model: "tft5_for_conditional_generation_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (TFSharedEmbeddings)  multiple                 24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  84954240  
                                                                 
 decoder (TFT5MainLayer)     multiple                  113275008 
                                                                 
 lm_head (Dense)             multiple                  24674304  
                                                                 
Total params: 247,577,856
Trainable params: 247,577,856
Non-trainable params: 0
_________________________________________________________________


Since T5 can perform multiple tasks we need to tell it what kind of output we want.  Therefore we need to prepend a "prompt" to our article text to make sure it does the right thing.

In [None]:
PROMPT = 'summarize: '
T5ARTICLE_TO_SUMMARIZE = PROMPT + ARTICLE_TO_SUMMARIZE

In [None]:
inputs = tokenizer(T5ARTICLE_TO_SUMMARIZE, max_length=1024, truncation=True, return_tensors="tf")

In [None]:
inputs

{'input_ids': <tf.Tensor: shape=(1, 96), dtype=int32, numpy=
array([[21603,    10,     3,  7861,   184,   427,  4568,    34,  5018,
            8,  1001,   670,     7,    16,  1773,    12,  7555,     7,
           21,   306, 13551, 18905,  2192,  1124,     5,    37,  2674,
           19,    12,  1428,     8,  1020,    13,  3645,  6608,     7,
            5, 10455,   120,  8640,  7863,   722,   130,  5018,    12,
           36,  4161,    57,     8,  6979,  1647,     7,    84,   130,
         1644,    12,   336,   190,    44,   709,  2076,  1135,  5721,
            5,   634,  1368,  7814, 19611,    65,   263,     8,   750,
         1124,   237,  4131,   145,    16,  1767,   203,     5,    94,
        25722,   120,    77, 24706,     7,     8, 15834,    13,   508,
         2643,  3645,  6608,     7,     5,     1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 96), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 

In [None]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"] )
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

'. PG&E has a total of 1.2 million customers..'


In [None]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"], 
                              num_beams=1,
                              no_repeat_ngram_size=1,
                              min_length=10,
                              max_length=20)
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

'. PG&E has not yet responded to any inquiries regarding the blackouts but'


Let's experiment with the four hyperparameters shown in the cell above.  Please experiment in the cell below.  The num_beams value is like a beam search.  It indicates the number of tries the model makes before showing you its best output.  The no_repeat_ngram_size is designed to help reduce repeition in the output.  min_length and max_length set boundaries on the size of the summary.

*There is no one correct answer to these questions.  There are ranges that tend to work better than others.  The goal is to have you experiment to help build inutition.  Please enter the values that you think are generating the most readable output.*

**QUESTION:**

1.1 What num_beams value gives you the most readable output?

1.2 Which no_repeat_ngram_size gives the most readable output?

1.3 What min_length value gives you the most readable output?

1.4 Which max_length value gives the most readable output?

In [None]:
# Generate Summary

### YOUR CODE HERE 
summary_ids = model.generate(inputs["input_ids"], 
                              num_beams=8,
                              no_repeat_ngram_size=1,
                              min_length=10,
                              max_length=15)                   
                             
### END YOUR CODE
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

'the blackouts are scheduled to last through midday tomorrow.'


### 2. Pegasus for summarization 

Pegasus is an encoder decoder architecture that has been trained as an abstractive summarizer.  You can read more about it [here](https://huggingface.co/docs/transformers/model_doc/pegasus).

We'll use the google/pegasus-xsum checkpoint.  It is trained on a summarization task that reads a news article and them emits a headline as a summary.  This doesn't mean that it is limited in its output.  It does mean that it works well with news article type inputs.

In [None]:
from transformers import PegasusTokenizer, TFPegasusForConditionalGeneration

model = TFPegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFPegasusForConditionalGeneration.

All the layers of TFPegasusForConditionalGeneration were initialized from the model checkpoint at google/pegasus-xsum.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFPegasusForConditionalGeneration for predictions without further training.


Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

In [None]:
model.summary()

Model: "tf_pegasus_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 model (TFPegasusMainLayer)  multiple                  569748480 
                                                                 
Total params: 569,844,583
Trainable params: 569,748,480
Non-trainable params: 96,103
_________________________________________________________________


In [None]:
inputs = tokenizer(ARTICLE_TO_SUMMARIZE, max_length=1024, truncation=True, return_tensors="tf")

In [None]:
inputs

{'input_ids': <tf.Tensor: shape=(1, 82), dtype=int32, numpy=
array([[14887,   759,  1005,  3163,   126,  2798,   109, 25690,   116,
          115,  1407,   112, 13378,   118,   281,  7213, 10754,  1514,
         1047,   107,   139,  2560,   117,   112,  1329,   109,   887,
          113, 39471,   107, 16502,  6194,  4927,   527,   195,  2798,
          112,   129,  2790,   141,   109, 87338,   116,   162,   195,
         1214,   112,   289,   224,   134,   583, 26568,  3469,   107,
          159,  1093,  4282, 11945,   148,   266,   109,   582,  1047,
          254,  3150,   197,   115,  1331,   231,   107,   168, 24168,
        62626,   116,   109, 11134,   113,   423,  2116, 39471,   107,
            1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 82), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [None]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"] 
)
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

("California's largest electricity provider has cut power to tens of thousands "
 'of customers in an effort to reduce the risk of wildfires.')


In [None]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"], 
                              num_beams=1,
                              no_repeat_ngram_size=1,
                              min_length=10,
                              max_length=20)
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

("California's largest power company has announced it will cut electricity to "
 'tens of thousands more customers')


Let's experiment with the same hyperparameters for the Pegasus system.  It is designed for abstractive summarization.

**QUESTION:**

2.1 What num_beams value gives you the most readable output?

2.2 Which no_repeat_ngram_size gives the most readable output?

2.3 What min_length value gives you the most readable output?

2.4 Which max_length value gives the most readable output?

In [None]:
# Generate Summary

### YOUR CODE HERE  
summary_ids = model.generate(inputs["input_ids"], 
                              num_beams=15,
                              no_repeat_ngram_size=2,
                              min_length=10,
                              max_length=20)                      
### END YOUR CODE                             
                             
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

("California's largest electricity provider has said it will cut power to tens "
 'of thousands of customers')


### 3. BART for conditional generation

BART is an encoder decoder architecture that uses a transformer like BERT as it encoder and a language generator like GPT2 as its decoder.  It is designed as a translator that takes symbols in and then generates symbols out.  It has not been explicitly trained as an abstractive summarizer.  It is able to generate text. You can read more about it [here](https://huggingface.co/docs/transformers/model_doc/bart).

In [None]:
from transformers import BartTokenizer, TFBartForConditionalGeneration

model = TFBartForConditionalGeneration.from_pretrained("facebook/bart-large")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")



Downloading:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBartForConditionalGeneration.

All the layers of TFBartForConditionalGeneration were initialized from the model checkpoint at facebook/bart-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

In [None]:
model.summary()

Model: "tf_bart_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 model (TFBartMainLayer)     multiple                  406291456 
                                                                 
Total params: 406,341,721
Trainable params: 406,291,456
Non-trainable params: 50,265
_________________________________________________________________


In [None]:
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, truncation=True, return_tensors="tf")


In [None]:
inputs

{'input_ids': <tf.Tensor: shape=(1, 83), dtype=int32, numpy=
array([[    0,  8332,   947,   717,  2305,    24,  1768,     5,   909,
         4518,    11,  1263,     7,  5876,    13,   239,  2372,  2876,
         3841,  1274,     4,    20,  4374,    16,     7,  1888,     5,
          810,     9, 12584,     4,  9221,  5735,  7673,   916,    58,
         1768,     7,    28,  2132,    30,     5,  2572, 10816,    61,
           58,   421,     7,    94,   149,    23,   513, 15372,  3859,
            4,   133,   638,  3433,  7635,    34,   156,     5,   595,
         1274,   190,  3007,    87,    11,   986,   107,     4,    85,
        30413, 33008,  9354,     5, 18102,     9,   739,  3189, 12584,
            4,     2]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 83), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [None]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"])
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False), compact=True)

['PG&E stated it scheduled the blackouts in response to forecasts for high '
 'winds amid']


In [None]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"], 
                              num_beams=1,
                              no_repeat_ngram_size=1,
                              min_length=10,
                              max_length=20)
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False), compact=True)

['PG&E stated it scheduled the blackouts in response to forecasts for high '
 'winds amid']


Let's experiment with the same hyperparameters for the BART system.  It is designed as a translator, taking words in and generating words as its output.

**QUESTION:**

3.1 What num_beams value gives you the most readable output?

3.2 Which no_repeat_ngram_size gives the most readable output?

3.3 What min_length value gives you the most readable output?

3.4 Which max_length value gives the most readable output?

In [None]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"],
### YOUR CODE HERE 
                              num_beams=1,
                              no_repeat_ngram_size=1,
                              min_length=10,
                              max_length=23                            
### END YOUR CODE
                             )
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False), compact=True)

['PG&E stated it scheduled the blackouts in response to forecasts for high '
 'winds amid dry conditions.']


Okay, you're done.  

Which model do you think produced the best summaries keeping in mind that best is in the eye of the reader?