# Text Summarization via Deep Reinforcement Learning

In this case study, we will apply the deep reinforcement learning concepts of this chapter to the task of text summarization. We will use the Cornell NewsRoom Sum- marization dataset. The goal here is to show readers how we can use deep reinforce- ment learning algorithms to train an agent that can learn to generate summaries of these articles. For the case study, we will focus on deep policy gradient and double deep Q-network agents. We will use the following packages in this case study:

* **TensorFlow** is an open-source software library for dataflow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. It is used for both research and production at Google.
* **RLSeq2Seq** is an open-source library which implements various RL techniques for text summarization using sequence-to-sequence models https://github.com/yaserkl/RLSeq2Seq.
* **pyrouge** is a python interface to the perl-based ROUGE-1.5.5 package that computes ROUGE scores of text summaries https://github.com/andersjo/pyrouge.

To measure the performance of machine generated summaries, we will use ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation. It is a set of metrics used to evaluate automatic summarization of texts as well as machine trans- lation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced).

ROUGE-N, ROUGE-S, and ROUGE-L are measures of the granularity of texts when comparing between the system predicted summaries and reference summaries. For example, ROUGE-1 refers to overlap of unigrams between the system summary and reference summary. ROUGE-2 refers to the overlap of bigrams between the system and reference summaries. Let’s take the example from above. Let us say we want to compute the ROUGE-2 precision and recall scores. For ROUGE, recall is a measure of how much of the reference summary is the captured by the system summary.

## Cornell Newsroom Dataset

The Cornell Newsroom dataset consists of 1.3 million articles and summaries writ- ten by news authors and editors from 38 major publications between 1998 and 2017. The dataset is split into train, dev, and test sets of 1.1 m, 100 k, and 100 k samples.

For our case study, we will use subsets of 10,000/1000/1000 articles and sum- maries from the Cornell Newsroom dataset for our training, validation, and test sets, respectively. We will tokenize and map these data sets using 100-dim embeddings generated with word2vec. For memory considerations, we limit our vocabulary to 50,000 words.

A sample story and summary are below:

>**Story:** Coinciding with Mary Shelley’s birthday week, this Scott family affair produced by Ridley for director son Luke is another runout for the old story about scientists who cre- ate new life only to see it lurch bloodily away from them. Frosty risk assessor Kate Mara’s investigations into the mishandling of the eponymous hybrid intelligence (The Witch’s still- eerie Anya Taylor-Joy) permits Scott Jr a good hour of existential unease: is it the placid Morgan or her intemperate human overseers (Toby Jones, Michelle Yeoh, Paul Giamatti) who pose the greater threat to this shadowy corporation’s safe operation? Alas, once that question is resolved, the film turns into a passably schlocky runaround, bound for a guess- able last-minute twist that has an obvious precedent in the Scott canon. The capable cast yank us through the chicanery, making welcome gestures towards a number of science- fiction ideas, but cranked-up Frankenstein isn’t one of the film’s smarter or more original ones.

> **Summary:** Ridley and son Luke turn in a passable sci-fi thriller, but the horror turns to shlock as the film heads for a predictable twist ending.


## Seq2Seq Model

Our first task is to train a deep policy gradient agent that can produce summaries of the articles. Before we do so, we pre-train the seq2seq model using maximum likelihood loss, an encoder and decoder layer size of 256, batch size of 20, and adagrad with gradient clipping for 10 epochs (5000 iterations). (*NOTE this will take a long time to train.*)

In [None]:
!python RLSeq2Seq/src/run_summarization.py --mode=train \
                                           --data_path=data/processed_train.bin \
                                           --vocab_path=data/vocab-50k \
                                           --log_root=. \
                                           --exp_name=seq2seq_pg \
                                           --batch_size=20 \
                                           --max_iter=5000 \
                                           --use_temporal_attention=True \
                                           --emb_dim=100

Let's calculate ROUGE scores:

In [None]:
!python RLSeq2Seq/src/run_summarization.py  --mode=decode \
                                            --data_path=data/processed_test.bin \
                                            --vocab_path=data/vocab-50k \
                                            --log_root=. \
                                            --exp_name=seq2seq_pg \
                                            --batch_size=20 \
                                            --emb_dim=100 \
                                            --use_temporal_attention=True \
                                            --single_pass=True

## Policy Gradient

Let’s apply a deep policy gradient algorithm to improve our summaries. We switch from MLE loss to RL loss:

In [None]:
!python RLSeq2Seq/src/run_summarization.py  --mode=train \
                                            --data_path=data/processed_train.bin  \
                                            --vocab_path=data/vocab-50k \
                                            --log_root=. \
                                            --exp_name=seq2seq_pg \
                                            --batch_size=20 \
                                            --emb_dim=100 \
                                            --use_temporal_attention=True \
                                            --eta=2.5E-05 \
                                            --rl_training=True \
                                            --convert_to_reinforce_model=True

We then continue training for 8 epochs (4000 iterations):

In [None]:
!python RLSeq2Seq/src/run_summarization.py  --mode=train \
                                            --data_path=data/processed_train.bin  \
                                            --vocab_path=data/vocab-50k \
                                            --log_root=. \
                                            --exp_name=seq2seq_pg \
                                            --batch_size=20 \
                                            --emb_dim=100 \
                                            --max_iter=9000 \
                                            --use_temporal_attention=True \
                                            --eta=2.5E-05 \
                                            --rl_training=True \

Let us evaluate the RL-trained model on the test data:

In [None]:
!python RLSeq2Seq/src/run_summarization.py  --mode=decode \
                                            --data_path=data/processed_test.bin \
                                            --vocab_path=data/vocab-50k \
                                            --log_root=. \
                                            --exp_name=seq2seq_pg \
                                            --emb_dim=100 \
                                            --rl_training=True \
                                            --use_temporal_attention=True \
                                            --single_pass=1 \
                                            --beam_size=4 \
                                            --decode_after=0

## DDQN

Let’s see if we can improve on the results above using a double deep Q-learning agent. We start as before by pre-training the seq2seq language model using maximum likelihood loss for 10 epochs:

In [None]:
!python RLSeq2Seq/src/run_summarization.py --mode=train \
                                           --data_path=data/processed_train.bin \
                                           --vocab_path=data/vocab-50k \
                                           --log_root=. \
                                           --exp_name=ddqn \
                                           --batch_size=20 \
                                           --max_iter=5000 \
                                           --emb_dim=100 \
                                           --use_temporal_attention=True

We switch from MLE to RL loss:

In [None]:
!python RLSeq2Seq/src/run_summarization.py --mode=train \
                                           --data_path=data/processed_train.bin \
                                           --vocab_path=data/vocab-50k \
                                           --log_root=. \
                                           --exp_name=ddqn \
                                           --batch_size=20 \
                                           --emb_dim=100 \
                                           --ac_training=True \
                                           --dueling_net=True \
                                           --dqn_target_update=500 \
                                           --convert_to_reinforce_model=True  

We will first pre-train the DDQN with a fixed actor model:

In [None]:
!python RLSeq2Seq/src/run_summarization.py --mode=train \
                                           --data_path=data/processed_train.bin \
                                           --vocab_path=data/vocab-50k \
                                           --log_root=. \
                                           --exp_name=ddqn \
                                           --batch_size=20 \
                                           --emb_dim=100 \
                                           --dqn_replay_buffer_size=5000 \
                                           --dqn_target_update=500 \
                                           --ac_training=True \
                                           --dqn_pretrain=True \
                                           --dueling_net=True \
                                           --dqn_pretrain_steps=500

Then we will train the DDQN for 8 epochs using a batch size of 20, replay buffer of 5000 samples and updating the target network every 500 iterations. We first start training with true Q-estimates:

In [None]:
!python RLSeq2Seq/src/run_summarization.py --mode=train \
                                           --data_path=data/processed_train.bin \
                                           --vocab_path=data/vocab-50k \
                                           --log_root=. \
                                           --exp_name=ddqn \
                                           --batch_size=20 \
                                           --max_iter=5500 \
                                           --emb_dim=100 \
                                           --dqn_replay_buffer_size=5000 \
                                           --dqn_target_update=500 \
                                           --ac_training=True \
                                           --dueling_net=True \
                                           --calculate_true_q=True 

We then switch to Q-estimates after the warm start:

In [None]:
!python RLSeq2Seq/src/run_summarization.py --mode=train \
                                           --data_path=data/processed_train.bin \
                                           --vocab_path=data/vocab-50k \
                                           --log_root=. \
                                           --exp_name=ddqn \
                                           --batch_size=20 \
                                           --max_iter=9000 \
                                           --emb_dim=100 \
                                           --dqn_replay_buffer_size=5000 \
                                           --dqn_target_update=500 \
                                           --ac_training=True \
                                           --dueling_net=True \
                                           --calculate_true_q=False 

Finally, we calculate ROUGE scores:

In [None]:
!python RLSeq2Seq/src/run_summarization.py  --mode=decode \
                                            --data_path=data/processed_test.bin \
                                            --vocab_path=data/vocab-50k \
                                            --log_root=. \
                                            --exp_name=ddqn \
                                            --emb_dim=100 \
                                            --ac_training=True \
                                            --dueling_net=True \
                                            --dqn_replay_buffer_size=5000 \
                                            --dqn_target_update=500 \
                                            --single_pass=1 \
                                            --beam_size=4

The DDQN agent outperforms the policy gradient agent for the chosen parameters. There are a myriad of possibilities to improve results further—we could use scheduled or prioritized sampling, intermediate rewards, and attention at the encoder or decoder.