In [1]:
%load_ext autoreload
%autoreload 2

# Introduction

<center><h3>**Welcome to the Summarization Notebook.**</h3></center>

In this assignment, you are going to train a neural network to summarize news articles.
Your neural network is going to learn from example, as we provide you with (article, summary) pairs.
We provide you with a **toy dataset** made of only articles about police related news.
Usual datasets can be 20x larger in size, but we have reduced it for computational purposes.

You will do this using a Transformer network, from the __[Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)__ paper.
In this assignment you will:
- Learn to process text into sub-word tokens, to avoid fixed vocabulary sizes, and UNK tokens.
- Implement the key conceptual blocks of a Transformer.
- Use a Transformer to read a news article, and produce a summary.
- Perform operations on learned word-vectors to examine what the model has learned.

    
** Before you start **

You should read the Attention is all you need paper.
We are providing you with skeleton code for the Transformer, but there will have to implement 5 conceptual blocks of the transformer yourself:
-  AttentionQKV: the Query, Key, Value attention mechanism at the center of the Transformer
- MultiHeadAttention: the multiple heads that enable each input to attend at many places at once.
- PositionEmbedding: the sinusoid-based position embedding of the Transformer.
- Encoder & Decoder: The encoder (that reads inputs, such as news articles), the decoder (that produces the output summary, one token at a time)
- Full Transformer: piecing it all together.

# Library imports

In [2]:
from transformer import GPT
import sentencepiece as spm
import tensorflow as tf
import numpy as np
import json
import capita

root_folder = ""

In [3]:
# Load the word piece model that will be used to tokenize the texts into
# word pieces with a vocabulary size of 10000

sp = spm.SentencePieceProcessor()
sp.Load(root_folder+"dataset/wp_vocab10000.model")

vocab = [line.split('\t')[0] for line in open(root_folder+"dataset/wp_vocab10000.vocab", "r")]
pad_index = vocab.index('#')

def pad_sequence(numerized, pad_index, to_length):
    pad = numerized[:to_length]
    padded = pad + [pad_index] * (to_length - len(pad))
    mask = [w != pad_index for w in padded]
    return padded, mask

# Creating a Transformer

Now that all the blocks of the Transformer are implemented, we can create a full model with placeholders and a loss.

We've helped you with the placeholders, and the loss, as it is similar to the one in the previous assignment.

In [4]:
# We are giving you the trainer, as it is similar to the one
# you created in the Language Modeling assignment.

class GPTTrainer():

    def __init__(self, vocab_size, d_model, output_length, n_layers, d_filter, learning_rate=1e-3):

        self.target_sequence = tf.placeholder(tf.int32, shape=(None,output_length),name="target_sequence")
        self.decoder_mask = tf.placeholder(tf.bool, shape=(None,output_length),name="decoder_mask")

        self.model = GPT(vocab_size=vocab_size, d_model=d_model, n_layers=n_layers, d_filter=d_filter)

        self.decoded_logits = self.model(self.target_sequence, decoder_mask=self.decoder_mask)
        self.global_step = tf.train.get_or_create_global_step()
        
        # Summarization loss
        self.loss = tf.losses.sparse_softmax_cross_entropy(self.target_sequence, self.decoded_logits, tf.cast(self.decoder_mask, tf.float32))
        self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
        self.train_op = self.optimizer.minimize(self.loss, global_step=self.global_step)
        self.saver = tf.train.Saver()

We now instantiate the Transformer with our sets of hyperparameters specific to the task of summarization.
In summarization, we are going to go from documents with up to 400 words, to documents with up to 100 words.
The vocabulary size is set for you, and is of 10,000 words (we are using WordPieces, [here is a paper about subword encoding](http://aclweb.org/anthology/P18-1007), if you are interested).

In [5]:
# Dataset related parameters
vocab_size = len(vocab)
ilength = 400 # Length of the article
olength  = 100 # Length of the summaries

# Model related parameters, feel free to modify these.
n_layers = 12
d_model  = 104
d_filter = 416

model = GPTTrainer(vocab_size, d_model, ilength, n_layers, d_filter)

# Training the model

Your objective is to train the Language on the dataset you are provided to reach a **validation loss <= 4.50**

Careful: we will be testing this loss on an unreleased test set, so make sure to evaluate properly on a validation set and not overfit.

You must save the model you want us to test under: models/final_transformer_summarization (the .index, .meta and .data files)

**Advice**:
- It should be possible to attain validation loss <= 4.50 with the model dimensions we've specified (n_layers=6, d_model=104, d_filter=416), but you can tune these hyperparameters. Increasing d_model will yield better model, at the cost of longer training time.
- You should try tuning the learning rate, as well as what optimizer you use.
- You might need to train for a few (up to 2 hours) to obtain our expected loss. Remember to tune your hyperparameters first, once you find ones that work well, let it train for longer.

**Dataset**: as in the previous notebook, make sure the dataset files are in the `dataset` folder. These can be found on the Google Drive.


In [6]:
with open(root_folder+"dataset/summarization_dataset_preprocessed.json", "r") as f:

    dataset = json.load(f)

# We load the dataset, and split it into 2 sub-datasets based on if they are training or validation.
# Feel free to split this dataset another way, but remember, a validation set is important, to have an idea of 
# the amount of overfitting that has occurred!

d_train = [d for d in dataset if d['cut'] == 'training']
d_valid = [d for d in dataset if d['cut'] == 'evaluation']

len(d_train), len(d_valid)

(61055, 1558)

In [7]:
# An example (article, summary) pair in the training data:

print(d_train[145]['story'])
print("=======================\n=======================")
print(d_train[145]['summary'])

Tbilisi, Georgia (CNN)Police have shot and killed a white tiger that killed a man Wednesday in Tbilisi, Georgia, a Ministry of Internal Affairs representative said, after severe flooding allowed hundreds of wild animals to escape the city zoo. 
The tiger attack happened at a warehouse in the city center. The animal had been unaccounted for since the weekend floods destroyed the zoo premises.
The man killed, who was 43, worked in a company based in the warehouse, the Ministry of Internal Affairs said. Doctors said he was attacked in the throat and died before reaching the hospital. 
Experts are still searching the warehouse, the ministry said, adding that earlier reports that the tiger had injured a second man were unfounded. 
The zoo administration said Wednesday that another tiger was still missing. It was unable to confirm if the creature was dead or had escaped alive.
Georgian Prime Minister Irakli Garibashvili apologized to the public, saying he had been misinformed by the zoo's ma

Similarly to the previous assignment, we create a function to get a random batch to train on, given a dataset.

In [None]:
def build_batch(dataset, batch_size):
    indices = list(np.random.randint(0, len(dataset), size=batch_size))
    
    batch = [dataset[i] for i in indices]
    batch_output = np.array([a['input'] for a in batch])
    batch_output_mask = np.array([a['input_mask'] for a in batch])
    
    return batch_output, batch_output_mask

In [None]:
# Skeleton code, as in the previous notebook.
# Write code training code and save your best performing model on the
# validation set. We will be testing the loss on a held-out test dataset.


with tf.Session() as sess:
    # This is how you randomly initialize the Transformer weights.
    sess.run(tf.global_variables_initializer())
    
    epochs = 20 #previously 50
    
    for epoch in range(epochs):
        
        batch_size = 128
        iterations = len(d_train) // batch_size
        
        # build validation set
        e_output, e_output_mask = build_batch(d_valid, 200)
        
        for iteration in range(iterations):       

            # Create a random mini-batch from the training dataset
            batch_output, batch_output_mask = build_batch(d_train, batch_size)
            # Build the feed-dict connecting placeholders and mini-batch
            feed = {model.target_sequence: batch_output, model.decoder_mask: batch_output_mask}

            # Obtain the loss. Be careful when you use the train_op and not, as previously.
            train_loss, _, step = sess.run([model.loss, model.train_op, model.global_step], feed_dict=feed)
            
            if iteration % 50 == 0:
                
                # get validation loss
                feed_val = {model.target_sequence: e_output, model.decoder_mask: e_output_mask}
                valid_loss = sess.run(model.loss, feed_dict=feed_val)
                
                print("Epoch {} Iteration {}, Train Loss: {}, Val Loss: {}".format(epoch, iteration, train_loss, valid_loss))
            
                
#                 print("Epoch {} Iteration {}, Train Loss: {}".format(epoch, iteration, train_loss))
                
#                 This is how you save model weights into a file
#                 model.saver.save(sess, root_folder+"models/gpt_test")    

#                 # This is how you restore a model previously saved
#                 model.saver.restore(sess, root_folder+"models/transformer_summarizer")


Epoch 0 Iteration 0, Train Loss: 13.942840576171875, Val Loss: 12.402144432067871
Epoch 0 Iteration 50, Train Loss: 8.218381881713867, Val Loss: 8.214381217956543


# Using the Summarization model

Now that you have trained a Transformer to perform Summarization, we will use the model on news articles from the wild.

The three subsections below explore what the model has learned.

In [29]:
# Put the file path to your best performing model in the string below.

model_file = root_folder+"models/gpt_test"
# model_file = root_folder+"models/transformer_summarizer"

## The validation loss

Measure the validation loss of your model. This part could be used, as in our previous notebook, in deciding what is a likely, vs. unlikely summary for an article.

We will use the code here with the unreleased test-set to evaluate your model.

In [30]:
with tf.Session() as sess:
    model.saver.restore(sess, model_file)

    e_output, e_output_mask = build_batch(d_valid, 200)
    feed = {model.target_sequence: e_output, model.decoder_mask: e_output_mask}
    valid_loss = sess.run(model.loss, feed_dict=feed)
    print("Validation loss:", valid_loss)

INFO:tensorflow:Restoring parameters from models/gpt_test
Validation loss: 6.28927


## Generating a summary

This model we have built is meant to be used to generate summaries for new articles we do not have summaries for.
We got a [news article](https://www.chicagotribune.com/news/local/breaking/ct-met-officer-shot-20190309-story.html) from the Chicago Tribune about a police shooting, and want to use our model to produce a summary.

As you will see, our model is still limited in its ability, and will most likely not produce a perfect summary, however, with more data and training, this model would be able to produce good summaries.
The article you produce should look like broken English sentences, but should roughly correspond to the article.

In [32]:
output_length = 400

with tf.Session() as sess:
    model.saver.restore(sess, model_file)

    decoded_so_far = [0]
    
    for j in range(output_length):
        padded_decoder_input, decoder_mask = pad_sequence(decoded_so_far, pad_index, output_length)
        padded_decoder_input = [padded_decoder_input]
        decoder_mask = [decoder_mask]
#         print("========================")
#         print(padded_decoder_input)
        # Use the model to find the distrbution over the vocabulary for the next word
        feed = {model.target_sequence: padded_decoder_input,
                model.decoder_mask: decoder_mask}
        logits = sess.run([model.decoded_logits], feed_dict=feed)
    
        chosen_words = np.argmax(logits[0], axis=2) # Take the argmax, getting the most likely next word
        decoded_so_far.append(int(chosen_words[0, j])) # We add it to the summary so far


print("The final summary:")
print("".join([vocab[i] for i in decoded_so_far]).replace("▁", " "))


INFO:tensorflow:Restoring parameters from models/gpt_test
The final summary:
<unk>  ↑↑ the  ↑↑ prince  ↑↑ eddie prince  ⇧⇧ lrb lrb incidents -  ⇧rrb cnn lrb - the  ↑↑ police  ↑↑ division division of of  ↑↑ the :  ↑↑ police police department department  ↑↑ county told  ↑↑ police the department  ↑↑ police told department  ↑↑ the the  ↑↑ : :  ↑↑ the the  ↑↑ police the department  ↑↑ police :  ↑↑ the the  ↑↑ supreme  ↑↑ police police department department . ,  the↑  the↑  the↑  the↑  ↑↑ supreme  ↑↑ police  department↑ . the  ↑↑ the the  ↑↑ the  ↑↑  khkhss  .↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  the↑  ↑↑ the  ↑↑  ↑↑  ↑↑ the the  ↑↑ the  ↑↑  ↑↑  ↑↑ the  ↑↑ the  ↑↑ the the  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑  ↑↑ the


## Word vectors

The model we train learns word representations for each word in our vocabulary. A word represention is a vector of **dim** size.

It is common in NLP to inspect the word vectors, as some properties of language often appear in the embedding structure.


We are going to load the word embeddings learned by our model, and inspect it.
Because our network was not trained for long, we are going for the simplest patterns, but if we let the network train longer, it learns more complex, semantic patterns.

Pronouns serve very similar purposes, therefore we should expect the representation of "he" and "she" to be similar, and have cosine similarity.

- **TODO**:  Find the cosine similarity between the vectors that represent words "she" and "he".
- **TODO**:  Find the cosine similarity between the vectors that represent words "more" and "less".

We can contrast that with the cosine similarity to a random, non-related word, like "ball", or "gorilla".
- **TODO**: Compute the cosine similarity between "she" and "ball".
- **TODO**: Compute the cosine similarity between "more" and "protest".



These effects are unfortunately small, as we have only trained the network on a few hours on a few thousand articles.
However, the same model trained for longer on more data exhibits many interesting semantic and syntactic patterns, such as:

- Words vectors with high cosine similarity usually represent words that have semantic similarity (such as duck and pigeon)
- Analogies can occur, a famous case is that of: woman - man + king ≈ queen. Or france - paris + rome ≈ italy.

- Looking at top-k similar words can help find synonyms.

To read examples of more complex patterns that appear in word embedding spaces, read [this blog](https://explosion.ai/blog/sense2vec-with-spacy). To play with a live demo and try similarities on rich word embeddings, [go here.](https://explosion.ai/demos/sense2vec)