![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

## NLP2 Lecture 3 Support Notebook

### Table of Contents
<p>
<div class="lev1">
    <a href="#Text-Generation-with-RNN"><span class="toc-item-num">1&nbsp;&nbsp;</span>
        Text Generation with RNN
    </a>
</div>
<div class="lev1">
    <a href="#BLEU:-BiLingual-Evaluation-Understudy"><span class="toc-item-num">2&nbsp;&nbsp;</span>
        BLEU: BiLingual Evaluation Understudy
    </a>
</div>
<div class="lev1">
    <a href="#ROUGE:-Recall-Oriented-Understudy-for-Gisting-Evaluation"><span class="toc-item-num">3&nbsp;&nbsp;</span>
        ROUGE: Recall-Oriented Understudy for Gisting Evaluation
    </a>
</div>
<div class="lev1">
    <a href="#Bilingual-Evaluation-Understudy-Score-(BLEAU)"><span class="toc-item-num">4&nbsp;&nbsp;</span>
        Bilingual Evaluation Understudy Score (BLEAU)
    </a>
</div>

# Text Generation with RNN
ref. https://cass-experiments.notebook.us-east-1.sagemaker.aws/notebooks/MLU-NLP-Teacher/NLP2/demos/text_generation.ipynb

In [1]:
!pip install -q -U torchtext==0.5

In [2]:
import torch
import torchtext
import numpy as np
import os
import time

In [3]:
# Check for GPU
print(torch.__version__)
if torch.cuda.is_available():
    print('Default GPU Device : {}'.format(torch.cuda.get_device_name(0)))
    device = torch.device('cuda')
else:
    print('No GPU available')
    device = torch.device('cpu')
    
print(device)

1.4.0
Default GPU Device : Tesla K80
cuda


## Download the Shakespeare dataset
Change the following line to run this code on your own data.

In [39]:
%%bash
# creating hidden folder needed for this torchtext download
if [ ! -d ".data" ]; then
  ! mkdir .data
fi

In [37]:
path_to_file = torchtext.utils.download_from_url('https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

mkdir: cannot create directory ‘.data’: File exists


## Read the data
First, look in the text:

In [33]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [34]:
# Take a look at the first 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [35]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

65 unique characters


## Process the text
### Vectorize the text

Before training, we need to map strings to a numerical representation. Create two lookup tables: one mapping characters to numbers, and another for numbers to characters.

In [36]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])
text_as_int

array([18, 47, 56, ..., 45,  8,  0])

Now we have an integer representation for each character. Notice that we mapped the character as indexes from 0 to `len(unique)`.

In [114]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '$' :   3,
  '&' :   4,
  "'" :   5,
  ',' :   6,
  '-' :   7,
  '.' :   8,
  '3' :   9,
  ':' :  10,
  ';' :  11,
  '?' :  12,
  'A' :  13,
  'B' :  14,
  'C' :  15,
  'D' :  16,
  'E' :  17,
  'F' :  18,
  'G' :  19,
  ...
}


In [115]:
# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

'First Citizen' ---- characters mapped to int ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]


### The prediction task

Given a character, or a sequence of characters, what is the most probable next character? This is the task we're training the model to perform. The input to the model will be a sequence of characters, and we train the model to predict the output—the following character at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?


### Create training examples and targets

Next divide the text into example sequences. Each input sequence will contain `seq_length` characters from the text.

For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.

So break the text into chunks of `seq_length+1`. For example, say `seq_length` is 4 and our text is "Hello". The input sequence would be "Hell", and the target sequence "ello".

To do this, we first convert the text vector into a stream of character indices using `DataLoader`. The `batch_size` argument slices the data into vectors of a particular length and `drop_last` ensures that the last sequence is ignored if it is not of stipulated length.

In [117]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

sequences = torch.utils.data.DataLoader(text_as_int, batch_size=seq_length+1, drop_last=True)

In [118]:
for item in sequences:
    print(repr(''.join(idx2char[item.numpy()])))
    
# 11043 full sequences

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'
'zens, the patricians good.\nWhat authority surfeits on would relieve us: if they\nwould yield us but th'
'e superfluity, while it were\nwholesome, we might guess they relieved us humanely;\nbut they think we a'
're too dear: the leanness that\nafflicts us, the object of our misery, is as an\ninventory to particula'
'rise their abundance; our\nsufferance is a gain to them Let us revenge this with\nour pikes, ere we bec'
'ome rakes: for the gods kn

For each sequence, duplicate and shift it to form the input and target text by using `map` to apply a simple function to each batch

In [119]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = list(map(split_input_target, sequences))
print(len(dataset))

11043


Print the first examples input and target values:

In [120]:
for input_example, target_example in  dataset:
    print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))
    break
    
# At this point, dataset consists of 11043 input / target pairs. Each input / target sequence consists of 100 
# chars, offset by 1 char to the right

Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


Each index of these vectors are processed as one time step. For the input at time step 0, the model receives the index for "F" and trys to predict the index for "i" as the next character. At the next timestep, it does the same thing but the `RNN` considers the previous step context in addition to the current input character.

In [121]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 18 ('F')
  expected output: 47 ('i')
Step    1
  input: 47 ('i')
  expected output: 56 ('r')
Step    2
  input: 56 ('r')
  expected output: 57 ('s')
Step    3
  input: 57 ('s')
  expected output: 58 ('t')
Step    4
  input: 58 ('t')
  expected output: 1 (' ')


### Create training batches

We use another `DataLoader` to split the input/target pairs into manageable chunks by packing them into batches.

In [122]:
# Batch size
BATCH_SIZE = 64

dataset = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, drop_last=True, shuffle=True)

## Build The Model

Define a new class based off `nn.Module` for our character predicting GRU. This comprises of 3 layers -

* `nn.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions;
* `nn.GRU`: A type of RNN with hidden size `units=rnn_units` (You can also use a LSTM layer here.)
* `nn.Linear`: The output layer, with `vocab_size` outputs.

In [123]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [124]:
import torch.nn as nn

class CharGRU(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.gru = nn.GRU(embedding_dim, rnn_units, batch_first=True)
        self.fc = nn.Linear(rnn_units, vocab_size)
    
    def forward(self, x):
        embedding_output = self.embedding(x)
        gru_output, hidden = self.gru(embedding_output)
        out = self.fc(gru_output)
        
        return out
        

In [125]:
# Instantiate the model

model = CharGRU(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units)

## Try the model

Now run the model to see that it behaves as expected.

First check the shape of the output:

In [126]:
for input_example_batch, target_example_batch in dataset:
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")
    break

torch.Size([64, 100, 65]) # (batch_size, sequence_length, vocab_size)


In the above example the sequence length of the input is `100` but the model can be run on inputs of any length:

In [127]:
print(model)

CharGRU(
  (embedding): Embedding(65, 256)
  (gru): GRU(256, 1024, batch_first=True)
  (fc): Linear(in_features=1024, out_features=65, bias=True)
)


To get actual predictions from the model we need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.

Note: It is important to _sample_ from this distribution as taking the _argmax_ of the distribution can easily get the model stuck in a loop.

Try it for the first example in the batch:

In [128]:
example_output = torch.distributions.categorical.Categorical(logits=example_batch_predictions[0])
sampled_indices = example_output.sample()

This gives us, at each timestep, a prediction of the next character index:

In [129]:
sampled_indices

tensor([13, 44, 47, 31, 25, 50, 50, 49, 16, 23, 25, 10, 34, 39, 58, 42, 52, 39,
        12, 23, 60, 59, 37, 15,  8, 48, 63,  8, 16, 39, 34, 29, 40, 30, 63, 41,
        53, 64,  1,  3, 15, 58, 51,  9, 56, 26,  2, 26, 10, 57, 41, 12, 21, 53,
        47,  5, 16, 29, 25, 25, 34,  8, 29, 48, 54,  7, 10, 54, 16, 60, 58, 43,
        26, 39, 39, 37, 62, 34, 32, 32, 57, 54, 16, 39, 64,  4,  2, 16, 58,  0,
        16, 31, 56, 32, 36, 10, 29, 25, 38, 29])

Decode these to see the text predicted by this untrained model:

In [130]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 "y think\nA due sincerity govern'd his deeds,\nTill he did look on me: since it is so,\nLet him not die."

Next Char Predictions: 
 "AfiSMllkDKM:Vatdna?KvuYC.jy.DaVQbRycoz $Ctm3rN!N:sc?Ioi'DQMMV.Qjp-:pDvteNaaYxVTTspDaz&!Dt\nDSrTX:QMZQ"


## Train the model

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

### Attach an optimizer, and a loss function

We define a simple function to calculate the loss across a batch. This simply takes the individual cross entropy loss between each predicted character in the sequence of 100 with the expected character at that location and stacks them together.


In [131]:
def batch_loss(batch_logits, batch_labels):
    batch_loss_vector = []
    
    loss = nn.CrossEntropyLoss(reduction="none")
    for logits, label in zip(batch_logits, batch_labels):
        batch_loss_vector.append(loss(logits, label))
    
    batch_loss_vector = torch.stack(batch_loss_vector)    
    return batch_loss_vector
    

# print(target_example_batch.shape)
# print(example_batch_predictions.shape)

example_batch_loss  = batch_loss(example_batch_predictions, target_example_batch)
print("Prediction shape: ", example_batch_loss.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.detach().numpy().mean())

Prediction shape:  torch.Size([64, 100])  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.174924


Finally, we will use the `Adam` optimizer with the default arguments for training.

In [132]:
from torch import optim

optimizer = optim.Adam(model.parameters())

### Configure checkpoints

Specify the directory where the checkpoints (different trained versions of the model) will be saved.

In [133]:
# Directory where the checkpoints will be saved
checkpoint_dir = 'checkpoints'

### Execute the training

To keep training time reasonable, use <=50 epochs. Larger number of epochs = better output usually.

In [135]:
%%time
# around 13:29 minutes with gpu
EPOCHS=30

# Move the model & data to GPU for faster training
model = model.to(device)

def train(data):

    for e in range(EPOCHS):
        running_loss = 0

        for input_example_batch, target_example_batch in data:
            
            input_example_batch = input_example_batch.to(device)
            target_example_batch = target_example_batch.to(device)
            
            optimizer.zero_grad()
            output = model.forward(input_example_batch)
            loss = batch_loss(output, target_example_batch)
            
            # Calculate the mean of the loss across the batch to backpropagate
            loss = loss.mean()
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
        
        # Divide by length of data to get average loss on a batch
        print("Epoch: ", e, " Loss: ", running_loss/len(data))
        
        # Save every 5th trained version of the model
        if (e+1)%5 == 0:
            # Save the model
            checkpoint_filepath = os.path.join(checkpoint_dir, "ckpt_{}.pt".format(e+1))
            print(checkpoint_filepath)
            torch.save({
                'epoch': e+1,
                'model_state_dict': model.state_dict(),
                'loss': loss
            }, checkpoint_filepath)
    
# model.fit(X, y, epochs=2000, verbose=0)
train(dataset)

cuda
Epoch:  0  Loss:  2.01819632427637
Epoch:  1  Loss:  1.5203065380107526
Epoch:  2  Loss:  1.3955902810706649
Epoch:  3  Loss:  1.3285814336566037
Epoch:  4  Loss:  1.2792481287967328
checkpoints/ckpt_5.pt
Epoch:  5  Loss:  1.2391047540099123
Epoch:  6  Loss:  1.2019981021104857
Epoch:  7  Loss:  1.166631658409917
Epoch:  8  Loss:  1.1321793191654737
Epoch:  9  Loss:  1.0970974017021269
checkpoints/ckpt_10.pt
Epoch:  10  Loss:  1.0620160795921503
Epoch:  11  Loss:  1.0259964611641197
Epoch:  12  Loss:  0.9888381313445956
Epoch:  13  Loss:  0.9539398810891218
Epoch:  14  Loss:  0.9180768409440684
checkpoints/ckpt_15.pt
Epoch:  15  Loss:  0.8842299590970195
Epoch:  16  Loss:  0.850935642802438
Epoch:  17  Loss:  0.819511775360551
Epoch:  18  Loss:  0.790586176653241
Epoch:  19  Loss:  0.7642903213584146
checkpoints/ckpt_20.pt
Epoch:  20  Loss:  0.7405664965856907
Epoch:  21  Loss:  0.7174516670232596
Epoch:  22  Loss:  0.7004720201325971
Epoch:  23  Loss:  0.6855604704036269
Epoch:  

## Generate text

### Restore the latest checkpoint

In [136]:
model = CharGRU(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units)

latest_checkpoint_filepath = os.path.join(checkpoint_dir, "ckpt_{}.pt".format(EPOCHS))
latest_checkpoint = torch.load(latest_checkpoint_filepath)
model.load_state_dict(latest_checkpoint['model_state_dict'])


<All keys matched successfully>

In [137]:
print(model)

CharGRU(
  (embedding): Embedding(65, 256)
  (gru): GRU(256, 1024, batch_first=True)
  (fc): Linear(in_features=1024, out_features=65, bias=True)
)


### The prediction loop

The following code block generates the text:

* It Starts by choosing a start string, initializing the RNN state and setting the number of characters to generate.

* Get the prediction distribution of the next character using the start string and the RNN state.

* Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.

* The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one character. After predicting the next character, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted characters.


![To generate text the model's output is fed back to the input](images/text_generation_sampling.png)

Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [171]:
def generate_text(model, start_string):
    # Evaluation step (generating text using the learned model)

    # Number of characters to generate
    num_generate = 1000

    # Converting our start string to numbers (vectorizing)
    input_eval = torch.LongTensor([char2idx[s] for s in start_string])
    input_eval = input_eval.unsqueeze(0)

    # Empty string to store our results
    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment to find the best setting.
    temperature = 1.0

    # Here batch size == 1
    # model.reset_states()
    
    model.eval()
    for i in range(num_generate):
        predictions = model.forward(input_eval)
        # remove the batch dimension
        predictions = predictions.squeeze(0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        output = torch.distributions.categorical.Categorical(logits=predictions)
        
        # Get the last predicted output character
        predicted_id = output.sample()[-1]

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = torch.LongTensor([predicted_id])
        input_eval = input_eval.unsqueeze(0)
        
        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [172]:
print(generate_text(model, start_string=u"ROMEO: "))

ROMEO: INA:
KELAR sthe. t sou pery asers upe sho hig'lanouse E:
Werovatha wir ses aves merelaishamiloce:
A MINUMysse inckis blk wino thoug ce mowed'sur w s ONG tstivanst swh atinoro whe t s a bot'dyor'd thee thy RThen thound my st n as maulere th mar, we be, aveap ortet touse s of anto tave,
F anothorino aser, borize ap, tlis
S:
LAD:
I ND d f f tite and eat t, howes owhy t ethowhe o We oucy at, o it, fown bare!
GElyo wn t ho bld



Me bu besty, are ben t.
Y ed I k, we.
CYo my wirancallds prnt'll
KES:
I Kath bl betheduir' ERI glllop t,
I k t mo bulden ghirere
WICart m qutherilf thany f thive I k t y ch mbe t ced, benofrgolereed sth fugothare lo I de s d met INO:
Malll pthelllll s it four-
AMan boonse te towemel d wit my he nd s Joore nnoucy:


Thaf inge at l IO
Th Ise ce, G anobe:
Delithin t juntad y s h, at cowe towamy tu che d, g:
Thathyo ce hth ar?
ESTh y, thandopt,
ALer gout, d My aboughuprs.
The,
HA beseainconds gethowe:
ICiglllere,
ARULAMea'llith hth d omowisur I at, th it Noknill

The easiest thing you can do to improve the results it to train it for longer (try `EPOCHS=30`).

You can also experiment with a different start string, or try adding another RNN layer to improve the model's accuracy, or adjusting the temperature parameter to generate more or less random predictions.

In [1]:
! rm -rf ./training_checkpoints

## End of Example.  Return to Slides

![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

<div class="lev1">
    <a href="#NLP2-Lecture-3-Support-Notebook">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        Go to TOP
    </a>
</div>

# BLEU: BiLingual Evaluation Understudy

*NLP evaluation metric used in Machine Translation tasks*

*Suitable for measuring corpus level similarity*

*$n$-gram comparison between words in candidate sentence and reference sentences*

*Range: 0 (no match) to 1 (exact match)*

### 1. Libraries
*Install and import necessary libraries*

In [7]:
import nltk
import nltk.translate.bleu_score as bleu

import math
import numpy
import os

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

### 2. Dataset
*Array of words: candidate and reference sentences split into words*

In [8]:
hyp = str('she read the book because she was interested in world history').split()
ref_a = str('she read the book because she was interested in world history').split()
ref_b = str('she was interested in world history because she read the book').split()

### 3. *Sentence* score calculation
*Compares 1 hypothesis (candidate or source sentence) with 1+ reference sentences, returning the highest score when compared to multiple reference sentences.*

In [9]:
score_ref_a = bleu.sentence_bleu([ref_a], hyp)
print("Hyp and ref_a are the same: {}".format(score_ref_a))
score_ref_b = bleu.sentence_bleu([ref_b], hyp)
print("Hyp and ref_b are different: {}".format(score_ref_b))
score_ref_ab = bleu.sentence_bleu([ref_a, ref_b], hyp)
print("Hyp vs multiple refs: {}".format(score_ref_ab))

Hyp and ref_a are the same: 1.0
Hyp and ref_b are different: 0.7400828044922853
Hyp vs multiple refs: 1.0


### 4. *Corpus* score calculation
*Compares 1 candidate document with multiple sentence and 1+ reference documents also with multiple sentences.*

* Different than averaging BLEU scores of each sentence, it calculates the score by *"summing the numerators and denominators for each hypothesis-reference(s) pairs before the division"*

In [10]:
score_ref_a = bleu.corpus_bleu([[ref_a]], [hyp])
print("1 document with 1 reference sentence: {}".format(score_ref_a))
score_ref_a = bleu.corpus_bleu([[ref_a, ref_b]], [hyp])
print("1 document with 2 reference sentences: {}".format(score_ref_a))
score_ref_a = bleu.corpus_bleu([[ref_a], [ref_b]], [hyp, hyp])
print("2 documents with 1 reference sentence each: {}".format(score_ref_a))

1 document with 1 reference sentence: 1.0
1 document with 2 reference sentences: 1.0
2 documents with 1 reference sentence each: 0.8778107713916036


### 5. BLEU-$n$
*In BLEU-$n$, $n$-gram scores can be obtained in both **sentence** and **corpus** calculations and they're indicated by the **weights** parameter.*

* *weights*: length 4, where each index contains a weight corresponding to its respective $n$-gram.
* $n$-gram with $n \in \{1, 2, 3, 4\}$
* $\textit{weights}=(W_{N=1}, W_{N=2}, W_{N=3}, W_{N=4})$


In [11]:
score_1gram = bleu.sentence_bleu([ref_b], hyp, weights=(1,0,0,0))
score_2gram = bleu.sentence_bleu([ref_b], hyp, weights=(0,1,0,0))
score_3gram = bleu.sentence_bleu([ref_b], hyp, weights=(0,0,1,0))
score_4gram = bleu.sentence_bleu([ref_b], hyp, weights=(0,0,0,1))
print("N-grams: 1-{}, 2-{}, 3-{}, 4-{}".format(score_1gram, score_2gram, score_3gram, score_4gram))

N-grams: 1-1.0, 2-0.9, 3-0.6666666666666666, 4-0.5


* Cumulative N-grams: *by default, the score is calculatedby considering all $N$-grams equally in a geometric mean*

In [12]:
score_ngram1 = bleu.sentence_bleu([ref_b], hyp)
score_ngram = bleu.sentence_bleu([ref_b], hyp, weights=(0.25,0.25,0.25,0.25))
score_ngram_geo = (11/11*9/10*6/9*4/8)**0.25
print("N-grams: {} = {} = ".format(score_ngram1, score_ngram, score_ngram_geo))

N-grams: 0.7400828044922853 = 0.7400828044922853 = 


### Further testing

In [13]:
hyp = str('she read the book because she was interested in world history').split()
ref_a = str('she was interested in world history because she read the book').split()
hyp_b = str('the book she read was about modern civilizations.').split()
ref_b = str('the book she read was about modern civilizations.').split()

score_a = bleu.sentence_bleu([ref_a], hyp)
score_b = bleu.sentence_bleu([ref_b], hyp_b)
score_ab = bleu.sentence_bleu([ref_a], hyp_b)
score_ba = bleu.sentence_bleu([ref_b], hyp)
score_ref_a = bleu.corpus_bleu([[ref_a], [ref_b]], [hyp, hyp_b])
average = (score_a+score_b)/2
corpus = math.pow((11+8)/19 * (9+7)/(17) * (6+6)/(9+6) * (4+5)/(8+5), 1/4)
print("Sent: {}, {}, {}, {} - Corpus {}, {}, {}".format(score_a, score_b, score_ab, score_ba, score_ref_a, average, corpus))

Sent: 0.7400828044922853, 1.0, 6.664457123729399e-155, 8.190757052088229e-155 - Corpus 0.8496988908521796, 0.8700414022461427, 0.8496988908521795


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


## End of Example.  Return to Slides

![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

<div class="lev1">
    <a href="#NLP2-Lecture-3-Support-Notebook">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        Go to TOP
    </a>
</div>

# ROUGE: Recall-Oriented Understudy for Gisting Evaluation
The most common way is using the so-called ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measure. This is a recall-based measure that determines how well a system-generated summary covers the content present in one or more human-generated model summaries known as references. It is recall-based to encourage systems to include all the important topics in the text. Recall can be computed with respect to unigram, bigram, trigram, or 4-gram matching. For example, ROUGE-1 is computed as division of count of unigrams in reference that appear in system and count of unigrams in reference summary.

If there are multiple references, the ROUGE-1 scores are averaged. Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent or the sentences flow together in a sensible manner. High-order n-gram ROUGE measures try to judge fluency to some degree. Note that ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision- based, because translation systems favor accuracy.

In [14]:
!pip install py-rouge

Collecting py-rouge
  Downloading py_rouge-1.1-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 5.7 MB/s  eta 0:00:01
[?25hInstalling collected packages: py-rouge
Successfully installed py-rouge-1.1


In [15]:
import rouge

In [16]:
def prepare_results(p, r, f):
    return '\t{}:\t{}: {:5.2f}\t{}: {:5.2f}\t{}: {:5.2f}'.format(metric, 'P', 100.0 * p, 'R', 100.0 * r, 'F1', 100.0 * f)

In [17]:
for aggregator in ['Avg', 'Best', 'Individual']:
    print('Evaluation with {}'.format(aggregator))
    apply_avg = aggregator == 'Avg'
    apply_best = aggregator == 'Best'

    evaluator = rouge.Rouge(metrics=['rouge-n', 'rouge-l', 'rouge-w'],
    #evaluator = rouge.Rouge(metrics=['rouge-n'],
                           max_n=4,
                           limit_length=True,
                           length_limit=100,
                           length_limit_type='words',
                           apply_avg=apply_avg,
                           apply_best=apply_best,
                           alpha=0.5, # Default F1_score
                           weight_factor=1.2,
                           stemming=True)


    hypothesis_1 = "King Norodom Sihanouk has declined requests to chair a summit of Cambodia 's top political leaders , saying the meeting would not bring any progress in deadlocked negotiations to form a government .\nGovernment and opposition parties have asked King Norodom Sihanouk to host a summit meeting after a series of post-election negotiations between the two opposition groups and Hun Sen 's party to form a new government failed .\nHun Sen 's ruling party narrowly won a majority in elections in July , but the opposition _ claiming widespread intimidation and fraud _ has denied Hun Sen the two-thirds vote in parliament required to approve the next government .\n"
    references_1 = ["Prospects were dim for resolution of the political crisis in Cambodia in October 1998.\nPrime Minister Hun Sen insisted that talks take place in Cambodia while opposition leaders Ranariddh and Sam Rainsy, fearing arrest at home, wanted them abroad.\nKing Sihanouk declined to chair talks in either place.\nA U.S. House resolution criticized Hun Sen's regime while the opposition tried to cut off his access to loans.\nBut in November the King announced a coalition government with Hun Sen heading the executive and Ranariddh leading the parliament.\nLeft out, Sam Rainsy sought the King's assurance of Hun Sen's promise of safety and freedom for all politicians.",
                    "Cambodian prime minister Hun Sen rejects demands of 2 opposition parties for talks in Beijing after failing to win a 2/3 majority in recent elections.\nSihanouk refuses to host talks in Beijing.\nOpposition parties ask the Asian Development Bank to stop loans to Hun Sen's government.\nCCP defends Hun Sen to the US Senate.\nFUNCINPEC refuses to share the presidency.\nHun Sen and Ranariddh eventually form a coalition at summit convened by Sihanouk.\nHun Sen remains prime minister, Ranariddh is president of the national assembly, and a new senate will be formed.\nOpposition leader Rainsy left out.\nHe seeks strong assurance of safety should he return to Cambodia.\n",
                    ]

    hypothesis_2 = "China 's government said Thursday that two prominent dissidents arrested this week are suspected of endangering national security _ the clearest sign yet Chinese leaders plan to quash a would-be opposition party .\nOne leader of a suppressed new political party will be tried on Dec. 17 on a charge of colluding with foreign enemies of China '' to incite the subversion of state power , '' according to court documents given to his wife on Monday .\nWith attorneys locked up , harassed or plain scared , two prominent dissidents will defend themselves against charges of subversion Thursday in China 's highest-profile dissident trials in two years .\n"
    references_2 = "Hurricane Mitch, category 5 hurricane, brought widespread death and destruction to Central American.\nEspecially hard hit was Honduras where an estimated 6,076 people lost their lives.\nThe hurricane, which lingered off the coast of Honduras for 3 days before moving off, flooded large areas, destroying crops and property.\nThe U.S. and European Union were joined by Pope John Paul II in a call for money and workers to help the stricken area.\nPresident Clinton sent Tipper Gore, wife of Vice President Gore to the area to deliver much needed supplies to the area, demonstrating U.S. commitment to the recovery of the region.\n"

    all_hypothesis = [hypothesis_1, hypothesis_2]
    all_references = [references_1, references_2]

    scores = evaluator.get_scores(all_hypothesis, all_references)

    for metric, results in sorted(scores.items(), key=lambda x: x[0]):
        if not apply_avg and not apply_best: # value is a type of list as we evaluate each summary vs each reference
            for hypothesis_id, results_per_ref in enumerate(results):
                nb_references = len(results_per_ref['p'])
                for reference_id in range(nb_references):
                    print('\tHypothesis #{} & Reference #{}: '.format(hypothesis_id, reference_id))
                    print('\t' + prepare_results(results_per_ref['p'][reference_id], results_per_ref['r'][reference_id], results_per_ref['f'][reference_id]))
            print()
        else:
            print(prepare_results(results['p'], results['r'], results['f']))
    print()

Evaluation with Avg
	rouge-1:	P: 28.62	R: 26.46	F1: 27.49
	rouge-2:	P:  4.21	R:  3.92	F1:  4.06
	rouge-3:	P:  0.80	R:  0.74	F1:  0.77
	rouge-4:	P:  0.00	R:  0.00	F1:  0.00
	rouge-l:	P: 30.52	R: 28.57	F1: 29.51
	rouge-w:	P: 15.85	R:  8.28	F1: 10.87

Evaluation with Best
	rouge-1:	P: 30.44	R: 28.36	F1: 29.37
	rouge-2:	P:  4.74	R:  4.46	F1:  4.59
	rouge-3:	P:  1.06	R:  0.98	F1:  1.02
	rouge-4:	P:  0.00	R:  0.00	F1:  0.00
	rouge-l:	P: 31.54	R: 29.71	F1: 30.60
	rouge-w:	P: 16.42	R:  8.82	F1: 11.47

Evaluation with Individual
	Hypothesis #0 & Reference #0: 
		rouge-1:	P: 38.54	R: 35.58	F1: 37.00
	Hypothesis #0 & Reference #1: 
		rouge-1:	P: 45.83	R: 43.14	F1: 44.44
	Hypothesis #1 & Reference #0: 
		rouge-1:	P: 15.05	R: 13.59	F1: 14.29

	Hypothesis #0 & Reference #0: 
		rouge-2:	P:  7.37	R:  6.80	F1:  7.07
	Hypothesis #0 & Reference #1: 
		rouge-2:	P:  9.47	R:  8.91	F1:  9.18
	Hypothesis #1 & Reference #0: 
		rouge-2:	P:  0.00	R:  0.00	F1:  0.00

	Hypothesis #0 & Reference #0: 
		rouge-3:	P: 

In short and approximately:

+ ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
+ ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
+ ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.

## End of Example.  Return to Slides

![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

<div class="lev1">
    <a href="#NLP2-Lecture-3-Support-Notebook">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        Go to TOP
    </a>
</div>

# Bilingual Evaluation Understudy Score (BLEAU)
ref. Develop Deep Learning Models for Natural Language in Python, Jason Brownlee
## Sentence BLEU Score
NLTK provides the sentence bleu() function for evaluating a candidate sentence against one or more reference sentences. The reference sentences must be provided as a list of sentences where each reference is a list of tokens. The candidate sentence is provided as a list of tokens.

In [18]:
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'course', 'from', 'MLU'], ['this', 'is' 'course']] 
candidate = ['this', 'is', 'a', 'course']
score = sentence_bleu(reference, candidate)
print(score)

1.0


Running this example prints a perfect score as the candidate matches one of the references exactly.

## Corpus BLEU Score
NLTK also provides a function called corpus bleu() for calculating the BLEU score for multiple sentences such as a paragraph or a document. The references must be specified as a list of documents where each document is a list of references and each alternative reference is a list of tokens, e.g. a list of lists of lists of tokens. The candidate documents must be specified as a list where each document is a list of tokens, e.g. a list of lists of tokens. This is a little confusing; here is an example of two references for one document.

In [19]:
# two references for one document
from nltk.translate.bleu_score import corpus_bleu
references = [[['this', 'is', 'a', 'course', 'from', 'MLU'], ['this', 'is' 'course']]] 
candidates = [['this', 'is', 'a', 'course', 'from', 'MLU']]
score = corpus_bleu(references, candidates)
print(score)

1.0


## Cumulative and Individual BLEU Scores
The BLEU score calculations in NLTK allow you to specify the weighting of different n-grams in the calculation of the BLEU score. This gives you the flexibility to calculate different types of BLEU score, such as individual and cumulative n-gram scores. Let’s take a look.
### Individual n-gram Scores
An individual n-gram score is the evaluation of just matching grams of a specific order, such as single words (1-gram) or word pairs (2-gram or bigram). The weights are specified as a tuple where each index refers to the gram order. To calculate the BLEU score only for 1-gram matches, you can specify a weight of 1 for 1-gram and 0 for 2, 3 and 4 (1, 0, 0, 0). For example:

In [20]:
# 1-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'course']]
candidate = ['this', 'is', 'a', 'course']
score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)) 
print(score)

0.75


We can repeat this example for individual n-grams from 1 to 4 as follows:

In [21]:
# n-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'course']]
candidate = ['this', 'is', 'a', 'course']
print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))) 
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0))) 
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0))) 
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

Individual 1-gram: 1.000000
Individual 2-gram: 1.000000
Individual 3-gram: 1.000000
Individual 4-gram: 1.000000


Although we can calculate the individual BLEU scores, this is not how the method was intended to be used and the scores do not carry a lot of meaning, or seem that interpretable.
### Cumulative n-gram Scores
Cumulative scores refer to the calculation of individual n-gram scores at all orders from 1 to n and weighting them by calculating the weighted geometric mean. By default, the sentence bleu() and corpus bleu() scores calculate the cumulative 4-gram BLEU score, also called BLEU-4. The weights for the BLEU-4 are 1/4 (25%) or 0.25 for each of the 1-gram, 2-gram, 3-gram and 4-gram scores. For example:

In [22]:
# 4-gram cumulative BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'course']]
candidate = ['this', 'is', 'a', 'course']
score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)) 
print(score)

1.0547686614863434e-154


The cumulative and individual 1-gram BLEU use the same weights, e.g. (1, 0, 0, 0). The 2-gram weights assign a 50% to each of 1-gram and 2-gram and the 3-gram weights are 33% for each of the 1, 2 and 3-gram scores. Let’s make this concrete by calculating the cumulative scores for BLEU-1, BLEU-2, BLEU-3 and BLEU-4:

In [23]:
# cumulative BLEU scores
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'course']]
candidate = ['this', 'is', 'a', 'course']
print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

Cumulative 1-gram: 0.750000
Cumulative 2-gram: 0.500000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000


It is common to report the cumulative BLEU-1 to BLEU-4 scores when describing the skill of a text generation system.

## A Working Example
In this demo, we try to develop further intuition for the BLEU score with some examples. We work at the sentence level with a single reference sentence of the following: <br/>
__the quick brown fox jumped over the lazy dog__

First, let’s look at a perfect score.

In [24]:
# prefect match
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] 
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog'] 
score = sentence_bleu(reference, candidate)
print(score)

1.0


Next, let’s change one word, ‘quick’ to ‘fast’.

In [25]:
# one word different
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] 
candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog'] 
score = sentence_bleu(reference, candidate)
print(score)

0.7506238537503395


This result is a slight drop in score.

Let's try changing two words, both ‘quick’ to ‘fast’ and ‘lazy’ to ‘sleepy’.


In [26]:
# two words different
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] 
candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'sleepy', 'dog'] 
score = sentence_bleu(reference, candidate)
print(score)

0.4854917717073234


Running the example, we can see a linear drop in skill.

What about if all words are different in the candidate?

In [27]:
# all words different
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] 
candidate = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
score = sentence_bleu(reference, candidate)
print(score)

0


We get the worse possible score.

Now, let’s try a candidate that has fewer words than the reference (e.g. drop the last two words), but the words are all correct.

In [28]:
# shorter candidate
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] 
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the']
score = sentence_bleu(reference, candidate)
print(score)

0.7514772930752859


The score is much like the score when two words were wrong above.

How about if we make the candidate two words longer than the reference?

In [29]:
# longer candidate
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] 
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog',
'from', 'space']
score = sentence_bleu(reference, candidate) 
print(score)

0.7860753021519787


Again, we can see that our intuition holds and the score is something like two words wrong.

Finally, let’s compare a candidate that is way too short: only two words in length.

In [30]:
# very short
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick']
score = sentence_bleu(reference, candidate)
print(score)

4.5044474950870215e-156


Running this example first prints a warning message indicating that the 3-gram and above part of the evaluation (up to 4-gram) cannot be performed. This is fair given we only have 2-grams to work with in the candidate.
And we can see a score that is very low indeed.


## End of Example.  Return to Slides

![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

<div class="lev1">
    <a href="#NLP2-Lecture-3-Support-Notebook">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        Go to TOP
    </a>
</div>