# Introduction

<center><h3>**Welcome to the Language modeling Notebook.**</h3></center>

In this assignment, you are going to train a neural network to **generate news headlines**.
To reduce computational needs, we have reduced it to headlines about technology, and a handful of Tech giants.
In this assignment you will:
- Learn to preprocess raw text so it can be fed into an LSTM.
- Make use of the LSTM library of Tensorflow, to train a Language model to generate headlines
- Use your network to generate headlines, and judge which headlines are likely or not




**What is a language model?**

Language modeling is the task of assigning a probability to sentences in a language. Besides assigning a probability to each sequence of words, the language models also assigns a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words.
— Page 105, __[Neural Network Methods in Natural Language Processing](https://www.amazon.com/Language-Processing-Synthesis-Lectures-Technologies/dp/1627052984/)__, 2017.

In terms of neural network, we are training a neural network to produce probabilities (classification) over a fixed vocabulary of words.
Concretely, we are training a neural network to produce:
$$ P ( w_{i+1} | w_1, w_2, w_3, ..., w_i), \forall i \in (1,n)$$

** Why is language modeling important? **

Language modeling is a core problem in NLP.

Language models can either be used as a stand-alone to produce new text that matches the distribution of text the model is trained on, but can also be used at the front-end of a more sophisticated model to produce better results.

Recently for example, the __[BERT](https://arxiv.org/abs/1810.04805)__ paper show-cased that pretraining a large neural network on a language modeling task can help improve state-of-the-art on many NLP tasks. 

How good can the generation of a Language model be?

If you have not seen the latest post by OpenAI, you should read some of the samples they generated from their language model __[here](https://blog.openai.com/better-language-models/#sample1)__.
Because of computational restrictions, we will not achieve as good text production, but the same algorithm is at the core. They just use more data and compute.

# Library imports

Before starting, make sure you have all these libraries.

In [3]:
from segtok import tokenizer
from collections import Counter
import tensorflow as tf
import numpy as np
import json
import os

import warnings
warnings.filterwarnings("ignore")

root_folder = ""

# Loading the datasets

Make sure the dataset files are all in the `dataset` folder of the assignment.

 - If you are using this notebook locally: You should run the `download_data.sh` script.
 - If you are using the Colab version of the notebook, make sure that your Google Drive is mounted, and you verify from the file explorer in Colab that the files are viewable within `/content/gdrive/CS182_HW03/dataset/`
 


In [4]:
# This cell loads the data for the model
# Run this before working on loading any of the additional data

with open(root_folder+"dataset/headline_generation_dataset_processed.json", "r") as f:
    d_released = json.load(f)

with open(root_folder+"dataset/headline_generation_vocabulary.txt", "r") as f:
    vocabulary = f.read().split("\n")
w2i = {w: i for i, w in enumerate(vocabulary)} # Word to index
unkI, padI, start_index = w2i['UNK'], w2i['PAD'], w2i['<START>']

vocab_size = len(vocabulary)
input_length = len(d_released[0]['numerized']) # The length of the first element in the dataset, they are all of the same length
d_train = [d for d in d_released if d['cut'] == 'training']
d_valid = [d for d in d_released if d['cut'] == 'validation']

print("Number of training samples:",len(d_train))
print("Number of validation samples:",len(d_valid))

Number of training samples: 88568
Number of validation samples: 946


Now that we have loaded the data, let's inspect one of the elements. Each sample in our dataset is has a `numerized` vector, that contains the preprocessed headline. This vector is what we will feed in to the neural network. The field `numerized` corresponds to this list of tokens. The already loaded dictionary `vocabulary` maps token lists to the actual string. Use these elements to recover `title` key of entry 1001 in the training dataset.

**TODO**: Write the numerized2text function and inspect element 1001 in the training dataset (`entry = d_train[1001]`).



In [5]:
def numerized2text(numerized):
    """ Converts an integer sequence in the vocabulary into a string corresponding to the title.
    
        Arguments:
            numerized: List[int]  -- The list of vocabulary indices corresponding to the string
        Returns:
            title: str -- The string corresponding to the numerized input, without padding.
    """
    #####
    # BEGIN YOUR CODE HERE 
    # Recover each word from the vocabulary in the list of indices in numerized, using the vocabulary variable
    # Hint: Use the string.join() function to reconstruct a single string
    #####
    
    words = []
    converted_string = None
    for n in numerized:
        if n != 2:
            words.append(vocabulary[n])
    converted_string = " ".join(words)
    
    #####
    # END YOUR CODE HERE
    #####
    
    return converted_string

entry = d_train[1001]
print("Reversing the numerized: "+numerized2text(entry['numerized']))
print("From the `title` entry: "+ entry['title'])

Reversing the numerized: microsoft donates cloud computing ' worth $ 1 bn '
From the `title` entry: Microsoft donates cloud computing 'worth $1 bn'


In language modeling, we train a model to produce the next word in the sequence given all previously generated words. This has, in practice, two steps:


    1. Adding a special <START> token to the start of the sequence for the input. This "shifts" the input to the right by one. We call this the "source" sequence
    2. Making the network predict the original, unshifted version (we call this the "target" sequence)

    
Let's take an example. Say we want to train the network on the sentence: "The cat is great."
The input to the network will be "`<START>` The cat is great." The target will be: "The cat is great".
    
Therefore the first prediction is to select the word "The" given the `<START>` token.
The second prediction is to produce the word "cat" given the two tokens "`<START>` The".
At each step, the network learns to predict the next word, given all previous ones.
    
---

Your next step is to write the build_batch function. Given a dataset, we select a random subset of samples, and will build the "inputs" and the "targets" of the batch, following the procedure we've described.

**TODO**: write the build_batch function. We give you the structure, and you have to fill in where we have left things `None`.


In [47]:
def build_batch(dataset, batch_size):
    """ Builds a batch of source and target elements from the dataset.
    
        Arguments:
            dataset: List[db_element] -- A list of dataset elements
            batch_size: int -- The size of the batch that should be created
        Returns:
            batch_input: List[List[int]] -- List of source sequences
            batch_target: List[List[int]] -- List of target sequences
            batch_target_mask: List[List[int]] -- List of target batch masks
    """
    
    #####
    # BEGIN YOUR CODE HERE 
    #####
    
    
    # We get a list of indices we will choose from the dataset.
    # The randint function uses a uniform distribution, giving equal probably to any entry
    # for each batch
    indices = list(np.random.randint(0, len(dataset), size=batch_size))
    
    # Recover what the entries for the batch are
    batch = [dataset[i] for i in indices]
    
    # Get the raw numerized for this input, each element of the dataset has a 'numerized' key
    batch_numerized = [data['numerized'] for data in batch]

    # Create an array of start_index that will be concatenated at position 1 for the input.
    # Should be of shape (batch_size, 1)
    start_tokens = np.zeros((batch_size, 1))

    # Concatenate the start_tokens with the rest of the input
    # The np.concatenate function should be useful
    # The output should now be [batch_size, sequence_length+1]
    batch_input = np.concatenate((start_tokens, batch_numerized), axis = 1)

    # Remove the last word from each element in the batch
    # To restore the [batch_size, sequence_length] size
    batch_input = batch_input[:, :-1]
    
    # The target should be the un-shifted numerized input
    batch_target = batch_numerized

    # The target-mask is a 0 or 1 filter to note which tokens are
    # padding or not, to give the loss, so the model doesn't get rewarded for
    # predicting PAD tokens.
    batch_target_mask = np.array([a['mask'] for a in batch])
    
    #####
    # END YOUR CODE HERE 
    #####
    return batch_input, batch_target, batch_target_mask

test_run = build_batch(d_train, 3)
print(test_run[0])
print(test_run[1])
print(test_run[2])

[[0.000e+00 8.357e+03 7.480e+02 3.040e+02 3.000e+00 6.000e+00 1.700e+01
  7.200e+01 8.410e+02 2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00
  2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00]
 [0.000e+00 5.000e+00 3.000e+00 1.130e+02 9.400e+01 9.000e+00 2.780e+02
  4.100e+01 1.331e+03 3.000e+01 1.185e+03 2.036e+03 2.000e+00 2.000e+00
  2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00]
 [0.000e+00 2.200e+01 7.000e+00 4.200e+02 8.360e+02 6.000e+00 1.700e+01
  1.300e+01 2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00
  2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00]]
[[8357, 748, 304, 3, 6, 17, 72, 841, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], [5, 3, 113, 94, 9, 278, 41, 1331, 30, 1185, 2036, 2, 2, 2, 2, 2, 2, 2, 2, 2], [22, 7, 420, 836, 6, 17, 13, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]]
[[ True  True  True  True  True  True  True  True False False False False
  False False False False False False False False]
 [ True  True  True  True  True 

# Creating the language model

Now that we've written the data pipelining, we are ready to write the Neural network.

The steps to setting up a neural network to do Language modeling are:
- Creating the placeholders for the model, where we can feed in our inputs and targets.
- Creating an RNN of our choice, size, and with optional parameters
- Using the RNN on our placeholder inputs.
- Getting the output from the RNN, and projecting it into a vocabulary sized dimension, so that we can make word predictions.
- Setting up the loss on the outputs so that the network learns to produce the correct words.
- Finally, choosing an optimizer, and defining a training operation: using the optimizer to minimize the loss.

We provide skeleton code for the model, you can fill in the `None` section. If you are unfamiliar with Tensorflow, we provide some idea of what functions to look for, you should use the Tensorflow online documentation.

**TODO**: Replace the `None` variables with their respective code elements in the LanguageModel Class


In [24]:
# Using a basic RNN/LSTM for Language modeling
class LanguageModel():
    def __init__(self, input_length, vocab_size, rnn_size, learning_rate=1e-4):
        
        # Create the placeholders for the inputs:
        # All three placeholders should be of size [None, input_length]
        # Where None represents a variable batch_size, and input_length is the
        # maximal length of a sequence of words, after being padded.
        self.input_num = tf.placeholder(tf.int32, shape=[None, input_length])
        self.targets = tf.placeholder(tf.int32, shape=[None, input_length])
        self.targets_mask = tf.placeholder(tf.int32, shape=[None, input_length])
        # Create an embedding variable of shape [vocab_size, rnn_size]
        # That will map each word in our vocab into a vector of rnn_size size.
        
        embedding = tf.get_variable("embedding", shape = [vocab_size, rnn_size], initializer = tf.contrib.layers.xavier_initializer())
        # Use the tensorflow embedding_lookup function
        # To embed the input_num, using the embedding variable we've created
        input_emb = tf.nn.embedding_lookup(embedding, self.input_num)
        print(self.input_num.shape)
        print(input_emb.shape)

        # Create a an RNN or LSTM cell of rnn_size size.
        # Look into the tf.nn.rnn_cell documentation
        # You can optionally use Tensorflow Add-ons such as the MultiRNNCell, or the DropoutWrapper
        lm_cell = tf.nn.rnn_cell.BasicRNNCell(rnn_size)
        
        # Use the dynamic_rnn function of Tensorflow to run the embedded inputs
        # using the lm_cell you've created, and obtain the outputs of the RNN cell.
        # You have created a cell, which represents a single block (column) of the RNN.
        # dynamic_rnn will "copy" the cell for each element in your sequence, runs the input you provide through the cell,
        # and returns the outputs and the states of the cell.
        outputs, states = tf.nn.dynamic_rnn(lm_cell, input_emb, dtype = float)

        # Use a dense layer to project the outputs of the RNN cell into the size of the
        # vocabulary (vocab_size).
        # output_logits should be of shape [None,input_length,vocab_size]
        # You can look at the tf.layers.dense function
        self.output_logits = tf.layers.dense(outputs, vocab_size)

        # Setup the loss: using the sparse_softmax_cross_entropy.
        # The logits are the output_logits we've computed.
        # The targets are the gold labels we are trying to match
        # Don't forget to use the targets_mask we have, so your loss is not off,
        # And your model doesn't get rewarded for predicting PAD tokens
        # You might have to cast the masks into float32. Look at the tf.cast function.
        float_mask = tf.to_float(self.targets_mask)
        self.loss = tf.losses.sparse_softmax_cross_entropy(self.targets, self.output_logits, float_mask)

        # Setup an optimizer (SGD, RMSProp, Adam), you can find a list under tf.train.*
        # And provide it with a start learning rate.

        optimizer = tf.train.AdamOptimizer(learning_rate)    

        # We create a train_op that requires the optimizer we've created to minimize the
        # loss we've defined.
        # look for the optimizer.minimize function, define what should be miniminzed.
        # You can provide it with the provide an optional global_step parameter as well that keeps of how many
        # Optimizations steps have been run.
        
        self.global_step = tf.train.get_or_create_global_step()
        self.train_op = optimizer.minimize(self.loss)
        self.saver = tf.train.Saver()

Once you have created the Model class, we should instantiate the model. The line tf.reset_default_graph() resets the graph for the Jupyter notebook, so multiple models aren't floating around. If you have trouble with redefinition of variables, it may be worth re-running the cell below. 

In [106]:
!pip3 install "gast==0.2.2"

Collecting gast==0.2.2
Installing collected packages: gast
  Found existing installation: gast 0.3.3
    Uninstalling gast-0.3.3:
      Successfully uninstalled gast-0.3.3
Successfully installed gast-0.2.2


In [25]:
# We can create our model,
# with parameters of our choosing.


tf.reset_default_graph() # This is so that when you debug, you reset the graph each time you run this, in essence, cleaning the board
model = LanguageModel(input_length=input_length, vocab_size=vocab_size, rnn_size=256, learning_rate=1e-3)

(?, 20)
(?, 20, 256)


# Training the model

Your objective is to train the Language on the dataset you are provided to reach a **validation loss <= 5.50**

**TODO**: Train your model so that it achieves a validation loss of <= 5.5. 

**Careful**: we will be testing this loss on an unreleased test set, so make sure to evaluate properly on a validation set and not overfit. You must save the model you want us to test under: models/final_language_model (the .index, .meta and .data files)

**Advice**:
- It should be possible to attain loss <= 5.50 with a 1-layer LSTM of size 256 or less.
- You should not need more than 10 epochs to attain the threshold. More passes over the data can however give you a better model.
- You can however try using:
    - LSTM dropout (Tensorflow has a layer for that)
    - Multi-layer RNN cell (Tensorflow has a layer for that)
    - Change your optimizers, tune your learning_rate, use a learning rate schedule.
    
**Extra credit**:

Get the loss below **validation loss <= 5.00** and get 5 points of extra-credit on this assignment. Get creative,

but remember, what you do should work on our held-out test set to get the points.

In [26]:
# Skeleton code
# You have to write your own training process to obtain a
# Good performing model on the validation set, and save it.
import tensorflow.python.util.deprecation as dep

dep._PRINT_DEPRECATION_WARNINGS = False

experiment = root_folder+"models/magic_model"
"""
with tf.Session() as sess:
    # Here is how you initialize weights of the model according to their
    # Initialization parameters.
    sess.run(tf.global_variables_initializer())
    
    # Here is how you obtain a batch:
    batch_size = 16
    batch_input, batch_target, batch_target_mask = build_batch(d_train, batch_size)
    # Map the values to each tensor in a `feed_dict`
    feed = {model.input_num: batch_input, model.targets: batch_target, model.targets_mask: batch_target_mask}

    # Obtain a single value of the loss for that batch.
    # !IMPORTANT! Don't forget to include the train_op to when using a batch from the training dataset
    # (d_train)
    # !MORE IMPORTANT! Don't use the train_op if you evaluate the loss on the validation set,
    # Otherwise, your network will overfit on your validation dataset.
    
    step, train_loss, _ = sess.run([model.global_step, model.loss, model.train_op], feed_dict=feed)
    
    # Here is how you save the model weights
    model.saver.save(sess, experiment)
    
    # Here is how you restore the weights previously saved
    model.saver.restore(sess, experiment)
"""
model_path = root_folder+"models/language_model_1"
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    batch_size = 30
    num_iter = 2
    print_every = 100
    num_loop = num_iter * len(d_train) / batch_size
    
    for i in range(int(num_loop)):
        batch_input, batch_target, batch_target_mask = build_batch(d_train, batch_size)
        feed = {model.input_num: batch_input, model.targets: batch_target, model.targets_mask: batch_target_mask}
        step, train_loss, _ = sess.run([model.global_step, model.loss, model.train_op], feed_dict=feed)
        if i % print_every == 0:
            print(train_loss)
            
    model.saver.save(sess, model_path)

9.210474
8.593926
7.5178742
7.587587
7.5068254
7.5164933
7.593069
7.366342
7.3722396
6.995211
6.965376
7.048409
7.0839877
7.2779613
7.059818
7.008044
6.9454203
7.0351086
6.7103286
6.8404307
6.836645
7.1152067
6.87054
6.753252
6.6962433
6.734917
6.6719484
6.7512827
6.8676853
6.5743656
6.7195644
6.549944
6.493685
6.722496
6.6197395
6.515445
6.6169167
6.8771763
6.7079744
6.516627
6.4671307
6.6655684
6.6805935
6.5874386
6.788619
6.6835103
6.686114
6.523892
6.5168195
6.561584
6.241732
6.4873343
6.4505258
6.5493784
6.4222345
6.392375
6.602844
6.206663
6.451582
6.523972
6.323896
6.3301616
6.6368446
6.404371
6.466408
6.45383
6.483219
6.11298
6.301477
6.4764977
6.493512
6.377012
6.3591733
6.487005
6.1950364
6.1753807
6.0535393
6.204574
6.3168893
6.227045
6.33189
6.2650013
6.345302
6.49921
6.3069754
6.5344105
6.39885
6.11942
6.098263
6.3924875
6.406852
6.345641
6.010637
6.168981
6.448154
6.037938
6.158817
6.103353
5.990959
6.0530734
6.1666694
6.129173
6.330406
6.298727
6.107854
6.1324296
6.31361

# Using the language model

Congratulations, you have now trained a language model! We can now use it to evaluate likely news headlines, as well as generate our very own headlines.

**TODO**: Complete the three parts below, using the model you have trained.

## (1) Evaluation loss

To evaluate the language model, we evaluate its loss (ability to predict) on unseen data that is reserved for evaluation.
Your first evaluation is to load the model you trained, and obtain a test loss.

In [28]:
# Your best performing model should go here.
model_file = root_folder+"models/language_model_1"

In [29]:
# We will evaluate your model in the model_file above
# In a very similar way as the code below.
# Make sure your validation loss is befow the threshold we specified
# and that you didn't train using the validation set, as you would
# get penalized.

with tf.Session() as sess:
    model.saver.restore(sess, model_file)
    eval_input, eval_target, eval_target_mask = build_batch(d_valid, 500)
    feed = {model.input_num: eval_input, model.targets: eval_target, model.targets_mask: eval_target_mask}
    eval_loss = sess.run([model.loss], feed_dict=feed)
    print("Evaluation set loss:", eval_loss)

INFO:tensorflow:Restoring parameters from models/language_model_1
Evaluation set loss: [5.439161]


## (2) Evaluation of likelihood of data

One use of a language model is to see what data is more likely to have originated from the training data. Because we have trained our model on news headlines, we can see which of these headlines is more likely:

``Apple to release another iPhone in September``


 ``Apple and Samsung resolve all lawsuits amicably``
 
**TODO**: Use the model to obtain the loss the neural network assigns to each sentence.
Because the neural network assigns probability to the words appearing in a sequence, this loss can be used as a proxy to measure how likely the sentence is to have occurred in the dataset.
Once you have the loss for each headline, write down which sentence was judged to be more likely, and explain why/if you think this is coherent.

**Your answer:**


In [50]:
headline1 = "Apple to release new iPhone in July"
headline2 = "Apple and Samsung resolve all lawsuits"

headlines = [headline1, headline2]

with tf.Session() as sess:
    model.saver.restore(sess, model_file)

    for headline in headlines:
        headline = headline.lower() # Our LSTM is trained on lower-cased headlines
        
        headline_dict = {}
        
        # From the code in the Preprocessing section at the end of the notebook
        # Find out how to tokenize the headline
        tokenized = tokenizer.word_tokenizer(headline.lower())
        
        # Find out how to numerize the tokenized headline
        numerized = numerize_sequence(tokenized)

        
        
        # Learn how to pad and obtain the mask of the sequence.
        padded, mask = pad_sequence(numerized, 2, input_length)
        headline_dict['numerized'] = padded
        headline_dict['mask'] = mask
        
        # Obtain the loss of the sequence, and pring it
        
        headline_in = []
        headline_in.append(headline_dict)
        batch_input, batch_target, batch_target_mask = build_batch(headline_in, 1)
        feed = {model.input_num: batch_input, model.targets: batch_target, model.targets_mask: batch_target_mask}
        step, loss = sess.run([model.global_step, model.loss], feed_dict=feed)
        print("----------------------------------------")
        print("Headline:",headline)
        print("Loss of the headline:", loss)

# Important check: one headline should be more likely (and have lower loss)
# Than the other headline. You should know which headline should have lower loss.

INFO:tensorflow:Restoring parameters from models/language_model_1
Tensor("dense/BiasAdd:0", shape=(?, 20, 10000), dtype=float32)
----------------------------------------
Headline: apple to release new iphone in july
Loss of the headline: 3.6254086
Tensor("dense/BiasAdd:0", shape=(?, 20, 10000), dtype=float32)
----------------------------------------
Headline: apple and samsung resolve all lawsuits
Loss of the headline: 6.2516007


## (3) Generation of headlines

We can use our language model to generate text according to the distribution of our training data.
The way generation works is the following:

We seed the model with a beginning of sequence, and obtain the distribution for the next word.
We select the most likely word (argmax) and add it to our sequence of words.
Now our sequence is one word longer, and we can feed it in again as an input, for the network to produce the next sentence.
We do this a fixed number of times (up to 20 words), and obtain automatically generated headlines!


We have provided a few headline starters that should produce interesting generated headlines.

**TODO:** Get creative and find at least 2 more headline_starters that produce interesting headlines.

In [72]:
with tf.Session() as sess:
    model.saver.restore(sess, model_file)

    # Here are some headline starters.
    # They're all about tech companies, because
    # That is what is in our dataset
    headline_starters = ["apple has released", "google has released", "amazon", "tesla to"]
    
    for headline_starter in headline_starters:
        print("===================")
        print("Generating headline starting with: "+headline_starter)

        # Tokenize and numerize the headline. Put the numerized headline
        # beginning in `current_build`
        tokenized = tokenizer.word_tokenizer(headline_starter.lower())
        current_build = numerize_sequence(tokenized)

        while len(current_build) < input_length:
            # Pad the current_build into a input_length vector.
            # We do this so that it can be processed by our LanguageModel class
            current_padded = current_build[:input_length] + [padI] * (input_length - len(current_build))
            current_padded = np.array([current_padded])
            
            
            input_dict = {}
            input_dict['numerized'] = current_padded[0]
            input_dict['mask'] = [w != 2 for w in current_padded][0]
            cur_in = []
            cur_in.append(input_dict)
            
            # Obtain the logits for the current padded sequence
            # This involves obtaining the output_logits from our model,
            # and not the loss like we have done so far
            batch_input, batch_target, batch_target_mask = build_batch(cur_in, 1)
            feed = {model.input_num: batch_input, model.targets: batch_target, model.targets_mask: batch_target_mask}
            logits = model.output_logits.eval(feed_dict = feed)[0]

            # Obtain the row of logits that interest us, the logits for the last non-pad
            # inputs
            last_logits = logits[len(current_build)]
            
            # Find the highest scoring word in the last_logits
            # array. The np.argmax function should be useful.
            # Append this word to our current build
            max_word = np.argmax(last_logits)
            current_build.append(max_word)
        
        # Go from the current_build of word_indices
        # To the headline (string) produced. This should involve
        # the vocabulary, and a string merger.
        produced_sentence = numerized2text(current_build)
        print(produced_sentence)

INFO:tensorflow:Restoring parameters from models/language_model_1
Generating headline starting with: apple has released
apple has released a huge UNK to the same UNK of the same UNK of the same UNK of the
Generating headline starting with: google has released
google has released a new version of its UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK
Generating headline starting with: amazon
amazon is UNK the UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK
Generating headline starting with: tesla to
tesla to UNK apple , UNK , UNK , UNK , UNK : intellectual property case : UNK UNK UNK


## All done

You are done with the first part of the HW.

Next notebook deals with Summarization of text!


# Preprocessing (read only)


**You can skip this section, however you may find these functions useful later in the assignment**

We have provided this code so you see how the dataset was generated. You will have to come back some of these functions later in the assignment, so feel free to read through, to get familiar.

In [32]:
def numerize_sequence(tokenized):
    return [w2i.get(w, unkI) for w in tokenized]
def pad_sequence(numerized, pad_index, to_length):
    pad = numerized[:to_length]
    padded = pad + [pad_index] * (to_length - len(pad))
    mask = [w != pad_index for w in padded]
    return padded, mask

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

for a in dataset:
    a['tokenized'] = tokenizer.word_tokenizer(a['title'].lower())

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

word_counts = Counter()
for a in dataset:
    word_counts.update(a['tokenized'])

print(word_counts.most_common(30))

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

# Creating the vocab
vocab_size = 20000
special_words = ["<START>", "UNK", "PAD"]
vocabulary = special_words + [w for w, c in word_counts.most_common(vocab_size-len(special_words))]
w2i = {w: i for i, w in enumerate(vocabulary)}

# Numerizing and padding
input_length = 20
unkI, padI, startI = w2i['UNK'], w2i['PAD'], w2i['<START>']

for a in dataset:
    a['numerized'] = numerize_sequence(a['tokenized']) # Change words to IDs
    a['numerized'], a['mask'] = pad_sequence(a['numerized'], padI, input_length) # Append appropriate PAD tokens
    
# Compute fraction of words that are UNK:
word_counters = Counter([w for a in dataset for w in a['input'] if w != padI])

print("Fraction of UNK words:", float(word_counters[unkI]) / sum(word_counters.values()))

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

d_released_processed   = [d for d in dataset if d['cut'] != 'testing']
d_unreleased_processed = [d for d in dataset if d['cut'] == 'testing']

with open("dataset/headline_generation_dataset_processed.json", "w") as f:
    json.dump(d_released_processed, f)

# This file is purposefully left out of the assignment, we will use it to evaluate your model.
with open("dataset/headline_generation_dataset_unreleased_processed.json", "w") as f:
    json.dump(d_unreleased_processed, f)
    
with open("dataset/headline_generation_vocabulary.txt", "w") as f:
    f.write("\n".join(vocabulary).encode('utf8'))