# Assignment 3: Language Modelling with LSTM networks

In this assignment, you will implement an LSTM based language model. We strongly recommend to finish first _lab 4_, which is closely related and is much simpler.

## Setup

First, let's load the data as before.

In [18]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import re
import random

# Let's do 2-way positive/negative classification instead of 5-way
easy_label_map = {0:0, 1:0, 2:None, 3:1, 4:1}

def load_sst_data(path):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)

    random.seed(1)
    random.shuffle(data)
    return data

sst_home = 'drive/My Drive/2019-2020_labs/data/trees/'
training_set = load_sst_data(sst_home + '/train.txt')
dev_set = load_sst_data(sst_home + '/dev.txt')
test_set = load_sst_data(sst_home + '/test.txt')

# Note: Unlike with k-nearest neighbors, evaluation here should be fast, and we don't need to
# trim down the dev and test sets. 

Next, we'll convert the data to index vectors.

To simplify your implementation, we'll use a fixed unrolling length of 20. This means that we'll have to expand each sentence into a sequence of 21 word indices. In the conversion process, we'll mark the start of each sentence with a special word symbol `<S>`, mark the end of each sentence (if it occurs within the first 21 words) with a special word symbol `</S>`, mark extra tokens after `</S>` with a special word symbol `<PAD>`, and mark out-of-vocabulary words with `<UNK>`, for unknown. As in the previous assignment, we'll use a very small vocabulary for this assignment, so you'll see `<UNK>` often.

In [0]:
import collections
import numpy as np

def sentence_to_padded_index_sequence(datasets):
    '''Annotates datasets with feature vectors.'''
    
    START = "<S>"
    END = "</S>"
    END_PADDING = "<PAD>"
    UNKNOWN = "<UNK>"
    SEQ_LEN = 21
    
    # Extract vocabulary
    def tokenize(string):
        return string.lower().split()
    
    word_counter = collections.Counter()
    for example in datasets[0]:
        word_counter.update(tokenize(example['text']))
    
    vocabulary = set([word for word in word_counter if word_counter[word] > 25])
    vocabulary = list(vocabulary)
    vocabulary = [START, END, END_PADDING, UNKNOWN] + vocabulary
        
    word_indices = dict(zip(vocabulary, range(len(vocabulary))))
    indices_to_words = {v: k for k, v in word_indices.items()}
        
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['index_sequence'] = np.zeros((SEQ_LEN), dtype=np.int32)
            
            token_sequence = [START] + tokenize(example['text']) + [END]
            
            for i in range(SEQ_LEN):
                if i < len(token_sequence):
                    if token_sequence[i] in word_indices:
                        index = word_indices[token_sequence[i]]
                    else:
                        index = word_indices[UNKNOWN]
                else:
                    index = word_indices[END_PADDING]
                example['index_sequence'][i] = index
    return indices_to_words, word_indices
    
indices_to_words, word_indices = sentence_to_padded_index_sequence([training_set, dev_set, test_set])

In [21]:
print(training_set[18])
print(len(word_indices))

{'text': "It could have been something special , but two things drag it down to mediocrity -- director Clare Peploe 's misunderstanding of Marivaux 's rhythms , and Mira Sorvino 's limitations as a classical actress .", 'index_sequence': array([  0, 160, 298, 344, 385, 295, 347, 254,  73, 191, 398,   3, 160,
       209,  57,   3, 142, 174,   3,   3, 284], dtype=int32)}
603


## Assignments: 
### Part 1: Implementation

Now, using the starter code and hyperparameter values provided below, implement an LSTM language model with dropout on the non-recurrent connections. Use the standard form of the LSTM reflected in the slides (without peepholes). **You should only have to edit the marked sections of code to build the base LSTM**, though implementing dropout properly may require small changes to the main training loop and to brittle_sampler().

**Don't use any TensorFlow code that is specifically built for RNNs**. If a TF function has 'recurrent', 'sequence', 'LSTM', or 'RNN' in its name, you should built it yourself instead of using it. (Your version will likely be much simpler, by the way, since these built in methods are powerful but fairly complex and potentially confusing.)

We won't be evaluating our model in the conventional way (perplexity on a held-out test set) for a few reasons: to save time, because we have no baseline to compare against, and because overfitting the training set is a less immediate concern with these models than it was with sentence classifiers. Instead, we'll use the value of the cost function to make sure that the model is converging as expected, and we'll use samples drawn from the model to qualitatively evaluate it.

**Tips**: 
- Check the code for the GRU based sentiment classifier (lab 4), specially the part where the RNN structure is defined.
- You'll need to use `tf.nn.embedding_lookup()`, `tf.nn.sparse_softmax_cross_entropy_with_logits()`, and `tf.split()` at least once each. All three should be easy to Google, though the last homework and the last exercise should show examples of the first two.
- As before, you'll want to initialize your trained parameters using something like `tf.random_normal(..., stddev=0.1)`

**TODOS:**
- **TODO1**: Define the parameters of the LSTM (check the given slides in class)
- **TODO2**: Build the LSTM LM (follow the instructions in the code-comments)


In [22]:
%tensorflow_version 2.x
import tensorflow as tf
tf.__version__

'2.1.0'

In [0]:
class LanguageModel:
    def __init__(self, vocab_size, sequence_length):
        # Define the hyperparameters
        self.learning_rate = 0.3  # Should be about right
        self.training_epochs = 250  # How long to train for - chosen to fit within class time
        self.display_epoch_freq = 1  # How often to test and print out statistics
        self.dim = 32  # The dimension of the hidden state of the RNN
        self.embedding_dim = 16  # The dimension of the learned word embeddings
        self.batch_size = 256  # Somewhat arbitrary - can be tuned, but often tune for speed, not accuracy
        self.vocab_size = vocab_size  # Defined by the file reader above
        self.sequence_length = sequence_length  # Defined by the file reader above
        self.rate = 0.25  # Used in dropout (at training time only, not at sampling time)
        
        #### Start main editable code block ####
        self.trainable_variables = []

        # logits (probabilities) and costs calculating parameters
        self.W_cl = tf.Variable(tf.random.normal([self.dim, self.vocab_size], stddev=0.1))
        self.b_cl = tf.Variable(tf.random.normal([self.vocab_size], stddev=0.1))
        self.trainable_variables.append(self.W_cl)
        self.trainable_variables.append(self.b_cl)
        self.l2_lambda = 0.001

        # Embbedings parameters
        self.E = tf.Variable(tf.random.normal([self.vocab_size, self.embedding_dim], stddev=0.1))
        self.trainable_variables.append(self.E)

        # LSTM params 
        self.W_f = tf.Variable(tf.random.normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_f = tf.Variable(tf.random.normal([self.dim], stddev=0.1))
        self.W_i = tf.Variable(tf.random.normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_i = tf.Variable(tf.random.normal([self.dim], stddev=0.1))
        self.W_c = tf.Variable(tf.random.normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_c = tf.Variable(tf.random.normal([self.dim], stddev=0.1))
        self.W_rnn = tf.Variable(tf.random.normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_rnn = tf.Variable(tf.random.normal([self.dim], stddev=0.1))
        self.trainable_variables.append(self.W_f)
        self.trainable_variables.append(self.b_f)
        self.trainable_variables.append(self.W_i)
        self.trainable_variables.append(self.b_i)
        self.trainable_variables.append(self.W_c)
        self.trainable_variables.append(self.b_c)
        self.trainable_variables.append(self.W_rnn)
        self.trainable_variables.append(self.b_rnn)
        
        # initial unrolling states
        self.h_zero = tf.zeros([self.batch_size, self.dim])
        self.c_zero = tf.zeros([self.batch_size, self.dim])
    
    def model(self,x,rate,sample=False,h_zero=None,c_zero=None):   
        def step(x, x_next, h_prev, c_prev):
            emb_x = tf.nn.embedding_lookup(params=self.E,ids=x)
            emb_h_prev = tf.concat([emb_x, h_prev], 1) 
            # LsTM internal machinery ...
            ft = tf.nn.sigmoid(tf.matmul(emb_h_prev, self.W_f)  + self.b_f)
            it = tf.nn.sigmoid(tf.matmul(emb_h_prev, self.W_i)  + self.b_i)
            Ctc = tf.nn.tanh(tf.matmul(emb_h_prev, self.W_c)  + self.b_c)
            Ct = ft * c_prev + it * Ctc  
            ot = tf.nn.sigmoid(tf.matmul(emb_h_prev, self.W_rnn)  + self.b_rnn)
            ht = ot * tf.nn.tanh(Ct)
            # apply dropout to each RNN layer ...
            drop = tf.nn.dropout(ht, self.rate)
            # Compute the logits using one last linear layer ... (here we aim to predict the following word)
            logits = tf.matmul(drop, self.W_cl) + self.b_cl
            # here we aim to learn predicting the next word
            # need to provide logits shape (256, vocab)
            costs = tf.nn.sparse_softmax_cross_entropy_with_logits(labels = x_next, logits=logits) 
            return logits, costs, ht, Ct
        #### End main editable code block ####

        self.x_slices = tf.split(x, self.sequence_length, 1)
        all_logits = []
        all_costs = []
        
        # self h and c contains each un - rolling step
        if h_zero != None and c_zero != None:
          self.h = [h_zero]
          self.c = [c_zero]
        else:
          self.h = [self.h_zero]
          self.c = [self.c_zero]
   
        x = tf.reshape(self.x_slices[0], [-1])
        #TODO unroll
        for t in range(self.sequence_length-1):
          x_t = tf.reshape(self.x_slices[t], [-1]) 
          x_next =  tf.reshape(self.x_slices[t + 1], [-1])
          h = self.h[len(self.h)-1] # retrieve last layer h-1
          c = self.c[len(self.c)-1] # retrieve last layer c-1
          logits, costs, h, c = step(x_t, x_next, h, c)
          self.h.append(h) # append last layer ht
          self.c.append(c) # append last layer ct
          all_logits.append(logits)
          all_costs.append(costs)
          if sample:
              return h, c, logits
        # we might want to return a sample ...
        return all_logits,all_costs        
            
           
    def train(self, training_data):
        def get_minibatch(dataset, start_index, end_index):
            indices = range(start_index, end_index)
            vectors = np.vstack([dataset[i]['index_sequence'] for i in indices])
            return vectors
             
        print('Training.')
        # Training cycle
        for epoch in range(self.training_epochs):
            random.shuffle(training_set)
            avg_cost = 0.
            total_batch = int(len(training_set) / self.batch_size)
            
            # Loop over all batches in epoch
            for i in range(total_batch):
                # Assemble a minibatch of the next B examples
                minibatch_vectors = np.int32(get_minibatch(training_set, self.batch_size * i, self.batch_size * (i + 1)))

                # Run the optimizer to take a gradient step, and also fetch the value of the 
                # cost function for logging
                with tf.GradientTape() as tape:
                  # costs contains the crossentropy achieved for each word
                  _, costs = self.model(minibatch_vectors,self.rate)
                  # so we expand the stack to sum all word costs
                  costs_tensor = tf.concat([tf.expand_dims(cost, 1) for cost in costs], 1)
                  # we average the word costs
                  cost_per_example = tf.reduce_sum(costs_tensor, 1)
                  # we average the sequence costs
                  total_cost = tf.reduce_mean(cost_per_example)
            
                # This performs the main SGD update equation over the sequence costs
                gradients = tape.gradient(total_cost, self.trainable_variables)
                optimizer = tf.optimizers.SGD(self.learning_rate)
                # back propagate errors
                optimizer.apply_gradients(zip(gradients, self.trainable_variables))
                                                            
                # Compute average loss
                avg_cost += total_cost / (total_batch * self.batch_size)
                
            # Display some statistics about the step
            if (epoch+1) % self.display_epoch_freq == 0:
                tf.print("Epoch:", (epoch+1), "Cost:", avg_cost, "Sample:", self.sample())
    
    def sample(self):
        # This samples a sequence of tokens from the model starting with <S>.
        # We only ever run the first timestep of the model, and use an effective batch size of one
        # but we leave the model unrolled for multiple steps, and use the full batch size to simplify 
        # the training code. This slows things down.

        def brittle_sampler():
            # The main sampling code. Can fail randomly due to rounding errors that yield probibilities
            # that don't sum to one.
            
            word_indices = [0] # 0 here is the "<S>" symbol
            for i in range(self.sequence_length - 1):
                dummy_x = np.zeros((self.batch_size, self.sequence_length),dtype=np.int32)
                dummy_x[0][0] = word_indices[-1]
                model_h = None
                model_c = None
                if i > 0:
                    model_h = h
                    model_c = c

                # in this case h and c represents the weigthing achieved  by the last layer 
                # so encodes the representation of the batch
                # and logits contains the achieved logits matrix each word (to discard those less probable words)
                h, c, logits = self.model(dummy_x,0.0,sample=True,h_zero=model_h,c_zero=model_c)
                logits = logits[0, :] # Discard all but first batch entry
                exp_logits = np.exp(logits - np.max(logits))
                distribution = exp_logits / exp_logits.sum()
                sampled_index = np.flatnonzero(np.random.multinomial(1, distribution))[0]
                word_indices.append(sampled_index)
            words = [indices_to_words[index] for index in word_indices]
            return ' '.join(words)
        
        while True:
            try:
              sample = brittle_sampler()
              return sample
            except ValueError as e:  # Retry if we experience a random failure.
              pass

Now let's train it.

Once you're confident your model is doing what you want, let it run for the full 250 epochs. This will take some time—likely between five and thirty minutes. If it much longer on a reasonably modern laptop—more than an hour—that suggests serious problems with your implementation. A properly implemented model with dropout should reach an average cost of less than 0.22 quickly, and then slowly improve from there. We train the model for a fairly long time because these small improvements in cost correspond to fairly large improvements in sample quality.

Samples from a trained models should have coherent portions, but they will not resemble interpretable English sentences. Here are three examples from a model with a cost value of 0.202:

`<S> the good <UNK> and <UNK> and <UNK> <UNK> with predictable and <UNK> , but also does one of -lrb- <UNK>`

`<S> <UNK> has <UNK> actors seems done <UNK> would these <UNK> <UNK> to <UNK> <UNK> <UNK> 're <UNK> to mind .`

`<S> an action story that was because the <UNK> <UNK> are when <UNK> as ``` <UNK> '' ' it is any`

`-lrb-` and `-rrb` are the way that left and right parentheses are represented in the corpus.

In [24]:
model = LanguageModel(len(word_indices), 21)
model.train(training_set)

Training.
Epoch: 1 Cost: 0.30790782 Sample: <S> <UNK> political <UNK> direction the <UNK> moment <UNK> <UNK> <UNK> <UNK> now honest <UNK> the funny work <UNK> <UNK> <UNK>
Epoch: 2 Cost: 0.264318854 Sample: <S> goes director lot <UNK> to already set movie does worst <UNK> is like of <UNK> feels <UNK> a just is
Epoch: 3 Cost: 0.257173806 Sample: <S> <UNK> <UNK> the tv too <UNK> become the <UNK> as the <UNK> into <UNK> , sweet <UNK> movie . </S>
Epoch: 4 Cost: 0.252105385 Sample: <S> the there animation <UNK> as the sad just and lost as <UNK> 's movie of us , have all attempt
Epoch: 5 Cost: 0.247123897 Sample: <S> there 's <UNK> an characters <UNK> over a <UNK> <UNK> intriguing <UNK> <UNK> , when them 've some <UNK> here
Epoch: 6 Cost: 0.243273705 Sample: <S> would a <UNK> and , nothing and <UNK> of <UNK> <UNK> like fascinating the <UNK> <UNK> to opera from character
Epoch: 7 Cost: 0.240564153 Sample: <S> be <UNK> and to <UNK> laugh as <UNK> <UNK> ... <UNK> and <UNK> , <UNK> <UNK> <UNK> .

Now we can draw as many samples as we like.

In [25]:
model.sample() # after running, must be launched

"<S> it 's all real charm in <UNK> a young woman <UNK> against life <UNK> that one along in the <UNK>"

### Part 2: Questions

**Question 1:** Looking at the samples that your model produced towards the end of training, point out three properties of (written) English that it seems to have learned.


1. In English, adjectives modify nouns and they appear preceding the noun. This has been learned by the model as it is shown in the following examples:

- dull familty (229)
- romantic story (epoch 230)
- sweet filmakers' (epoch 240)


2. The model has also learned to form grammatically correct English sentences that follow the Subject-Verb-Object, where there is a subject followed by a verb and an object, like the two following examples depict:

- the movie is as entertaining in its <UNK> years (epoch 163)
- One film was really beautiful (epoch 213)


3. It has also learned some specificities of the English language like the fact that when we have the verb "feel" + "like" we always need the verb to be conjugated in the gerund form. This is shown in the example below.

- i feel like making any <UNK> (215)

**Question 2:** If we could make the model as big as we wanted, train as long as we wanted, and adjust or remove dropout at will, could we ever get the model to reach a cost value of 0.0? In a single sentence, say why.

* Yes, we could achieve a cost near to 0.0, that will mean overfitting to current dataset and have a clear representation of certain language, and follows the premise introduced by ElMo and BERT systems where bigger systems achieves better scores and overfit harder the dataset provided (so, generate better Language Models), in other tasks that will lack on deployment due the unknown examples won't be well treated, but in Language Modelig task that will accurate embbedings that aren't supposed to generalize just represent as accurate as possible for further tasks.

**Question 3:** Give an example of a situation where the LSTM language model's ability to propagate information across many steps (when trained for long enough, at least) would cause it to reach a better cost value than a model like a simple RNN without that ability. (Answer in one sentence or so.)

 * LSTM cells can maintain information in memory for long periods of time (They use a set of gates to control the flow of information), so for a long time training task, LSTM can be more productive (achieve long dependency comprenhension) and efficient than RNN (could omit the not relevant information). in case of RNN, for a long time training the gradient of the loss function decays exponentially with time (the vanishing gradient problem).

 In the practical field, the clinical domain is kind of domains which should extensively use LSTM with pre-trained embbeding due the information accesible is low, and the success of each experiment require to handle every single sentence in a manner that conjugates as information as possible, my experience insight me to use BiLSTM (certain variant of the LSTM).

 further information about RNNs and real applications: https://addi.ehu.es/bitstream/handle/10810/37091/TFG_EdgarAndresSantamaria.pdf?sequence=1&isAllowed=y 

**Question 4:** Would the model be any worse if we were to just delete unknown words instead of using an `<UNK>` token? (Answer in one sentence or so.)

 * If we remove the unknown words, that can lead to an inconsistency in sentences provided for training because those which haven't any embbeding associated will crash the program, so we need to put a sepecial tag **< UNK >** for treating those unknown words.
 * The optimal solution for that peoblem is actually provided by char- based models as FLAIR, those use RNN for treating from character layer into classification stage, so we can have certain representation for all words even unknown.

# Team members: 
Edgar Andrés

Mohammed Yassin

Xaidé Caceres

Radostina Peteva

# Atribution:
Adapted by Oier Lopez de Lacalle and Olatz Perez de Viñaspre, based on a notebook by Sam Bowman at NYU