<a href="https://colab.research.google.com/github/TongleiChen/colab_notebook/blob/main/a4_languagemodel_lstm_template_0331.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## A4 - Language Model LSTM 

Author: Austin Blodgett

Adaptation to colab: Nitin Venkateswaran

### Follow the steps to use this notebook for your A4.

**NOTE**: It is best to use your Georgetown Google accounts.
##### 1. Save a copy of this notebook starter template in your Google Drive (File -> Save a copy in drive)
##### 2. Upload a copy of all 3 txt files from **lm-data** directory (available in a4.zip) to your Google Drive in the folder location **A4/lm-data/**; you will need to create the folder 'A4' at the root location in your Drive, followed by the subfolder 'lm-data'
##### 3. You are all set!


###Import libraries and mount Google Drive





In [14]:
# !pip uninstall tensorflow
# !pip install tensorflow==2.11.0

In [1]:
from google.colab import drive
drive.mount('/content/drive')

!pip install transformers
import os, random
from collections import Counter

from keras.models import Sequential
from keras.layers import Embedding, Bidirectional, LSTM, Dense, TimeDistributed

import numpy as np
import tensorflow as tf
from keras import backend as K
from keras import Model
from keras.activations import softmax
from keras.initializers import Constant

from transformers import BertTokenizer, TFBertLMHeadModel, BertConfig, TFBertModel
import tensorflow as tf

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
train_file = '/content/drive/My Drive/A4/lm-data/little-prince-train.txt'
dev_file = '/content/drive/My Drive/A4/lm-data/little-prince-dev.txt'
test_file = '/content/drive/My Drive/A4/lm-data/little-prince-test.txt'
UNK = '[UNK]'
PAD = '[PAD]'
START = '<s>'
END = '</s>'
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

###Change these arguments as needed for your experiments

In [30]:
epochs = 3 # number of epochs
learning_rate = 0.1 # learning rate
dropout = 0.3 # dropout rate
early_stopping = -1 # early stopping criteria
embedding_size = 100 # embedding dimension size
hidden_size = 10 # hidden layer size
batch_size = 50 # batch size
use_bert = True # to use the BERT embeddings

### Implement this function if you want to transform the input text, e.g. normalizing case


In [18]:
# TODO
def transform_text_sequence(seq):
    '''
    Implement this function if you want to transform the input text,
    for example normalizing case.
    '''
    return seq

### Implement this function to generate the next-word labels for a sequence

In [19]:
def shift_by_one(seq):
    '''
    input: ['<s>', 'The', 'dog', 'chased', 'the', 'cat', 'around', 'the', 'house', '</s>']
    output: ['The', 'dog', 'chased', 'the', 'cat', 'around', 'the', 'house', '</s>', '[PAD]']
    '''
    output = []
    for i in range(1,len(seq)):
      output.append(seq[i])
      
    output.append(PAD)
    return output



In [20]:
# print(shift_by_one(['The', 'dog', 'chased', 'the', 'cat', 'around', 'the', 'house', '</s>','[PAD]']))

### Download the GloVe embeddings

In [21]:
# !wget https://nlp.stanford.edu/data/glove.6B.zip
# !unzip -o glove.6B.zip

### Implement this function to load the pre-trained GloVE embeddings

In [22]:
glove_file = 'glove.6B.100d.txt' # Change as necessary

def load_pretrained_embeddings(glove_file, vocab):
    embedding_matrix = np.zeros((len(vocab), embedding_size))
    embeddings_index = {}
    word_index = dict(zip(vocab, range(len(vocab))))
    with open(glove_file, encoding='utf8') as f:
        for line in f:
            # Each line will be a word and a list of floats, separated by spaces.
            # If the word is in your vocabulary, create a numpy array from the list of floats.
            # Assign the array to the correct row of embedding_matrix.
            word, coefs = line.split(maxsplit=1)
            coefs = np.fromstring(coefs, "f", sep=" ")
            embeddings_index[word] = coefs
    num_tokens = len(vocab)


    # Prepare embedding matrix
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            # This includes the representation for "padding" and "OOV"
            embedding_matrix[i] = embedding_vector


            
    embedding_matrix[vocab[UNK]] = np.random.randn(embedding_size)
    return embedding_matrix

###Helper Functions (no need to implement)



In [23]:
def get_vocabulary_and_data_with_bert_tokenization(data_file):
    data = []
    with open(data_file, 'r', encoding='utf8') as f:
        for line in f:
            line = line.strip()
            if not line: continue
            sent = [START]
            sent.extend(tokenizer.tokenize(line))
            sent.append(END)
            data.append(sent)
    vocab = {k:v for k,v in tokenizer.vocab.items()}
    vocab['<s>'] = 101 # alias for [CLS]
    vocab['</s>'] = 102 # alias for [SEP]
    return vocab, data


def get_vocabulary_and_data(data_file, max_vocab_size=None, use_bert=False):
    if use_bert:
        return get_vocabulary_and_data_with_bert_tokenization(data_file)
    vocab = Counter()
    data = []
    with open(data_file, 'r', encoding='utf8') as f:
        for line in f:
            line = line.strip()
            if not line: continue
            sent = [START]
            tokens = transform_text_sequence(line.split())
            for tok in tokens:
                sent.append(tok)
                vocab[tok]+=1
            sent.append(END)
            data.append(sent)
            vocab[START]+=1
            vocab[END]+=1
    vocab = [w for w in sorted(vocab, key=lambda x:vocab[x], reverse=True)]
    if max_vocab_size:
        vocab = vocab[:max_vocab_size-2]
    vocab = [UNK, PAD] + vocab

    return {k:v for v,k in enumerate(vocab)}, data


def vectorize_sequence(seq, vocab):
    seq = [tok if tok in vocab else UNK for tok in seq]
    return [vocab[tok] for tok in seq]


def unvectorize_sequence(seq, vocab):
    translate = sorted(vocab.keys(),key=lambda k:vocab[k])
    return [translate[i] for i in seq]


def one_hot_encode_label(label, vocab):
    vec = [1.0 if l==label else 0.0 for l in vocab]
    return vec


def batch_generator_lm(data, vocab, batch_size=1):
    while True:
        batch_x = []
        batch_y = []
        for sent in data:
            batch_x.append(vectorize_sequence(sent, vocab))
            batch_y.append([one_hot_encode_label(token, vocab) for token in shift_by_one(sent)])
            if len(batch_x) >= batch_size:
                # Pad Sequences in batch to same length
                batch_x = pad_sequences(batch_x, vocab[PAD])
                batch_y = pad_sequences(batch_y, one_hot_encode_label(PAD, vocab))
                batch_x, batch_y = np.array(batch_x), np.array(batch_y)
                yield batch_x, batch_y.astype('float32')
                batch_x = []
                batch_y = []


def describe_data(data, generator):
    batch_x, batch_y = [], []
    for bx, by in generator:
        batch_x = bx
        batch_y = by
        break
    print('Data example:',data[49])
    print('Data size',len(data))
    print('Batch input shape:', batch_x.shape)
    print('Batch output shape:', batch_y.shape)


def pad_sequences(batch_x, pad_value):
    ''' This function should take a batch of sequences of different lengths
        and pad them with the pad_value token so that they are all the same length.

        Assume that batch_x is a list of lists.
    '''
    pad_length = len(max(batch_x, key=lambda x: len(x)))
    for i, x in enumerate(batch_x):
        if len(x) < pad_length:
            batch_x[i] = x + ([pad_value] * (pad_length - len(x)))

    return batch_x


def generate_text(language_model, vocab):
    prediction = [START]
    while not (prediction[-1] == END or len(prediction)>=50):
        next_token_one_hot = language_model.predict(np.array([[vocab[p] for p in prediction]]), batch_size=1)[0][-1]
        threshold = random.random()
        sum = 0
        next_token = 0
        for i,p in enumerate(next_token_one_hot):
            sum += p
            if sum>threshold:
                next_token = i
                break
        for w, i in vocab.items():
            if i==next_token:
                prediction.append(w)
                break
    return prediction


def perplexity(y_true, y_pred):
    # https://stackoverflow.com/questions/41881308/how-to-calculate-perplexity-of-rnn-in-tensorflow
    cross_entropy = K.categorical_crossentropy(y_true, y_pred)
    perp = K.exp(cross_entropy)
    return perp


class BERT_Wrapper(Model):

  def __init__(self):
    super(BERT_Wrapper, self).__init__()
    self.encoder = TFBertModel.from_pretrained("bert-base-uncased", trainable=False)
    self.dense = Dense(hidden_size)

  def call(self, inputs, **kwargs):
      outputs = self.encoder(inputs)
      last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
      output = self.dense(last_hidden_states)
      return output

In [24]:
# BERT_Wrapper()

###Check the GPU is available

In [25]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  device_name = '/cpu:0'
  print(
      '\n\n This notebook is not '
      'configured to use a GPU.  You can change this in Notebook Settings. Defaulting to:' + device_name)
else:
  print ('GPU Device found: ' + device_name)

GPU Device found: /device:GPU:0


###Main procedure call: Implement the keras model here


In [None]:
vocab, train_data = get_vocabulary_and_data(train_file, use_bert=use_bert)
_, dev_data = get_vocabulary_and_data(dev_file, use_bert=use_bert)

describe_data(train_data, batch_generator_lm(train_data, vocab, batch_size))

# from keras.initializers import Constant


with tf.device(device_name):
    # Implement your model here! ----------------------------------------------------------------------
    # Use the variables batch_size, hidden_size, embedding_size, dropout, epochs
    
    if use_bert:
        embedding_layer = BERT_Wrapper()
    else:
        embedding_matrix = load_pretrained_embeddings(glove_file, vocab)
        embedding_layer = Embedding(
            len(vocab),
            embedding_size,
            embeddings_initializer=Constant(embedding_matrix),
            trainable=False,
        )
    language_model = tf.keras.Sequential()
    input_size = len(vocab)
    output_size = len(vocab)

    drop_out_e = 0.25
    drop_out_lstm = 0.25
    drop_out_d = 0.25
    language_model.add(embedding_layer)
    language_model.add(tf.keras.layers.Dropout(drop_out_e))
    language_model.add(LSTM(hidden_size, return_sequences=True))# dropout

    language_model.add(TimeDistributed(Dense(output_size, activation='softmax')))
    language_model.add(tf.keras.layers.Dropout(drop_out_d))

    # ------------------------------------------------------------------------------------------------
    optimizer = tf.optimizers.Adam(learning_rate = learning_rate)
    language_model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy',perplexity])

    for i in range(epochs):
        print('Epoch',i+1,'/',epochs)
        # Training
        language_model.fit(batch_generator_lm(train_data, vocab, batch_size),
                                      epochs=1, steps_per_epoch=len(train_data)/batch_size)
        # Evaluation
        loss, acc, perp = language_model.evaluate(batch_generator_lm(dev_data, vocab),
                                                  steps=len(dev_data))
        print('Dev Loss:', loss, 'Dev Acc:', acc, 'Dev Peprlexity:', perp)

    for i in range(10):
        text = generate_text(language_model, vocab)
        print(text)

Data example: ['<s>', 'next', ',', 'the', 'lamp', '##light', '##ers', 'of', 'china', 'and', 'siberia', 'would', 'enter', 'for', 'their', 'steps', 'in', 'the', 'dance', ',', 'and', 'then', 'they', 'too', 'would', 'be', 'waved', 'back', 'into', 'the', 'wings', '.', '</s>']
Data size 1362
Batch input shape: (50, 33)
Batch output shape: (50, 33, 30524)


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Epoch 1 / 3