<a href="https://colab.research.google.com/github/Mahnazshamissa/Python/blob/main/LSTM_LangModel_%2B_Sent_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN / BiLSTM and Word Vectors

Let's re-use the example you've had with Matthias Lechner, a few weeks back.

We'll load the frankenstein book, and convert it into semantic representation through word vectors.

We then train a language model, using LSTM, through these vectors.
Later, please change the data source from "frankenstein.txt" to "dracula.txt", and observe the result. 

### How are we going to do it?

We will define our data, as such, that for every word we use as an input for the model (X = Wn), the next word would be the output (Y = Wn+1)

The words in the output, Y, will be represented as a one-hot-vector. 

**Q: What is the size of this Vector?**


In [1]:
!pip install bpemb

Collecting bpemb
  Downloading https://files.pythonhosted.org/packages/91/77/3f0f53856e86af32b1d3c86652815277f7b5f880002584eb30db115b6df5/bpemb-0.3.2-py3-none-any.whl
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 9.6MB/s 
Installing collected packages: sentencepiece, bpemb
Successfully installed bpemb-0.3.2 sentencepiece-0.1.94


In [2]:
import tensorflow as tf

import time
import math
import unicodedata
import string

import torch
import torch.nn.functional as F
from torch import nn, tensor

from torchtext.data import get_tokenizer

from bpemb import BPEmb

In [3]:
device = torch.device("cuda")

# Word Vectors - BPEmb

Let's convert the text into vectors using **BPE**.

Byte Pair Encoding (BPE) is used to encode the input sequences. BPE was originally proposed as a data compression algorithm in 1990s and then was adopted to solve the open-vocabulary issue in machine translation, as we can easily run into rare and unknown words when translating into a new language. Motivated by the intuition that rare and unknown words can often be decomposed into multiple subwords, BPE finds the best word segmentation by iteratively and greedily merging frequent pairs of characters.

We will use the BPE package, which is called [BPEmb](https://nlp.h-its.org/bpemb/). It encodes words to vectors by dividing each word to its to sub-words, pieces of words, made of characters which often appear together.

It is based on the paper: Neural Machine Translation of Rare Words with Subword Units - https://arxiv.org/abs/1508.07909

Q: What is the name of the Linguistic level that deals with character/letter-level? 

In [4]:
bpemb_en = BPEmb(lang="en")

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model


100%|██████████| 400869/400869 [00:00<00:00, 903239.32B/s]


downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz


100%|██████████| 3784656/3784656 [00:00<00:00, 5322966.99B/s]


In [5]:
bpemb_en.vectors.shape

(10000, 100)

Let's create a function to load the corpus data (the books):

In [6]:
def get_file(filename = "frankenstein.txt"):
  path = tf.keras.utils.get_file(
      filename, origin=f"https://raw.githubusercontent.com/liadmagen/NLP-Course/master/dataset/{filename}"
  )
  with open(path, encoding="utf-8") as f:
      text = f.read() 
  text = text.replace("\n", " ")        # Remove line-breaks & newlines
  print("Corpus length:", len(text))
  return text

# RNN Model
And this is the model itself. This is a very raw structure of it. 

In [7]:
class RNNModel(nn.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, ninp, noutp, nhid, nlayers, dropout=0.5, tie_weights=False):
        """
        Parameters:
          ninp =  LSTM input size 
          noutp = size of the output (number of classes)
          nhid = number of neurons in the hidden layer
          nlayers = number of hidden layer
          dropout = dropout rate
          tie_weights = whether to use tie_weights (see note)
        """
        super(RNNModel, self).__init__()
        self.noutp = noutp
        self.drop = nn.Dropout(dropout)

        self.encoder = nn.Embedding.from_pretrained(tensor(bpemb_en.vectors))
        # Freeze the embedding - don't let them be trained
        self.encoder.weight.requires_grad = False
        
        # self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity='relu', dropout=dropout)
        self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)

        self.decoder = nn.Linear(nhid, noutp)

        # Optionally tie weights as in:
        # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
        # https://arxiv.org/abs/1608.05859
        # and
        # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
        # https://arxiv.org/abs/1611.01462
        if tie_weights:
            if nhid != ninp:
                raise ValueError('When using the tied flag, nhid must be equal to ninp (embedding size)')
            self.decoder.weight = self.encoder.weight

        self.init_weights()

        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, input, hidden):
        emb = self.drop(self.encoder(input))
        output, hidden = self.rnn(emb, hidden)
        output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.noutp)
        return F.log_softmax(decoded, dim=1), hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters())
        return (weight.new_zeros(self.nlayers, batch_size, self.nhid),
                weight.new_zeros(self.nlayers, batch_size, self.nhid))


A helper class to convert the tokens into batches:

In [8]:
def batchify(data, batch_size):
    # Work out how cleanly we can divide the dataset into batch_size parts.
    nbatch = data.size(0) // batch_size
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * batch_size)
    # Evenly divide the data across the batch_size batches.
    data = data.view(batch_size, -1).t().contiguous()
    return data.to(device)

Let's load the data:

In [9]:
train_corpus = get_file('dracula.txt')
val_corpus = get_file('frankenstein.txt')

print(train_corpus[:300])
print(val_corpus[:300])

Downloading data from https://raw.githubusercontent.com/liadmagen/NLP-Course/master/dataset/dracula.txt
Corpus length: 842159
Downloading data from https://raw.githubusercontent.com/liadmagen/NLP-Course/master/dataset/frankenstein.txt
Corpus length: 420726
Dracula, by Bram Stoker  CHAPTER I  JONATHAN HARKER'S JOURNAL  (_Kept in shorthand._)   _3 May. Bistritz._--Left Munich at 8:35 P. M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse whic
Frankenstein, or, the Modern Prometheus by Mary Wollstonecraft (Godwin) Shelley  Letter 1  _To Mrs. Saville, England._   St. Petersburgh, Dec. 11th, 17—.   You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. 


# Semantic representation + word-parts

And convert it into vectors:

In [10]:
train_encoded_text = bpemb_en.encode(train_corpus)
train_encoded_ids = bpemb_en.encode_ids(train_corpus)

val_encoded_text = bpemb_en.encode(val_corpus)
val_encoded_ids = bpemb_en.encode_ids(val_corpus)


In [18]:
val_encoded_ids[:20]

[2285,
 19,
 3521,
 9934,
 127,
 9934,
 7,
 1463,
 1073,
 8108,
 70,
 101,
 2195,
 15,
 484,
 3820,
 1568,
 64,
 8323,
 3084]

In [19]:
train_encoded_ids[:10]

[1187, 9924, 2206, 9934, 101, 473, 56, 66, 7468, 5468]

Let's check the result of encoded_text (we'll get to encoded_ids in a moment).

Notice that every word is now broken to pieces. 

A **'_'** mark in the beginning of a token, represents a beginning of a new word.

In [11]:
train_encoded_text[:50]

['▁dra',
 'c',
 'ula',
 ',',
 '▁by',
 '▁br',
 'am',
 '▁st',
 'oker',
 '▁chapter',
 '▁i',
 '▁jonathan',
 '▁har',
 'ker',
 "'",
 's',
 '▁journal',
 '▁(',
 '_',
 'ke',
 'pt',
 '▁in',
 '▁sh',
 'or',
 'th',
 'and',
 '.',
 '_',
 ')',
 '▁',
 '_',
 '0',
 '▁may',
 '.',
 '▁b',
 'ist',
 'rit',
 'z',
 '.',
 '_',
 '-',
 '-',
 'left',
 '▁mun',
 'ich',
 '▁at',
 '▁0:00',
 '▁p',
 '.',
 '▁m']

This method is called word-parts. 

Instead of converting a whole word (word2vec, gloVe), or a character (FastText), this method converts slices of text, a combination of characters, together.

It does so by finding the most common combinations, most frequent combinations, of characters in a very big corpus. 

The result is having a vocabulary which is WAY smaller than all-the-words (how big would that be?) bug bigger than all the characters:

**character-based << word-piece based << word-based**

# Model Parameters

In [12]:
batch_size = 32
eval_batch_size = 32

vocab_size = bpemb_en.vocab_size
embsize = bpemb_en.vectors.shape[1]
nhidden = 256
nlayers = 2

In [13]:
model = RNNModel(embsize, vocab_size, nhidden, nlayers).to(device)

In [14]:
criterion = nn.NLLLoss()

# Division to train/validation

In [15]:
train_enc_ids = torch.tensor(train_encoded_ids).type(torch.int64)
train_data = batchify(train_enc_ids, batch_size)

val_enc_ids = torch.tensor(val_encoded_ids).type(torch.int64)
val_data = batchify(val_enc_ids, batch_size)

In [16]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""

    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)


In [17]:
def get_batch(source, i):
    seq_len = min(batch_size, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

# Training function

Note: In 'real-ilfe' we're using helping frameworks such as [ignite](https://pytorch.org/ignite/) or [lightning](https://www.pytorchlightning.ai/). 

We bring it in this version here, for learning purposes only.

In [21]:
def train_epoch(train_data, optimizer, lr_scheduler, log_interval = 100):
    # Turn on training mode - which enables dropout.
    model.train()

    total_loss = 0.

    start_time = time.time()
    # ntokens = len(train_data)
    hidden = model.init_hidden(batch_size)

    for batch, i in enumerate(range(0, train_data.size(0) - 1, batch_size)):
        data, targets = get_batch(train_data, i)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()
        optimizer.zero_grad()

        hidden = repackage_hidden(hidden)
        
        output, hidden = model(data, hidden)
        
        loss = criterion(output, targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25)

        # update parameters (manually:)
        # for p in model.parameters():
        #   if p.grad is not None:
        #     p.data.add_(p.grad, alpha=-lr)

        # better with an optimizer:
        optimizer.step()


        total_loss += loss.item()

        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // batch_size, 
                lr_scheduler.get_last_lr()[0],
                elapsed * 1000 / log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

In [22]:
def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.

    hidden = model.init_hidden(eval_batch_size)

    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, batch_size):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            hidden = repackage_hidden(hidden)
            total_loss += len(data) * criterion(output, targets).item()
    return total_loss / (len(data_source) - 1)

# Training loop:

In [24]:
# Loop over epochs.
lr = 5
best_val_loss = None
epochs = 20

optimizer = torch.optim.SGD(model.parameters(), lr=lr)
lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer=optimizer,
                                                   max_lr=lr,
                                                   epochs=epochs,
                                                   steps_per_epoch=10)

for epoch in range(1, epochs+1):
    epoch_start_time = time.time()

    train_epoch(train_data, optimizer, lr_scheduler)
    
    val_loss = evaluate(val_data)
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
            'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                        val_loss, math.exp(val_loss)))
    print('-' * 89)

    if not best_val_loss or val_loss < best_val_loss:
        best_val_loss = val_loss
    # else:
        # At this point, the learning rate is annealing if no improvement 
        # has been seen in the validation dataset. But we use pyTorch bult-in
        # lr_scheduler for this.
        # lr /= 2.0

| epoch   1 |   100/  217 batches | lr 0.20 | ms/batch 12.26 | loss  6.39 | ppl   596.56
| epoch   1 |   200/  217 batches | lr 0.20 | ms/batch 11.28 | loss  6.29 | ppl   538.78
-----------------------------------------------------------------------------------------
| end of epoch   1 | time:  2.91s | valid loss  6.87 | valid ppl   965.38
-----------------------------------------------------------------------------------------
| epoch   2 |   100/  217 batches | lr 0.20 | ms/batch 11.43 | loss  6.36 | ppl   578.53
| epoch   2 |   200/  217 batches | lr 0.20 | ms/batch 11.33 | loss  6.26 | ppl   525.25
-----------------------------------------------------------------------------------------
| end of epoch   2 | time:  2.84s | valid loss  6.84 | valid ppl   935.94
-----------------------------------------------------------------------------------------
| epoch   3 |   100/  217 batches | lr 0.20 | ms/batch 11.46 | loss  6.33 | ppl   563.31
| epoch   3 |   200/  217 batches | lr 0.20 | m

# Text Generation example

In [25]:
model.eval()

log_interval = 100
words_to_generate = 50
temperature = 1. # higher temperature will increase diversity

# generate random start
input = torch.randint(10000, (1, 1), dtype=torch.long).to(device)

hidden = model.init_hidden(1)

generated_word_ids = []

with torch.no_grad():  # no tracking history
 for i in range(words_to_generate):
    output, hidden = model(input, hidden)
    word_weights = output.squeeze().div(temperature).exp().cpu()
    word_idx = torch.multinomial(word_weights, 1)[0]
    input.fill_(word_idx)

    generated_word_ids.append(word_idx.tolist())

bpemb_en.decode_ids(generated_word_ids)

"exists of theum in a pen of his that. jonathan greatly isive, hur lainn, and ifat des, and he madenentlycript and ax, and itself he went figures to keep back that i could them from in '"

As discussed in class, the RNN/LSTM can be used to many various task:

it can be used for sequence2sequence, where the sequence size is the same or different: 
* Translation
* Tagging words as POS / SLR / NER
* Encoding a document as a vector for classification

etc.

# Your Turn:

Let's practice LSTM.

Train a sentiment anlalysis on the Stanford Sentiment Treebank (SST).

You will need to:
* Change the network output to produce a score, instead of class (Think: which loss function would you use for that matter?)
* Use bpEmb to vectorize the sentences
* divide your training set to Train + Validation + Hold-out sets - use split_df
* Change the training + validation loops to compute the loss over a whole sentence, and not for every word. 
* Add a test set (very similar to the validation set) to check your network score

Q: What is the network output after the whole sentence is processed? 
Hint: https://www.aclweb.org/anthology/P18-1198.pdf

Q: Here we trined a language model over a simple book. Which datasource(s) would you choosoe to train a language model for this dataset?  

## Setup & DS download

In [None]:
!wget 'https://raw.githubusercontent.com/liadmagen/NLP-Course/master/sst/datasetSentences.txt'
!wget 'https://raw.githubusercontent.com/liadmagen/NLP-Course/master/sst/sentiment_labels.txt'
!wget 'https://raw.githubusercontent.com/liadmagen/NLP-Course/master/sst/datasetSplit.txt'


--2020-11-10 16:54:21--  https://raw.githubusercontent.com/liadmagen/NLP-Course/master/sst/datasetSentences.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1290263 (1.2M) [text/plain]
Saving to: ‘datasetSentences.txt.1’


2020-11-10 16:54:21 (31.3 MB/s) - ‘datasetSentences.txt.1’ saved [1290263/1290263]

--2020-11-10 16:54:21--  https://raw.githubusercontent.com/liadmagen/NLP-Course/master/sst/sentiment_labels.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3263577 (3.1M) [text/plain]
Saving to: ‘sentiment_labels.txt’


2020-11-10 1

In [None]:
import pandas as pd

In [None]:
data_df = pd.read_csv('datasetSentences.txt', sep='\t', index_col=0)
label_df = pd.read_csv('sentiment_labels.txt', sep='|', index_col=0)
split_df = pd.read_csv('datasetSplit.txt', sep='|', index_col=0)

## Look at the data

In [None]:
 df.head(10)

Unnamed: 0_level_0,sentence
sentence_index,Unnamed: 1_level_1
1,The Rock is destined to be the 21st Century 's...
2,The gorgeously elaborate continuation of `` Th...
3,Effective but too-tepid biopic
4,If you sometimes like to go to the movies to h...
5,"Emerges as something rare , an issue movie tha..."
6,The film provides some great insight into the ...
7,Offers that rare combination of entertainment ...
8,Perhaps no picture ever made has more literall...
9,Steers turns in a snappy screenplay that curls...
10,But he somehow pulls it off .


In [None]:
label_df.head(10)

Unnamed: 0_level_0,sentiment values
phrase ids,Unnamed: 1_level_1
0,0.5
1,0.5
2,0.44444
3,0.5
4,0.42708
5,0.375
6,0.41667
7,0.54167
8,0.33333
9,0.45833


In [None]:
split_df.head(10)