<a href="https://colab.research.google.com/github/DaSalm/NetologyHW/blob/master/language_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language modelling


Обучим две различные символьные модели для генерации динозавров:
* модель на символьных биграмах
* ***RNN***-модель.


## Bigram model


In [0]:
!wget https://raw.githubusercontent.com/artemovae/NLP-seminar-LM/master/dinos.txt

--2019-12-25 16:12:03--  https://raw.githubusercontent.com/artemovae/NLP-seminar-LM/master/dinos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19909 (19K) [text/plain]
Saving to: ‘dinos.txt’


2019-12-25 16:12:03 (69.3 MB/s) - ‘dinos.txt’ saved [19909/19909]



In [0]:
!cat dinos.txt | wc -l

1535


In [0]:
!cat dinos.txt | head

Aachenosaurus
Aardonyx
Abdallahsaurus
Abelisaurus
Abrictosaurus
Abrosaurus
Abydosaurus
Acanthopholis
Achelousaurus
Acheroraptor


In [0]:
a = []
for i in range(10):
  a.append(i)

In [0]:
b = [i + 2 for i in range(10)]

In [0]:
b

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [0]:
names = ['<' + name.strip().lower() + '>' for name in open('dinos.txt').readlines()]
print(names[:10])

['<aachenosaurus>', '<aardonyx>', '<abdallahsaurus>', '<abelisaurus>', '<abrictosaurus>', '<abrosaurus>', '<abydosaurus>', '<acanthopholis>', '<achelousaurus>', '<acheroraptor>']


In [0]:
import nltk

Вычислим частоту каждого символа в корпусе имен динозавров

In [0]:
chars = [char for name in names for char in name]

In [0]:
freq = nltk.FreqDist(chars)

In [0]:
print(list(freq.keys()))

['h', 'p', 'i', 'u', 'q', '<', 'b', 'a', 'r', 'x', 'g', 'e', 'o', 'l', 'c', 'j', 'z', 's', 'd', 'f', 'y', 'm', 'v', 'n', 'w', '>', 'k', 't']


In [0]:
freq.most_common(10)

[('a', 2487),
 ('s', 2285),
 ('u', 2123),
 ('o', 1710),
 ('r', 1704),
 ('<', 1536),
 ('>', 1536),
 ('n', 1081),
 ('i', 944),
 ('e', 913)]

Define a function to estimate probabilty of character

In [0]:
l = sum([freq[char] for char in freq])

def unigram_prob(char):
    return freq[char] / l

In [0]:
unigram_prob('a')

0.11596568124591998

In [0]:
print('p(a) = %1.4f' %unigram_prob('a'))

p(a) = 0.1160


Вычислим условную вероятность каждого символа в зависимости от того, какой символ стоял на предыдущей позиции.

In [0]:
bigrams = nltk.bigrams(chars)

In [0]:
cfreq = nltk.ConditionalFreqDist(nltk.bigrams(chars))

In [0]:
cfreq['a']

FreqDist({'>': 138,
          'a': 11,
          'b': 24,
          'c': 100,
          'd': 36,
          'e': 42,
          'f': 6,
          'g': 40,
          'h': 17,
          'i': 23,
          'j': 5,
          'k': 20,
          'l': 138,
          'm': 68,
          'n': 347,
          'o': 22,
          'p': 89,
          'q': 3,
          'r': 124,
          's': 171,
          't': 204,
          'u': 791,
          'v': 30,
          'w': 6,
          'x': 12,
          'y': 12,
          'z': 8})

Оценим условные вероятности с помощью MLE.

In [0]:
cprob = nltk.ConditionalProbDist(cfreq, nltk.MLEProbDist)

In [0]:
print('p(a a) = %1.4f' %cprob['a'].prob('a'))
print('p(a b) = %1.4f' %cprob['a'].prob('b'))
print('p(a u) = %1.4f' %cprob['a'].prob('u'))

p(a a) = 0.0044
p(a b) = 0.0097
p(a u) = 0.3181


In [0]:
cprob['a'].generate()

'u'

In [0]:
import numpy as np

In [0]:
np.random.choice([1, 3, 43, 4, 4, 4, 4], p=[0.8, 0.05, 0.05, 0.025, 0.025, 0.025, 0.025])

3

### Задание 1.

1) Напишите функцию, которая генерирует имя динозавра **фиксированной** длины. Используйте '<' как начальный символ.

2) Напишите функцию, которая генерирует имя динозавра любой дины.

In [0]:
def generate_n_chars(cprob, n):
  name = '<'
  for i in range(n):
    name += cprob[name[-1]].generate()

    if name[-1] == '>':
      break

  if name[-1] != '>':
    name += '>'
  return name

In [0]:
generate_n_chars(cprob, 10)

'<crurustato>'

## Реккурентные нейронные сети (RNN)

Исходная последовательность:

$x_{1:n} = x_1, x_2, \ldots, x_n$, $x_i \in \mathbb{R}^{d_{in}}$

Для каждого входного значения $x_{1:i}$ получаем на выходе $y_i$:

$y_i = RNN(x_{1:i})$, $y_i \in \mathbb{R}^{d_{out}}$

Для всей последовательности $x_{1:n}$:

$y_{1:n} = RNN^{*}(x_{1:n})$, $y_i \in \mathbb{R}^{d_{out}}$

$R$ - рекурсивная функция активации, зависящая от двух параметров: $x_i$ и $s_{i-1}$ (вектор предыдущего состояния)

$RNN^{*}(x_{1:n}, s_0) = y_{1:n}$

$y_i = O(s_i) = g(W^{out}[s_{i} ,x_i] +b)$

$s_i = R(s_{i-1}, x_i)$

$s_i = R(s_{i-1}, x_i) = g(W^{hid}[s_{i-1} ,x_i] +b)$  -- конкатенация $[s_{i-1}, x]$

$x_i \in \mathbb{R}^{d_{in}}$, $y_i \in \mathbb{R}^{ d_{out}}$, $s_i \in \mathbb{R}^{d_{hid}}$

$W^{hid} \in \mathbb{R}^{(d_{in}+d_{out}) \times d_{hid}}$, $W^{out} \in \mathbb{R}^{d_{hid} \times d_{out}}$

Построим языковую модель на основе RNN с помощью pytorch

In [0]:
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
import pdb
from torch.utils.data import Dataset, DataLoader

%load_ext autoreload
%autoreload 2

torch.set_printoptions(linewidth=200)

In [0]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
hidden_size = 50

In [0]:
device

device(type='cuda', index=0)

Подготовим данные

In [0]:
class DinosDataset(Dataset):
    def __init__(self):
        super().__init__()
        with open('dinos.txt') as f:
            content = f.read().lower()
            self.vocab = sorted(set(content)) + ['<', '>']
            self.vocab_size = len(self.vocab)
            self.lines = content.splitlines()
        self.ch_to_idx = {c:i for i, c in enumerate(self.vocab)}
        self.idx_to_ch = {i:c for i, c in enumerate(self.vocab)}
    
    def __getitem__(self, index):
        line = self.lines[index]
        #teacher forcing
        x_str = '<' + line 
        y_str = line + '>' 
        x = torch.zeros([len(x_str), self.vocab_size], dtype=torch.float)
        y = torch.empty(len(x_str), dtype=torch.long)
        for i, (x_ch, y_ch) in enumerate(zip(x_str, y_str)):
            x[i][self.ch_to_idx[x_ch]] = 1
            y[i] = self.ch_to_idx[y_ch]
        return x, y
    
    def __len__(self):
        return len(self.lines)

In [0]:
list(zip('<dino', 'dino>'))

[('<', 'd'), ('d', 'i'), ('i', 'n'), ('n', 'o'), ('o', '>')]

In [0]:
trn_ds = DinosDataset()
trn_dl = DataLoader(trn_ds, shuffle=True)

In [0]:
trn_ds.lines[5]

'abrosaurus'

In [0]:
x, y = trn_ds[5]

In [0]:
y

tensor([ 1,  2, 18, 15, 19,  1, 21, 18, 21, 19, 28])

In [0]:
print(trn_ds.idx_to_ch)

{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 27: '<', 28: '>'}


In [0]:
trn_ds.vocab_size

29

In [0]:
x.shape

torch.Size([11, 29])

In [0]:
y.shape

torch.Size([11])

In [0]:
[trn_ds.idx_to_ch[i.item()] for i in y]

['a', 'b', 'r', 'o', 's', 'a', 'u', 'r', 'u', 's', '>']

Опишем модель, функцию потерь и алгоритм оптимизации

In [0]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.dropout = nn.Dropout(0.3)
        self.i2o = nn.Linear(hidden_size, output_size)
    
    def forward(self, h_prev, x):
        combined = torch.cat([h_prev, x], dim = 1) # concatenate x and h
        h = torch.tanh(self.dropout(self.i2h(combined)))
        y = self.i2o(h)
        return h, y

In [0]:
model = RNN(trn_ds.vocab_size, hidden_size, trn_ds.vocab_size).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)

![rnn](images/dinos3.png)

In [0]:
def sample(model):
    model.eval()
    word_size=0
    newline_idx = trn_ds.ch_to_idx['>'] # индекс конечного символа
    with torch.no_grad():
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device) # первый вектор скрытого состояния
        x = h_prev.new_zeros([1, trn_ds.vocab_size])
        start_char_idx = trn_ds.ch_to_idx['<'] # индекс начального символа
        x[0, start_char_idx] = 1 # one-hot вектор для начального символа

        indices = [start_char_idx]      
        predicted_char_idx = start_char_idx
        
        while predicted_char_idx != newline_idx and word_size != 50:
            h_prev, y_pred = model(h_prev, x)
            y_softmax_scores = torch.softmax(y_pred, dim=1)
            
            np.random.seed(np.random.randint(1, 5000))
            idx = np.random.choice(np.arange(trn_ds.vocab_size), p=y_softmax_scores.cpu().numpy().ravel())
            indices.append(idx)
            
            x = (y_pred == y_pred.max(1)[0]).float()
            
            predicted_char_idx = idx
            
            word_size += 1
        
        if word_size == 50:
            indices.append(newline_idx)
    return indices

In [0]:
def print_sample(sample_idxs):
    print(''.join([trn_ds.idx_to_ch[x] for x in sample_idxs]))

Обучим получившуюся модель

In [0]:
y.size()[0]

11

In [0]:
def train_one_epoch(model, loss_fn, optimizer):
    model.train()
    for line_num, (x, y) in enumerate(trn_dl):
        loss = 0
        
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
        x, y = x.to(device), y.to(device)
        for i in range(x.shape[1]):
            h_prev, y_pred = model(h_prev, x[:, i])
            loss += loss_fn(y_pred, y[:, i])
            
        if (line_num + 1) % 200 == 0:
            print('loss', loss.item() / y.size()[0])
            print_sample(sample(model))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [0]:
def train(model, loss_fn, optimizer, dataset='dinos', epochs=1):
    for e in range(1, epochs+1):
        print('Epoch:{}'.format(e))
        train_one_epoch(model, loss_fn, optimizer)
        print()

In [0]:
train(model, loss_fn, optimizer, epochs = 20)

Epoch:1
loss 36.22065734863281
<sdg>
loss 31.024066925048828
<lnrus>
loss 30.07791519165039
<sourasaerus>
loss 24.26682472229004
<girusharus>
loss 22.46570587158203
<atjaskurui>
loss 28.10103988647461
<ynucssnurus>
loss 25.450807571411133
<zulotobhsaoudroynrusaurxs>

Epoch:2
loss 18.97470474243164
<auansosausus>
loss 19.69462013244629
<sintunortairai>
loss 16.424253463745117
<amltakhuhus>
loss 36.60989761352539
<gabscaurus>
loss 31.39891242980957
<topnusaurus>
loss 22.008193969726562
<tbjhddteus>
loss 11.527524948120117
<airpaurus>

Epoch:3
loss 21.711658477783203
<auajtaurus>
loss 17.585601806640625
<lanrnviurus>
loss 17.016849517822266
<kbbpudaurus>
loss 19.65509033203125
<laidonalros>
loss 16.544395446777344
<pucrsdurus>
loss 13.841693878173828
<yuirtsamsauras>
loss 34.29077911376953
<talsua>

Epoch:4
loss 19.199398040771484
<cplras>
loss 24.156190872192383
<juptasaurus>
loss 27.526050567626953
<iysosiurus>
loss 16.011913299560547
<etasisam>
loss 19.2605037689209
<lbiesaurus>
loss 3

### Task 2.
Перепешите функцию сэмплирования так, чтобы она порождала панграмы (слова, в которых каждая буква встречается только 1 раз)

### Task 3.
Перепешите функцию сэмплирования так, чтобы было возможно менять температуру сэмплирования

### Task 4.
Реализуйте beam search для сэмплирования

In [0]:
# task 2

In [0]:
def get_mask(indices):
  mask = []
  for i in np.arange(trn_ds.vocab_size):
    if i in indices and i != trn_ds.ch_to_idx['u']:
      mask.append(0)
    else:
      mask.append(1)
  return np.array(mask)


In [0]:

def sample(model):
    model.eval()
    word_size=0
    newline_idx = trn_ds.ch_to_idx['>'] # индекс конечного символа
    with torch.no_grad():
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device) # первый вектор скрытого состояния
        x = h_prev.new_zeros([1, trn_ds.vocab_size])
        start_char_idx = trn_ds.ch_to_idx['<'] # индекс начального символа
        x[0, start_char_idx] = 1 # one-hot вектор для начального символа

        indices = [start_char_idx]      
        predicted_char_idx = start_char_idx
        
        while predicted_char_idx != newline_idx and word_size != 50:
            h_prev, y_pred = model(h_prev, x)
            y_softmax_scores = torch.softmax(y_pred, dim=1)
            
            np.random.seed(np.random.randint(1, 5000))
            probas = y_softmax_scores.cpu().numpy().ravel()
            probas *= get_mask(indices)
            sum_proba = sum(probas)
            idx = np.random.choice(np.arange(trn_ds.vocab_size), p=[el / sum_proba for el in probas])
            indices.append(idx)
            
            x = (y_pred == y_pred.max(1)[0]).float()
            
            predicted_char_idx = idx
            
            word_size += 1
        
        if word_size == 50:
            indices.append(newline_idx)
    return indices

In [0]:
sample(model)

[27, 19, 5, 16, 1, 9, 15, 20, 21, 18, 14, 12, 28]

In [0]:
for i in range(10):
  print_sample(sample(model))

<sotaencurux>
<tahslcouruk>
<tanyoscurul>
<soryilcuwuh>
<ahteumsurul>
<mathigsurup>
<crtoelhusux>
<sotaencurux>
<tahslcouruk>
<tanyoscurul>


In [0]:
# task 3

In [0]:
def sample(model, T):
    model.eval()
    word_size=0
    newline_idx = trn_ds.ch_to_idx['>'] # индекс конечного символа
    with torch.no_grad():
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device) # первый вектор скрытого состояния
        x = h_prev.new_zeros([1, trn_ds.vocab_size])
        start_char_idx = trn_ds.ch_to_idx['<'] # индекс начального символа
        x[0, start_char_idx] = 1 # one-hot вектор для начального символа

        indices = [start_char_idx]      
        predicted_char_idx = start_char_idx
        
        while predicted_char_idx != newline_idx and word_size != 50:
            h_prev, y_pred = model(h_prev, x)
            y_softmax_scores = torch.softmax(y_pred, dim=1)
            
            np.random.seed(np.random.randint(1, 5000))
      
            probas = y_softmax_scores.cpu().numpy().ravel()
            probas = probas ** (1/T) # меняем венроятности в соответствии с температурой
            sum_proba = sum(probas)
            idx = np.random.choice(np.arange(trn_ds.vocab_size), p=[el / sum_proba for el in probas])
            indices.append(idx)
            
            x = (y_pred == y_pred.max(1)[0]).float()
            
            predicted_char_idx = idx
            
            word_size += 1
        
        if word_size == 50:
            indices.append(newline_idx)
    return indices

In [0]:
for i in range(10):
  print_sample(sample(model, 3))

<naoeulibsar>
<qyetusmirus>
<vktswakrcmsaukwacsm
nguoratalzs>
<ctvtlisur>
<vktswakrcmsaukwacsm
nguoratalzs>
<ctvtlisur>
<vktswakrcmsaukwacsm
nguoratalzs>
<ctvtlisur>
<vktswakrcmsaukwacsm
nguoratalzs>
<ctvtlisur>


# Reference

1. Sampling in  RNN: https://nlp.stanford.edu/blog/maximum-likelihood-decoding-with-rnns-the-good-the-bad-and-the-ugly/
2. Coursera course (main source): https://github.com/furkanu/deeplearning.ai-pytorch/tree/master/5-%20Sequence%20Models
3. Coursera course (main source): https://github.com/Kulbear/deep-learning-coursera/blob/master/Sequence%20Models/Dinosaurus%20Island%20--%20Character%20level%20language%20model%20final%20-%20v3.ipynb
4. LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/