# Language models

Which word in the sequence is more likely:

The train arrived at the
* north 
* railway station

Which sequence is more likely:
* The train arrived at the station
* The station arrived at the train

The language model [language model, LM] allows you to estimate the probability of the next word in the sequence $P(w_n | w_1, \ldots, w_{n-1})$ and estimate the probability of the entire sequence of words $P(w_1, \ldots, w_n)$.

### Applications:

#### Tasks where complex and noisy input needs to be processed: 
* Speech recognition, 
* Recognition of scanned and handwritten texts;
* Correction of typos
* Machine translation
* Tip when typing

#### Types of models:
* Countable models
    - Markov chains
* Neural network models, usually recurrent neural networks with LSTM/GRU 
* Seq2Seq architectures


## The $n$-gram model
Let $w_{1:n}=w_1,\ldots,w_m$ be a sequence of words.

Chain rule: 

$$P(w_{1:m}) = P(w_1) P(w_2 | w_1) P(w_3 | w_{1:2}) \ldots P(w_m | w_{1:m-1}) = \prod_{k=1}^{m} P(w_k | w_{1:k-1}) $$

But evaluating $P(w_k | w_{1:k-1})$ is not easier!

We move on to $n$-grams: $P(w_{i+1} | w_{1:i}) \approx P(w_{i+1} | w_{i-n:i}) $ , which means that we take into account $n-1$ the previous word.

### Model

* _unigram:_ $P(w_k)$

* _bigram:_ $P(w_k | w_{k-1})$

* _trigram:_ $P(w_k |w_{k-1} w_{k-2})$

I.e. we use Markov assumptions about the length of the stored chain.

* The probability of the next word in the sequence: $ P(w_{i+1} | w_{1:i}) \approx P(w_{in:i}) $
* The probability of the whole sequence of words, $P(w_{1:n}) = \prod_{k=1}^l P(w_k | w_{k-n+1: k-1}) $


## The $n$-gram model quality estimation

__Perplexity:__ How good is the model at predicting the sample. The lower the perplexy value, the better.

$PP(\texttt{LM}) = 2 ^ {-\frac{1}{m} \log_2 \texttt{LM} (w_i | w_{1:i-1})}$

![](https://miro.medium.com/max/1050/1*J5kBR7XsQqRiu0p_CZEk1w.png)

We want our model to assign high probabilities to sentences that are real and syntactically correct, and low probabilities to fake, incorrect, or highly infrequent sentences. Assuming our dataset is made of sentences that are in fact real and correct, this means that the best model will be the one that assigns the highest probability to the test set. Intuitively, if a model assigns a high probability to the test set, it means that it is not surprised to see it (it’s not perplexed by it), which means that it has a good understanding of how the language works.

MLE probability estimation:

$ P_{MLE}(w_k | w_{k-n+1:k-1}) = \frac{\texttt{count}(w_{k-n+1:k-1} w_k )}{\texttt{count}(w_{k-n+1:k-1} )} $

In the bigram model:

$ P_{MLE}(w_k | w_{k-1}) = \frac{\texttt{count}(w_{k-1} w_k )}{\texttt{count}(w_{k-1} )} $

The problem of zero probabilities arises!

Additive Laplace smoothing
$ P(w_k | w_{k-1}) = \frac{\texttt{count}(w_{k-1} w_k ) + \alpha}{\texttt{count}(w_{k-1} ) + \alpha |V|} $


## Example

![](https://github.com/artemovae/ML-for-compling/raw/668293ddcf40ef30461c45676ec1931c69551553/2018/img/aib.png)

BOS А и Б сидели на трубе EOS

BOS А упало Б пропало EOS

BOS что осталось на трубе EOS

$P($ и $| $ A $) = \frac{1}{2}$

$P($ Б $| $ и $) = \frac{1}{1}$

$P($ трубе $| $ на $) = \frac{2}{2}$

$P($ сидели $| $ Б $) = \frac{1}{2}$

$P($ на $| $ сидели $) = \frac{1}{2}$

In [None]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import nltk
cfreq = nltk.ConditionalFreqDist(nltk.bigrams('''BOS А и Б сидели на трубе EOS

BOS А упало Б пропало EOS

BOS что осталось на трубе EOS'''.split()))

cprob = nltk.ConditionalProbDist(cfreq, nltk.MLEProbDist)
print('p(А и) = %1.4f' %cprob['А'].prob('и'))
print('p(и Б) = %1.4f' %cprob['и'].prob('Б'))
print('p(на трубе) = %1.4f' %cprob['на'].prob('трубе'))
print('p(Б сидели) = %1.4f' %cprob['Б'].prob('сидели'))
print('p(сидели на) = %1.4f' %cprob['сидели'].prob('на'))

p(А и) = 0.5000
p(и Б) = 1.0000
p(на трубе) = 1.0000
p(Б сидели) = 0.5000
p(сидели на) = 1.0000


**Now let us make dinosaurs!!!!!!!!!!!1**

In [None]:
!pip install wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9674 sha256=cbddd5e34369fdd6ddeec4572622ff76b096c5b333bcd355cab13fe7f637963c
  Stored in directory: /root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
import wget

wget.download("https://raw.githubusercontent.com/artemovae/ML-for-compling/master/2018/dinos.txt", "dinos.txt")

'dinos.txt'

In [None]:
import nltk

from sklearn.utils import shuffle

In [None]:
names = ['<' + name.strip().lower() + '>' for name in open('dinos.txt').readlines()]
print(names[:10])

['<aachenosaurus>', '<aardonyx>', '<abdallahsaurus>', '<abelisaurus>', '<abrictosaurus>', '<abrosaurus>', '<abydosaurus>', '<acanthopholis>', '<achelousaurus>', '<acheroraptor>']


In [None]:
chars = [char  for name in names for char in name]
freq = nltk.FreqDist(chars)

print(sorted(list(freq.keys())))

['<', '>', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [None]:
cfreq = nltk.ConditionalFreqDist(nltk.bigrams(chars))
cfreq['a']

FreqDist({'u': 791, 'n': 347, 't': 204, 's': 171, 'l': 138, '>': 138, 'r': 124, 'c': 100, 'p': 89, 'm': 68, ...})

In [None]:
cprob = nltk.ConditionalProbDist(cfreq, nltk.MLEProbDist)
print('p(a a) = %1.4f' %cprob['a'].prob('a'))
print('p(a b) = %1.4f' %cprob['a'].prob('b'))
print('p(a u) = %1.4f' %cprob['a'].prob('u'))

p(a a) = 0.0044
p(a b) = 0.0097
p(a u) = 0.3181


In [None]:
from math import log
log(cprob['a'].prob('a')) + log(cprob['a'].prob('b')) + log(cprob['a'].prob('c'))

-13.275378042275806

In [None]:
freq

FreqDist({'a': 2487, 's': 2285, 'u': 2123, 'o': 1710, 'r': 1704, '<': 1536, '>': 1536, 'n': 1081, 'i': 944, 'e': 913, ...})

In [None]:
freq['a']

2487

In [None]:
l = sum([freq[char] for char in freq])

def unigram_prob(char):
    return freq[char] / l

print('p(a) = %1.4f' %unigram_prob('a'))

p(a) = 0.1160


In [None]:
[bi for bi in nltk.bigrams('<aachenosaurus>')]

[('<', 'a'),
 ('a', 'a'),
 ('a', 'c'),
 ('c', 'h'),
 ('h', 'e'),
 ('e', 'n'),
 ('n', 'o'),
 ('o', 's'),
 ('s', 'a'),
 ('a', 'u'),
 ('u', 'r'),
 ('r', 'u'),
 ('u', 's'),
 ('s', '>')]

In [None]:
import nltk

In [None]:
bigrams = [bi for bi in nltk.bigrams('aba caba baca bac')]

In [None]:
len(bigrams)

16

In [None]:
ab = [bi for bi in bigrams if bi == ('a', 'b')]

In [None]:
len(ab)

2

In [None]:
chars = list('aba caba baca bac')

In [None]:
cfreq = nltk.ConditionalFreqDist(nltk.bigrams(chars))
cprob = nltk.ConditionalProbDist(cfreq, nltk.MLEProbDist)
print(f"p(a b) = {cprob['a'].prob('b')}")

p(a b) = 0.2857142857142857


In [None]:
2/7

0.2857142857142857

### Task 1. 

1.1 Write a function to estimate the probability of a dinosaur name.

1.2 Find the most likely dinosaur name from this list.

In [None]:
def estimate_dino_prob(dinosaur_name):
    prob = 1.0
    for left, right in nltk.bigrams(dinosaur_name):
        prob *= cprob[left].prob(right)
    return prob

In [None]:
assert estimate_dino_prob('<aachenosaurus>') > estimate_dino_prob('<aachenosauril>')

In [None]:
list(reversed(sorted([(dinosaur_name, estimate_dino_prob(dinosaur_name)) for dinosaur_name in names], key=lambda x: x[1])))[0]

('<talos>', 2.639985626826119e-05)

In [None]:
# your code here
def compute_probability(word):
    whole_probability = 1.0
    for i, j in [bi for bi in nltk.bigrams(word)]:
        #print('p(%s %s) = %1.4f' %(i, j, cprob[i].prob(j)))
        whole_probability *= cprob[i].prob(j)
    return whole_probability

In [None]:
from collections import Counter

probs = [(word, compute_probability(word)) for word in names]
sorted(probs, key=lambda x: x[1], reverse=True)[:5]

[('<talos>', 2.639985626826119e-05),
 ('<mei>', 2.112487234254014e-06),
 ('<elosaurus>', 1.8327456825026856e-06),
 ('<almas>', 7.361477189901781e-07),
 ('<balaur>', 6.732825877952762e-07)]

### Task 2.

Write a function that generates a new dinosaur given the length of the expected name.

In [None]:
# your code here
def generate_n_word(n=10):
    new_name = "<"
    for i in range(n):
        new_name += cprob[new_name[-1]].generate()
        print(new_name)
    return new_name

In [None]:
generate_n_word(11)

<m
<ma
<man
<mani
<mania
<maniah
<maniahu
<maniahus
<maniahus>
<maniahus><
<maniahus><a


'<maniahus><a'

## Recurrent Neural Networks (RNN)

The original sequence:

$x_{1:n} = x_1, x_2, \ldots, x_n$, $x_i \in \mathbb{R}^{d_{in}}$

-----------------------------------------

For each input value $x_{1:i}$, we get $y_i$ at the output:

$y_i = RUN(x_{1:i})$, $y_i \in \mathbb{R}^{d_{out}}$

-----------------------------------------

For the entire sequence $x_{1:n}$:

$y_{1:n} = RN^{*}(x_{1:n})$, $y_i \in \mathbb{R}^{d_{out}}$

$R$ is a recursive activation function depending on two parameters: $x_i$ and $s_{i-1}$ (the vector of the previous state)

-----------------------------------------

$RNN^{*}(x_{1:n}, s_0) = y_{1:n}$

$y_i = O(s_i) = g(W^{out}[s_{i} ,x_i] +b)$

$s_i = R(s_{i-1}, x_i)$

$s_i = R(s_{i-1}, x_i) = g(W^{hid}[s_{i-1} ,x_i] +b)$ -- concatenation $[s_{i-1}, x]$

$x_i \in \mathbb{R}^{d_{in}}$, $y_i \in \mathbb{R}^{ d_{out}}$, $s_i \in \mathbb{R}^{d_{hid}}$

$W^{head} \in \mathbb{R}^{(d_{in}+d_{out}) \times d_{hid}}$, $W^{out} \in \mathbb{R}^{d_{hid} \times d_{out}}$

Let's build a language model based on RNN using pytorch

In [None]:
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
import pdb
from torch.utils.data import Dataset, DataLoader

%load_ext autoreload
%autoreload 2

torch.set_printoptions(linewidth=200)

In [None]:
random_seed = 381 
torch.manual_seed(random_seed)
np.random.seed(random_seed)

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
hidden_size = 50

Let us prepare a dataset:

In [None]:
class DinosDataset(Dataset):
    def __init__(self):
        super().__init__()
        with open('dinos.txt') as f:
            content = f.read().lower()
            self.vocab = sorted(set(content)) + ['<', '>']
            self.vocab_size = len(self.vocab)
            self.lines = content.splitlines()
        self.ch_to_idx = {c:i for i, c in enumerate(self.vocab)}
        self.idx_to_ch = {i:c for i, c in enumerate(self.vocab)}
    
    def __getitem__(self, index):
        line = self.lines[index]
        #teacher forcing
        x_str = '<' + line 
        y_str = line + '>' 
        x = torch.zeros([len(x_str), self.vocab_size], dtype=torch.float)
        y = torch.empty(len(x_str), dtype=torch.long)
        for i, (x_ch, y_ch) in enumerate(zip(x_str, y_str)):
            x[i][self.ch_to_idx[x_ch]] = 1
            y[i] = self.ch_to_idx[y_ch]
        
        return x, y
    
    def __len__(self):
        return len(self.lines)

In [None]:
trn_ds = DinosDataset()
trn_dl = DataLoader(trn_ds, shuffle=True, batch_size=1)

In [None]:
trn_ds.lines[1]

'aardonyx'

In [None]:
trn_ds.vocab

['\n',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '<',
 '>']

In [None]:
trn_ds.vocab_size

29

In [None]:
print(trn_ds.idx_to_ch)

{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 27: '<', 28: '>'}


In [None]:
trn_ds.vocab_size

29

In [None]:
x, y = trn_ds[1]
x.shape, y.shape, x, y

(torch.Size([9, 29]),
 torch.Size([9]),
 tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0

In [None]:
x

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0

In [None]:
y

tensor([ 1,  1, 18,  4, 15, 14, 25, 24, 28])

In [None]:
["<"] +[trn_ds.idx_to_ch[i] for i in [ 1,  1, 18,  4, 15, 14, 25, 24, 28]]

['<', 'a', 'a', 'r', 'd', 'o', 'n', 'y', 'x', '>']

Let us create the RRN model:

In [None]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.lstm = nn.LSTMCell(input_size, hidden_size)
        self.dropout = nn.Dropout(0.3)
        # test
        self.i2o = nn.Linear(hidden_size, output_size)
    
    def forward(self, h_prev, x):
        h, c = self.lstm(x, h_prev)
        h = torch.tanh(h)
        y = self.i2o(h)
        return (h, c), y

In [None]:
model = RNN(trn_ds.vocab_size, hidden_size, trn_ds.vocab_size).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)

![](https://github.com/PragmaticsLab/NLP-course-AMI/raw/0cb50728ceaa825f97d88f4e72efc954b817badc/seminars/sem4_language_models/images/dinos3.png)

In [None]:
def sample(model):
    model.eval()
    word_size=0
    newline_idx = trn_ds.ch_to_idx['>']
    with torch.no_grad():
        h_prev = (torch.zeros([1, hidden_size], dtype=torch.float, device=device),
                  torch.zeros([1, hidden_size], dtype=torch.float, device=device)
                  )
        x = h_prev[0].new_zeros([1, trn_ds.vocab_size])
        start_char_idx = trn_ds.ch_to_idx['<']
        indices = [start_char_idx]
        x[0, start_char_idx] = 1
        predicted_char_idx = start_char_idx
        
        while predicted_char_idx != newline_idx and word_size != 50:
            h_prev, y_pred = model(h_prev, x)
            y_softmax_scores = torch.softmax(y_pred, dim=1)
            
            np.random.seed(np.random.randint(1, 5000))
            idx = np.random.choice(np.arange(trn_ds.vocab_size), p=y_softmax_scores.cpu().numpy().ravel())
            indices.append(idx)
            
            x = (y_pred == y_pred.max(1)[0]).float()
            
            predicted_char_idx = idx
            
            word_size += 1
        
        if word_size == 50:
            indices.append(newline_idx)
    return indices

In [None]:
def print_sample(sample_idxs):
    [print(trn_ds.idx_to_ch[x], end ='') for x in sample_idxs]
    print()

Let us train the model

In [None]:
def train_one_epoch(model, loss_fn, optimizer):
    model.train()
    for line_num, (x, y) in enumerate(trn_dl):
        loss = 0
        optimizer.zero_grad()
        h_prev = (torch.zeros([1, hidden_size], dtype=torch.float, device=device),
                  torch.zeros([1, hidden_size], dtype=torch.float, device=device)
                  )

        x, y = x.to(device), y.to(device)
        for i in range(x.shape[1]):
            h_prev, y_pred = model(h_prev, x[:, i])
            loss += loss_fn(y_pred, y[:, i])
            
        if (line_num+1) % 100 == 0: 
            print_sample(sample(model))
        loss.backward()
        optimizer.step()
    perplexity = torch.exp(loss)
    print(f'Perplexity:{perplexity}') 


In [None]:
def train(model, loss_fn, optimizer, dataset='dinos', epochs=1):
    for e in range(1, epochs+1):
        print('Epoch:{}'.format(e))
        train_one_epoch(model, loss_fn, optimizer)
        print()

In [None]:
train(model, loss_fn, optimizer, epochs=10)

Epoch:1
<i
erb
pcseancuvnnyaouko<nopsapsfh>
<>
<dlsuaupusanoo>
<uessrauap>
<rxcuuurmuus>
<tnrsuaosar>
<ashuabsoasn>
<hiatanusuc>
<tsnerrn<uru>
<ualnaiuaxr>
<acrnaneslra>
<alvruauu>
<lerrn>
<smrstaoearur>
<guaeooaossssa>
Perplexity:3719522156544.0

Epoch:2
<alvpteousus>
<rmysosrudusarsaur>
<adrnaaususo>
<amurugourus>
<rmysisruarros>
<asgsagiugus>
<hkcucaurus>
<ttkiosaurus>
<tainanhirusaus>
<anashoaurus>
<ucusopaurut>
<krttanarrus>
<guagooaurus>
<aucitcurus>
<lhrrr>
Perplexity:14633119744.0

Epoch:3
<snostbaurus>
<sksadhucus>
<hlbucaurus>
<ttraspnusaurur>
<lbnuceurts>
<maljmures>
<kvotaurus>
<sjnytmosmrrus>
<sbrgsaurup>
<cuhictcaurus>
<ttreron>
<snrttaipapuiuo>
<aeriaonaurus>
<ivrudhurus>
<scwroshunus>
Perplexity:4220029780361216.0

Epoch:4
<euasgsaurul>
<cthiclnrus>
<bttsolaurus>
<qrtaipaurus>
<tbcslaaurus>
<tcivotaurus>
<rriyssauras>
<alrasgsaurur>
<crnpatas>
<pucotrsaurus>
<lrsubaurus>
<slucesanrus>
<hbtanturus>
<snbsop>
<snostcaurus>
Perplexity:60500.1875

Epoch:5
<sluccsarrus>
<hbtc

## Test model:

In [None]:
ids = sample(model)
print_sample(ids)

<ronytsaurus>


## Task 1.
Rewrite the sampling function so that pangrams (words that contain each character of the alphabet only once)

In [None]:
# your code here
def sample(model):
    model.eval()
    word_size=0
    newline_idx = trn_ds.ch_to_idx['>']
    with torch.no_grad():
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
        x = h_prev.new_zeros([1, trn_ds.vocab_size])
        start_char_idx = trn_ds.ch_to_idx['<']
        indices = [start_char_idx]
        x[0, start_char_idx] = 1
        predicted_char_idx = start_char_idx
        
        while predicted_char_idx != newline_idx and word_size != 50:
            h_prev, y_pred = model(h_prev, x)
            y_softmax_scores = torch.softmax(y_pred, dim=1)
            
            idx = indices[-1]
            while idx in indices:
                np.random.seed(np.random.randint(1, 5000))
                idx = np.random.choice(np.arange(trn_ds.vocab_size), p=y_softmax_scores.cpu().numpy().ravel())
            indices.append(idx)
            
            x = (y_pred == y_pred.max(1)[0]).float()
            
            predicted_char_idx = idx
            
            word_size += 1
        
        if word_size == 50:
            indices.append(newline_idx)
    return indices

In [None]:
print_sample(sample(model))

<tarkedlunis>


## Task 2.
Rewrite the sampling function so that is it is possible to change the sampling temperature

In [None]:
def equalize_probs_sqrt(in_vector):
    out_vector = np.zeros_like(in_vector)
    for i, el in enumerate(in_vector):
        out_vector[i] = np.math.sqrt(el)

    return out_vector / sum(out_vector)

In [None]:
# your code here

##Task 3.
Implement the beam search for sampling

In [None]:
# your code here