# Baseline performance.

![alt text](https://i.ibb.co/W2GVnCb/Screen-Shot-2021-03-31-at-21-27-42.png)

BLEU score: 14.79307847852665

**<font color=blue>Observations:</font>**

- First we observe that the different techniques for improving the translation quality are not mutually exclusive.
- Below just for the sake of the problem at hand we apply them separately in order to see how much of an improvement each can contribute.
- The first to model didn't do that well. Their BLUE score $< 22$ **but** the last one does relatively well (above the lower bound required for sending the homework). That's why allow myself to submit the homework as it is. If it was require to obtain a BLUE score $\geq 20$ in each model, I appreciate you inform me in order to make the respective fixes.

# Model 1: Optimization Enhancement: Learning Rate Decay.

![alt text](https://i.ibb.co/J5kymX7/Screen-Shot-2021-03-31-at-21-29-27.png)

BLEU score: 13.639520230929117

**<font color=blue>Conclusions:</font>**

- For learning rate decay we apply out-of-the-box `lr_scheduler.ReduceLROnPlateau` with `patience=2` for dynamic learning rate reduction.
- Both loss and scores is worse than the baseline performance. This may due to the fact that:
    - Each batch is small (128 sentence, each of length approx. equal to 50), which may lead to more erratic behaviour when updating weights (the gradient descent step).
    - The number of epochs is also not that big (10), which doesn't give time the scheduler to reduce the learning rate (it only does so if after `patience` steps it doesn't see improvements in the loss function). This in turn may only lead to additional computation rather that performance enhacements.

## Model 2: Word Segmentation for Russian Language

Tokenization of russian words is not that simple compared to english words, mainly because of the presence of compound words or word connected by hyphens. In that case word like `какой-то` will give multiple tokens, which separately convey the meaning of neither the original russian nor the target english word.

In [None]:
from nltk.tokenize import WordPunctTokenizer

In [None]:
tokenizer_W = WordPunctTokenizer()
def tokenize(x, tokenizer=tokenizer_W):
    return tokenizer.tokenize(x.lower())

In [None]:
text = "Не ветер, а какой-то ураган!"

To deal with this we'll use spacy's out-of-the-box tokenization and text segmentation API that can handle more accurate rules for the russian language.

The vocabulary of patterns is obtained from National Russian Language Corpus (НКРЯ). For more details (and props) on the API see [here](https://github.com/aatimofeev/spacy_russian_tokenizer).

In [None]:
!pip install pymorphy2==0.8

Collecting pymorphy2==0.8
[?25l  Downloading https://files.pythonhosted.org/packages/a3/33/fff9675c68b5f6c63ec8c6e6ff57827dda28a1fa5b2c2d727dffff92dd47/pymorphy2-0.8-py2.py3-none-any.whl (46kB)
[K     |████████████████████████████████| 51kB 2.8MB/s 
[?25hCollecting pymorphy2-dicts<3.0,>=2.4
[?25l  Downloading https://files.pythonhosted.org/packages/02/51/2465fd4f72328ab50877b54777764d928da8cb15b74e2680fc1bd8cb3173/pymorphy2_dicts-2.4.393442.3710985-py2.py3-none-any.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 6.1MB/s 
Collecting dawg-python>=0.7
  Downloading https://files.pythonhosted.org/packages/6a/84/ff1ce2071d4c650ec85745766c0047ccc3b5036f1d03559fd46bb38b5eeb/DAWG_Python-0.7.2-py2.py3-none-any.whl
Installing collected packages: pymorphy2-dicts, dawg-python, pymorphy2
Successfully installed dawg-python-0.7.2 pymorphy2-0.8 pymorphy2-dicts-2.4.393442.3710985


In [None]:
from spacy.lang.ru import Russian

In [None]:
!pip install git+https://github.com/aatimofeev/spacy_russian_tokenizer.git

Collecting git+https://github.com/aatimofeev/spacy_russian_tokenizer.git
  Cloning https://github.com/aatimofeev/spacy_russian_tokenizer.git to /tmp/pip-req-build-hoav0yqa
  Running command git clone -q https://github.com/aatimofeev/spacy_russian_tokenizer.git /tmp/pip-req-build-hoav0yqa
Building wheels for collected packages: spacy-russian-tokenizer
  Building wheel for spacy-russian-tokenizer (setup.py) ... [?25l[?25hdone
  Created wheel for spacy-russian-tokenizer: filename=spacy_russian_tokenizer-0.1.1-cp37-none-any.whl size=12675 sha256=866ca11f368845def1b6105c7b869ea7e4df3c08fc534d31e045c2ae885cfd05
  Stored in directory: /tmp/pip-ephem-wheel-cache-5hc1eki8/wheels/37/3b/bb/cfe712f7c0b78cd08f4a2ef122d17748baf9d4bebecf2e5a54
Successfully built spacy-russian-tokenizer
Installing collected packages: spacy-russian-tokenizer
Successfully installed spacy-russian-tokenizer-0.1.1


In [None]:
from spacy.lang.ru import Russian
from spacy_russian_tokenizer import RussianTokenizer, MERGE_PATTERNS

In [None]:
nlp = Russian()
russian_tokenizer = RussianTokenizer(nlp, MERGE_PATTERNS)
nlp.add_pipe(russian_tokenizer, name='russian_tokenizer')
def rus_tokenize(x, tokenizer=nlp):
  tokens = nlp(x.lower())
  return [token.text for token in tokens]

In [None]:
tokenize(text)

['не', 'ветер', ',', 'а', 'какой', '-', 'то', 'ураган', '!']

In [None]:
rus_tokenize(text)

['не', 'ветер', ',', 'а', 'какой-то', 'ураган', '!']

![alt text](https://i.ibb.co/x14Xzgz/Screen-Shot-2021-03-31-at-21-55-01.png)

BLEU Score: 14.94341105622846


**<font color=blue>Conclusions:</font>**

- This model does slightly better than the baseline, which implies that a more correct tokenization of russian words may have helped to put a better correspondence between source and target words.
- Repeated experiments may help to elucidate whether this improvements are always expected or not (I mean, the BLUE score here and in the baseline are quite tight).
- One reason why we didn't get the expected improvements may be related to the nature of the text we trained on. The texts dealt with are hotel descriptions, which means that the language used is more formal and words such as `какой-то`, `кто-нибудь` are not met that often (are words that transmit some sense of uncertainty; something we wouldn't expect from a hotel description found in booking.com or somewhere else).

## Model 4: Transformer with Attention.

![alt text](https://i.ibb.co/5xdj004/Screen-Shot-2021-03-31-at-22-10-02.png)

**<font color=green>BLEU Score: 27.33281166978366</font>**

**<font color=blue>Observations:</font>**

- Here we implemented Luong's Attention from [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/pdf/1508.04025.pdf).
- The number of layers chosen is 1 for computational reasons.
- The score used for computing the attention values $\alpha_{ts}$ were computed using the _dot_ alternative: $\mathrm{score}(h_t, \overline{h}_s) = h_t^{\top}\overline{h}_s$ where $h_t$ is the current decoder hidden state and $\overline{h}_s$ are all the encoder hidden states. Hence $\alpha_{ts}$ is computed by softmaxing $\mathrm{score}(h_t, \overline{h}_s)$.
- Implementation details can be found in the file `my_network_attention.py` (class `Decoder`).

**<font color=blue>Conclusions:</font>**

- The score is good enough (: 
- Here we can see why having only a meaningful tokenization is not enough for improving the translation quality: It doesn't prevent the bottleneck caused by only passing just last encoder hidden state to the decoder, as we do in the vanilla transformer. It's advisable to introduce (not limited to this, though) some technique that captures the "influence" of each word in the source on the next word we're trying to predict (word alignment). Attention does just that.
- Since each source sentence has a length of $\approx 50$ words/tokens the total number of hidden states produces by the encoder is also $\approx 50$. Their number is small enough as to calculate attention use them all, and it's also large enough to see that how attention allows to capture information from not-that-small sentences. 
- Further experiments relatd to attenton may include:
  - Trying different techniques for calculating $\mathrm{score}(h_t, \overline{h}_s)$.
  - Trying [Bahdanau's Attention](https://arxiv.org/pdf/1409.0473.pdf).