# Machine Translation
## Statistical Machine Translation
It aims at building a probalistic model $argmax_{\text{target language}}P(\text{target language} | \text{source language})$. We can use Bayes Rule to break this down into two models. A language model:
$$
P(\text{target language})
$$
and a translation model:
$$
P(\text{source language} | \text{target language})
$$

To learn the translation, we need a large amount of parallel data(e.g. aligned source language/target language sentences). Besides, different languages have their own word orders, and one word in one langage may correspond with many words in another language. So intuitionally, we introduce an alignment variable $a$. $a$ is a latent variable, which can not be observed. The alignment variable not only handles word orders, but also many-to-many, one-to-many and many-to-one relationships, etc. Therefore, we model:
$$
P(\text{source language}, a | \text{target language})
$$

Since we have a latent variable, some special learning algorithms are required like the EM(Expectation Maximization) Algorithm.

# Sequence to Sequence
Neural Machine Translation is a way to do machine translation with an end-to-end neural network. And its architecture is called a sequence-to-sequence model. In this architecture, we have an encoder for understanding languages, and a decoder for generating languages.

Put simply, we have take 2 simple RNNs respectively as an encoder and a decoder. In this architecture, we feed the entire source language sentence into the encoder, we get its outputs and last hidden states. We get rid of the former and transmit the last hidden states to the initial hidden states of the decoder. Next, do teacher forcing with the decoder.

The sequence-to-sequence model is an example of a **Conditional Language Model**. It directly calculates $P(y|x)$:
$$
P(\vec{y}|\vec{x})=P(y_1|\vec{x}) P(y_2|y_1, \vec{x}) P(y_3|y_2, y_1, \vec{x}) ...
$$

## Multilayer RNNs
Like in Computer Vision where lower layers extract lower-level features and higher layers extract higher-level features, the same principle applies with multi-layer RNNs.

## Decoding for NMT
Now we have a model that can predict the next word given the source language condition and the current word. The process where we generate a complete sentence by generating next words one by one is called **decoding**.
### Greedy Decoding
Greedy Decoding is the most intuitive method. With this method, we take the most probable word from candidates as the output at each time step. However, taking the local optimal solution at each time step doesn't necessarily lead to the globally optimal solution. But enumerating every possible sentence till we find the optimal one is computationally expensive and virtually impossible. To make a compromise between efficiency and accuracy, we have Beam Search.
### Beam Search
Beam Search: On each step of decoder, keep track of k most probable partial translations(which we call hypotheses). k is the beam size.

Algorithm Description:

1. For the current hypotheses, calculate their scores: $score(y_1, y_2, ..., y_t)$ = $log \quad P(y_1,y_2,y_3,...,y_t | x)$.
2. Select k best partial hypotheses based on the scores.
3. Make the next predictions based on the selected k best partial hypotheses and return to step 1.
4. A parital hypothesis is complete once it produces \<EOS>(End of String). And then it is put aside.
Beam Search is done when we reach a pre-defined cut-off timestep T, or we have at least n completed hypotheses.
5. With a list of completed hypotheses, now that they may vary in length, and that longer hypotheses in general have lower scores, we fix their scores with a factor $\frac{1}{L^{\alpha}}$, where $L$ is the length of the hypothesis and $\alpha$ is a hyperparameter(usually 0.75).
6. Select the best completed hypothesis as the sequence output based on fixed scores.

## Evaluation of Machine Translation
BLEU(BiLingual Evaluation Understudy): It compares the machine-written translation to **one or several** human-written translations, and computes a similarity score.

$$
BLEU=BP * exp({\sum_{n=1}^{N}}w_n log\,p_n)
$$

where:
N is the highest n-gram order, a hyperparameter. From it, we have BLEU-N metric.

p_n is the precision of a machine-written translation compared to one or several human-written translations.

w_n is the weight of each $p_n$, usually $\frac{1}{N}$, to give equal importance to each n-gram order.

BP(Brevity Penalty) is a penalty term applied to machine translations that are accurate but incomplete(usually shorter), calculated this way:
$$
\text{BP} =
\begin{cases}
1, & \text{if } c > r \\
\exp\left(1 - \frac{r}{c}\right), & \text{if } c \le r
\end{cases}
$$

Algorithm Description:

1. Tokenize the human-written translations and the machine-written translation.
2. Select a hyperparameter n with which to generate n-gram phrases with a moving window. For example, under 3-gram: \[the, cat, sat, on, the, mat]->\[(the, cat, sat), (cat, sat, on), (sat, on, the), (on, the, mat)]
3. Calculate how many n-gram phrases in the machine-written translation also appear in the human-written translation regardless of the position. **N.B.**: Every **type** of n-gram phrase in the human-written translation contributes at most its frequency counts(每种n元语法词组最多只能贡献它所出现的次数那么多来计算相同词组的个数) for overlapping calculation(Clipped Count). For example, we have a machine translation(never mind how it is generated, we discuss only the calculation of $p_n$ here):\[(the, cat, sat), (cat, sat, on), (sat, on, the), (on, the, mat), (on, the, mat)], and a human translation: \[(there, is, a), (is, a, cat), (a, cat, on), (cat, on, the), (on, the, mat)]. In this case, we have (on, the, mat) n-gram phrase that co-occur in both translation so it is the only type of n-gram phrase that contributes to clipped match counts. Although it occurs 2 times in the machine translation, it occurs only once in the human-written translation so the clipped match counts that this phrase contributes is 1, which is also the clipped match count of this machine translation. Divide it by the number of n-gram phrases(not the number of types of n-gram phrases!) in the machine translation to get $p_n$($p_3$ in this example).
4. Sum up the $p_n$ weighted by w_n(usually $\frac{1}{N}$, different n-grams are equally important) from n=1 to n=N.
5. Apply exp, BP.
# Attention
See next Chapter: Self-Attetion and Transformers

In [1]:
import torch
import torch.nn as nn
# This is a NMT model architechture just for reference. It receives word vector inputs. The complete code is in assignment 3.
#In an NMT task, 2-4 is the best num_layers for the encoder RNN while 4 is the best for the decoder RNN - Britz, 2017.
class encoder_decoder_NMT(nn.Module):
    def __init__(self, embed_size, trg_vocab_size, hidden_size=256):
        super().__init__()
        self.encoder=nn.GRU(embed_size, hidden_size, num_layers=4, bidirectional=False, batch_first=True)
        self.decoder=nn.GRU(embed_size, hidden_size, num_layers=4, batch_first=True)
        self.cls=nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_size, trg_vocab_size)
        )

    def forward(self, source, target=None):
        if self.training:
            return self._forward_train(source, target)
        else:
            return self._forward_inference(source)

    def _forward_train(self, source, target): #when model.train, the batch sizes of source and target must be identical!
        encoder_output, encoder_last_hidden_states = self.encoder(source)
        print("encoder_output shape",encoder_output.shape)
        print("encoder_hidden shape",encoder_last_hidden_states.shape)
        decoder_output, decoder_last_hidden_states = self.decoder(target, encoder_last_hidden_states)
        print("decoder_output shape",decoder_output.shape)
        print("decoder_output_reshape",decoder_output.reshape(-1, decoder_output.size(-1)).shape)
        logits=self.cls(decoder_output.reshape(-1, decoder_output.size(-1)))
        logits=logits.view(decoder_output.size(0), decoder_output.size(1), -1)
        return logits

    def _forward_inference(self, source):
        pass

Viterbi Algorithm, EM(Expectation Maximization) Algorithm, Beam Search