<img src='https://drive.google.com/uc?export=view&id=1ouZdIiuVwSlIloeygaelDhcBW5bNqk-S' width='30%' alt='liege.jpeg'>

### Preamble
Please open this notebook with Google Colab using Google Chrome.
If an other platform/browser is used, some code may not run correctly and some figures may not be displayed correctly.
For instance, the figures are not displayed at all when Colab is used with Safari.

All the figures and the results shown in this notebook can be found on this [git](https://github.com/FloDg/supervised-nmt-project) in the _figures_ and _results_ folders.

This work is meant as a scholar project in the context of the course _Web and Text Analytics_ (INFO2049-1) of the _University of Liège_.
Its authors are **De Geeter Florent**, **Nelissen Louis** and **Pirenne Thomas**.
The sources on which this work is based are mostly cited along the notebook and are summarized in the _Sources_ section of the _Report_ part of the notebook.


# Supervised Neural Machine Translation
This notebook has the objective of studying the impact of various basic embeddings and the attention mechanism by Bahdanau on Neural Machine Translation. To that end, we implemented an Encoder-Decoder architecture based on LSTM cells and trained multiple models in a supervised manner. The notebook itself is made up of two parts, one dedicated to reporting our results and overviewing our methods and implementation, the other is the implementation itself.

## Part 1 - Report
This section describes our solution, presents the results we have obtained and discusses them.

## Part 2 - Implementation
This section contains the actual implementation of the project.

# Part 1 - Report
1. Implementation
    1. Overview
        1. Dataset
        2. Embeddings
        3. Architecture
        4. Attention
        5. Training
    2. Problems encountered
    3. Considered solutions
2. Results
    1. Word2Vec
    2. GloVe
    3. fastText
    4. Comparison
    5. Attention
3. Discussion
4. Improvements
5. Sources

## Implementation
This section starts with an overview of our implementation choices, more specifically the way we exported, created and used embeddings, the way we designed our architecture, the basic principles of the attention mechanism applied to MT and the way we trained the models. The section then presents the main problems we encountered and finishes by describing the solutions we found and considered for those problems.

### Overview

The base of our implementation was largely inspired by the Tensorflow tutorial [_Neural Machine Translation with attention_](https://www.tensorflow.org/tutorials/text/nmt_with_attention) cited at the end of the **Report** section. We started off by removing the attention mechanism from it and replacing the GRU by LSTM cells. We also removed the embedding layer used in the tutorial and replaced it by pretrained and custom embeddings. We chose to do this for the _Word2Vec_, _GloVe_ and _FastText_ embeddings.

Later on, we chose to reimplement the attention mechanism shown in that same tutorial and then train models with it to judge its usefulness.

#### Dataset

In the Tensorflow tutorial, the authors use a Spanish-English dataset found on _manythings.org_. 
We decided to train our models to translate sentences from English to French and found an English-French parallel dataset on this very same website.
This dataset consists in about 170,000 bilingual sentences.
The decision to translate from English to French is motivated by the fact that French is more complex than English. 
Indeed, this complexity is such that there might be on average more French interpretations for the same English sentences. 
Since the task of translating from a language to another consists in interpreting a source sentence and outputting a single sentence in the target language of similar interpretation, it seemed like the task of translating from English to French would be slightly easier than translating from French to English.

Additionally, we chose to only train a model to translate in one direction (i.e. FR -> EN) but training it in the other direction should not require any modification of the architecture. In fact, it should only require to swap the _source_ and _target_ sentences in the dataset before training as the tasks are similar.

#### Embeddings
We considered the 3 following word embedding techniques: [_Word2Vec_](https://arxiv.org/abs/1301.3781), [_GloVe_](https://nlp.stanford.edu/projects/glove/) and [_FastText_](https://fasttext.cc/).
For each of those, we had two options: either use **pretrained** word embeddings or train **custom** embeddings on our own dataset. 
<!-- For the pretrained embeddings we use the [gensim downloader API](https://radimrehurek.com/gensim/downloader.html) to import one of the word embedding models [they propose](https://github.com/RaRe-Technologies/gensim-data). -->
We will now go over each in detail:

##### Pretrained Word2Vec 
- EN 🇬🇧 - We chose one of the pretrained models provided by the [gensim downloader API](https://radimrehurek.com/gensim/downloader.html). 
In particular we used a model trained on [Google News](https://github.com/RaRe-Technologies/gensim-data) articles using Word2Vec CBOW. 
This model learnt over 3 million words and phrases, modelled in a vector of _300_ dimensions.

- FR 🇫🇷 - We sourced French Word2Vec embeddings from [Jean-Phillipe Fauconnier's website](https://fauconnier.github.io/) as binary, loaded in Keyed Vectors. 
These word embeddings are trained on [frWaC](https://wacky.sslmit.unibo.it/doku.php?id=corpora#french), a dataset of 1.6 billion words sourced from a web crawl of the **.fr** domain. 
The embeddings are modelled in _200_ dim and use the Continuous Bag of Words method.
From the different models available, we tried to pick the closest equivalent to the english language embdedder we use.

##### Custom Word2Vec
- EN 🇬🇧 & FR 🇫🇷 - We trained both word embeddings on our dataset using [gensim's Word2Vec model](https://radimrehurek.com/gensim/models/word2vec.html). 
We mapped these embeddings in _100_ dimensions, a number we chose arbitrarily. 
Since our dataset is considerably smaller than the datasets commonly used to pretrain embeddings, we chose this dimension to be smaller as well which somewhat motivates the value _100_.
We did try to take a number that is neither too high nor too low as our dataset is limited in size.
<!-- Custom w2v -> With gensim models, train on our data (format?) gives a dictionary in the form of KeyedVectors. Training takes a certain amount of time ? -->

##### Pretrained GloVe
- EN 🇬🇧 - Again, we decided to one of the pretrained models provided by the [gensim downloader API](https://radimrehurek.com/gensim/downloader.html). 
In this case we used a [model](https://github.com/RaRe-Technologies/gensim-data) trained on a combination of [Wikipedia text crawl from 2014](https://dumps.wikimedia.org/enwiki/20140102/) and [gigaword](https://catalog.ldc.upenn.edu/LDC2011T07), a dataset made from aggregation of news articles in English. 
This model learnt a vocabulary of over 400,000 words, modelled in a vector of _300_ dimensions (there were several other options available but we chose to use 300 to keep as much consistency as possible with Word2Vec).
- FR 🇫🇷 - We did not find any pretrained GloVe model for the French language. We therefore use a custom model, as detailed below.


##### Custom GloVe 
<!-- -> With glove_python package, train model, then adjust this model to work in the same format as w2v (for genericity in our code) -->
- EN 🇬🇧 & FR 🇫🇷 - We used the `glove_python` package to generate and train our custom GloVe model. 
To that end we followed this [Medium tutorial on GloVe](https://medium.com/analytics-vidhya/word-vectorization-using-glove-76919685ee0b) and trained both French and English embeddings on our dataset. 
Similarly as for the other custom embeddings, we chose to dimension the vectors to _100_ in order to make the models as comparable as possible.

##### Pretrained fastText 
<!-- -> from gensim downloader api trained on wikinews (size?). -->
- EN 🇬🇧 - Again, we decided to one of the pretrained models provided by the [gensim downloader API](https://radimrehurek.com/gensim/downloader.html). 
In this case we used a [model](https://github.com/RaRe-Technologies/gensim-data) trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset.
The embeddings are modelled in a vector of _300_ dimensions. 
An important note is that we get a dictionary and not the actual model, meaning with this model we cannot find embeddings for out-of-vocabulary words.
- FR 🇫🇷 - We did not find any pretrained fastText model for the French language. We therefore use a custom model, as detailed below.

##### Custom fastText 
<!-- -> With gensim models, train on our data too. Gives dictionary in the form of KeyedVectors too. -->
- EN 🇬🇧 & FR 🇫🇷 - We trained both word embeddings on our dataset using [gensim's Word2Vec model](https://radimrehurek.com/gensim/models/word2vec.html).
As with the pretrained model, we do get a dictionary and not an actual model.
We mapped these embeddings in _100_ dimensions, to keep consistency with the other custom models.


#### Architecture
We chose to use a traditional Encoder - Decoder Recural Neural Network (RNN) architecture, inspired by Google's [seq2seq](https://google.github.io/seq2seq/) architecture. This architecture consists of two RNN networks. 
The first network (the Encoder) encodes a source sentence of variable length into a state with a fixed shape. 
The second network (the Decoder) takes this state and decodes it into a variable-length sequence, in our case a sentence in our target language.
Encoder - Decoder RNNs were one of the first architecure used in Neural Machine Translation and are still today a very solid option.

For the choice of which type of RNN we would use, we chose to work with LSTM. The tutorial which we used as foundation for out propject used a GRU, but we picked LSTM instead because it decouples the long term and short term memory which we see as a desirable property.
In addition, LSTM seem more prevalent in literature, and for many application, they are juged to be at least as good as the GRU. However, this choice remains vastly arbitrary.


#### Attention Mechanism
An early problem of RNNs in NMT was that they tended to have a hard time working with very long input sentences as a single vector. 
A common solution to this is to use an [attention mechanism](https://arxiv.org/pdf/1409.0473.pdf). 
The reason why the classical Encoder-Decoder RNN scheme has trouble with long input sequences is that the longer it is, the more difficult it is for the encoder to provide an output state that characterize equally the first words of the sequence and the last ones. Since the decoder typically takes as input the output of the encoder i.e. its last hidden state, it comes that with long sequences, this output is somewhat less influenced by the first words of the sequence than the last. This asymetry makes it difficult for the decoder to consider effectively first words of the sequence. The idea that comes naturally to solve this problem is to provide all hidden states of the encoder to the decoder instead of just the last one. However, this would result in providing much superfluous information to the decoder, thus making it difficult to extract the actually useful information. This idea is the foundation of the attention mechanism.
Indeed, this mechanism is trained to allow the decoder to focus its attention on the relevant parts of the encoder's hidden states. I.e. for a word to output by the decoder, the attention mechanism will allow it to focus on the hidden states that the encoder outputted for the words that are relevant to the translation of the former word to output.
A good attention mechanism will therefore let the Decoder focus on the more important parts of a source sentence. 
This has been proved to increase the performance of Encoder-Decoder RNNs but also other NMT architectures such as CNNs. 
Modern NLP models based on Transforms (such as BERT or GPT) directly incorporate attention in their design.

In practice, we used the Bahdanau attention mechanism implemented in the same [tensor flow tutorial](https://www.tensorflow.org/tutorials/text/nmt_with_attention) that we have been following for our implementation.
In terms of implementation, the attention mechanism can be seen as an additional layer in the decoder before the LSTM cell. It takes as input all of the hidden states from the Encoder and obviously the previous hidden state from the decoder.
Where it becomes open for debate is whether that "hidden state" is actually the proper hidden state of the LSTM or its cell state or both concatenated.
Our understanding of the attention mechanism makes us believe that using the cell state, which is the long term memory of the LSTM unit, would only serve to noise the state the attention mechanism tries to extract relevant information from.
Indeed, as the attention mechanism is supposed to extract information characteristic of the specific word in the sequence the hidden state is from and since the cell state only carries little information on said word (being the long term memory), we believed it wiser to provide the hidden state (short term memory) of the LSTM to the attention mechanism.


#### Training
For training, we used the dataset described in section **Dataset** and split it into training and validation sets.
For the actual training, we used the whole dataset which is obtained by setting the parameter _num_examples_ to **None** in the Constants section of the implementation.
Of the _num_examples_ sentences chosen from the dataset, we randomly (but with a fixed random seed) chose 10% of them for validation and dedicated the rest to training.

As GPU running time is limited on colab, we chose to train each of our models for 10 epochs.
One epoch of training on the whole dataset lasted between 30 and 65 minutes depending on the colab sessions.
This makes training one model last between 5 and 11 hours.

### Problems Encountered & Solutions

In this section we will go over different problems and challenges we have faced during this project and the solutions we thought of and implemented in the different cases.

#### Embedding problems
- **Punctuation** 
    - **Problem**: The pretrained Word2Vec embeddings we found did not include punctuation.
    However, for a translation task, punctuation has meaning and we thus needed a way to represent it to feed to the networks.

    - **Solution**: To solve this problem we simply chose to remove the punctuation from the dataset when  _w2v_pretrained(_fr/_en)_ is used. 
    We could have fine-tuned the gensim embedding model with our dataset in order for it to learn an embedding for punctuation but we believed that the first fix would be sufficient for our purposes.
    Additionally, we thought it would not be useful to implement 
    this solution as we meant to train our own custom embeddings from scratch in subsequent tests. In order 
    to have comparison points between the custom embeddings and the pretrained ones, it seemed wise not to 
    fine tune the pretrained embeddings on the dataset the custom ones are trained on. 

- **Out-Of-Vocabulary**
    - **Problem**: All pretrained embeddings faced the out-of-vocabulary issue for different words.
    The punctuation issue mentioned above is actually a specific OOV issue.
    It is an issue because these words cannot be embedded and if the embedder is asked on such _unknown_ words, the program will crash.

    - **Solution**: To solve this problem we simply replaced each out-of-vocabulary word by a handmade embedding 
    containing only zeros.

    Another option would be, as mentioned for the punctuation problem, to fine tune 
    each embedding model with our custom vocabulary. However, this solution is more easily applied for 
    word2vec and can actually become complex for the others as the models we import from the gensim 
    downloader are actually vectors and not trainable models. 
    Moreover, the comparison argument provided in the punctuation solution stays valid for this as well.

    A third possible solution could have been to remove the sentences containing unknown words from the dataset 
    and train the network on a reduced dataset. However this solution limits the dataset used and thus the 
    potential performance of the network. 
    Moreover, it is especially inadpated to the punctuation problem because it would mean removing all sentences 
    from the dataset.

- **fastText**
    - **Problem**: As stated previously, we don't have an actual proper fastText model.
    Theoretically, fastText embedding should enable us to embed words that were not present in the 
    dataset used to train the embedder (therefore solving the out-of-vocabulary problem). 
    However, the pretrained embedder obtained with the gensim downloader only provides a dictionary mapping words (keyed vector) to embeddings. 
    This means it can't actually give an embeding for words it has not encountered.
    <!-- though it's not really a problem for the custom embedder as all the test phrases are in the embedder's vocabulary.-->

    - **Solution**: We could have searched for a way to export the model rather than the dictionary.
    Nevertheless, for the sake of simplicity, we chose to treat these OOV words the same way as for the other embeddings, as described in the previous paragraph.
    Note that regarding the custom fastText embedder, even though we manipulate the model itself, we did not implement a way to embed unkown words even though it might have been possible.
    The reason for that is to avoid complicating uselessly our implementation.
    Indeed, since the model is trained on our dataset, all of the words it contains will be known and we will thus never face an OOV problem while validating.
    The OOV problem could still appear if translating sentences from outside the dataset but this special scenario did not justify that we spend time implementing the above-mentioned solution.


- **French Embeddings**
    - **Problem**: We could not find suitable pretrained French embeddings for GloVe nor fastText.
    This is a problem in that French embeddings are mandatory for the decoder to produce an interpretable output.
    Using one of the French embedders that we found for embeddings other than GloVe or fastText is a problem in that it would undermine the validity of our comparisons of the embeddings' performances.

    - **Solution**: We chose to replace those pretrained embeddings by custom embeddings of the same type.
    This makes comparisons of the results of pretrained embeddings with custom ones less reliable but as it was only embeddings in our target language that were replaced, it seemed like the conclusions would still carry some weight.

- **Heterogeneity of pretrained embeddings** 
    - **Problem**: The pretrained embeddings we found are not uniformly trained: some have different dimensions than the others and all have been trained on different corpora and even on corpora with different scales.
    This is a problem only because it slightly undermines the validity of our comparisons of the embeddings' performances.
    <!-- Since the pretrained embeddings were not uniformly trained 
    i.e. some were embedded into different vector dimensions and some were trained on corpora of different scales, 
    the reliability of the conclusions that we could make while comparing the different embedding schemes is 
    diminished as well.  -->
    - **Solution**: This motivates further the use of custom embeddings, which are trained on the exact same
    corpora (we used a fixed random seed to ensure that it is the case) and to the same vector dimensions.


#### Other problems

- **RAM saturation**
    - **Problem**: This problem is rather self-explanatory.
    The way we first implemented our training and more specifically the place in the pipeline where we embedded the dataset caused the 12GB of RAM available on Colab to be saturated.
    The reason why this is a problem is that Colab resets the machine when this happens, thus losing the local variables of the environment.

    - **Solution**: At first we used to embed all of the sentences of the dataset and then we'd pass the huge vector to the 
    encoder and decoder for it to learn. However, this requires to load the whole embedded dataset as a block 
    into the RAM and it saturated it on _colab_. As a result, the solution was pretty straight forward i.e. embedding the dataset one batch at 
    a time directly in the encoder in order to allow the OS to evict and bring back parts of the former huge 
    vector from and to RAM, thus making the computation slightly slower but preventing the virtual environment 
    of colab from crashing.

- **Colab Shutting off**
    - **Problem**: To train our algorithms, we have chosen to work with Google Colab.
    Google Colab generously allocates us a certain amount of GPUs to do our work. 
    Unfortunately it stops working when the user is not active after a certain amount of time, which is rather impractical considering that a training lasts at least 5 hours.

    - **Solution**: Therefore, we use a javascript code in our browser to make it periodically open a certain window to make Colab believe we're still active. The code is the following:

        function ClickConnect()
        {
            console.log('Working');
            document.querySelector('colab-connect-button').shadowRoot.getElementById('connect').click();
        }
        x = setInterval(ClickConnect, 60000);
    
    It works pretty well.
    Big thanks to [Shivam Rawat](https://medium.com/@shivamrawat_756/how-to-prevent-google-colab-from-disconnecting-717b88a128c0) for the fix!


- **Length of Training Time**
    - **Problem**: Training a Neural Machine Translation model takes a lot of data and training on a lot of data takes a lot of time.
    In our case, the time it takes to train our models is fundamental as we both have a deadline to submit our work and have training length restrictions on Colab (which is limited to 12 hours).

    - **Solution**: To make this problem tracktable, we trained our models for 10 epochs only. Each training took about 5 hours when we managed to connect to one of Colab's better GPUs 
    and due to Colab timeouts, we had to restart many training sessions. This project serves as proof of concept 
    and if the objective was to actually train a model for a day-to-day application, we would pick a training
    strategy among those tried in this probject and use it to train a definitive model for more epochs and 
    possibly on more data. A discussion on which one we'd chose is given in the section 'Discussion'.

- **Custom embeddings checkpoints**
    - **Problem**: To restore a model's state to after being trained without having to retrain it, we use tensorflow checkpoints.
    However, these checkpoints work weirdly with custom embeddings i.e. the network cannot translate anything, even what it could translate right after training.
    There seems to be a random factor in the training of the embeddings, which makes them different each time they are trained.
    Said-random factor is either the seed which was documented to be fixed but may not actually be or it is the fact that we use multiple workers which reorders the tasks of each worker based on hardware factors such as processor scheduling.
    The consequence is that the training of the network is done on certain embeddings and when restoring the checkpoint of the network, the embeddings used to embed the words are different.
    As a result, the network cannot recognize the input words and in these circumstances, it obviously cannot translate them.

    - **Solution**: If it is just a seed problem, it is possible to fix one, however this solution did not seem to work as easily as that, it is thus likely that this is not the problem. 
    If the problem is the reordering of worker's tasks, then the solution is to set the number of workers to 1. 
    Nevertheless, this solution did not seem to work as straightforwardly as that either.
    We thus decided to store the custom embeddings model in a file, in order to be able to reload it later.
    Consequently, when it is asked to create a custom embeddings model, we first search if a file with the correct name
    exists, and if it is the case we load it, otherwise we train a new model and save it in the file.


## Results

This section is meant to illustrate and discuss the results obtained by each of the models we trained.
We trained models using _Word2Vec_, _GloVe_ and _FastText_ embeddings.
For each embedding, we trained 3 models based on:
- pretrained embeddings
- our custom embeddings
- our custom embeddings and using the attention mechanism

The section first shows each embedding's loss and score results to compare the pretrained with the custom embeddings.
It then compares the scores of the different embeddings without attention mechanism.
Only then does it compare the custom embeddings with and without the attention mechanism.

It is interesting to note that to evaluate the performance of a model, we computed the _BLEU_ scores (with a smoothing function) of each validation sentence and displayed the average and standard deviations of those scores on the whole validation set.

### Word2Vec

**Losses** - We can probably see that the losses have the expected negative exponential curve converging to about 0.05.
Both loss curves converge rather similarly.
The custom embeddings seem to start with a lower loss but end up decreasing less than the pretrained embeddings-based model.
This slight difference is inconsequential however since the difference between the two curves is within the standard deviation of each.
The conclusion in terms of loss is that it is very similar for both models and that both learned from the dataset.

<img src='https://drive.google.com/uc?export=view&id=1HQBw6XDjR7kGiXMH3ktb1cG6qTzCe82P' width='70%' alt='w2v.png'>

**Scores** - The first thing to notice is that the standard deviation of the BLEU scores is of the same scale as the average score itself, which is very high.
This means that the scores are highly sensitive to the sentence they score. As a result, the interpretations of the average scores that we make are to be taken with a grain of salt.
The only thing we can be sure of is that it means that our translator can translate some sentences very accuretely (most likely those similar to the training sentences) and some terribly.
This tends to show signs of overfitting the training sentences.

It can be observed that the average BLEU score is better for the pretrained embedding-based model than for the custom one.
One justification of this is that the pretrained embeddings are more representative of the words than the custom ones which may be due to the fact that it was trained on a lot more data and very likely for longer.
Another justification is, once more, that the pretrained embeddings have a higher dimension than the custom ones.

| Embedding           | BLEU score         | Standard deviation |
|:--------------------|:-------------------|:-------------------|
| Word2Vec pretrained | 0.342048           | 0.310822           |
| Word2Vec custom     | 0.309999           | 0.294159           |

### GloVe

**Losses** - The first thing to notice is that the loss of the custom embeddings-based model starts and stays significantly lower than the custom one.
If compared to the Word2Vec loss curves, it seems like the pretrained GloVe loss curve goes down as much as the Word2Vec curves while the custom GloVe curve reaches twice their loss after 10 epochs i.e. about 0.1.
In a word, it seems like the custom GloVe model had troubles learning from the same dataset as the other models.
The reason behind this behavior may be that we misparametrized the learning of the GloVe embeddings themselves.
For instance, we used a learning rate of 0.05 and trained the embeddings for 30 epochs on our dataset.
These are standard values used in the [tutorial](https://medium.com/analytics-vidhya/word-vectorization-using-glove-76919685ee0b) we followed and they are not motivated in any way so they may be misguided.
Moreover, we did not extract the evolution of the loss with respect to the epochs, we thus have no way of knowing whether it actually converged or not.

The impact embeddings may have on the loss of the translation task's training is that if the embeddings do not characterize clearly each word, it may be more difficult for the network to interpret them and thus to learn to translate them.
In a word, one could say that the network possibly confuses words more due to uncharacteristic embeddings and it would thus have more trouble learning to translate them.

<img src='https://drive.google.com/uc?export=view&id=1D5-IKXv8K1v507Cjp6qsKUck8Z9D5ViU' width='70%' alt='glove.png'>

**Scores** - The same observation of the standard deviation can be done for GloVe embedding-based models than for Word2Vec i.e. that they are very high.

The average scores are a lot better for the pretrained embedding-based model than for the custom one.
In this case, this was expected seeing as the loss did not decrease as much.
The loss curves showed that the model had troubles learning from the dataset, it is thus not surprising that the resulting model translates worse.

| Embedding           | BLEU score         | Standard deviation |
|:--------------------|:-------------------|:-------------------|
| GloVe    pretrained | 0.355462           | 0.309589           |
| GloVe    custom     | 0.240517           | 0.267595           |

### FastText

**Losses** - The loss curves of the models based on fastText embeddings have shapes that are very similar to those of Word2Vec based models.
A slight difference might be that the custom and pretrained curves seem to join rather than cross but given more epochs it is likely that they would have crossed the same way as the Word2Vec curves.

Apart from the custom GloVe curve which may be a sort of mishandling, both the Word2Vec and the FastText custom curves start off with a lower loss than th epretrained ones.
A hypothesis as to what it could be due is that custom embeddings have a lower dimension (100) than pretrained embeddings (~300).
Indeed, an smaller representation of each word makes it easier to learn to interpret them early on but is more limited in terms of nuances.
The tendency described in the previous sentence is exactly the one we find in the plots i.e. a lower custom loss early on which stabilizes faster (and thus higher) than the pretrained models' losses.

<img src='https://drive.google.com/uc?export=view&id=1sv3SnKVFFDhzjT_a49BLQCZXIy2kDSmD' width='70%' alt='ft.png'>

**Scores** - The standard deviation was very high for Word2Vec and GloVe and FastText is no exception, the standard deviation is almost as high as the average score itself.

What changes from the others embeddings is that it seems like the custom FastText's average score is hardly smaller than pretrained FastText's.
Relating this to the loss curve can be done by remebering that when the training stopped, both loss curves were just crossing each other.
The loss curves (based on the training set) and the scores (based on the validation set) seem to be strongly correlated which leads to believe that the models can somewhat generalize their learning from the learning set to the validation set.
Indeed, the final loss on the training set is almost the same for both models and so is the average score on the validation set.
The hypothesis that training both models for more epochs would have the loss curves cross can be extended to the average scores:
It is likely that training them for more epochs would increase the difference in score between both models to yield a significantly better pretrained than custom model.

| Embedding           | BLEU score         | Standard deviation |
|:--------------------|:-------------------|:-------------------|
| Fasttext pretrained | 0.312947           | 0.296650           |
| Fasttext custom     | 0.306072           | 0.292292           |

### Comparison
It seems like overall, the pretrained embedding-based models tend to have better scores than the custom ones.
Regardless, after 10 epochs, the custom models seem to remain competitive.
However, the loss curves let us expect that given more epochs, the pretrained-based models could significantly outdo their custom counterparts.
The justification of this statement is that the pretrained embeddings have a higher dimension than the custom ones and they have thus more potential for interpretation.

The rest of this section serves to compare the performances of the embedding methods with respect to one another.

**Pretrained Losses** - The pretrained losses are more similar than they each were with respect to their corresponding custom losses.
This further corroborates the statement that there is a correlation between the custom-or-not factor of the embeddings and the models' ability to learn to translate.
As a reminder, we hypothesized that it was mainly due to the dimension of the embedding which is significantly lower for custom than for pretrained embeddings.

<img src='https://drive.google.com/uc?export=view&id=1QQCqFZmD6e1j4ko_6FEt5BzTCwHHwzmp' width='70%' alt='pretrained.png'>

**Custom Losses** - The custom loss curves have the same shape but as mentioned, the custom GloVe embeddings yielded a significantly worse loss curve than the others.
As mentioned, we believe that this is a result of our inexperience in training embeddings rather than an actual weakness of GloVe itself or the GloVe training framework we used.

<img src='https://drive.google.com/uc?export=view&id=1zPvVKfNI1tMcEl92B5ywPV_4NDVpSaur' width='70%' alt='custom.png'>

**Scores** - The performance of pretrained with respect to custom embeddings have been discussed before but can still be observed in this recapitulative table.

What is newly highlighted by this table however, is the fact that among the pretrained embedding-based models, Word2Vec and GloVe seem to perform better than FastText.
Once more, this can only be said for our specific implementation and dataset as the standard deviations are so high that our results can hardly be generalized to the embedding schemes themselves.
Nevertheless, in our case, it seems like GloVe pretrained performs best and with the lowest standard deviation with respect to its average value.

| Embedding           | BLEU score         | Standard deviation |
|:--------------------|:-------------------|:-------------------|
| Word2Vec pretrained | 0.342048           | 0.310822           |
| Word2Vec custom     | 0.309999           | 0.294159           |
| GloVe    pretrained | 0.355462           | 0.309589           |
| GloVe    custom     | 0.240517           | 0.267595           |
| Fasttext pretrained | 0.312947           | 0.296650           |
| Fasttext custom     | 0.306072           | 0.292292           |

### Attention
In this section, we compare the performances of our models based on each of our custom embeddings with and without the use of Bahdanau's attention mechanism.
To that end, we both take a look at the evolution of the loss as well as the BLEU score they obtain.

**Losses** - The Word2Vec curves being very similar, we can infer that the model learns as easily to translate the training dataset whether the attention mechanism is used or not.
The same can be said of the FastText curves, however, the GloVe loss curves are dissimilar.
It was noticed previously that the GloVe custom loss curve was already higher than all of the other loss curves, which lead us to believe that it was due to a misparametrization of the embedding's training on our part.
Since the GloVe model with attention is also based on these very same embeddings, it was expected that it would also have trouble learning.
However we can notice that the attention mechanism seems to have helped the model learn as it brought the loss curve back to a similar trend than the other models.

<img src='https://drive.google.com/uc?export=view&id=11kRLMFz6F6Zd49kuERtOQx1Hkk0jVdFa' width='70%' alt='w2v_attention.png'>

<img src='https://drive.google.com/uc?export=view&id=12yfGOLvSL9DxhQNiNPWNbB3_QWixdKnU' width='70%' alt='glove_attention.png'>

<img src='https://drive.google.com/uc?export=view&id=1wvwD1nVmyoo1pNLAc3UneygZx73lGzi1' width='70%' alt='ft_attention.png'>


**Scores** - It can be seen that regardless of the type of embedding, using the attention mechanism systematically improves the BLEU score.
We can safely assume that the attention mechanism is indeed very useful to increase the performance of the models.
Additionally, it improves the BLEU score at barely any cost in computation. Indeed, each epoch took only slightly longer with the attention mechanism than without and the loss evolution is almost identical, which shows that the attention mechanism does not make it more difficult for the network to learn.

An interesting observation is that, regardless of the seemingly bad base that the custom GloVe embeddings represents with a mere 24 average score, the attention mechanism on top of it managed to make its performance competitive with th eother embeddings.
It made the average score go up by 0.10 which is more than the 0.06 increase that it brought to Word2Vec and FastText.

One additional thing to notice is that the attention mechanism increased the average score of each model considerably while increasing less than proportionately the standard deviation of said scores.
They remain very high regardless but it is still an improvement.

| Embedding           | BLEU score         | Standard deviation |
|:--------------------|:-------------------|:-------------------|
| Word2Vec            | 0.309999           | 0.294159           |
| Word2Vec attention  | 0.357615           | 0.312432           |
| GloVe               | 0.240517           | 0.267595           |
| GloVe    attention  | 0.340590           | 0.307263           |
| Fasttext            | 0.306072           | 0.292292           |
| Fasttext attention  | 0.363475           | 0.316606           |

## Discussion
**BLEU Standard Deviation** - As previously mentioned, the BLEU scores have a very large standard deviation which means that some sentences are very well translated as per the BLEU standard, while some are very wrong.
We came up with three interpretations of the cause of this behavior.

First, it could be due to an overfitting of our model on the training set, leading to good translations of sentences that are very close to the training set's.

Another interpretation is that it could be due to the inadequacy of the BLEU metric for evaluating short sentences.
Indeed, it is best used on very long sentences or on whole text corpora but in our case we only made the average of each (relatively short) sentence individually.
It thus becomes expected that the standard deviation would be high.
Note that the [article by Rachael Tatman](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213) states that averaging the scores of each sentence in a corpus rather than applying the metric to the full corpus is going to "artifitially inflate the score" which is obviously frowned upon by the scientific community.
Still, we stick to our averaging for simplicity and it is not too much of a problem as we only compare our results with respect to one another rather than comparing them to others from the literature.

Lastly, the dataset we used sometimes has several times the same English sentence with different French translations.
Thus, if this English sentence is in both training and validation sets, then the model will be teached to translate this sentence into the French sentence from the training set.
However, when evaluating it, the score will expect the validation set French translation and it will receive a bad score because the other French translation is different.
This characterizes a well-known limitation of the BLEU score, i.e. that it cannot consider meaning nor sentence structure.
Nevertheless, for this specific case, we could have done better and _how_ is explained in the **Improvements** section.

**Best Embeddings** - As stated in the results comparison, it seems like the pretrained embeddings performed best and among them, GloVe had the highest average score and the lowest standard deviation.
Nevertheless, the deviations are such that this result has only very little reliability.
It is known in the literature that FastText is supposedly better than the two others because it is the only one that can handle rare (unknown) words. 
However, as mentioned previously, we did not use this feature due to the format of the pretrained embeddings and due to the fact that our custom embeddings were never faced with unknown words.
Purely in terms of performance, fastText seems worse than the other two in our case but that could either be purely random (since the deviations are so high) or due to a potential inadequate training of the custom embeddings. 
We cannot emhasize enough that the conclusions we make here are moderately reliable because of the high standard deviations.

**Loss and Score** - One very interesting thing we noticed while comparing the loss curves with the scores is that they are strongly correlated.
Indeed, when the loss at the 10th epoch of a custom embedding is close to a pretrained embedding's, their scores are close and when the former are far from one another, the scores are as well.
This is interesting because the loss is based on the training set while the score is based on the validation set.
Consequently, having correlations between the two tends to say that the models can generalize their knowledge of the training set to the validation set.

**Time of MT training** - One thing that has grown clearer with each passing hour of work on this project is that the task of Machine Translation and more specifically Neural MT requires a lot of data to train and thus a lot of time.
Our full dataset containing 170,000 sentences, we initially thought it would be rather high but it became clear that to train a competitive translator, it would need a lot more than that.
Despite that, training on our 170,000 sentences for 10 mere epochs already took about 5 hours at best.
As a result, to train a competitive translator, one would need much data but more importantly, a lot more time.

**Best Model** - Overviewing our results, we can say that the pretrained embeddings had the best results and that they could probably be even better if given more epochs of training.
We also said that it was probably due to the higher dimension of those embeddings.
Additionally, we established that the attention mechanism improves greatly the models regardless of the type of embeddings.
We can thus hope that the best possible model we could train would be:

    A custom embedding model trained on data at the scale of those used for the pretrained embeddings and trained to higher dimensions (300 for instance). It would be trained for more epochs and on a larger dataset of sentences using the attention mechanism.

Using custom embeddings even though the best results we obtained were with pretrained embeddings is because using custom embeddings enables the model to exploit the full dataset it learns on (no unknown words and even punctuation).
That way we can take the best of both worlds i.e. the higher dimension of the pretrained embeddings and the exhaustiveness (on the dataset) of the custom ones.

Strictly limiting ourselves to the models we trained as we trained them, the best one seemed to be **FastText with attention** though it also suffers from having the largest standard deviation of all.
Moreover, concluding this way without repeating that the high standard deviations are such that any model (with attention) could have come on top in the end would be careless.

## Improvements
- In all the tests we made, the encoder and decoders models were the same, i.e. with a _LSTM_ layer of 1000 units.
  This provided a basis to compare the different embeddings. A way to improve the scores obtained would be to **increase the complexity of the encoder and the decoder**, i.e. adding more _LSTM_ layers or more units in a layer. 
- When using pretrained embeddings, we did not always find French embeddings of **dimensions corresponding** to their English counterpart.
For instance, pretrained Word2Vec embeddings had a dimension 300 for English and 200 for French.
For the others it's even worse as we did not find pretrained embeddings, we thus used a 100 dimensioned custom french embedding.
Somehow finding French embeddings of the same dimension as the English ones would result in more consistency in the networks' learning.
It is not a fundamental problem that the dimensions are different but it would be interesting to see the impact it actually has on performance.
- As mentioned previously, we chose to train our custom embeddings to a dimension of 100 due to the limitation of the data we had to train it on.
To reinforce the reliability of our comparison of pretrained and custom embeddings, it would have been better to **train our custom embeddings on a quantity of data at the same scale as were trained the pretrained** ones.
This would have allowed us to confidently train our custom embeddings to a higher dimension as well (e.g. 300 like the pretrained English embeddings) thus hitting two birds with one stone.
- Previously, we mentioned that we did not look for a way to **fine-tune the pretrained embeddings on our own dataset** for the sake of comparing custom and pretrained embeddings.
However, given more time, it could have been interesting to train additional fine-tuned models to compare them to just-pretrained models and thus establish whether it improves, worsen them or does not change anything.
- In terms of **validation**, we established that BLEU does not consider meaning nor sentence structure, is more suited to evaluating corpus as a whole rather than sentences and is better when using more than one reference sentence for each translation evaluation.
    The scope of improvements on this matter is thus large.
    - We could have simply searched for **other means of evaluating** translations and used them side by side.
    - We could have computed BLEU scores while considering **different n-grams** and compared them. (We use the standard n=4)
    - We could have computed the **BLEU scores at each epoch** to observe their evolution.
    - We could have made a **corpus with our sentences** and computed the BLEU score of the corpus rather than sentence per sentence.
    - We could have exploited the repetition of English sentences from our dataset.
    Indeed, as mentioned, our dataset contains English sentences that correspond to different French translations.
    Consequently, we could gone through our whole dataset and for each sentence present in both the training set and the validation set, we could have added the training French translation to the validation set in order to have an evaluation set containing all of our known possible translations for each English sentence.
    We could have thus **evaluated the BLEU score with all our known translations as references** rather than a single one each sentence.

- Regarding the **attention mechanism**, we barely scratched the surface of what was possible.
    Should we explore this mechanism more in depth, there are a few things we would first try.
    - We could explore other attention mechanisms than _Bahdanau_ to compare them e.g. _Luong_'s global and/or local attention mechanisms.
    - We could train models where the attention mechanism's context vector is made with the encoder's _cell state_, _hidden state_ and a concatenation of both to compare and discuss their performances.
    - We could train the pretrained embedding-based models with the attention as well to see if we can get even higher scores.

## Sources
- [TensorFlow tutorial : Neural machine translation with attention](https://www.tensorflow.org/tutorials/text/nmt_with_attention)
- [Datasets of bilingual sentence pairs : manythings.org](http://www.manythings.org/anki/)
- [Pretrained Embedding vectors : Gensim data](https://github.com/RaRe-Technologies/gensim-data)
- [J-P Fauconnier : French word2vec embeddings](https://fauconnier.github.io)
- [Medium tutorial : Word vectorization using GloVe](https://medium.com/analytics-vidhya/word-vectorization-using-glove-76919685ee0b)
- [Radim Rehurek tutorial : Fasttext](https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html#sphx-glr-auto-examples-tutorials-run-fasttext-py)
- [NLTK documentation : _translate.bleu\_score_](https://www.nltk.org/_modules/nltk/translate/bleu_score.html)
- [Towards Data Science article : Evaluating text output in NLP BLEU at your own risk](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213)
- [Towards Data Science article : Sequence 2 sequence model with Attention Mechanism](https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a)


# Part 2 - Implementation
1. Data extraction and parsing
2. Word embeddings
    1. Word2Vec
    2. Glove
    3. FastText
3. Encoder implementation
4. Decoder implementation
5. Training
6. Evaluating

## Imports

In [None]:
import tensorflow as tf

from sklearn.model_selection import train_test_split

import gensim.downloader as api
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

import nltk
from nltk.translate.bleu_score import sentence_bleu as bleu
from nltk.translate.bleu_score import SmoothingFunction

import unicodedata
import re
import numpy as np
import os
import io
import time
import warnings
import statistics

#### Disable Deprecation Warnings
The `gensim` package makes some warnings we do not want to see.

In [None]:
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Constants
This section declares all the constants needed to configure this notebook.
Note that `num_examples` has been set to `1000`, such that the code can run quickly on a small portion of the dataset. For making a real training on the whole dataset, please set it to `None`.

In [None]:
# General constants
drive = False                               # Set to 'True' to store the results on Google Drive
restore = False                             # Set to 'True' if the model must be restored
train = True                                # Set to 'True' if the model must be trained

results_folder = 'results/'                 # Where to store the results
checkpoint_dir = 'training_checkpoints/'    # Where to store the weights
embeddings_dir = 'custom_embeddings/'       # Where to store the custom embeddings model

# Dataset
num_examples = 1000                         # Num of sentences taken from the dataset ('None' for all)
test_size = 0.1                             # Ratio of sentences used for evaluation
split_seed = 42                             # Seed used for splitting the dataset

# Embeddings
emb_dims = 100                              # Number of dimensions used for the custom embeddings
sos_seed = 42                               # Seed used for generating the 'sos' symbol's embeddings
eos_seed = 66                               # Seed used for generating the 'eos' symbol's embeddings

embed_name_en = 'w2v_custom'                # Embedding model used for the English sentences
embed_name_fr = 'w2v_custom'                # Embedding model used for the French sentences
# Possible values:
# Only English:   w2v_pretrained, glove_pretrained, ft_pretrained
# Only French:    w2v_fr_pretrained
# Both:           w2v_custom, glove_custom, ft_custom


# Encoder - Decoder
units = 1000                                # Number of units of the LSTM layer
attention = False                           # Set to 'True' for using the attention mechanism

# Training
batch_size = 64                             # Number of sentences in each batch
epochs = 10                                 # Number of epochs

# Translating
max_length_translation = 100                # Maximum length of a translation

In [None]:
if drive is True:
    from google.colab import drive
    drive.mount('/content/drive')

    results_folder = 'drive/My Drive/Colab Notebooks/' + results_folder
    checkpoint_dir = 'drive/My Drive/Colab Notebooks/' + checkpoint_dir
    embeddings_dir = 'drive/My Drive/Colab Notebooks/' + embeddings_dir

## Data extraction and parsing
This section aims at downloading, extracting and preprocessing the sentences of the dataset.

#### Dataset extraction

In [None]:
path_to_zip = tf.keras.utils.get_file(
    'fra-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/fra-eng.zip',
    extract=True)

path_to_file = os.path.dirname(path_to_zip) + "/fra.txt"

#### Dataset parsing
All the characters are turned into ascii characters and the sentences are preprocessed as follow:
- All the accents are removed,
- If we are using a pretrained Word2Vec model, then we only keep the alphabetical characters and 
  everything else is transformed into a space,
- Otherwise, we keep the alphabetical characters as well as the punctuation characters.

In [None]:
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())

    if embed_name_en == "w2v_pretrained" or embed_name_fr == "w2v_fr_pretrained":
        # removing multiple spaces
        w = re.sub(r'[" "]+', " ", w)

        # replacing everything with space except (a-z, A-Z)
        w = re.sub(r"[^a-zA-Z]+", " ", w)
    else:

        # adding space between punctuation and characters
        w = re.sub(r"([?.!,])", r" \1 ", w)

        # removing multiple spaces
        w = re.sub(r'[" "]+', " ", w)

        # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
        w = re.sub(r"[^a-zA-Z?.!]+", " ", w)

    w = w.strip()

    # adding the start and end tokens to the sentence
    w = '<start> ' + w + ' <end>'
    return w

In [None]:
# Return word pairs in the format: [ENGLISH, FRENCH].
def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]

    return zip(*word_pairs)

At this point, all the sentences have been preprocessed. But in order to interact with the seq2seq model
we have to do two more things:
- The encoder and the decoder takes as inputs embeddings, so two embedding models must be created.
  This will be done in the next section.
- To be able to train the models, the target sentences must be tokenized, i.e. turned into arrays of
  numbers, where each number represents a word. This is done next.

In [None]:
# Tokenize all the sentences contained in lang and pad them.
def tokenize(lang):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
    lang_tokenizer.fit_on_texts(lang)

    # tokenize the sentences
    tensor = lang_tokenizer.texts_to_sequences(lang)

    # pad the tokenized sentences with '0' so that they always have the same length
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

    # return the tokenized sentences as well as the tokenizer which contains the mapping between the
    # words and their related number
    return tensor, lang_tokenizer

In [None]:
# Pad (non-tokenized) sentences with empty strings so that they always have the same length.
def pad_sentences(lang):
    max_size = len(max(lang, key=len))

    for sentence in lang:
        while(len(sentence) < max_size):
            sentence.append('')

    return lang

#### Dataset loading
Everything related to the dataset has been implemented, it can thus be loaded into some variables.

In [None]:
def load_dataset(path, num_examples=None):
    # load the preprocessed sentences
    input_lang, target_lang = create_dataset(path, num_examples)

    # tokenize the target sentences and stored them in a tensor
    target_tensor, target_lang_tokenizer = tokenize(target_lang)

    # split the sentences into list of words
    input_lang = [[w for w in s.split()] for s in input_lang]
    target_lang = [[w for w in s.split()] for s in target_lang]

    # pad the splitted sentences to always have the same length
    input_lang = pad_sentences(input_lang)
    target_lang = pad_sentences(target_lang)

    return input_lang, target_lang, target_tensor, target_lang_tokenizer

In [None]:
# Load dataset
input_lang, target_lang, \
target_tensor, target_lang_tokenizer \
    = load_dataset(path_to_file, num_examples)

In [None]:
# Creating training and validation sets
input_lang_train, input_lang_val, \
target_lang_train, target_lang_val, \
target_tensor_train, target_tensor_val = \
    train_test_split(input_lang, target_lang,target_tensor,
                     test_size=test_size, random_state=split_seed)


A Tensorflow dataset is created from the preprocessed sentences.

In [None]:
buffer_size = len(input_lang_train)
steps_per_epoch = len(input_lang_train) // batch_size
vocab_target_size = len(target_lang_tokenizer.word_index) + 1

dataset = tf.data.Dataset.from_tensor_slices(
    (input_lang_train, target_lang_train, target_tensor_train)).shuffle(buffer_size)

dataset = dataset.batch(batch_size, drop_remainder=True)


## Word Embeddings
This section defines several functions that creates the embedding models.
There are three big types of embeddings: Word2Vec, GloVe and Fasttext. For each one, it is possible to
use a pretrained model or to train a new one on our dataset. In the case of Word2Vec, it is possible to
have a pretrained model in both languages, while in GloVe and Fasttext, the pretrained models are only
available in English.

### Word2Vec

#### Pre-Trained - English

In [None]:
def w2v_pretrained_create(sentences, embed_file):
    return api.load('word2vec-google-news-300')


#### Pre-Trained - French

In [None]:
def w2v_fr_pretrained_create(sentences, embed_file):
    url = 'http://embeddings.net/embeddings/frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin'
    path_to_bin = tf.keras.utils.get_file(
        'frWac_non_lem_no_postag_no_phrase_200_cbow_cut100',
        origin=url,
        extract=False)

    return KeyedVectors.load_word2vec_format(path_to_bin, binary=True)

#### Custom

In [None]:
def w2v_custom_create(sentences, embed_file):

    if os.path.exists(embed_file):
        print('Previous adequate custom embeddings found, loading model from file:')
        print(embed_file)
        return KeyedVectors.load(embed_file)

    model = Word2Vec(sentences, size=emb_dims, window=10, min_count=0, workers=4).wv

    if not os.path.exists(embeddings_dir):
        os.makedirs(embeddings_dir)

    model.save(embed_file)
    return model

### GloVe

#### Pre-Trained

In [None]:
def glove_pretrained_create(sentences, embed_file):
        return api.load('glove-wiki-gigaword-300')

#### Custom

In [None]:
def glove_custom_create(sentences, embed_file):
    !pip install glove_python
    from glove import Corpus, Glove

    if os.path.exists(embed_file):
        print('Previous adequate custom embeddings found, loading model from file:')
        print(embed_file)
        return Glove.load(embed_file)

    corpus = Corpus() 
    corpus.fit(sentences, window=10)
    glove = Glove(no_components=emb_dims, learning_rate=0.05)
    
    glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=False)

    glove.wv = dict()
    for w in corpus.dictionary:
        glove.wv[w] = glove.word_vectors[corpus.dictionary[w]]

    glove.vector_size = emb_dims

    if not os.path.exists(embeddings_dir):
        os.makedirs(embeddings_dir)

    glove.save(embed_file)

    return glove

### FastText

#### Pre-Trained

In [None]:
def ft_pretrained_create(sentences, embed_file):
        return api.load('fasttext-wiki-news-subwords-300')

#### Custom

In [None]:
def ft_custom_create(sentences, embed_file):
    if os.path.exists(embed_file):
        print('Previous adequate custom embeddings found, loading model from file:')
        print(embed_file)
        return KeyedVectors.load(embed_file)

    model = FastText(sentences, size=emb_dims, window=10, min_count=0, workers=4).wv

    if not os.path.exists(embeddings_dir):
        os.makedirs(embeddings_dir)

    model.save(embed_file)
    return model

Binding the embedding names to the correct creation function

In [None]:
create_embeddings = dict()

create_embeddings['w2v_pretrained'] = w2v_pretrained_create
create_embeddings['w2v_fr_pretrained'] = w2v_fr_pretrained_create
create_embeddings['w2v_custom'] = w2v_custom_create

create_embeddings['glove_pretrained'] = glove_pretrained_create
create_embeddings['glove_custom'] = glove_custom_create

create_embeddings['ft_pretrained'] = ft_pretrained_create
create_embeddings['ft_custom'] = ft_custom_create

### Embedding Functions

In [None]:
# Return the embeddings of the 'sos' symbol, with the same dimensions as 'model'.
def get_sos(model):
    return tf.random.stateless_normal(
        (model.vector_size,), mean=0.0, stddev=0.00001, seed=(sos_seed, 1))

# Return the embeddings of the 'eos' symbol, with the same dimensions as 'model'.
def get_eos(model):
    return tf.random.stateless_normal(
        (model.vector_size,), mean=0.0, stddev=0.00001, seed=(eos_seed, 1))

# Return the list of embeddings of a sentence.
def embed(sentence, model):
    embedded = []
    for w in sentence:
        # w has been added for padding, its embeddings are all 0's
        if w == '':
            embedded.append(tf.zeros(model.vector_size))

        # w is the sos
        elif w == '<start>':
            embedded.append(get_sos(model))

        # w is the eos
        elif w == '<end>':
            embedded.append(get_eos(model))

        # w is not known by the embedder
        elif not w in model.wv:
            embedded.append(tf.zeros(model.vector_size))

        # normal case where w is a word known by the embedder
        else:
            embedded.append(model.wv[w])

    return embedded

# Return the embeddings of a list of sentences.
def embed_all_sentences(lang, model):
    return tf.convert_to_tensor([embed(s, model) for s in lang])

In [None]:
embed_file_en = embeddings_dir + embed_name_en + '_en_' \
              + str(emb_dims) + '_' + str(num_examples) + '.model'

embed_file_fr = embeddings_dir + embed_name_fr + '_fr_' \
              + str(emb_dims) + '_' + str(num_examples) + '.model'

# Create the chosen embedders
embedder_en = create_embeddings[embed_name_en](input_lang, embed_file_en)
embedder_fr = create_embeddings[embed_name_fr](target_lang, embed_file_fr)

## Model
This section shows how the encoder and the decoder are defined. Note that their architecture remains
quite simple, with only one LSTM layer in each model, and a fully connected layer in the decoder to
output the word tokens. If the attention mechanism has been enabled, the decoder owns one more layer,
the attention one. This attention layer is a custom one, and has been implemented in the tensorflow
tutorial from which this notebook is inspired.

### Encoder

In [None]:
class Encoder(tf.keras.Model):
    def __init__(self, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        
        # LSTM layer
        self.rnn = tf.keras.layers.LSTM(self.enc_units,
                                        return_sequences=True,
                                        return_state=True,
                                        recurrent_initializer='glorot_uniform')        

    # Returns the output and the two states of the LSTM layer.
    def call(self, x, hidden):
        output, hidden_state, cell_state = self.rnn(x, initial_state=hidden)

        return output, (hidden_state, cell_state)

    # Returns the initial states of the two LSTM layers, i.e. tensors with only 0's. 
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units)), tf.zeros((self.batch_sz, self.enc_units))

In [None]:
encoder = Encoder(units, batch_size)

### Attention Mechanism

In [None]:
if attention is True:
    class BahdanauAttention(tf.keras.layers.Layer):
        def __init__(self, units):
            super(BahdanauAttention, self).__init__()

            self.W1 = tf.keras.layers.Dense(units)
            self.W2 = tf.keras.layers.Dense(units)

            self.V = tf.keras.layers.Dense(1)

        # Compute the context vector from the current hidden state of the decoder ('query') and
        # the outputs of the encoder ('values').
        def call(self, query, values):
            query_with_time_axis = tf.expand_dims(query, 1)

            score = self.V(tf.nn.tanh(
                self.W1(query_with_time_axis) + self.W2(values)))

            attention_weights = tf.nn.softmax(score, axis=1)

            context_vector = attention_weights * values
            context_vector = tf.reduce_sum(context_vector, axis=1)

            return context_vector

### Decoder

In [None]:
class Decoder(tf.keras.Model):
    def __init__(self, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units

        # attention layer
        if attention is True:
            self.attention = BahdanauAttention(self.dec_units)

        # LSTM layer
        self.rnn = tf.keras.layers.LSTM(self.dec_units,
                                        return_sequences=True,
                                        return_state=True,
                                        recurrent_initializer='glorot_uniform')

        # FC layer for generating the output word tokens
        self.fc = tf.keras.layers.Dense(vocab_target_size)

    # Returns the scores of the word tokens and the two states of the LSTM layer.
    def call(self, x, hidden, enc_output=None):
        if attention:
            context_vector = self.attention(hidden[0], enc_output)

            x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        output, hidden_state, cell_state = self.rnn(x, initial_state=hidden)

        output = tf.reshape(output, (-1, output.shape[2]))

        x = self.fc(output)

        return x, (hidden_state, cell_state)

In [None]:
decoder = Decoder(units, batch_size)

## Optimizer and Loss
There is one thing left to do before the training: choosing the optimizer and the loss function.

In [None]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

# Returns cross-entropy loss while applying the padding mask. 
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

## Training
This section defines two things:
- A train step which computes the loss on a batch and updates the weights of the models. It is
  implemented as a `tf.function` to speed up training.
- A train loop that iterates a certain number of epochs over the whole training dataset and that calls
  the train step for each batch.

In [None]:
# Computes the loss over a batch, updates the models and returns the loss
@tf.function
def train_step(input_lang, target_lang, target_tensor, enc_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(input_lang, enc_hidden)

        dec_hidden = enc_hidden

        # Teacher forcing - feeding the target as the next input
        for t in range(target_lang.shape[1] - 1):
            dec_input = tf.expand_dims(target_lang[:, t], axis=1)

            predictions, dec_hidden = decoder(dec_input, dec_hidden, enc_output)

            loss += loss_function(target_tensor[:, t+1], predictions)

    batch_loss = (loss / int(target_lang.shape[1]))

    variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, variables)

    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss

In [None]:
# Creates a tensorflow Checkpoint to save the weights
ckpt_dir = checkpoint_dir + embed_name_en + ('_with_attention' if attention is True else '')
checkpoint_prefix = os.path.join(ckpt_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer, encoder=encoder, decoder=decoder)

In [None]:
# Restore a previous checkpoint
if restore is True:
    if os.path.exists(ckpt_dir) and os.listdir(ckpt_dir):
        checkpoint.restore(tf.train.latest_checkpoint(ckpt_dir))

In [None]:
# Transform a numpy array of bites string into a list of lists of words (= list of sentences)
def to_list_strings(a):
    sentences = np.empty(a.shape, dtype=object)

    for (x, y), w in np.ndenumerate(a):
        sentences[x, y] = w.decode('utf-8')
    
    return sentences

In [None]:
if train is True:
    # creates the results folder if required
    if not os.path.exists(results_folder):
        os.makedirs(results_folder)

    filename = results_folder + embed_name_en \
            + ('_with_attention' if attention is True else '') + '_loss.csv'

    # creates the results file. If restoration is enabled and the file already exists, then we do
    # not recreate it, and we will append our results to it.
    if restore is True:
        if not os.path.exists(filename):
            with open(filename, 'w') as f:
                f.write('# mean,stddev\n')

    else:
        with open(filename, 'w') as f:
                f.write('# mean,stddev\n')

    results = np.zeros((2, 2))

    # training loop
    for epoch in range(epochs):
        epoch_losses = []
        start = time.time()

        enc_hidden = encoder.initialize_hidden_state()
        total_loss = 0

        for (batch, (input_lang, target_lang, target_tensor)) in enumerate(dataset.take(steps_per_epoch)):

            # transform input_lang and target_lang into list of list of words
            input_lang = to_list_strings(input_lang.numpy())
            target_lang = to_list_strings(target_lang.numpy())

            # computes the embeddings
            input_embed = embed_all_sentences(input_lang, embedder_en)
            target_embed = embed_all_sentences(target_lang, embedder_fr)

            # computes the loss and updates the models
            batch_loss = train_step(input_embed, target_embed, target_tensor, enc_hidden)
            epoch_losses.append(batch_loss)
            total_loss += batch_loss

            if batch % 100 == 0:
                print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch, batch_loss.numpy()))

        epoch_losses = np.array(epoch_losses)
        results[epoch%2, 0] = epoch_losses.mean()
        results[epoch%2, 1] = epoch_losses.std()

        # saves models and results every 2 epochs
        if (epoch + 1) % 2 == 0:
            checkpoint.save(file_prefix = checkpoint_prefix)

            with open(filename, 'a') as f:
                np.savetxt(f, results, delimiter=',')

        print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                            total_loss / steps_per_epoch))
        print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))


## Translating
This section defines the functions required for translating a sentence with the trained model,
and thus evaluating it. Note that the translation is generated by taking at each time the most probable
word. This could be improved by using beam search for example.

In [None]:
# Returns the translated sentence given the embeddings of the input sequence.
def translate_embed(embeddings):
    # initial state of the encoder
    hidden = tf.zeros((1, units)), tf.zeros((1, units))

    # pass through the encoder
    enc_output, enc_hidden = encoder(tf.expand_dims(embeddings, axis=0), hidden)

    dec_hidden = enc_hidden
    l = 0
    word_input_embed = get_sos(embedder_fr)
    translation = []
    
    # loop for generating the translation
    while l < max_length_translation:
        # pass through the decoder
        dec_output, dec_hidden = decoder(
            tf.reshape(word_input_embed, (1, 1, word_input_embed.shape[0])), dec_hidden, enc_output)
            
        dec_output = tf.reshape(dec_output, (dec_output.shape[1]))

        # get the most probable token
        token_output = tf.math.argmax(dec_output).numpy()

        # get the word associated to this token
        if token_output == 0:
            word_output = '<unk>'

        else:
            word_output = target_lang_tokenizer.index_word[token_output]

        # stop the translating process if the decoder has outputed an eos
        if word_output == '<end>':
            return translation

        translation.append(word_output)

        # get the embeddings of the outputed word, which will be the next inputs of the decoder
        word_input_embed = tf.convert_to_tensor(embed([word_output], embedder_fr)[0])

        l += 1

    return translation

In [None]:
# Translate a sentence using the trained model.
def translate(sentence):
    # preprocess sentence and split it into words
    sentence = preprocess_sentence(sentence).split()

    # get the embeddings
    embeddings = tf.convert_to_tensor(embed(sentence, embedder_en))

    # translate
    return translate_embed(embeddings)

In [None]:
' '.join(translate('hi'))

## Evaluating
The last part of this notebook consists of an evaluating function using the BLEU score.

In [None]:
# Returns the mean and the standard deviation of the BLEU scores obtained.
def evaluate(input_embed, target_lang):
    translations = []
    scores = []
    # Method 1 was chosen as it seemed the simplest, it simply adds a fixed very small value (epsilon) to
    # the numerator of each precision score when it is zero.
    # Source: https://www.nltk.org/_modules/nltk/translate/bleu_score.html
    sf = SmoothingFunction().method1
    for index, e in enumerate(input_embed):
        translation = translate_embed(e)
        trunc_target = target_lang[index][1:(target_lang[index].index("<end>"))]

        # computing BLEU score
        if (len(trunc_target) > 3 and len(translation) > 3):
            scores.append(bleu([trunc_target], translation, smoothing_function=sf))
        else:
            # Special case when the sentences have less than 4 words
            shortest = min(len(trunc_target), len(translation))
            weights = tuple()
            for _ in range(shortest):
                weights += (1./shortest,)

            scores.append(bleu([trunc_target], translation, smoothing_function=sf, weights=weights))

    scores = np.array(scores)
    return scores.mean(), scores.std()

In [None]:
input_embed_val = embed_all_sentences(input_lang_val, embedder_en)
mean, std = evaluate(input_embed_val, target_lang_val)

print('BLEU score: mean = {}, std = {}'.format(mean, std))

with open(results_folder + embed_name_en + ('_with_attention' if attention is True else '') + '_bleu.csv', 'w') as file:
    file.write('# mean,stddev\n')
    file.write(str(mean) + ',' + str(std) + '\n')