# Machine Learning Engineer Nanodegree
## Capstone Proposal
Henry Maguire 
1st March 2017

## Proposal
I would like to make an end-to-end machine translation system that is capable of translating phrases from French to English, to a higher degree of accuracy than a benchmark dictionary based system.

### Domain Background
_(approx. 1-2 paragraphs)_

Over the years of development in machine learning, performance in many tasks has improved substantially. A clear example is speech recognition, where Google's speech recognition system is now "95% accurate" in the English language and has "improved 20%" since 2013 alone. In many areas the field of NLP still lags behind, with chatbots still being restricted to very closed domain conversations and remain very unconvincing when passing for humans, despite many recent technological improvements. However, the application of machine learning in language translation has been very successful, with the Google neural MT system [reducing errors](https://research.google.com/pubs/pub45610.html) by 60% compared with phrase-based systems. Statistical and phrase-based machine translation has been used since World War Two, which require a large amount of domain knowledge and human input/hard-coding of grammar etc. Many developments have been made in using Bayesian inference to maximise the likelihood of translations and other sophisticated SMT techniques - the accuracy of these systems has slowly improved over time. Other approaches have involved translating both source and target phrases into a common intermediate language, again using pre-developed knowledge of the structure of each language.

Modern solutions often use a combination of grammatical domain knowledge/phrase based systems and deep neural networks, since deep neural networks are able to approximate arbitrary functions, which makes them useful in supervised learning applications. The backpropagation algorithm allows correlations to be learned from data by updating the weights of a neural network iteratively, based on how far from the ideal scenario an approximation/prediction was. Neural networks can also take advantage of the symmetry of a problem. Recurrent Neural Network share parameters across "time steps" - so that each time we see the word "cat" we don't have to re-learn the weights in the network. This also allows sequential data to be handled, with a varying number of inputs mapping to a fixed number of outputs, so it generalises very well to natural language processing tasks. This means that all you need to develop a translation system is a load of text in both the source and target language, as well as enough computing power. It turns out that both requirements have recently become a lot easier to satisfy, with free access to large parallel corpora and relatively cheap GPUs.

The Google MT system deploys two Deep Recurrent Neural Networks which are connected end-to-end, one which generates fixed size vector representations for variable length input phrases in the source language (the encoder) and another which takes these vectors and generates variable length sequence representations of them in the target language (the decoder). The final layer then projects a probability distribution over predicted words in the target language at each time step. Simple RNN cells suffer from a number of pathologies, such as vanishing gradients, which make it difficult for them to keep track of long-term language dependencies. To get around this the Google NMT system uses Long Short-Term cells, which are gated units that can adaptively learn to forget and remember different weights depending on the context of a current word - this theoretically allows long-term dependencies to be learned. The effectiveness of the system increases greatly with the depth of the network so they also use 8 decoder and 8 encoder LTSM layers, which in turn expands the hardware requirements substantially. Another breakthrough of the Google system is in the use of attention mechanisms, whereby the entire output of the encoder layer is made visible to the (first) decoder layer throughout the computation. This allows the decoder to keep track of the entire context of the word at all timesteps, using important summary representations of the whole phrase. Another advanced technique is in the handling of rare, where they split up individual words into sub units, for example `"going"` becomes `"go", "ing"` etc - this would require some kind of auxillary learning algorithm (I probably won't do this). 


### Problem Statement
- The problem of machine translation requires decoding the *meaning* of phrases and then re-encoding them, often in an entirely different representation/script.
- RNNs can map sequences to fixed length vectors, which is excellent for generating language models. For a given unfinished phrase you can work out the probability distribution over the vocabulary to find the most likely words which finish the phrase. This allows us to build generate sequences of words probabilistically such as in [Andrej Karpathy's blog](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). However, translation is a more restricted problem since the source and target words are in different vector spaces.
- The number of parameters and accuracy of solutions scales badly with the size of vocabulary.
- Sentences contain long distance dependencies, for example:

    __*Mr*__ Smith was walking to the shop when a bird flew into __*his*__ head

Here the words "Mr" and "his" share the same gender, but this dependency is *hidden* for 11 words, this makes it very difficult for simple (and even more advanced) RNNs to keep track of.


### Datasets and Inputs

TO train the neural networks we need a corpus of text in both the target language and the source language - for this I have decided to use the [WMT'14](https://github.com/bicici/ParFDAWMT14) English-French parallel corpora, which are transcribed TED talks.

In order to process the information easily and to reduce the possiblity of alignment errors, the two corpora need to be broken up into parallel lists of phrases. These prhrases then need to be split up into tokens (words, punctuation and abbreviations), so that each one can be input into the model at each time step.


## Solution Statement


As outlined above, we need a way of mapping from variable length input sequences (in the source language) to variable length output sequences (target language), which can be done using [sequence-to-sequence models](https://arxiv.org/pdf/1409.3215.pdf). 
![seq2seq architecture](pictures/2-seq2seq-feed-previous.png)
Essentially, a phrase in the target language is fed into a recurrent neural network one word at a time which gets encoded in the final hidden RNN layer. The final encoder layer is then passed to the decoder layer which generates a representation of the phrase in the target language. During training the entire network is unrolled across some truncated number of timesteps and the target phrase is fed in as both the input sequence and the "gold truth" training example. The output of the decoder layer is converted to a probability.
Build the end-to-end translation system comparing:
- Vanilla RNN units, understand pathologies
- LSTMs
- Bi-directional LSTMs
- Reverse source sentences

Two other problems are how to quantify how close our prediction and the actual result are and how to make the output intelligible/appropriate. These form the basis of bleeding edge NLP/machine translation research.

### Benchmark Model

Benchmark model could just translate word for word the source sentence using dictionary definitions. This may be cheating since we would have to reference some third party dictionary definitions, but it would provide a simple baseline with which we can compare. This method would no doubt miss all of the idiomatic phrases which may appear in the original corpus and is likely to get the ordering of adjectives and gendered articles confused. An example of a likely benchmark error is that the phrase "the black cat" in English becomes "le chat noir" - which uses the masculine definite pronoun (which doesn't exist in English) and has the ordering of "chat" and "noir" reversed from the source phrase. We could use the [BLEU metric](https://en.wikipedia.org/wiki/BLEU) to measure the accuracy of both the word-for-word translation and the Deep Learning approach. The BLEU metric is the most popular used in the field of machine translation since it is known to [correlate well with human judgment](https://www.aclweb.org/anthology/E/E06/E06-1032.pdf).

### Evaluation Metrics


The neural network is trained by predicting the 
The [BLEU scoring method](https://en.wikipedia.org/wiki/BLEU) uses n-gram frequency in the machine output and human translated phrase, whereby each sequence of words up to length `n` in the target and output translation are recorded. The BLEU score is based on the ratio of: the number of n-grams in the predicted phrase which appear in the target phrase, `m` and the total number of words in the predicted phrase, `w_t`. A score of `P=m/w_t=1` is a perfect score. This standalone metric would value translations which are dominated by just a single n-gram which is found in the target phrase. The value in the numerator is therefore augmented such that the number of times an n-gram in the predicted phrase can be counted is limited to the number of times it appears in the target phrase. For example, using a 2-gram BLEU score:

Predicted phrase: **The dog the dog the dog**              

Target phrase: **The dog sat on the log**

In the predicted phrase there are five 2-grams: **the dog** appears three times and **dog the** appears twice, of these only **the dog** appears in the translation and it appears once. Without the augmentation, the 2-gram score would be `P=3/5` but with the augmentation the score would be `P=1/5`. Unfortunately, `n` is a hyperparameter, but fortunately research has shown that `n=4` seems to agree most strongly with "monolingual human judgements". Also, small values of `n` seem to measure how much information was retained in the translation whereas larger values are a measure of the readability/fluency. Given this latter definition, I expect that the benchmark model outlined above may score a reasonable unigram BLEU score but will probably not score very highly with respect to larger N-grams.

## Project Design
The overall pipeline will be:
- Load English and French datasets into memory
- Preprocess the data by tokenizing, removing capitalisation, limiting vocabulary size and replacing with the `<UNK>` token.
- Write the benchmark model
- Load in pretrained word embeddings and find embeddings for each vocabulary word, if there is no pretrained embedding available just initialise to Gaussian random distribution.
- Implement BLEU metric either by hand or using NLTK
- 

I will first need to preprocess the datasets. This will involve removing capitalisation, tokenising the text (which means breaking it up into a list of words), limiting the vocabulary to a certain number of (the most frequent words) and replacing any out of vocabulary words with the `<OOV>` tag, removing as much punctuation as possible, adding the `<EOS>` tag to the end of each sequence, replace numbers with words and potentially a few other things depending on resources.

I'll then write the benchmark part of the code, using [PyDictionary](https://pypi.python.org/pypi/PyDictionary/1.3.4) to get english to French word translations. This will just be a while loop:

`for phrase in corpus:` <br/>
&nbsp;&nbsp;&nbsp;&nbsp; `translated_phrase = []` <br/>
&nbsp;&nbsp;&nbsp;&nbsp; `for word in phrase:` <br/>
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `if word == '<UNK>':`<br/>
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `translated_phrase.append('<UNK>')`<br/>
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `elif word == '<EOS>'`<br/>
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `translated_phrase.append('<EOS>')`<br/>
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; `else:`<br/>
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;`translated_phrase.append(translate(word, 'fr'))`<br/>

I will then familiarise myself with using [word embeddings](https://spinningbytes.com/resources/embeddings/) which have been pre-trained on 200 million (ENG) and 300 million (FR) tweets, with an embedding size of 52. A package called `gensim` can be used to easily load and manipulate word-to-vec embeddings and the [API documentation](http://radimrehurek.com/gensim/models/word2vec.html) looks fairly straightforward to use. I need to then make sure that each word in the restricted vocabulary of each corpus is represented by an embedding - if they are I'm unsure what to do but I'll figure something out.


I've done some research to make sure that it's feasible to write a sequence-to-sequence model in Tensorflow to do the job that is required. I'll start by writing a very simple model, then train it on some dummy numeric data to see if it is capable of learning anything and outputting the correct sequences. The underlying architecture of RNN unit needs to be determined and whether or not we use an attention mechanism.

## HERE WRITE A FULL DESCRIPTION OF THE ARCHITECTURE
- Inputs, outputs of each layer. 
- How do the sequences pop out of the projection layer? 
- Training with ADAM optimiser



Computational resources may become an issue - I have access free to 2-64 cores but perhaps can use a GPU on [Floyd](https://www.floydhub.com/) or [AWS](https://aws.amazon.com/) if I need more training power.

*In this final section, summarize a theoretical workflow for approaching a solution given the problem. Provide thorough discussion for what strategies you may consider employing, what analysis of the data might be required before being used, or which algorithms will be considered for your implementation. The workflow and discussion that you provide should align with the qualities of the previous sections. Additionally, you are encouraged to include small visualizations, pseudocode, or diagrams to aid in describing the project design, but it is not required. The discussion should clearly outline your intended workflow of the capstone project.*

-----------

**Before submitting your proposal, ask yourself. . .**

- Does the proposal you have written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Solution Statement** and **Project Design**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your proposal?
- Have you properly proofread your proposal to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?