# Introduction

This project entails to cover different models for the task of Neural machine translation. There are four models that will be looked into each with increasing improvements and modifications. We will start off with a basic sequence-to-sequence model, followed by an encoder decoder setup for joint translation. After this the following models will improve upon the performance by adding attention and coverage.

## Models

Using implementations of following paper make 4 translation models and check their performance on test data

- [Sequence to Sequence Learning with Neural Networks](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) To get the ball rolling.
- [Neural Machine Translation By Jointly Learning To Align And Translate](https://arxiv.org/pdf/1409.0473.pdf) Basic attention
- [Effective Approaches to Attention-based Neural Machine Translation](http://aclweb.org/anthology/D15-1166) Try the three global attention options, ignore local attention
- [Modeling Coverage for Neural Machine Translation](http://www.aclweb.org/anthology/P16-1008) Does this help for repeated words?

In [3]:
# load the relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Dataset

We will be using a parallel corpus of English-Hindi text. This corpus contains a dataset as described below:
```bash
    401    6709   35463 dev.en
    200    2784   14367 test.en
  49398  846781 4516717 train.en
  49999  856274 4566547 total
```
As seen above, we have around 49.4k parallel aligned sentences for training, and about 200 sentences for testing. Additionally, there are 400 sentences for parameter tuning and validation. We will be training all our models on the `train.{en,hi}`. First, let us load the data. It has already been processed and added as pandas `DataFrame`s for easy loading and manipulation.

In [7]:
store = pd.HDFStore('data/data.h5')
print(store.keys())
train = store['train']
print(train.shape)
train.head(10)

['/test', '/train', '/validate']
(49398, 2)


Unnamed: 0,en,hi
0,The treatment of cataract is possible through ...,मोतियाबिंद का उपचार केवल शल्य-चिकित्सा द्वारा ...
1,Complete lens capsule is taken out in the meth...,इन्ट्रा कैनसूलर कैटरेक्ट एक्सट्रेक्शन ( Intra ...
2,During operation lens is implanted at front of...,ऑपरेशन के दौरान लैन्स प्रत्यारोपण आँख के अगले ...
3,"In the Extra Capsular Cataract Method , the pa...",इक्स्ट्रा कैनसूलर कैटरेक्ट एक्सट्रेक्शन विधि म...
4,Lens is fitted in the capsular bag .,कैपस्यूलर बैग में लैन्स फिट किया जाता है ।
5,In the S . . . . -LRB- Small Incision Cataract...,S.I.C.S ( Small Incision Cataract Surgry ) विध...
6,No stitches are applied in the S . . . . -LRB-...,S.I.C.S ( Small Incision Cataract Surgry ) विध...
7,In black cataract the eye nerves dilapidates g...,काला मोतियाबिंद में नेत्र तंत्रिका धीरे -धीरे ...
8,The blindness caused by the black cataract can...,काले मोतियाबिंद से होने वाली अधंता अंधता को रो...
9,When the extra pressure is more in the eyes .,जब आँखों में अतिरिक्त दबाव ज्यादा हो ।


# Models

There are 4 basic models, each one with more functionality than the last. The functionality is as follows:
- Sequence 2 Sequence models
- Joint encoder decoder setups
- Attention mechanism
- Modeling Coverage

First, let us load the necessary libraries for training and building our models.

In [8]:
import keras
import numpy as np

Using TensorFlow backend.


## Seq2Seq Model


## Joint Encoder-decoder Model

## Attention mechanism

## Modeling Coverage

# Testing & Results

Now that we have our models ready, we can compare them against the test dataset, and see if and how well they stack up against each other. We are going to be using the same test dataset (_as will be loaded from the `DataFrame`_). This test dataset contains around 400 sentences. Ideally, we _should_ see increasing performance gains moving from the first model to the last. In addition to the testing data for the statistical accuracies, the models are put up against some hand-picked sentences to highlight the key differences and improvements by adding various features (50).

In [9]:
test = store['test']
test.head(10)

Unnamed: 0,en,hi
0,Fresh breath and shining teeth enhance your pe...,ताजा साँसें और चमचमाते दाँत आपके व्यक्तित्व को...
1,Your self-confidence also increases with teeth .,दाँतों से आपका आत्मविश्‍वास भी बढ़ता है ।
2,Bacteria stay between our gums and teeth .,हमारे मसूढ़ों और दाँतों के बीच बैक्टीरिया मौजू...
3,They make teeth dirty and breath stinky .,ये दाँतों को गंदा और साँसों को बदबूदार बना देत...
4,You may keep your teeth clean and breath fresh...,यहाँ दिए कुछ आसान नुस्खों की मदद से आप अपने दा...
5,Clean your teeth properly .,दाँतों को ठीक से साफ करें ।
6,It takes two to three minutes to clean your te...,दाँतों को ठीक से साफ करने में दो से तीन मिनट क...
7,But most of the people give less than one minu...,लेकिन ज्यादातर लोग इसके लिए एक मिनट से भी कम स...
8,Drink plenty of water .,खूब पानी पीएँ ।
9,Bacteria attack fast if the mouth dries up .,मुँह सूखने पर बैक्टीरिया हमला तेज कर देते हैं ।
