Neural Machine Translation (NMT) has today become the most powerful way to perform the task of translating text from one language to another. Back in the old days,translation task was difficult to perform and led to disfluency . Traditional phrase-based translation systems use to perform their task by dividing up source sentences into multiple chunks and then translated them phrase-by-phrase.This was not like how we, humans, translate.We translate by first understanding the meaning of sentence .Neural Machine Translation (NMT) work that way . Neural Machine Translation first read the sentence in the input language and creates a thought vector from this sentence. Then, it processes the sentence vector to emit a translation. Neural machine translation usually use Recurrent Neural Network .Neural Machine Translation use recurrent neural networks by coupling them to external memory resources which they can interact with by attentional processes. Neural Machine Translation work as follows:
- Dataset is prepared by loading dataset ,removing spaces and special character, tokenizing the dataset ,padding each sentence to a maximum length.
- Creating the encoder (encode information of source sequence into real- valued vector)and decoder(produce output sequence). Train the encoder-decoder model.
- In our project we used Tensorflow framework to offer low –level working example of the concept. We train a sequence to sequence model for Hindi to English translation.
The dataset contains language translation pairs .We have used Hindi to English dataset which is text file and contain 2778 pairs of sentences .In our project English is the source languge and Hindi is target language. After importing the required libraries preprocessed the dataset by removing quotes , cleaning digits from source and target sentence, removing the different symbol used for numbers,creating space between punctuation and words. We have added a start and end token to each sentence.Then we have created a word index and reverse word index (dictionaries mapping from word → id and id → word). Padded each sentence to a maximum length. Then we have clean the sentences and preprocess the source and target sentence to have word pair in format :[ HINDI,ENGLISH]