# Machine Translation (MT)

In my project I created a Basic Machine Translation Model which includes the following main steps:
- Preprocess data
- Pretrained embeddings using Word2Vec
- Generate machine translation using sequence-to-sequence (encoder-decoder) model with LSTM cells
<br>

#### Motivation for Machine Translation
Machine Translation is the use of software programs which have been specifically designed to automatically translate text from one language to another, without human involvement. Especially in the time of rapid globalization, such services gain great importance and become invaluable in many application fields. It has some great advantages (compared with human translation):
- **Speed**: Machine translation is fast, can quickly translate content and provide a quality output to the user in no time
- **Cost**: Machine translation is cheap in comparison to employing a professional translator
- **Confidentiality**: Machine translation protects sensitive information which might be risky to give to a professional translator

# Data

The data I used for generating a machine translation model is **[European Parliament Proceedings Parallel Corpus 1996-2011](http://www.statmt.org/europarl/)** 
Therof I used the parallel corpus Spanish-English

## Load Datasets (en, es)

In [1]:
import os

In [2]:
# set dataset_path
dataset_path = './Dataset_MT'
if not os.path.exists(dataset_path):
    os.makedirs(dataset_path)

In [3]:
# Function for data loading
def load_doc(filename):
    # Open the file as read only
    file = open(dataset_path + '/' + filename, mode='rt', encoding='utf-8')
    # Read all text
    text = file.read()
    # Close the file
    file.close()
    return text

In [4]:
# Split the loaded texts into datasets of sentences
def to_sentences(doc):
    return doc.strip().split('\n')

In [5]:
# Load the English dataset
text_en = load_doc('europarl-v7.es-en.en')
text_en = to_sentences(text_en)

# Load the Spanish dataset
text_es = load_doc('europarl-v7.es-en.es')
text_es = to_sentences(text_es)

## Observe the Datasets

In [6]:
# Check the size of the datasets
print(len(text_en), len(text_es))

1965734 1965734


In [9]:
# Observe the first samples of both datasets
print(text_en[1:5])
print(text_es[1:5])

['I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.', "Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.", 'You have requested a debate on this subject in the course of the next few days, during this part-session.', "In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union."]
['Declaro reanudado el período de sesiones del Parlamento Europeo, interrumpido el viernes 17 de diciembre pasado, y reitero a Sus Señorías mi deseo de que hayan tenido unas buenas vacaciones.', 'Como todos han podido comprobar, el gran "efecto del

**Consequences of observing the datasets**:<br>
For the normalization process I regarded the following steps as necessary 
- Convert to lower case
- Remove special characters (in Spanish dataset: Replace language specific letters by ascii letters e.g. Señorías --> Senorias)

Although I cannot see a contraction in the samples I decided to add the expansion of possible contractions to the normalization process (to prevent the risk).<br>
Additionally I decided to also add the option for lemmatization. 

<br>

## Pre-Preprocessing Step: Defining functions for saving/loading intermediate results of preprocessing

These functions are defined in advance to save and load intermediate results of the preprocessing process and thereby save time while working on the project

In [15]:
# Necessary imports
from pickle import dump, load

In [16]:
# Define save_path
save_path_for_data = './Preprocessed_datasets'
if not os.path.exists(save_path_for_data):
    os.makedirs(save_path_for_data)

In [17]:
# Functions to save intermediate results
def save_text(text, filename):
    dump(text, open(save_path_for_data + '/' + filename, 'wb'))
    print('Saved: %s' % filename)

In [18]:
# Function to load intermediate results
def load_text(filename):
    data = load(open(save_path_for_data + '/' + filename, 'rb'))
    print('Loaded: %s' % filename)
    return data

<br>

## Clean / Normalize the datasets

In [None]:
# Imports for cleaning / normalizing the datasets
from contractions import contractions_dict
import re
from unicodedata import normalize
# for lemmatization:
from nltk.stem import WordNetLemmatizer
from pattern.text.en import tag
from nltk.corpus import wordnet as wn

### Function definition for the Normalization process

Expand extractions

In [None]:
def expand_contractions(text):
    contractions_pattern = re.compile('({})'.format('|'.join(contractions_dict.keys())), flags=re.IGNORECASE|re.DOTALL)
    
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contractions_dict.get(match)\
                                if contractions_dict.get(match)\
                                else contractions_dict.get(match.lower())\
                                    if  contractions_dict.get(match.lower())\
                                    else contractions_dict.get(match.capitalize())
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    return expanded_text

Remove special characters and symbols (and replace language specific letters such as "ñ")

In [None]:
def remove_special_characters(text):
    # Remove all characters that are not part of the ascii alphabet
    text = normalize('NFD', text).encode('ascii', 'ignore')
    text = text.decode('UTF-8')
    # Remove all unwanted symbols
    text = re.sub(r'[_"\'\%()|.,;:+&=*%!?#$@\[\]/]', '', text)
    return text

Convert to lower case

In [None]:
def convert_to_lower_case(text):
    text = text.lower()
    return text

Lemmatize

In [None]:
# Annotate text tokens with POS tags
def pos_tag_text(text):
    
    # Translate POS tags to WordNet tags for later usage in the lemmatize function
    def translate_to_wn_tags(pos_tag):
        if pos_tag.startswith('J'):
            return wn.ADJ
        elif pos_tag.startswith('V'):
            return wn.VERB
        elif pos_tag.startswith('N'):
            return wn.NOUN
        elif pos_tag.startswith('R'):
            return wn.ADV
        else:
            return None

    tagged_text = tag(text)
    tagged_lower_text = [(word.lower(), translate_to_wn_tags(pos_tag)) for word, pos_tag in tagged_text]
    return tagged_lower_text

In [None]:
# Initialize the WordNetLemmatizer
wnl = WordNetLemmatizer()

In [None]:
# Lemmatize the input text based on POS tags
def lemmatize_text(text):
    pos_tagged_text = pos_tag_text(text)
    lemmatized_tokens = [wnl.lemmatize(word, pos_tag) if pos_tag else word for word, pos_tag in pos_tagged_text]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text

Combine all normalization functions

In [None]:
def normalize_text(text, lemmatizeText=False):
    normalized_text = []
    for sentence in text:
        sentence = expand_contractions(sentence)
        sentence = remove_special_characters(sentence)
        sentence = convert_to_lower_case(sentence)
        if lemmatizeText:
            sentence = lemmatize_text(sentence)
        normalized_text.append(sentence)
    return normalized_text

**Normalize datasets (apply normalization functions) and save intermediate results**

In [None]:
text_en = normalize_text(text_en, lemmatizeText=False)
text_es = normalize_text(text_es, lemmatizeText=False)
# NOTE: Runtime for all sentences: ca. 45 min

In [None]:
save_text(text_en, 'normalized_text_en.pkl')
save_text(text_es, 'normalized_text_es.pkl')

**Load normalized datasets**

In [19]:
text_en = load_text('normalized_text_en.pkl')
text_es = load_text('normalized_text_es.pkl')

Loaded: normalized_text_en.pkl
Loaded: normalized_text_es.pkl


<br>

## Tokenization of datasets

In [None]:
# Necessary Imports
import nltk

In [None]:
# Function for (word) tokenizing the sentences of the datasets
def tokenize(text):
    tokenized_text = []
    for sentence in text:
        tokens = nltk.word_tokenize(sentence)
        tokenized_sentence = [token.strip() for token in tokens]
        tokenized_text.append(tokenized_sentence)
    return tokenized_text

In [None]:
# Tokenize the datasets
text_en = tokenize(text_en)
text_es = tokenize(text_es)
# NOTE: Runtime for all the data: ca 15-20 min

In [None]:
# Save tokenized datasets
save_text(text_en, 'tokenized_text_en.pkl')
save_text(text_es, 'tokenized_text_es.pkl')

**Load tokenized datasets**

In [14]:
text_en = load_text('tokenized_text_en.pkl')
text_es = load_text('tokenized_text_es.pkl')

Loaded: tokenized_text_en.pkl
Loaded: tokenized_text_es.pkl


<br>

## Remove empty sentences

*Note*: This step can also be done as part of the normalization process. However, I determined the necessity to remove empty sentences much later which is the reason why the removal of the empty sentences is done after the tokenization 

**Observe datasets for empty sentences**

In [None]:
# Minimize dataset for observation
text_en_short = text_en[0:5000]
text_es_short = text_es[0:5000]

In [None]:
# Function to get the indices of the empty sentences
def get_empty_sentence_indices(text):
    indices = []
    i = 0 
    for sentence in text:
        if len(sentence) == 0:
            indices.append(i)
        i += 1
    return indices

In [None]:
# Analyse the datasets checking their accordance of empty sentence indices
index_en = get_empty_sentence_indices(text_en_short[0:5000])
index_es = get_empty_sentence_indices(text_es_short[0:5000])
index_both = list(set(index_en).intersection(index_es))

print(str(len(index_en)) + ' empty sentences in English dataset: ' + str(index_en))
print(str(len(index_es)) + ' empty sentences in Spanish dataset: ' + str(index_es))
print('Number of empty sentences in both datasets: ' + str(len(index_both)))

In [None]:
# Analyse data to investigate the reason for discrepancies in the indices of both datasets
for i in range(103,105):
    print(text_en[i])
for i in range(103,105):
    print(text_es[i])

*Consequence of observing the sequence lenghts of the datasets*:<br>
From the example above one can see that there are differences in the indices of the empty results. As can been seen aswell these differences result from mistakes in the dataset according to splitting the data. Therefor I consider it necessary to do the following steps to remove all possible mistakable data:
- If both datasets have an empty sentence at the same index: Remove the empty sentence
- If only one dataset has an empty sentence at a certain index: Remove the sentences at and around the index for both datasets (--> thereby it is assured that no parts of a sentence in one dataset is missing in the other)

**Functions to remove empty sentences according to strategy named above**

In [20]:
# Function to define the indices that need to be removed according to strategy
def get_indices_for_remove(text_1, text_2):
    indices = []
    for index in range(len(text_1)):
        if len(text_1[index]) == 0 and len(text_2[index]) == 0:
            indices.append(index)
        elif len(text_1[index]) == 0 or len(text_2[index]) == 0:
            indices.append(index-1)
            indices.append(index)
            indices.append(index+1)
        else:
            continue
    return indices

In [21]:
# Removing the empty sentences
# Returning datasets where all sentences have a counterpart in the other dataset
def remove_empty_sentences_of_texts(text_1, text_2):
    indices_for_remove = get_indices_for_remove(text_1, text_2)
    clean_text_1 = []
    clean_text_2 = []
    for index in range(len(text_1)):
        if index not in indices_for_remove:
            clean_text_1.append(text_1[index])
            clean_text_2.append(text_2[index])
    
    return clean_text_1, clean_text_2

**Remove empty sentences from datasets and save intermediate results**

In [22]:
text_en, text_es = remove_empty_sentences_of_texts(text_en, text_es)
# Note: Runtime: ca 10 min

In [21]:
# Save preprocessed datasets
save_text(text_en, 'preprocessed_text_en.pkl')
save_text(text_es, 'preprocessed_text_es.pkl')

Saved: preprocessed_text_en.pkl
Saved: preprocessed_text_es.pkl


**Load preprocessed datasets**

In [None]:
text_en = load_text('preprocessed_text_en.pkl')
text_es = load_text('preprocessed_text_es.pkl')

## Create smaller dataset for faster training / experimenting

In [22]:
text_en_small = text_en[0:100000]
text_es_small = text_es[0:100000]

In [23]:
save_text(text_en_small, 'preprocessed_text_en_small.pkl')
save_text(text_es_small, 'preprocessed_text_es_small.pkl')

Saved: preprocessed_text_en_small.pkl
Saved: preprocessed_text_es_small.pkl
