<img src='data/images/section-notebook-header.png' />

# This notebook is modified for GEC task.

Anocha S.
6/4/2024


---

# Data Preparation: Machine Translation

Machine translation in NLP refers to the use of computer algorithms and models to automatically translate text or speech from one language to another. It is a subfield of NLP that focuses on developing systems that can bridge the language barrier and facilitate communication between people who speak different languages. Machine translation systems aim to replicate the process of human translation by analyzing the structure, grammar, and meaning of the source language text and generating an equivalent translation in the target language. These systems employ various techniques, including statistical methods, rule-based approaches, and more recently, neural machine translation (NMT) models.

* **Statistical machine translation (SMT)** was the dominant approach before the rise of neural machine translation. SMT involves training models on large bilingual corpora, extracting patterns, and using statistical algorithms to generate translations. However, SMT often requires extensive manual feature engineering and may struggle with translating rare or unseen phrases.

* **Neural machine translation (NMT)** has emerged as a powerful approach, leveraging deep learning techniques to improve translation quality. NMT models, particularly sequence-to-sequence models with recurrent or transformer architectures, learn to directly map source language sentences to target language sentences. They do not rely on explicit alignment models or handcrafted linguistic features, enabling end-to-end learning.

NMT models are trained on parallel corpora, which consist of pairs of sentences in the source and target languages. During training, the models learn to encode the source sentence into a continuous representation, often using an encoder network, and then decode this representation into the target language sentence using a decoder network. The models are optimized to minimize the difference between the generated translation and the reference translation in the training data. There have been significant advancements in recent years, but challenges still remain, particularly in handling ambiguous or context-dependent language, translating idioms, and capturing nuances of meaning. Ongoing research and advancements in NLP continue to push the boundaries of machine translation capabilities.

In this notebook, we will create a dataset to train machine learning models using publicly available corpora. The scale of this dataset will be rather limited and certainly insufficient to train a state-of-the-art machine translation model.

## Setting up the Notebook

### Import Required Packages

In [1]:
import os
import pandas as pd

from collections import Counter, OrderedDict
from tqdm import tqdm

We utilize some utility methods from PyTorch as well as Torchtext, so we need to import the `torch` and `torchtext` package.

In [2]:
import torch
import torchtext
from torchtext.vocab import vocab

As usual, we rely on spacy to perform basic text preprocessing and cleaning steps. Note that we have to load 2 language models, one for the source and one for the target language, which will be German and English in this notebook, at least by default.

In [3]:
import spacy

# Tell spaCy to use the GPU (if available)
spacy.prefer_gpu()

nlp_eng = spacy.load("en_core_web_trf")
# nlp_deu = spacy.load("de_dep_news_trf")

Lastly, `src/utils.py` provides some utility methods to download and decompress files. Since the datasets used in some of the notebooks are of considerable size -- although far from huge -- they are not part of the repository and need to be downloaded (and optionally decompressed) separately. The 2 methods `download_file` and `decompress_file` accomplish this for convenience.

In [4]:
from src.utils import download_file, decompress_file, get_line_count

In [9]:
from convert2pair import c2pairTxt, dup2pairTxt

ImportError: cannot import name 'dup2pairTxt' from 'convert2pair' (/mnt/c/Users/nomar/OneDrive - National University of Singapore/CS4248 NLP/Project/ChrisRNN/convert2pair.py)

Below we set the target path where to stored and find all datafiles. Since the files are quite large, they are nor part of the repository by need to be downloaded first.

In [10]:
# Change to GEC path
# target_path = 'data/corpora/tatoeba/'
# target_path = 'data/corpora/sentence_pairV2/'
# target_path = 'data/corpora/Awe.datasets_binary/'
target_path = 'data/corpora/test/'

In [None]:
##########
# convert json sentence pair to csv txt file
c2pairTxt(target_path,'ABC.train.gold.bea19') #jsonPath, jsonName
c2pairTxt(target_path,'ABCN.dev.gold.bea19') #jsonPath, jsonName

In [None]:
dup2pairTxt(testPath, testName)

---

## Download & Generate Dataset

### Motivation

Training a machine learning model typically requires a large dataset of text documents and their corresponding translation in the one or more target languages. Collecting corpora for training machine translation models involves several steps. Here's an overview of the process:

* **Determine Source and Target Languages:** First, identify the specific languages you want to train your machine translation model on. The choice of languages depends on your target audience and the availability of resources in those languages.

* **Obtain Parallel Corpora:** Parallel corpora are collections of sentences or texts that have translations available in both the source and target languages. These corpora are essential for training machine translation models. There are several sources to consider:

	* Publicly Available Corpora: Explore publicly available parallel corpora, such as those provided by research institutions, organizations, or language-related projects. Some examples include the Europarl corpus, United Nations documents, or the Tatoeba project.

	* Government and Legal Translations: Government websites, legislative documents, legal agreements, and court proceedings often have translated versions available. These can be a valuable resource for specific domains or language pairs.

	* News Articles and Publications: News organizations and publishers may have translated articles or publications that can be used as parallel corpora. This can provide a diverse range of topics and sentence structures.

	* Crowdsourcing: Consider utilizing crowdsourcing platforms to collect translations. Platforms like Amazon Mechanical Turk or specialized translation communities can help in gathering sentence pairs for your target language pair.

* **Ensure Data Quality and Preprocessing:** After obtaining parallel corpora, it is important to ensure data quality and perform preprocessing steps. This includes:

	* Removing noisy or irrelevant data: Review the data and remove any sentences or segments that are low quality, incorrect, or contain undesirable characteristics.

	* Tokenization: Tokenize sentences into individual words or subword units. This step is crucial for building vocabulary and preparing data for input to the machine translation model.

	* Cleaning and Normalization: Normalize the data by removing unnecessary punctuation, correcting spelling mistakes, or handling special characters specific to the languages.

* **Align Sentences:** For training machine translation models, it is crucial to align the sentences in the parallel corpora, i.e., to establish which sentence in the source language corresponds to which sentence in the target language. Alignment can be done manually or with the help of alignment tools such as FastAlign or GIZA++.

* **Corpus Size and Balance:** Consider the size of your corpus. Larger corpora can provide better coverage and generalization. Additionally, ensure the balance between the source and target languages, so that both languages have roughly equal representation to avoid bias in translation quality.

* **Pretraining and Fine-tuning:** Machine translation models, particularly neural network-based models, often benefit from pretraining on a large dataset and then fine-tuning on a domain-specific or smaller dataset. This allows the model to learn general language patterns before focusing on the specific translation task.

It is important to note that collecting corpora for machine translation can be a complex and time-consuming process, especially for low-resource languages or specialized domains. The availability of quality parallel corpora directly affects the translation quality of the trained models. Therefore, it is essential to invest effort into obtaining high-quality and diverse corpora for effective machine translation training.

### Data Source: Tatoeba

In this notebook we rely on [Tatoeba](https://tatoeba.org/en/) to collect our text corpus for generating our dataset(s). The Tatoeba website is a collaborative online platform that aims to collect and provide example sentences and translations in multiple languages. The word "tatoeba" means "for example" in Japanese, reflecting the purpose of the platform—to provide examples for various languages and contexts. The website's main goal is to create a large and diverse sentence database that can be used for language learning, translation, and linguistic research.

Users of Tatoeba can contribute by submitting new sentences in any language, along with their translations into other languages. The sentences can cover a wide range of topics, allowing learners and researchers to explore different domains and language usage. The website follows a community-driven approach, where registered users can suggest corrections, discuss translations, and engage in collaborative efforts to improve the quality and accuracy of the sentence database.

Tatoeba provides various features and tools to facilitate language exploration and learning. Users can search for sentences, filter by language, browse through curated lists, and save their favorite sentences. The translations provided on Tatoeba are typically contributed by volunteers, so the quality may vary, but the community actively works on improving and reviewing the translations over time.

The Tatoeba project promotes open data and open-source principles. The sentence database and its source code are freely available, allowing others to reuse and build upon them. This openness enables researchers, developers, and language enthusiasts to create applications, tools, and resources that leverage the sentence data for diverse language-related tasks.

### Auxiliary Method for Data Collection

Tatoeba makes all sentences of a language available as a single compressed file. However, downloading these 2 files for the source and target language is not sufficient as they lack the connection which sentence in the source file matches which sentences in the target file. To make this connection requires an additional file containing the information about the links between sentences of different languages. The method `generate_sentence_pairs()` in the code cell below automates this process. It takes the identifiers of the source and target language as input and creates a new text file containing all matching sentence pairs between the two languages. While the first half of the method code handles downloading and decompressing the required files, the second half performs the linking between the 2 language files.

In [6]:
# # No need to call this since we have all sentence_pairs form Laing

# def generate_sentence_pairs(src_lang, tgt_lang, target_path, overwrite=False):
#     output_file_name = target_path+'ABC.train.gold.bea19-{}-{}.txt'.format(src_lang, tgt_lang)
    
#     # Check if file exists; onoverwriterwrite if specified
#     if os.path.isfile(output_file_name) == True and overwrite is not True:
#         print('Output file "{}" already exists.'.format(output_file_name))
#         return output_file_name
    
#     print('Download files...')
#     raw_src = download_file('https://downloads.tatoeba.org/exports/per_language/{}/{}_sentences.tsv.bz2'.format(src_lang, src_lang), target_path, overwrite=overwrite)
#     raw_tgt = download_file('https://downloads.tatoeba.org/exports/per_language/{}/{}_sentences.tsv.bz2'.format(tgt_lang, tgt_lang), target_path, overwrite=overwrite)    
#     raw_lnk = download_file('https://downloads.tatoeba.org/exports/links.tar.bz2', target_path, overwrite=overwrite)
    
#     print('Decompress files...')
#     src = decompress_file(raw_src, target_path)
#     tgt = decompress_file(raw_tgt, target_path)
#     lnk = decompress_file('data/corpora/tatoeba/links.tar.bz2', target_path)
#     lnk= decompress_file('data/corpora/tatoeba/links.tar', target_path)
    
#     print('Link language files...')
#     df_src = pd.read_csv(src, sep='\t', header=None)
#     df_tgt = pd.read_csv(tgt, sep='\t', header=None)
#     df_links = pd.read_csv(target_path+'links.csv', sep='\t', header=None)
    
#     src_ids = set(df_src[0])
#     tgt_ids = set(df_tgt[0])
    
#     df_links = df_links[df_links[0].isin(src_ids) & df_links[1].isin(tgt_ids)]
    
#     num_pairs = len(df_links)
    
#     print('Generate output file...')
#     output_file = open(output_file_name, 'w')
#     with tqdm(total=num_pairs) as progress_bar:
#         for index, row in df_links.iterrows():
#             try:
#                 src_row = df_src[df_src[0] == row[0]].to_records(index=False)[0]
#                 tgt_row = df_tgt[df_tgt[0] == row[1]].to_records(index=False)[0]
#                 output_file.write('{}\t{}\n'.format(src_row[2], tgt_row[2]))
#             except Exception as e:
#                 pass
#             finally:
#                 progress_bar.update(1)
    
#     print('DONE')
#     return output_file_name

In this and subsequent notebooks, by default, our goal is to build and train a machine translation model for translating German into English sentences. This means that our source language is German (Tatoeba identifier: `deu`) and our target language is English (Tatoeba identifier: `eng`). Let's call `generate_sentence_pairs()` to download all required files and prepare our dataset file containing match sentence pairs.

**Important:** If you look at the code of `generate_sentence_pairs()` it assumes that the language files are accessible via certain URLs. In principle, these URLs might change overtime. So if the code cell below throws an error indicating that the URLs are invalid, we recommend to go to the Tatoeba website to check for the new URLs and update the method above accordingly.

In [20]:
# # No need to call this since we have all sentence_pairs form Laing
# dataset_file_name = generate_sentence_pairs('src', 'tgt', target_path, overwrite=False)

# # The methods returns the file name of our dataset, as we need that later to read the file.
# print(dataset_file_name)

Download files...



0.00iB [00:00, ?iB/s]

FileNotFoundError: [Errno 2] No such file or directory: 'data/corpora/sentence_pairV2/src_sentences.tsv.bz2'

## Generate Dataset

### Auxiliary Methods

The code cell below defines 2 auxiliary methods to "preprocess" the sentences, respective to their language. Since the task is machine translation, the preprocessing is rather minimal, and here limited to lowercasing the tokenization. Other steps such as stopword removal or lemmatization are of course not appropriate here.

Keep in mind that in practice, there are many additional steps conceivable. For example, one can replace number with some placeholder token and replace this token with the number in the translation. This is also often done with named entities such as the names of people or locations as they are commonly not translated, and this is an easy way to limited to size of the vocabularies. But as usual, the goals is not to build a state-of-the-art translation model, so we ignore such more sophisticated considerations here.

In [11]:
def tokenize_eng(text):
    return [token.text.lower() for token in nlp_eng.tokenizer(text)]

# def tokenize_deu(text):
#     return [token.text.lower() for token in nlp_deu.tokenizer(text)]

### Create Vocabularies

In previous Data Preparation notebooks, we already went multiple times to the basic steps of creating a vocabulary and vectorizing a corpus of text documents. We therefore keep it short in this notebook and put all the required code into a single code cell. But again, all this code should look very familiar if you went through earlier notebooks where we prepared a dataset for tasks such as sentiment analysis of language models.

The main difference here is that we need to create 2 vocabularies, one for each language.

In [13]:
## preparing .vocab from sentence_pair.txt


## output_file_name = target_path+'tatoeba-{}-{}.txt'.format(src_lang, tgt_lang)
# input sentence_pair path
dataset_file_name = target_path + 'ABCN.test.bea19.pairtxt'#'valid.pairstxt' #'ABC.train.gold.bea19.txt'#'ABCN.dev.gold.bea19.txt' 
## Create Counter to get word frequencies for Source and Target
token_counter_src = Counter()
token_counter_tgt = Counter()

num_samples = get_line_count(dataset_file_name)

## Read file line by line
with open(dataset_file_name) as file:
    with tqdm(total=num_samples) as t:
        for line in file:
            line = line.strip()
            try:
                # The German sentence comes first, then the English sentence
                src, tgt = line.split("\t")
                # Update German token counts
                for token in tokenize_eng(src):
                    token_counter_src[token] += 1
                # Update English token counts
                for token in tokenize_eng(tgt):
                    token_counter_tgt[token] += 1
            except:
                pass
            finally:
                # Update progress bar
                t.update(1)

## Sort word frequencies and conver to an OrderedDict
token_counter_src_sorted = sorted(token_counter_src.items(), key=lambda x: x[1], reverse=True)
token_counter_tgt_sorted = sorted(token_counter_tgt.items(), key=lambda x: x[1], reverse=True)

# Limited the maximum size of the vocabulary (note that 20k is kind of arbitrary,
# and it's also not obvious which we should use the same value for both languages)
max_words = 20000
token_ordered_src_dict = OrderedDict(token_counter_src_sorted[:max_words])
token_ordered_tgt_dict = OrderedDict(token_counter_tgt_sorted[:max_words])

# Create vocabularies for EN and DE (note that we add a couple of special tokens you might have not seen yet,
# but you can ignore them here as they don't harm training our RNN-based encoder-decoder model)
PAD_TOKEN = "<PAD>"
UNK_TOKEN = "<UNK>"
SOS_TOKEN = "<SOS>"
EOS_TOKEN = "<EOS>"
CLS_TOKEN = "<CLS>"
SEP_TOKEN = "<SEP>"

SPECIALS = [PAD_TOKEN, UNK_TOKEN, SOS_TOKEN, EOS_TOKEN, CLS_TOKEN, SEP_TOKEN]

# Create vocab objects
vocab_src = vocab(token_ordered_src_dict, specials=SPECIALS)
vocab_tgt = vocab(token_ordered_tgt_dict, specials=SPECIALS)

# Set index of default token (i.e., the index that gets returned in case of unknown words)
vocab_src.set_default_index(vocab_src[UNK_TOKEN])
vocab_tgt.set_default_index(vocab_tgt[UNK_TOKEN])

print("Size of Src vocabulary: {}".format(len(vocab_src)))
print("Size of Tgt vocabulary: {}".format(len(vocab_tgt)))

100%|█████████████████████████████████████████████████████████████████████████████| 4477/4477 [00:00<00:00, 8689.03it/s]

Size of Src vocabulary: 7675
Size of Tgt vocabulary: 7675





As usual, we need to save both vocabularies for later use (i.e., for when we want to train our model).

In [14]:
max_words = 20000

In [15]:
vocab_src_file_name = target_path+ 'test-src.vocab' #'ABC.train-src-{}.vocab'.format(max_words)
vocab_tgt_file_name = target_path+ 'test-tgt.vocab' #'ABC.train-tgt-{}.vocab'.format(max_words)

torch.save(vocab_tgt, vocab_tgt_file_name)
torch.save(vocab_src, vocab_src_file_name)

### Vectorize Sentences

With both vocabularies, we can now vectorize all sentences in the source and target language.

In [18]:
## In case of unseen sentences, keep using train vocab index
src_path = 'data/corpora/Awe.datasets_binary/'
vocab_src = torch.load( src_path+ 'train-src.vocab')# "ABC.train-src-20000.vocab")
vocab_tgt = torch.load( src_path+ 'train-tgt.vocab')#"ABC.train-tgt-20000.vocab")

In [20]:
dataset_file_name = target_path + 'ABCN.test.bea19.pairtxt'#"ABC.train.gold.bea19.txt"
output_file = open(target_path + 'test-vectorized.txt', 'w')#'ABC.train-src-tgt-vectorized.txt', "w")
num_samples = get_line_count(dataset_file_name)


with open(dataset_file_name) as file:
    with tqdm(total=num_samples) as t:
        for line in file:
            line = line.strip()

            try:
                src, tgt = line.split("\t")
                # Vectorize both texts 
                src_vec = vocab_src.lookup_indices(tokenize_eng(src))
                tgt_vec = vocab_tgt.lookup_indices(tokenize_eng(tgt))  
#                 print(src,tgt)
                # Write both texts to the output file (use tab as separator)
                output_file.write("{}\t{}\n".format(" ".join([str(idx) for idx in src_vec]), " ".join([str(idx) for idx in tgt_vec])))        
            except:
                # Detect error in sentence_pair file
                # for removing them from m2 coz we need to assert size of m2 and our hypothesis(output)
                print("====\nRejecting ...\n",line,"====")
                
                pass
            finally:
                # Update progress bar
                t.update(1)
            
output_file.flush()
output_file.close()     

FileNotFoundError: [Errno 2] No such file or directory: 'data/corpora/test/ABCN.test.bea19'

## Summary

Collecting and preparing datasets for training machine translation models is a crucial step in building high-quality translation systems. The process involves sourcing parallel corpora, which consist of sentences or texts in the source language along with their translations in the target language. These corpora can be obtained from various sources such as public repositories, government documents, news articles, or through crowdsourcing.

Once the parallel corpora are acquired, preprocessing steps are applied to clean and normalize the data. This includes tokenization to split sentences into words or subword units, lowercasing, handling special characters, and cleaning the text by removing noise, punctuation, and irrelevant information. Sentence alignment ensures that each sentence in the source language corresponds correctly to its translation in the target language.

Furthermore, data augmentation techniques can be employed to increase the dataset size and improve model performance. Techniques like back-translation and sentence length filtering can be applied to augment and filter the dataset, respectively. These preprocessing steps ensure that the data is in a suitable format for training the machine translation models, enabling effective learning of language patterns and translation mappings.