# Corpus Enlargement🔍
This notebook takes the parallel corpus between catalan and english that is stored in ./data and using different techniques create new files that will be used later to train different models.\
The resulting dataset and the models trained with them are summarised in the spreadsheet "Data & Models Summary".

## Previous steps
Imports, mounting drive, setting the working directory.

In [None]:
!pip install transformers
!pip install sentencepiece
! pip install --upgrade torch==2.0.0 --extra-index-url https://download.pytorch.org/whl/cu116
!pip install sacremoses

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1
Looking in indexes: https://pypi.org/simple, https://

In [None]:
from transformers import MarianTokenizer, MarianMTModel
from typing import List
import os
from os import path
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
import random

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data_path = "/content/drive/MyDrive/Final-Project-MT/data"

## Backtranslation
For this technique, we will be using the translations systems available at https://huggingface.co/Helsinki-NLP.

First we select the models we are going to use.

In [None]:
target_source = "Helsinki-NLP/opus-mt-en-ca" 
target_similarS = "Helsinki-NLP/opus-mt-en-es" 
similarS_source = "Helsinki-NLP/opus-mt-es-ca"
similarS_target = "Helsinki-NLP/opus-mt-es-en"
target_similarT = "Helsinki-NLP/opus-mt-en-de"
similarT_source = "Helsinki-NLP/opus-mt-de-ca"
similarT_target = "Helsinki-NLP/opus-mt-de-en"

We define a translate function.

In [None]:
def translate(model_name,input_file,output_file_source,output_file_target,num_lines=2000):
  # This function takes as input the name of the model it will translate with, the input file we want translated,
  # the output file with the truncated source language, the output file with the translated target language
  #and the number of lines of the file we want to translate.

  # We initialize the model and the tokenizer
  model = MarianMTModel.from_pretrained(model_name)
  tokenizer = MarianTokenizer.from_pretrained(model_name)

  # We open the input_file and store the lines in a variable
  f = open(input_file, "r")
  ff=f.readlines()

  # We iterate over the lines with a range, that way we can select how many lines we want.
  for i in range(num_lines):
    # We open a file to store the translated lines
    tgt_file = open(output_file_target , "a") 

    # We select the line we want to translate
    src_text = ff[i]
    # We translate and detokenize the sentences
    translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
    tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

    # We store the translated lines in a file
    tgt_file.write("\n")
    for text in tgt_text:
      tgt_file.write(text)
    tgt_file.close()
  
  # We store the number of lines of the source file in a file
  src_text = ff[:num_lines]
  with open(output_file_source , "w") as src_file:
    src_file.write("\n")
    for text in src_text:
      src_file.write(text)
  tgt_file.close()

Finally, we use the *translate* function and the bash command *cat* to translate data and combine partial corpora to get our final corpora.

Backtranslation from english to catalan.



In [None]:
translate(target_source,data_path+"/mono.en", data_path+"/bckr.source.en",data_path+"/bckr.target.ca")
!cat $data_path/train.ca $data_path/bckr.target.ca  > $data_path/bckr.train.ca
!cat $data_path/train.en $data_path/bckr.source.en  > $data_path/bckr.train.en

Backtranslation from english to spanish.

In [None]:
translate(target_similarS,data_path+"/mono.en", data_path+"/bckr.source.en",data_path+"/bckr.target.es")
!cat $data_path/train.ca $data_path/bckr.target.es  > $data_path/bckr.train.es

Backtranslation using a pivot language similar to source, from english to spanish to catalan.

In [None]:
translate(target_similarS,data_path+"/mono.en", data_path+"/bckr.source.en",data_path+"/bckr.target.es")
translate(similarS_source,data_path+"/bckr.target.es", data_path+"/bckr.source.es",data_path+"/bckr.target.ca.from_es")
!cat $data_path/train.ca $data_path/bckr.target.ca.from_es  > $data_path/bckr.train.ca.from_es

Backtranslation using a pivot language similar to source, from english to spanish to catalan, and also translating target, from english to spanish to english.

In [None]:
translate(target_similarS,data_path+"/mono.en", data_path+"/bckr.source.en",data_path+"/bckr.target.es")
translate(similarS_source,data_path+"/bckr.target.es", data_path+"/bckr.source.es",data_path+"/bckr.target.ca.from_es")
translate(similarS_target,data_path+"/bckr.target.es", data_path+"/bckr.source.es",data_path+"/bckr.target.en.from_es")
!cat $data_path/train.ca $data_path/bckr.target.ca.from_es  > $data_path/bckr.train.ca.from_es
!cat $data_path/train.en $data_path/bckr.target.en.from_es  > $data_path/bckr.train.en.from_es

Backtranslation using a pivot language similar to target, from english to german to catalan

In [None]:
translate(target_similarT,data_path+"/mono.en", data_path+"/bckr.source.en",data_path+"/bckr.target.de")
translate(similarT_source,data_path+"/bckr.target.de", data_path+"/bckr.source.de",data_path+"/bckr.target.ca.from_de")
!cat $data_path/train.ca $data_path/bckr.target.ca.from_de  > $data_path/bckr.train.ca.from_de

Backtranslation using a pivot language similar to target, from english to german to catalan, and translating target, from english to german to english

In [None]:
translate(target_similarT,data_path+"/mono.en", data_path+"/bckr.source.en",data_path+"/bckr.target.de")
translate(similarT_source,data_path+"/bckr.target.de", data_path+"/bckr.source.de",data_path+"/bckr.target.ca.from_de")
translate(similarT_target,data_path+"/bckr.target.de", data_path+"/bckr.source.de",data_path+"/bckr.target.en.from_de")
!cat $data_path/train.ca $data_path/bckr.target.ca.from_de  > $data_path/bckr.train.ca.from_de
!cat $data_path/train.en $data_path/bckr.target.en.from_de  > $data_path/bckr.train.en.from_de

## Copying

We copied the the instances of the target language in the source part and we put it again duplicated in the target.

In [None]:
!cat  $data_path/train.ca $data_path/train.en > $data_path/train.copy.ca
!cat  $data_path/train.en $data_path/train.en > $data_path/train.copy.en

## Synonyms
For this technique we will use WordNet, available in the NLTK package, to substitute some words for their synonyms.

We define three functions: one to output a synonym given a word, another one to substitute some words of a sentence for their synonyms and the last one to create a file out of an input file and substitue words for their synonyms on each line.

In [None]:
def get_synonym(w):
  # We input a word and get a synonym that can be a word or a multiword entity
  synonyms = []

  #We iterate over the wordnet synsets of a word
  for syn in wn.synsets(w):
    #We get all words for that synset and we append it to the list of synonyms
    for i in syn.lemmas():
        synonyms.append(i.name())

  #If the synonym list is not empty we pick one word of the list randomly/ if not we return the same word
  if synonyms:
    n = random.randint(0,len(synonyms)-1)
    return synonyms[n].replace("_"," ")
  else:
    return w

def replace_synonym_sentence(s):
  # We input a sentence and get the same with synonym substitution

  # We do a list with the sentence
  sentence_list = s.split(" ")
  c= 0

  # We iterate the sentence list
  for word in sentence_list:
    num = random.randint(1,100)
    # We substitute each word by a synonym with a probability of 70%
    if num < 70:
      if word.endswith(".") or word.endswith(",") or word.endswith("?") or word.endswith("!"):
        word_wp = word[:-1]
        sentence_list[c] = get_synonym(word_wp)+word[-1]
      else:
        sentence_list[c] = get_synonym(word)
    c+=1
  return " ".join(sentence_list)

def replace_synonym_file(input_file, output_file):
  # We input 2 file paths the first one the one you want to subtitute and the other one the place you want to save it
  f = open(input_file, "r")

  # We read the input file
  ff=f.readlines()
  with open(output_file , "w") as tgt_file:
    tgt_file.write("\n")
    for text in ff:
      # For each line we will apply the replace_synonym_sentence method
      tgt_file.write(replace_synonym_sentence(text))
  tgt_file.close()

We use the last function and the *cat* command to create our enlarged dataset by substituting words in English for their synonyms.  

In [None]:
replace_synonym_file(data_path+"/train.en",data_path+"/syn.target.en")
!cat $data_path/train.en $data_path/syn.target.en  > $data_path/syn.train.en
!cat $data_path/train.ca $data_path/train.ca > $data_path/syn.train.ca

## Combining
We combine all the partial corpora.

In [None]:
!cat $data_path/train.ca $data_path/bckr.target.ca $data_path/bckr.target.es $data_path/bckr.target.ca.from_es $data_path/bckr.target.ca.from_es $data_path/bckr.target.ca.from_de $data_path/bckr.target.ca.from_de $data_path/train.en > $data_path/train.monster.ca
!cat $data_path/train.en $data_path/bckr.source.en  $data_path/bckr.source.en $data_path/bckr.source.en $data_path/bckr.target.en.from_es $data_path/bckr.source.en $data_path/bckr.target.en.from_de $data_path/train.en > $data_path/train.monster.en