In this notebook, I applied various machine translation models on a sample of Croatian data to explore which of the models is the most suitable for the task.

I used the models to translate from Croatian to English, but the same pipeline can be used for any pairs of languages that the models support.

I experimented with the following models:
* [OPUS-MT models, used through the EasyNMT library]: (https://github.com/UKPLab/EasyNMT): [Helsinki-NLP/opus-mt-zls-en](https://huggingface.co/Helsinki-NLP/opus-mt-zls-en)
* the eTranslation (I obtained the translation manually from their platform)

## Data Preparation

In [16]:
# Add data - parlamint sample
file = open("/kaggle/input/parlamintsample/ParlaMint-HR_S01_sample.txt", "r").read()

file

'Cijenjene gospođe i gospodo, sukladno članku 4. stavak 2. Poslovnika Hrvatskog sabora pripala mi je dužnost i čast otvoriti prvu konstituirajuću sjednicu Hrvatskog sabora i privremeno joj predsjedati do izbora novog predsjednika Hrvatskog sabora. Sve vas srdačno pozdravljam i čestitam na izboru za zastupnice i zastupnike u Hrvatski sabor. Poštovane gospođe i gospodo, cijenjeni uzvanici čast mi je i zadovoljstvo posebno pozdraviti predsjednicu Republike Hrvatske gospođu Kolindu Grabar Kitarović, predsjednika Vlade Republike Hrvatske gospodina Tihomira Oreškovića i sve nazočne ministre. Posebno pozdravljam mandatara za sastavljanje Vlade Republike Hrvatske gospodina Andreja Plenkovića. Pozdravljam predsjednika Ustavnog suda gospodina Miroslava Šeparovića i predsjednika Vrhovnog suda gospodina Branka Hrvatina. Pozdravljam načelnika Glavnog stožera Oružanih snaga Republike Hrvatske generala zbora Mirka Šundova. \nDa bismo to ostvarili parafrazirat ću Charlesa Pickeringa koji je rekao „Zdr

In [3]:
# We need to tokenize the file into sentences.

# Install NLTK
!pip install --user -U -q nltk
from nltk.tokenize import sent_tokenize, word_tokenize

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
preprocessing 0.1.13 requires nltk==3.2.4, but you have nltk 3.8 which is incompatible.[0m[31m
[0m

In [17]:
# Split the file into sentences
sentence_list = sent_tokenize(file)
len(sentence_list)

52

In [19]:
# Use only first 42 sentences to have the same number as in Slovene sample
sentence_list = sentence_list[:42]
len(sentence_list)

42

In [20]:
# View the beginning of the list
sentence_list[:3]

['Cijenjene gospođe i gospodo, sukladno članku 4. stavak 2.',
 'Poslovnika Hrvatskog sabora pripala mi je dužnost i čast otvoriti prvu konstituirajuću sjednicu Hrvatskog sabora i privremeno joj predsjedati do izbora novog predsjednika Hrvatskog sabora.',
 'Sve vas srdačno pozdravljam i čestitam na izboru za zastupnice i zastupnike u Hrvatski sabor.']

In [21]:
# Save the list of sentences as a new file with each sentence in a separate line
new_file = open("ParlaMint-HR_S01_sample_sentence_tokenized.txt", "w")

for i in sentence_list:
    new_file.write(i)
    new_file.write("\n")

new_file.close()

## Apply MT models to the sample data

## Use Easy NMT for translation with OPUS-MT models

In [6]:
# Install easynmt
!pip install -q -U easynmt

[0m

In [7]:
from easynmt import EasyNMT

In [8]:
# Define the model - The system will automatically detect the suitable Opus-MT model and load it.
model = EasyNMT('opus-mt')

11.9kB [00:00, 10.3MB/s]                   


In [22]:
#Translate the list of sentences - you need to provide the source language as it is in the name of the model (zls - South Slavic)
translation_list = model.translate(sentence_list, source_lang = 'zls', target_lang='en')

translation_list[:3]

['Ladies and gentlemen, in accordance with Article 4(2).',
 'The Rules of Procedure of the Croatian Parliament have been my duty and it is my honor to open the first constitutional session of the Croatian Parliament and to chair it temporarily until the election of the new president of the Croatian Parliament.',
 'I warmly welcome you all and congratulate you on the election for MPs and MPs in the Croatian Parliament.']

## Create a DataFrame with results and save them to a CSV file

In [23]:
# Create a dataframe with all the results
import pandas as pd

# Add all the lists with original sentences and translations to a DataFrame
results_df = pd.DataFrame({"source": sentence_list, "OPUS-MT-SouthSlavic": translation_list})

In [24]:
display(results_df.head(5))

Unnamed: 0,source,OPUS-MT-SouthSlavic
0,"Cijenjene gospođe i gospodo, sukladno članku 4...","Ladies and gentlemen, in accordance with Artic..."
1,Poslovnika Hrvatskog sabora pripala mi je dužn...,The Rules of Procedure of the Croatian Parliam...
2,Sve vas srdačno pozdravljam i čestitam na izbo...,I warmly welcome you all and congratulate you ...
3,"Poštovane gospođe i gospodo, cijenjeni uzvanic...","Dear ladies and gentlemen, it is my honour and..."
4,Posebno pozdravljam mandatara za sastavljanje ...,I particularly welcome the Prime Minister for ...


In [25]:
# Save the results to a CSV
results_df.to_csv("MT-models-comparison-Croatian.csv", sep = "\t")