In this notebook, I will prepare the GINCO corpus for machine translation. To be able to use as much useful text as possible, we will use all paragraphs, marked as "keep". The paragraphs will be joint into full text (attribute "full_text").

Although we won't use all labels in the experiments, I will machine translate all texts so that they could be used in any other experiments.

In [1]:
# Import the GINCO dataset
import json

with open("G:/My Drive/GitHub/Experiments-with-CORE/data/GINCO-mapped-to-GINCORE.json", "r") as file:
    dataset = json.load(file)

In [2]:
dataset[0]

{'id': '3949',
 'url': 'http://www.pomurje.si/aktualno/sport/zimska-liga-malega-nogometa/',
 'crawled': '2014',
 'hard': False,
 'paragraphs': [{'text': 'Šport', 'duplicate': False, 'keep': True},
  {'text': 'Zimska liga malega nogometa sobota, 12.02.2011',
   'duplicate': False,
   'keep': True},
  {'text': 'avtor: Tonček Gider', 'duplicate': False, 'keep': True},
  {'text': "V 7. krogu zimske lige v malem nogometu v Križevcih pri Ljutomeru je v prvi ligi vodilni 100 plus iz Križevec izgubil s tretjo ekipo na lestvici Rock'n roll iz Križevec z rezultatom 1:2, druga na lestvici Top Finedika iz Križevec je bila poražena z ekipo Bar Milene iz Ključarovec z rezultatom 7:8. V drugi križevski ligi je vodilni Cafe del Mar iz Vučje vasi premagal Montažo Vrbnjak iz Stare Nove vasi z rezultatom 3:2.",
   'duplicate': False,
   'keep': True},
  {'text': 'oglasno sporočilo', 'duplicate': False, 'keep': True},
  {'text': 'Ocena', 'duplicate': False, 'keep': True},
  {'text': 'Komentiraj Za komenti

In [3]:
# Join all paragraphs with the attribute "keep" True into one text (attribute "full_text")

for instance in dataset:
    paragraphs = instance["paragraphs"]
    # Removing duplicates:
    paragraphs = [p for p in paragraphs if p["keep"]]

    # Joining texts:
    if len(paragraphs) > 0:
        instance_full_text = " <p/> ".join([p["text"] for p in paragraphs])

        # Assigning texts to a new field:
        instance["full_text"] = instance_full_text
    else:
        instance["full_text"] = ""

dataset[0]

{'id': '3949',
 'url': 'http://www.pomurje.si/aktualno/sport/zimska-liga-malega-nogometa/',
 'crawled': '2014',
 'hard': False,
 'paragraphs': [{'text': 'Šport', 'duplicate': False, 'keep': True},
  {'text': 'Zimska liga malega nogometa sobota, 12.02.2011',
   'duplicate': False,
   'keep': True},
  {'text': 'avtor: Tonček Gider', 'duplicate': False, 'keep': True},
  {'text': "V 7. krogu zimske lige v malem nogometu v Križevcih pri Ljutomeru je v prvi ligi vodilni 100 plus iz Križevec izgubil s tretjo ekipo na lestvici Rock'n roll iz Križevec z rezultatom 1:2, druga na lestvici Top Finedika iz Križevec je bila poražena z ekipo Bar Milene iz Ključarovec z rezultatom 7:8. V drugi križevski ligi je vodilni Cafe del Mar iz Vučje vasi premagal Montažo Vrbnjak iz Stare Nove vasi z rezultatom 3:2.",
   'duplicate': False,
   'keep': True},
  {'text': 'oglasno sporočilo', 'duplicate': False, 'keep': True},
  {'text': 'Ocena', 'duplicate': False, 'keep': True},
  {'text': 'Komentiraj Za komenti

In [4]:
dataset[-1]

{'id': '1277644',
 'url': 'https://pogajanja.si/?s=article&a=izpis&id=1221',
 'crawled': '2021',
 'hard': True,
 'paragraphs': [{'text': 'Pogajalska akadmija v letu 2014 02.02.2014',
   'duplicate': False,
   'keep': True},
  {'text': 'Kako se pogajati tako, da boste dosegli to, kar želite in ohranili dober odnos z drugo stranjo 3. RAZPIS USPEŠNE AKADEMIJE! Pridobite celovito pogajalsko znanje ter spretnosti, pridružite se nam na intenzivnem tečaju. Tečaj traja 21 šolskih ur in vključuje pet poldnevnih modulov v marcu in aprilu 2014: 6. marca, 13. marca, 20. marca, 27. marca in 3. aprila 2014 od 15.30 do 19.00. Srečanja bodo v prostorih podjetja Planet GV, seminarska dvorana, Poslovni center Alta, Železna cesta 18 v Ljubljani. VEČ>>',
   'duplicate': False,
   'keep': True}],
 'primary_level_1': 'List of Summaries/Excerpts',
 'primary_level_2': 'List of Summaries/Excerpts',
 'primary_level_3': 'List of Summaries/Excerpts',
 'secondary_level_1': '',
 'secondary_level_2': '',
 'secondary

Save the dataset into a txt file which will be machine translated. Each text will be on a separate line, preceded by its id. Just one file is too large for DeepL which takes files up to 1 MB. Let's split it into three files.

In [6]:
new_file_part1 = open("GINCO-for-MT-part1-keeptext.txt","w")
new_file_part2 = open("GINCO-for-MT-part2-keeptext.txt","w")
new_file_part3 = open("GINCO-for-MT-part3-keeptext.txt","w")

counter = 0

for i in dataset:
    current_id = i["id"]
    current_text = i["full_text"]
    if counter <= 300:
        new_file_part1.write(f"{current_id}\t{current_text}\n")
    elif counter > 300 and counter <= 600:
        new_file_part2.write(f"{current_id}\t{current_text}\n")
    else:
        new_file_part3.write(f"{current_id}\t{current_text}\n")
    counter += 1


new_file_part1.close()
new_file_part2.close()
new_file_part3.close()

After translating the files with the DeepL system, let's join them and create a JSON file.

In [27]:
with open("GINCO-for-MT-part1-keeptext en-GB.txt", "r") as fileMT1:
    mt_text_part1 = fileMT1.readlines()
with open("GINCO-for-MT-part2-keeptext en-GB.txt", "r") as fileMT2:
    mt_text_part2 = fileMT2.readlines()
with open("GINCO-for-MT-part3-keeptext en-GB.txt", "r") as fileMT3:
    mt_text_part3 = fileMT3.readlines()

In [28]:
import parse

pattern = "{id} {text}\n"
pattern_id_only = "{id}\n"
compiled_pattern = parse.compile(pattern)
id_compiled_pattern = parse.compile(pattern_id_only)

MT_dataset = []

def parse_MT_file(file_name):
    """
    Function that takes the name of the txt file, created in previous steps and machine-translated,
    and parses it to obtain text ids and text.
    Args:
    file_name = the name of the file object, e.g. mt_text_part1
    """
    for line in file_name:
        if " " in line:
            result = compiled_pattern.parse(line)
            if result != None:
                current_id = result["id"]
                current_text = result["text"]
                current_dictionary = {"MT-id": current_id, "MT-text": current_text}
                MT_dataset.append(current_dictionary)
            else:
                print(line)
        else:
            empty_result = id_compiled_pattern.parse(line)
            if empty_result != None:
                current_id = empty_result["id"]
                current_text = ""
                current_dictionary = {"MT-id": current_id, "MT-text": current_text}
                MT_dataset.append(current_dictionary)
            else:
                print(line)

In [29]:
parse_MT_file(mt_text_part1)
parse_MT_file(mt_text_part2)
parse_MT_file(mt_text_part3)
print(len(MT_dataset))

1002


In [30]:
print(MT_dataset[-3:])

[{'MT-id': '674213', 'MT-text': "About the product There are two women behind the Dame brand, and we've been more than impressed by the two products they've launched so far. Not because they're women, but because they obviously know [...] <p/> About the product The Satisfyer Partner Plus sex toy is a great way to show your partner how fantastic and fun erotic toys can be when they're included in your bedroom. A toy that will bring pleasure to both of you and [...] <p/> About product The Svakom Vick Vibrating Plug, a female G-spot and male P-spot massager, is an extremely versatile toy that not only looks nice, but is here to satisfy all your needs. This velvety soft seducer, made [...] <p/>About product It's time for the ultimate erotic toy in the form of a rabbit vibrator that is sure to become your muse for all your intimate play to come. In typical Satisfyer style, this toy also stimulates the clitoris without direct [...] <p/> About product The Lelo Bob is a smaller sized anal toy 

Let's join the two datasets.

In [31]:
import pandas as pd

Sl_df = pd.DataFrame(dataset)
En_df = pd.DataFrame(MT_dataset)

En_df.columns = ["id","MT-text"]

In [32]:
Sl_df.head()

Unnamed: 0,id,url,crawled,hard,paragraphs,primary_level_1,primary_level_2,primary_level_3,secondary_level_1,secondary_level_2,secondary_level_3,tertiary_level_1,tertiary_level_2,tertiary_level_3,split,domain,GINCORE,full_text
0,3949,http://www.pomurje.si/aktualno/sport/zimska-li...,2014,False,"[{'text': 'Šport', 'duplicate': False, 'keep':...",News/Reporting,News/Reporting,News/Reporting,,,,,,,test,www.pomurje.si,News,"Šport <p/> Zimska liga malega nogometa sobota,..."
1,3726,http://www.ss-sezana.si/sss/index.php?option=c...,2014,False,"[{'text': 'JEDILNIK', 'duplicate': False, 'kee...",Information/Explanation,Information/Explanation,Information/Explanation,,,,,,,train,www.ss-sezana.si,Information/Explanation,JEDILNIK <p/> Iskalnik <p/> Poglavitni cilj pr...
2,5621,http://www.kamnik-starejsi.si/novice/144-sodel...,2014,False,"[{'text': 'Projekt INNOVAge in zavod Oreli', '...",Promotion of Services,Promotion of Services,Promotion,Opinion/Argumentation,Opinion/Argumentation,Opinion/Argumentation,Information/Explanation,Information/Explanation,Information/Explanation,train,www.kamnik-starejsi.si,Promotion,Projekt INNOVAge in zavod Oreli <p/> Zavod Ore...
3,3776,http://www.radiocelje.si/novica.php?id=13007&a...,2014,False,"[{'text': 'V novembru, mesecu preprečevanja od...",News/Reporting,News/Reporting,News/Reporting,,,,,,,train,www.radiocelje.si,News,"V novembru, mesecu preprečevanja odvisnosti, b..."
4,2102,http://www.mtv.si/novice/selena-gomez-ponudila...,2014,False,[{'text': 'Selena Gomez ponudila v poslušanje ...,Opinionated News,Opinionated News,Opinionated News,,,,,,,test,www.mtv.si,News,Selena Gomez ponudila v poslušanje novi album ...


In [33]:
En_df.head()

Unnamed: 0,id,MT-text
0,3949,Sport <p/> Winter Little League Football Satur...
1,3726,JEDILNIK <p/> Search <p/> The main objective o...
2,5621,Project INNOVAge and the Oreli Institute <p/> ...
3,3776,"In November, the month of addiction prevention..."
4,2102,Selena Gomez launches new album <p/> 16.07.201...


In [34]:
En_df.describe()

Unnamed: 0,id,MT-text
count,1002,1002
unique,1002,1002
top,3949,Sport <p/> Winter Little League Football Satur...
freq,1,1


Let's check if all the ids are the same.

In [35]:
sl_ids = list(Sl_df.id.unique())
len(sl_ids)

1002

In [36]:
en_ids = list(En_df.id.unique())
len(en_ids)

1002

In [37]:
# Create a set of ids that are the same for both lists
same_ids = set(set(sl_ids) & set(en_ids))
len(same_ids)

1002

In [38]:
# Find which element is not in the set
for element in sl_ids:
    if element not in same_ids:
        print(element)

At first, there was one id missing, I manually corrected the translation for this text so that the ids are the same.

In [39]:
# Join the datasets based on the IDs
joined_df = pd.merge(Sl_df, En_df, on=['id'])

In [40]:
joined_df.head()

Unnamed: 0,id,url,crawled,hard,paragraphs,primary_level_1,primary_level_2,primary_level_3,secondary_level_1,secondary_level_2,secondary_level_3,tertiary_level_1,tertiary_level_2,tertiary_level_3,split,domain,GINCORE,full_text,MT-text
0,3949,http://www.pomurje.si/aktualno/sport/zimska-li...,2014,False,"[{'text': 'Šport', 'duplicate': False, 'keep':...",News/Reporting,News/Reporting,News/Reporting,,,,,,,test,www.pomurje.si,News,"Šport <p/> Zimska liga malega nogometa sobota,...",Sport <p/> Winter Little League Football Satur...
1,3726,http://www.ss-sezana.si/sss/index.php?option=c...,2014,False,"[{'text': 'JEDILNIK', 'duplicate': False, 'kee...",Information/Explanation,Information/Explanation,Information/Explanation,,,,,,,train,www.ss-sezana.si,Information/Explanation,JEDILNIK <p/> Iskalnik <p/> Poglavitni cilj pr...,JEDILNIK <p/> Search <p/> The main objective o...
2,5621,http://www.kamnik-starejsi.si/novice/144-sodel...,2014,False,"[{'text': 'Projekt INNOVAge in zavod Oreli', '...",Promotion of Services,Promotion of Services,Promotion,Opinion/Argumentation,Opinion/Argumentation,Opinion/Argumentation,Information/Explanation,Information/Explanation,Information/Explanation,train,www.kamnik-starejsi.si,Promotion,Projekt INNOVAge in zavod Oreli <p/> Zavod Ore...,Project INNOVAge and the Oreli Institute <p/> ...
3,3776,http://www.radiocelje.si/novica.php?id=13007&a...,2014,False,"[{'text': 'V novembru, mesecu preprečevanja od...",News/Reporting,News/Reporting,News/Reporting,,,,,,,train,www.radiocelje.si,News,"V novembru, mesecu preprečevanja odvisnosti, b...","In November, the month of addiction prevention..."
4,2102,http://www.mtv.si/novice/selena-gomez-ponudila...,2014,False,[{'text': 'Selena Gomez ponudila v poslušanje ...,Opinionated News,Opinionated News,Opinionated News,,,,,,,test,www.mtv.si,News,Selena Gomez ponudila v poslušanje novi album ...,Selena Gomez launches new album <p/> 16.07.201...


In [41]:
print(list(joined_df.columns))

['id', 'url', 'crawled', 'hard', 'paragraphs', 'primary_level_1', 'primary_level_2', 'primary_level_3', 'secondary_level_1', 'secondary_level_2', 'secondary_level_3', 'tertiary_level_1', 'tertiary_level_2', 'tertiary_level_3', 'split', 'domain', 'GINCORE', 'full_text', 'MT-text']


In [42]:
final_dataset = joined_df[['id', 'url', 'crawled', 'hard', 'primary_level_1', 'primary_level_2', 'primary_level_3', 'secondary_level_1', 'secondary_level_2', 'secondary_level_3', 'tertiary_level_1', 'tertiary_level_2', 'tertiary_level_3', 'split', 'domain', 'GINCORE', 'full_text', 'MT-text']]

In [43]:
final_dataset.tail()

Unnamed: 0,id,url,crawled,hard,primary_level_1,primary_level_2,primary_level_3,secondary_level_1,secondary_level_2,secondary_level_3,tertiary_level_1,tertiary_level_2,tertiary_level_3,split,domain,GINCORE,full_text,MT-text
997,374730,http://khetanes.si/sl-si/produkti/projektne-no...,2021,False,Information/Explanation,Information/Explanation,Information/Explanation,,,,Promotion of a Product,Promotion of a Product,Promotion,train,khetanes.si,Information/Explanation,Projektne novine <p/> Promocijski projektni ča...,Project News <p/> Promotional project newspape...
998,476885,https://www.merkur.si/navigacija/nasveti/kopal...,2021,False,List of Summaries/Excerpts,List of Summaries/Excerpts,List of Summaries/Excerpts,,,,,,,train,www.merkur.si,List of Summaries/Excerpts,V raznoliki ponudbi tušev izberite popolno raz...,Choose the perfect shower to match your taste ...
999,674213,http://www.sex2.si/category/ocene-izdelkov/,2021,False,List of Summaries/Excerpts,List of Summaries/Excerpts,List of Summaries/Excerpts,,,,,,,train,www.sex2.si,List of Summaries/Excerpts,"O izdelku Za znamko Dame stojita dve ženski, z...",About the product There are two women behind t...
1000,975590,http://www.ipsos.si/VodenjeVIZ_VI_past_dvojne_...,2021,False,Opinion/Argumentation,Opinion/Argumentation,Opinion/Argumentation,,,,,,,train,www.ipsos.si,Opinion/Argumentation,Razprava pogosto potegne na plano najprej tist...,The debate often brings to the surface first t...
1001,1277644,https://pogajanja.si/?s=article&a=izpis&id=1221,2021,True,List of Summaries/Excerpts,List of Summaries/Excerpts,List of Summaries/Excerpts,,,,,,,test,pogajanja.si,List of Summaries/Excerpts,Pogajalska akadmija v letu 2014 02.02.2014 <p/...,Negotiation Academy 2014 02.02.2014 <p/> How t...


In [44]:
final_dataset.describe()

Unnamed: 0,id,url,crawled,hard,primary_level_1,primary_level_2,primary_level_3,secondary_level_1,secondary_level_2,secondary_level_3,tertiary_level_1,tertiary_level_2,tertiary_level_3,split,domain,GINCORE,full_text,MT-text
count,1002,1002,1002,1002,1002,1002,1002,1002.0,1002.0,1002.0,1002.0,1002.0,1002.0,1002,1002,1002,1002,1002
unique,1002,881,2,2,24,21,12,14.0,14.0,10.0,14.0,14.0,9.0,3,721,20,1002,1002
top,3949,https://publishwall.si/pozareport.7dni,2014,False,Information/Explanation,Information/Explanation,Promotion,,,,,,,train,publishwall.si,Promotion,"Šport <p/> Zimska liga malega nogometa sobota,...",Sport <p/> Winter Little League Football Satur...
freq,1,10,501,933,130,130,209,812.0,812.0,812.0,849.0,849.0,849.0,602,10,209,1,1


As we did not use all the text (but only the text with the parameter "keep"), let's check if there are any empty or very short texts.

In [45]:
final_dataset["text_length"] = final_dataset["full_text"].str.split().str.len()

final_dataset

Unnamed: 0,id,url,crawled,hard,primary_level_1,primary_level_2,primary_level_3,secondary_level_1,secondary_level_2,secondary_level_3,tertiary_level_1,tertiary_level_2,tertiary_level_3,split,domain,GINCORE,full_text,MT-text,text_length
0,3949,http://www.pomurje.si/aktualno/sport/zimska-li...,2014,False,News/Reporting,News/Reporting,News/Reporting,,,,,,,test,www.pomurje.si,News,"Šport <p/> Zimska liga malega nogometa sobota,...",Sport <p/> Winter Little League Football Satur...,93
1,3726,http://www.ss-sezana.si/sss/index.php?option=c...,2014,False,Information/Explanation,Information/Explanation,Information/Explanation,,,,,,,train,www.ss-sezana.si,Information/Explanation,JEDILNIK <p/> Iskalnik <p/> Poglavitni cilj pr...,JEDILNIK <p/> Search <p/> The main objective o...,76
2,5621,http://www.kamnik-starejsi.si/novice/144-sodel...,2014,False,Promotion of Services,Promotion of Services,Promotion,Opinion/Argumentation,Opinion/Argumentation,Opinion/Argumentation,Information/Explanation,Information/Explanation,Information/Explanation,train,www.kamnik-starejsi.si,Promotion,Projekt INNOVAge in zavod Oreli <p/> Zavod Ore...,Project INNOVAge and the Oreli Institute <p/> ...,232
3,3776,http://www.radiocelje.si/novica.php?id=13007&a...,2014,False,News/Reporting,News/Reporting,News/Reporting,,,,,,,train,www.radiocelje.si,News,"V novembru, mesecu preprečevanja odvisnosti, b...","In November, the month of addiction prevention...",158
4,2102,http://www.mtv.si/novice/selena-gomez-ponudila...,2014,False,Opinionated News,Opinionated News,Opinionated News,,,,,,,test,www.mtv.si,News,Selena Gomez ponudila v poslušanje novi album ...,Selena Gomez launches new album <p/> 16.07.201...,63
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
997,374730,http://khetanes.si/sl-si/produkti/projektne-no...,2021,False,Information/Explanation,Information/Explanation,Information/Explanation,,,,Promotion of a Product,Promotion of a Product,Promotion,train,khetanes.si,Information/Explanation,Projektne novine <p/> Promocijski projektni ča...,Project News <p/> Promotional project newspape...,86
998,476885,https://www.merkur.si/navigacija/nasveti/kopal...,2021,False,List of Summaries/Excerpts,List of Summaries/Excerpts,List of Summaries/Excerpts,,,,,,,train,www.merkur.si,List of Summaries/Excerpts,V raznoliki ponudbi tušev izberite popolno raz...,Choose the perfect shower to match your taste ...,50
999,674213,http://www.sex2.si/category/ocene-izdelkov/,2021,False,List of Summaries/Excerpts,List of Summaries/Excerpts,List of Summaries/Excerpts,,,,,,,train,www.sex2.si,List of Summaries/Excerpts,"O izdelku Za znamko Dame stojita dve ženski, z...",About the product There are two women behind t...,295
1000,975590,http://www.ipsos.si/VodenjeVIZ_VI_past_dvojne_...,2021,False,Opinion/Argumentation,Opinion/Argumentation,Opinion/Argumentation,,,,,,,train,www.ipsos.si,Opinion/Argumentation,Razprava pogosto potegne na plano najprej tist...,The debate often brings to the surface first t...,409


In [46]:
nonduplicated_dataset = final_dataset[final_dataset["text_length"] < 11]

nonduplicated_dataset.head()

Unnamed: 0,id,url,crawled,hard,primary_level_1,primary_level_2,primary_level_3,secondary_level_1,secondary_level_2,secondary_level_3,tertiary_level_1,tertiary_level_2,tertiary_level_3,split,domain,GINCORE,full_text,MT-text,text_length


There is none such text.

Save the file with Slovene and English text:

In [47]:
new_GINCO_file = final_dataset.to_json("Sl-and-MT-GINCO-mapped-to-GINCORE-keeptext.json", orient="index",indent=2)
csv_file = final_dataset.to_csv("Sl-and-MT-GINCO-mapped-to-GINCORE-keeptext.csv",index=False)