# Translation of the KORE dataset

See **translateWikipageNames.py** script.

This script translate automatically the original English KORE dataset into other languages.

To translate a single entity from the English KORE dataset the [MediaWkiki Action API](https://www.mediawiki.org/wiki/API:Search) is used. For each English entity in the KORE dataset (which has an English Wikipedia page), the corresponding Wikipedia page in the target language is searched and then taken as translated entity.


## Import

[requests](https://realpython.com/python-requests/) and [argparse](https://docs.python.org/3/library/argparse.html) are needed for this script to work:

In [1]:
import requests
import argparse

## General usage

The usage of the script can be seen with the default -h or --help flag:

In [2]:
%%cmd
python translateWikipageNames.py --help

Microsoft Windows [Version 10.0.17134.885]
(c) 2018 Microsoft Corporation. Alle Rechte vorbehalten.

(base) C:\Users\nadin\Documents\Bachelorarbeit\Code>python translateWikipageNames.py --help
usage: translateWikipageNames.py [-h] source target lang

Script for translating wikipedia page names

positional arguments:
  source      source folder with the dataset
  target      target folder in which to store the translated data
  lang        language in which to translate to

optional arguments:
  -h, --help  show this help message and exit

(base) C:\Users\nadin\Documents\Bachelorarbeit\Code>

## Functions

With following function the original English KORE datatset is loaded:

In [3]:
def readFile(source):
    list = [line.rstrip('\n') for line in open(source, encoding="UTF-8")]
   
    lenght = len(list)
    #list[0] = 'Apple Inc.'
    #list[195] = 'Golden Globe Award for Best Actor - Motion Picture Drama' #instead of Golden Globe Award for Best Actor â€“ Motion Picture Drama
    #list[299]= 'Kärtsy Hatakka' #instead of KÃ¤rtsy Hatakka
    #list[302]= 'Ragnarök' #instead of RagnarÃ¶k
    #print(list)

    for index in range(lenght):
        if list[index][0] == "\t":
            list[index] = list[index][1:len(list[index])]
    return list

The writeFile-function writes the translated entities (which are in the list) to the target file:

In [4]:
def writeFile(target, list):
    
    with open(target, 'w', encoding="UTF-8") as f:
        for name in list:
            f.writelines(name)
            f.writelines("\n")

## Configuration and Main


In [None]:
parser = argparse.ArgumentParser(description='Script for translating wikipedia page names')
parser.add_argument('source', type=str, help='source folder with the dataset')
parser.add_argument('target', type=str, help='target folder in which to store the translated data')
parser.add_argument('lang', type=str, help='language in which to translate to')

args = parser.parse_args()

#read file into list
pageList = readFile(args.source)
#print(pageList)

#set language in which the datatset should be translated to
LANG = str(args.lang)
#print(LANG)

#empty list, where the translated entities will be stored in
pageTranslated = []

for e in pageList:
    
    S = requests.Session()
    
    #set URL to the wikipedia in which language on like to translate to; here german: de
    #it: italian, es: spanish, fr: french
    URL = "https://" + LANG + ".wikipedia.org/w/api.php"
    #print(URL)
    
    SEARCHPAGE = LANG + ":" + e

    PARAMS = {
        'action':"query",
        'list':"search",
        'srsearch': SEARCHPAGE,
        'format':"json"
    }

    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    
    
    pageTranslated.append(str((DATA['query']['search'][0]['title'])))
    
#write list into file
writeFile(args.target, pageTranslated)
print("Translation successfully saved into " + args.target)

## Example 

Translate English entity _European Commission_ into German entity _Europäische Kommission_:

In [5]:
#test
S = requests.Session()
    
#set URL to the wikipedia in which language on like to translate to: here german: de
URL = "https://de.wikipedia.org/w/api.php"
    
SEARCHPAGE = 'de:European Commission'
    
    
PARAMS = {
        'action':"query",
        'list':"search",
        'srsearch': SEARCHPAGE,
        'format':"json"
    }

R = S.get(url=URL, params=PARAMS)
DATA = R.json()
    

print(str((DATA['query']['search'][0]['title'])))

Europäische Kommission


## Convert Jupyter Notebook into py-script

In [None]:
!jupyter nbconvert --to script translateWikipageNames.ipynb