# DeepL API example of bulk translation


### Overview

Sometimes you want to bulk translate. This code shows how to do that with a basic example of the DeepL api

### About Me

My name is Alton Alexander. I am a Data Science consultant turned entreprenuer building SaaS tools for SEO.

Find more about my free scripts or ask me any question on twitter: @alton_lex

In [150]:
# setup
import urllib.request
import requests
import json
import pandas as pd

## Example Translation Problem

In recent news there was an allegation made that sourcecode was leaked from a major search engine. A file was posted that suggests it may have been part of that leak.

A meaningful portion of the file was in a language other than english.

We download the file from dropbox and loop through it to create an english version.

A resulting csv is provided as an excersise in journalism.

In [3]:
# download a file from public repository online
# to temporary file

file_url = "https://www.dropbox.com/s/toyehkkfduogbwk/factors_gen.txt?dl=1"
filename = "factors_gen.txt"
urllib.request.urlretrieve(file_url, filename)

('factors_gen.txt', <http.client.HTTPMessage at 0x7fe9fcf4d2d0>)

In [105]:
# convert file to dict

myfile = open(filename, "r")

all_factors = []
factor = {}
i = 0
while myfile:
    line  = myfile.readline()
    if "Factor {" in line:
        if i:
            all_factors.append(factor)
        i = i+1
        factor = {}
        
    if ":" in line:
        var = line.split(":")
        factor[var[0].strip().strip("\n")] = line[(line.find(":")+1):].strip().strip("\n").strip('"')
        
    if line == "":
        break
        
myfile.close()


In [106]:
# how many items were extracted
len(all_factors)

1922

In [152]:
# preview of one prior to translation
all_factors[0]

{'Index': '0',
 'CppName': 'FI_PAGE_RANK',
 'Name': 'PR',
 'Wiki': 'https://wiki.yandex-team.ru/jandekspoisk/kachestvopoiska/factordev/web/factors/PageRank',
 'AntiSeoUpperBound': '1.0',
 'Tags': '[TG_DOC, TG_LINK_GRAPH, TG_STATIC, TG_L2, TG_UNUSED]',
 'Description': 'Page rank. Фактор ремапится.',
 'Authors': 'aavdonkin',
 'Responsibles': 'aavdonkin',
 'Desc': 'Page rank. The factor is remapped.',
 'TG_DOC': True,
 'TG_LINK_GRAPH': True,
 'TG_STATIC': True,
 'TG_L2': True,
 'TG_UNUSED': True}

# API translation

Input: Any language

Output: Translated to specified language

Signup at Deepl.com

In [29]:
api_key_deepl = "add key here"

# from the docs
#curl -X POST 'https://api-free.deepl.com/v2/translate' \
#	-H 'Authorization: DeepL-Auth-Key asdfasdfasdfa' \
#	-d 'text=Hello%2C%20world!' \
#	-d 'target_lang=DE'

def translate( text, target_lang="EN"):
    res = requests.post(
        'https://api-free.deepl.com/v2/translate',
        headers= {"Authorization":"DeepL-Auth-Key "+api_key_deepl},
        json = {"text":[text], "target_lang":target_lang}
    )

    return json.loads(res.text)['translations'][0]['text']

In [30]:
# test one input - uncomment to run test
#translate('Приоритет strict для TR - текстовый приоритет - есть все слова запроса где-то в документе (при этом они проходят контекстные ограничения запроса, например, оба слова д.б. в одном предложении).')

'Priority strict for TR is text priority - there are all query words somewhere in the document (and they pass contextual restrictions of the query, for example, both words d.b. in the same sentence).'

In [None]:
# potentially store a cache of all descriptions
descriptions_to_en = {}

In [None]:
# loop over every description and convert to english
i = 0
for factor in all_factors:
    
    if ('Desc' not in factor) and ('Description' in factor):
        #print(i, factor['Description'])
        if factor['Description'] in descriptions_to_en:
            # use cache if already translated
            factor['Desc'] = descriptions_to_en[factor['Description']]
        else:
            factor['Desc'] = translate(factor['Description'])
    i = i+1

# Pivot data into csv

In [115]:
# convert list into columns

distinct_tags = {}

for factor in all_factors:
    
    # convert text to array
    tags = factor.get('Tags',"[]").strip("]").strip("[").split(",")
    
    for tag in tags:
        tag = tag.strip()
        # aggregate each tag
        distinct_tags[tag] = distinct_tags.get(tag,0)+1
        
        # pivot into column
        factor[tag] = True

In [153]:
# list tags sorted by frequency used
sorted(distinct_tags.items(), key=lambda x:x[1], reverse=True)

[('TG_DOC', 1155),
 ('TG_DYNAMIC', 1154),
 ('TG_L2', 1030),
 ('TG_USER', 1019),
 ('TG_DEPRECATED', 999),
 ('TG_NN_OVER_FEATURES_USE', 902),
 ('TG_USER_SEARCH', 900),
 ('TG_STATIC', 671),
 ('TG_UNDOCUMENTED', 574),
 ('TG_USERFEAT', 508),
 ('TG_LOCALIZED_COUNTRY', 408),
 ('TG_USER_SEARCH_ONLY', 408),
 ('TG_DOC_TEXT', 383),
 ('TG_ANNOTATION_NOFILTER', 345),
 ('TG_TEXT_MACHINE', 330),
 ('TG_OFTEN_ZERO', 304),
 ('TG_USERFEAT_90D', 259),
 ('TG_SAMOHOD_UNIMPLEMENTED', 253),
 ('TG_CALLISTO_UNIMPLEMENTED', 250),
 ('TG_QUERY_ONLY', 248),
 ('TG_UNUSED', 242),
 ('TG_HOST', 222),
 ('TG_REARR_USE', 216),
 ('TG_RANDOM_LOG', 199),
 ('TG_NEURAL', 196),
 ('TG_BINARY', 192),
 ('TG_BROWSER', 185),
 ('TG_URL_TEXT', 153),
 ('TG_UNIMPLEMENTED', 149),
 ('TG_LINK_TEXT', 146),
 ('TG_USER_EXT_DATA', 143),
 ('TG_OWNER', 141),
 ('TG_FORMULA_2245_DEP_2', 137),
 ('TG_USERFEAT_CLICK_MACHINE', 133),
 ('TG_LINGBOOST', 126),
 ('TG_FORMULA_2245_DEP_3', 120),
 ('TG_REMOVED', 115),
 ('TG_USERFEAT_SEARCH_DWELL_TIME', 102),


# Convert to DataFrame

In [162]:
df = pd.DataFrame.from_dict(all_factors)
df.to_csv("factors_gen_translated.csv",index=False)

# Example of Selecting by tag

In [164]:
df_non = df[['Name','Desc','TG_UNIMPLEMENTED','TG_DEPRECATED','TG_UNUSED','Description']]
df_non[(df_non['TG_UNIMPLEMENTED']==True) | (df_non['TG_DEPRECATED']==True) | (df['TG_UNUSED']==True)]

for i, row in df_non.iterrows():
    print("###", row['Name'])
    print(row.get('Description',None))
    print(row['Desc'])
    print()

### PR
Page rank. Фактор ремапится.
Page rank. The factor is remapped.

### TR
Текстовая релевантность (maxfreq – частота самого частого слова, которая имеет смысл длины документа).
Textual relevance (maxfreq - the frequency of the most frequent word, which makes sense of the length of the document).

### LR
Линковая релевантность. Фактор ремапится.
Link Relevance. The factor is remapped.

### PrBonus
Priority bonus, приоритет 7  - текстовый приоритет. Фактор бинарный, имеет значение 0 для всех однословных запросов, и значение 1 практически для всех двух и более словных, кроме очень маленького количества ответов, для которых нет ни одной ссылки, прошедшей кворум, и текст тоже не прошел кворум.
Priority bonus, priority 7 - text priority. Factor is binary, has value 0 for all single word queries, and value 1 for almost all two or more word queries, except for a very small number of responses, for which there are no links that passed the quorum, and the text did not pass the quorum either

# Results of Bulk Translation

In [161]:
df_2 = df[-((df['TG_UNIMPLEMENTED']==True) | (df['TG_DEPRECATED']==True) | (df['TG_UNUSED']==True))]
for i, row in df_2[(df_2['TG_DOC']==True) & (df_2['TG_STATIC']==True)].iterrows():
    print("###", row['Name'])
    print(row.get('Description',None))
    print(row['Desc'])
    print()

### Long
Длинный документ (чем длиннее документ, тем больше значение фактора).
Long document (the longer the document, the greater the value of the factor).

### PureText
Длинный текст без ссылок.
Long text without references.

### Root
Это морда.
It's a muzzle.

### RusLang
Язык документа - русский.
The language of the document is Russian.

### AddTime
Время добавления страницы, больше - более старый документ; кладется корень из времени, отображенный на интервал [0,1] так, чтобы 3+ года давало 1.
The time of adding a page, more is an older document; put the root of the time mapped to the interval [0,1] so that 3+ years gives 1.

### IsMainPage
Если главная страница владельца (чаще всего домен второго уровня, например xxxx.ru), то фактор равен 1. Для бомжатников, хостингов, личных блогов и т.д. (например, лайфджорнал, народ.ру и пр.) - домены третьего уровня (типа xxxxx.narod.ru) так же будут иметь фактор равный 1.
If the main page of the owner (most often a second-level domain, such a