# DeepL API


### Overview

Sometimes you want to bulk translate. This code shows how to do that with a basic example of the DeepL api

### About Me

My name is Alton Alexander. I am a Data Science consultant turned entreprenuer building SaaS tools for SEO.

Find more about my free scripts or ask me any question on twitter: @alton_lex

In [49]:
import urllib.request
import requests
import json
import pandas as pd

In [3]:
# download a file from public repository online
# to temporary file

file_url = "https://www.dropbox.com/s/toyehkkfduogbwk/factors_gen.txt?dl=1"
filename = "factors_gen.txt"
urllib.request.urlretrieve(file_url, filename)

('factors_gen.txt', <http.client.HTTPMessage at 0x7fe9fcf4d2d0>)

In [105]:
# convert file to dict

myfile = open(filename, "r")

all_factors = []
factor = {}
i = 0
while myfile:
    line  = myfile.readline()
    if "Factor {" in line:
        if i:
            all_factors.append(factor)
        i = i+1
        factor = {}
        
    if ":" in line:
        var = line.split(":")
        factor[var[0].strip().strip("\n")] = line[(line.find(":")+1):].strip().strip("\n").strip('"')
        
    if line == "":
        break
        
myfile.close()


In [106]:
len(all_factors)

1922

In [107]:
# preview
all_factors[0:10]

[{'Index': '0',
  'CppName': 'FI_PAGE_RANK',
  'Name': 'PR',
  'Wiki': 'https://wiki.yandex-team.ru/jandekspoisk/kachestvopoiska/factordev/web/factors/PageRank',
  'AntiSeoUpperBound': '1.0',
  'Tags': '[TG_DOC, TG_LINK_GRAPH, TG_STATIC, TG_L2, TG_UNUSED]',
  'Description': 'Page rank. Фактор ремапится.',
  'Authors': 'aavdonkin',
  'Responsibles': 'aavdonkin'},
 {'Index': '1',
  'CppName': 'FI_TEXT_RELEV',
  'Name': 'TR',
  'AntiSeoUpperBound': '0.95',
  'Group': 'LegacyTR',
  'Tags': '[TG_DOC, TG_DOC_TEXT, TG_DYNAMIC, TG_REARR_USE, TG_UNDOCUMENTED, TG_NN_OVER_FEATURES_USE]',
  'Description': 'Текстовая релевантность (maxfreq – частота самого частого слова, которая имеет смысл длины документа).',
  'Authors': '["gulin", "iseg", "leo", "maslov"]',
  'Responsibles': '["gulin", "leo", "maslov"]'},
 {'Index': '2',
  'CppName': 'FI_LINK_RELEV',
  'Name': 'LR',
  'AntiSeoUpperBound': '1.0',
  'Group': 'Dynamic',
  'Tags': '[TG_DOC, TG_DYNAMIC, TG_LINK_TEXT, TG_UNDOCUMENTED, TG_DEPRECATED]',

# API translation

Input: Any language

Output: Translated to specified language

Signup at Deepl.com

In [29]:
api_key_deepl = "add key here"

# from the docs
#curl -X POST 'https://api-free.deepl.com/v2/translate' \
#	-H 'Authorization: DeepL-Auth-Key asdfasdfasdfa' \
#	-d 'text=Hello%2C%20world!' \
#	-d 'target_lang=DE'

def translate( text, target_lang="EN"):
    res = requests.post(
        'https://api-free.deepl.com/v2/translate',
        headers= {"Authorization":"DeepL-Auth-Key "+api_key_deepl},
        json = {"text":[text], "target_lang":target_lang}
    )

    return json.loads(res.text)['translations'][0]['text']

In [30]:

# test one input
translate('Приоритет strict для TR - текстовый приоритет - есть все слова запроса где-то в документе (при этом они проходят контекстные ограничения запроса, например, оба слова д.б. в одном предложении).')

'Priority strict for TR is text priority - there are all query words somewhere in the document (and they pass contextual restrictions of the query, for example, both words d.b. in the same sentence).'

In [None]:
descriptions_to_en = {}

In [114]:
# loop over every description and convert to english

i = 0

for factor in all_factors:
    
    if ('Desc' not in factor) and ('Description' in factor):
        print(i, factor['Description'])
        if factor['Description'] in descriptions_to_en:
            factor['Desc'] = descriptions_to_en[factor['Description']]
        else:
            factor['Desc'] = translate(factor['Description'])
    i = i+1

0 Page rank. Фактор ремапится.
1 Текстовая релевантность (maxfreq – частота самого частого слова, которая имеет смысл длины документа).
2 Линковая релевантность. Фактор ремапится.
3 Priority bonus, приоритет 7  - текстовый приоритет. Фактор бинарный, имеет значение 0 для всех однословных запросов, и значение 1 практически для всех двух и более словных, кроме очень маленького количества ответов, для которых нет ни одной ссылки, прошедшей кворум, и текст тоже не прошел кворум.
4 Приоритет strict для TR - текстовый приоритет - есть все слова запроса где-то в документе (при этом они проходят контекстные ограничения запроса, например, оба слова д.б. в одном предложении).
5 Приоритет phrase для TR - текстовый приоритет - есть все слова запроса подряд в документе.
6 (strict) есть все слова запроса в одном линке.
7 (phrase) есть все слова запроса подряд в одном линке.
8 Наличие точной фразы (текста запроса) в заголовке (если точнее, в первом предложении документа). Контекстные ограничения и ст

# Pivot data

In [115]:
# convert list into columns

distinct_tags = {}

for factor in all_factors:
    
    # convert text to array
    tags = factor.get('Tags',"[]").strip("]").strip("[").split(",")
    
    for tag in tags:
        tag = tag.strip()
        # aggregate each tag
        distinct_tags[tag] = distinct_tags.get(tag,0)+1
        
        # pivot into column
        factor[tag] = True

In [116]:
sorted(distinct_tags.items(), key=lambda x:x[1], reverse=True)

[('TG_DOC', 1155),
 ('TG_DYNAMIC', 1154),
 ('TG_L2', 1030),
 ('TG_USER', 1019),
 ('TG_DEPRECATED', 999),
 ('TG_NN_OVER_FEATURES_USE', 902),
 ('TG_USER_SEARCH', 900),
 ('TG_STATIC', 671),
 ('TG_UNDOCUMENTED', 574),
 ('TG_USERFEAT', 508),
 ('TG_LOCALIZED_COUNTRY', 408),
 ('TG_USER_SEARCH_ONLY', 408),
 ('TG_DOC_TEXT', 383),
 ('TG_ANNOTATION_NOFILTER', 345),
 ('TG_TEXT_MACHINE', 330),
 ('TG_OFTEN_ZERO', 304),
 ('TG_USERFEAT_90D', 259),
 ('TG_SAMOHOD_UNIMPLEMENTED', 253),
 ('TG_CALLISTO_UNIMPLEMENTED', 250),
 ('TG_QUERY_ONLY', 248),
 ('TG_UNUSED', 242),
 ('TG_HOST', 222),
 ('TG_REARR_USE', 216),
 ('TG_RANDOM_LOG', 199),
 ('TG_NEURAL', 196),
 ('TG_BINARY', 192),
 ('TG_BROWSER', 185),
 ('TG_URL_TEXT', 153),
 ('TG_UNIMPLEMENTED', 149),
 ('TG_LINK_TEXT', 146),
 ('TG_USER_EXT_DATA', 143),
 ('TG_OWNER', 141),
 ('TG_FORMULA_2245_DEP_2', 137),
 ('TG_USERFEAT_CLICK_MACHINE', 133),
 ('TG_LINGBOOST', 126),
 ('TG_FORMULA_2245_DEP_3', 120),
 ('TG_REMOVED', 115),
 ('TG_USERFEAT_SEARCH_DWELL_TIME', 102),


# Convert to DataFrame

In [117]:
df = pd.DataFrame.from_dict(all_factors)

# Not Implemented but Named Factors

In [142]:
df_non = df[['Name','Desc','TG_UNIMPLEMENTED','TG_DEPRECATED','TG_UNUSED']]
df_non[(df_non['TG_UNIMPLEMENTED']==True) | (df_non['TG_DEPRECATED']==True) | (df['TG_UNUSED']==True)]

Unnamed: 0,Name,Desc,TG_UNIMPLEMENTED,TG_DEPRECATED,TG_UNUSED
0,PR,Page rank. The factor is remapped.,,,True
2,LR,Link Relevance. The factor is remapped.,,True,
6,LRp1,(strict) have all query words in one link.,,True,
7,LRp2,(phrase) have all query words in a row in one ...,,True,
10,Removed_10,,,,True
...,...,...,...,...,...
1908,RandomCommercial,The 'random' factor for commercial sites.,,,True
1909,UnexpectedTrashUrlQualityFresh,Neural document model for finding unexpected t...,True,,True
1910,RequestMultitokensAllMaxFUrlBclmMixPlainKE5,Features calculated on url with request multit...,True,,
1911,RequestMultitokensAllSumW2FSumWUrlExactQueryMa...,Features calculated on url with request multit...,True,,


# Search Factors

In [143]:
df_2 = df[-((df['TG_UNIMPLEMENTED']==True) | (df['TG_DEPRECATED']==True) | (df['TG_UNUSED']==True))]
for i, row in df_2[(df_2['TG_DOC']==True) & (df_2['TG_STATIC']==True)][['Name','Desc','TG_DYNAMIC']].iterrows():
    print("###", row['Name'])
    print(row['Desc'])
    print()

### Long
Long document (the longer the document, the greater the value of the factor).

### PureText
Long text without references.

### Root
It's a muzzle.

### RusLang
The language of the document is Russian.

### AddTime
The time of adding a page, more is an older document; put the root of the time mapped to the interval [0,1] so that 3+ years gives 1.

### IsMainPage
If the main page of the owner (most often a second-level domain, such as xxxx.ru), the factor is 1. For bomzhatniki, hosting, personal blogs, etc. (eg, Lyfjornal, narod.ru, etc.) - third-level domains (such as xxxxx.narod.ru) will also have a factor of 1.

### Hops
The number of hops of the url in a roundtrip (like less - closer to the muzzle, the smaller the value (0 - muzzle, 1 - cannot be reached from the muzzle, 0 < can be reached from the muzzle < 1). Normal value for nost root is 0.0039).

### TextFeatures
Text quality. Calculated according to a rather complicated formula

### TextLike
Text quality (Alekseev's cla

In [144]:

for i, row in df_2[(df_2['TG_QUERY_ONLY']==True) & (df_2['TG_QUERY_ONLY']==True)].iterrows():
    print("##", row['Name'])
    print(row['Desc'])
    print(row['Tags'])
    print()

## LongQuery
The sum of the idf of the query words. The name does not reflect the essence: for example, for the query 'Gadyach' this factor will be greater than for the query 'Moscow Peter Yekaterinburg Samara'.
[TG_QUERY_ONLY, TG_DYNAMIC, TG_REARR_USE, TG_UNDOCUMENTED, TG_L2, TG_L3_OVERWRITE, TG_NN_OVER_FEATURES_USE]

## WordCount
Min(number of query words/10, 1.f)
[TG_QUERY_ONLY, TG_DYNAMIC, TG_REARR_USE, TG_UNDOCUMENTED, TG_L2, TG_NN_OVER_FEATURES_USE]

## InvWordCount
1 / number_words_in_request.
[TG_DYNAMIC, TG_QUERY_ONLY, TG_REARR_USE, TG_UNDOCUMENTED, TG_L2, TG_NN_OVER_FEATURES_USE]

## QueryRegionSize
Request region size
[TG_DYNAMIC, TG_LOCALIZED_CITY, TG_QUERY_ONLY, TG_UNDOCUMENTED, TG_L2, TG_NN_OVER_FEATURES_USE]

## QueryRegionInvSize
The factor is inversely proportional to the size of the region of the request
[TG_DYNAMIC, TG_LOCALIZED_CITY, TG_QUERY_ONLY, TG_UNDOCUMENTED, TG_L2, TG_NN_OVER_FEATURES_USE]

## SpecificalQuery
The query is locale-specific. The query is often r

In [146]:

for i, row in df_2[(df_2['TG_URL_TEXT']==True)].iterrows():
    print("##", row['Name'])
    print(row['Desc'])
    print(row['Tags'])
    print()

## UrlLen
URL length divided by 5
[TG_DOC, TG_STATIC, TG_URL_TEXT, TG_REARR_USE, TG_UNDOCUMENTED, TG_L2, TG_NN_OVER_FEATURES_USE]

## IsForum
The URL satisfies the FORUM_DETECTOR regularity
[TG_DOC, TG_STATIC, TG_URL_TEXT, TG_BINARY, TG_REARR_USE, TG_UNDOCUMENTED, TG_OFTEN_ZERO, TG_L2, TG_NN_OVER_FEATURES_USE]

## IsObsolete
There is an ancient date in the URL. Ancient news are recognized. Factor 1 if url has year <=2007.
[TG_DATE, TG_DOC, TG_STATIC, TG_URL_TEXT, TG_BINARY, TG_UNDOCUMENTED, TG_OFTEN_ZERO, TG_L2, TG_NN_OVER_FEATURES_USE]

## OnlyUrl
All matches are in the URL only, no matches in the text of the page
[TG_DOC, TG_DYNAMIC, TG_URL_TEXT, TG_BINARY, TG_OFTEN_ZERO, TG_L2, TG_NN_OVER_FEATURES_USE]

## IsCom
.com domain
[TG_HOST, TG_STATIC, TG_URL_TEXT, TG_BINARY, TG_OFTEN_ZERO, TG_L2, TG_NN_OVER_FEATURES_USE]

## IsUa
Domain in the zone .ua
[TG_HOST, TG_STATIC, TG_URL_TEXT, TG_BINARY, TG_OFTEN_ZERO, TG_L2, TG_NN_OVER_FEATURES_USE]

## IsNotRu
The domain is not in the .ru zone
[

In [145]:
for i, row in df_2[(df_2['TG_LINK_TEXT']==True)].iterrows():
    print("##", row['Name'])
    print(row['Desc'])
    print(row['Tags'])
    print()

## QfufAllMaxFLinkAnnIndicatorAnnotationMaxValueWeighted
Linguistic Boosting Factor. Type of extensions: Qfuf. Aggregation by all extensions. Highest factor value. By stream from LinkAnnIndicator link index. Algorithm AnnotationMaxValueWeighted - maximum weight (by MainWeights word weights) of annotation coverage, weighted by annotation weight
[TG_USER, TG_LINK_TEXT, TG_LINGBOOST, TG_TEXT_MACHINE, TG_USER_SEARCH, TG_DYNAMIC, TG_NN_OVER_FEATURES_USE]

## QfufAllMaxWFLinkAnnIndicatorFullMatchValue
Linguistic Boosting Factor. Type of extensions: Qfuf. Aggregation by all extensions. Highest factor value. By stream from LinkAnnIndicator link index. Algorithm AnnotationMaxValueWeighted - maximum weight (by MainWeights word weights) of annotation coverage, weighted by annotation weight
[TG_USER, TG_LINK_TEXT, TG_LINGBOOST, TG_TEXT_MACHINE, TG_USER_SEARCH, TG_DYNAMIC, TG_NN_OVER_FEATURES_USE]

## RequestWithRegionNameLinkAnnFloatMultiplicityCMMatchTop5AvgMatchValue
Linguistic Boosting Factor. 

In [147]:
for i, row in df_2[(df_2['TG_HOST']==True)].iterrows():
    print("##", row['Name'])
    print(row['Desc'])
    print(row['Tags'])
    print()

## News
This is news (determined by distinctive ((http://wiki.yandex-team.ru/JandeksPoisk/KachestvoPoiska/ObshayaFormula/TekushhieKomponenty/Klassificacionnye?v=tkd#h45859-3 patterns in url)) ).
[TG_HOST, TG_STATIC, TG_BINARY, TG_REARR_USE, TG_UNDOCUMENTED, TG_OFTEN_ZERO, TG_L2, TG_NN_OVER_FEATURES_USE]

## Cat
This is a directory (determined by characteristic ((http://wiki.yandex-team.ru/JandeksPoisk/KachestvoPoiska/ObshayaFormula/TekushhieKomponenty/Klassificacionnye?v=tkd#h45859-2 patterns in the url)) or by the Yandex directory).
[TG_HOST, TG_STATIC, TG_BINARY, TG_UNDOCUMENTED, TG_OFTEN_ZERO, TG_L2, TG_NN_OVER_FEATURES_USE]

## YaBar
Attendance from Bar - ((http://wiki.yandex-team.ru/AndrejjKostjagin/YaBarLog/HostStat Data Description)). The factor is remapped.
[TG_BROWSER, TG_HOST, TG_STATIC, TG_USER, TG_OFTEN_ZERO, TG_L2, TG_NN_OVER_FEATURES_USE, TG_USERFEAT_VISITS_ACTIVITY_DOWNLOADS, TG_USERFEAT]

## AddTimeMP
The owner (host?) main page addition time, remaps in the same way as 

In [148]:
for i, row in df_2[(df_2['TG_BROWSER']==True)].iterrows():
    print("##", row['Name'])
    print(row['Desc'])
    print(row['Tags'])
    print()

## YaBar
Attendance from Bar - ((http://wiki.yandex-team.ru/AndrejjKostjagin/YaBarLog/HostStat Data Description)). The factor is remapped.
[TG_BROWSER, TG_HOST, TG_STATIC, TG_USER, TG_OFTEN_ZERO, TG_L2, TG_NN_OVER_FEATURES_USE, TG_USERFEAT_VISITS_ACTIVITY_DOWNLOADS, TG_USERFEAT]

## YaBarCoreOwner
The core audience of owners according to Yandex.Browsing
[TG_STATIC, TG_OWNER, TG_USER, TG_BROWSER, TG_L2, TG_USERFEAT, TG_USERFEAT_90D, TG_NN_OVER_FEATURES_USE, TG_USERFEAT_VISITS_ACTIVITY_DOWNLOADS]

## YaBarCoreHost
Host audience kernel according to Yandex.Browsing
[TG_STATIC, TG_HOST, TG_USER, TG_BROWSER, TG_REARR_USE, TG_L2, TG_USERFEAT, TG_USERFEAT_90D, TG_NN_OVER_FEATURES_USE, TG_USERFEAT_VISITS_ACTIVITY_DOWNLOADS]

## HasYaBarCore
Does the host have a kernel
[TG_STATIC, TG_HOST, TG_USER, TG_BROWSER, TG_BINARY, TG_OFTEN_ZERO, TG_L2, TG_USERFEAT, TG_USERFEAT_90D, TG_NN_OVER_FEATURES_USE, TG_USERFEAT_VISITS_ACTIVITY_DOWNLOADS]

## YabarHostVisitors
the number of unique visitors, remaps e

In [149]:
for i, row in df_2[(df_2['TG_TEXT_MACHINE']==True)].iterrows():
    print("##", row['Name'])
    print(row['Desc'])
    print(row['Tags'])
    print()

## FioFromOriginalRequestBodyMinWindowSize
Factor by name from the original query It is counted by the content of the document. Minimum window size, which includes all the words of the query. Normalized by the number of words in the query.
[TG_DOC, TG_DOC_TEXT, TG_UNDOCUMENTED, TG_TEXT_MACHINE]

## TelFullAttributeTextBocm15K001
Factor by phone attributes tel_full from the original query Text document. The word weights aggregation algorithm Bocm15. The normalization coefficient is 0.01.
[TG_DOC, TG_DOC_TEXT, TG_TEXT_MACHINE, TG_UNDOCUMENTED, TG_DYNAMIC]

## UrlBm15K01
Bm15K01 factor over hits from Url
[TG_DYNAMIC, TG_DOC, TG_URL_TEXT, TG_TEXT_MACHINE, TG_NN_OVER_FEATURES_USE]

## TitleBm15K01
Bm15K01 factor over hits from Title
[TG_DYNAMIC, TG_DOC, TG_DOC_TEXT, TG_TEXT_MACHINE, TG_NN_OVER_FEATURES_USE]

## TitleBocm15K001
Bocm15K001 factor over hits from Title
[TG_DYNAMIC, TG_DOC, TG_DOC_TEXT, TG_TEXT_MACHINE, TG_NN_OVER_FEATURES_USE]

## TextBm11Norm16384
Bm11Norm16384 factor over hit