# Script get_language

## Aim
This notebook tests how to identify the language of a text, using an external Azure API. 

## Use 
The API key must be in a "param.txt" file in the same directory as this script.


In [11]:
import sys
import requests
import pandas as pd

In [24]:
with open('param.txt') as file:
    API_KEY = file.readline().strip()
    
ENDPOINT = "https://api.cognitive.microsofttranslator.com/detect?api-version=3.0"

HEADERS = {
    "Ocp-Apim-Subscription-Key": API_KEY,
    "Content-Type": "application/json",
    }    

MAX_LENGTH = 140

In [25]:
def get_language(text):
    body = [{"Text": text}]
    r = requests.post(ENDPOINT, headers=HEADERS, json=body)
    return r.json()[0]["language"]

This part of code tests the API through several examples from [Wikipedia Language Identification Database](https://zenodo.org/record/841984). 

In [26]:
# building DataFrame with content_x and content_y as columns text and lang

with open('Dataset/x_train.txt', encoding='utf-8') as source_x:
    content_x = [item.strip() for item in source_x.readlines()]
with open(r'Dataset/y_train.txt', encoding='utf-8') as source_y:
    content_y =  [item.strip() for item in source_y.readlines()]
data = pd.DataFrame({'text': content_x, 'lang': content_y})

# selecting 5 most common languages : english, mandarin (standard chinese), hindi, spanish, french 
# (source: https://fr.wikipedia.org/wiki/Liste_de_langues_par_nombre_total_de_locuteurs)
# with corresponding codes from Azure API

most_common_langs = {'eng': 'en', 'zho': 'zh-Hans', 'hin': 'hi', 'spa': 'es', 'fra': 'fr'}
most_common_df = data.loc[data.lang.isin(most_common_langs)]

# selecting n examples for each language among the 5 most common
n = 5
first_n_examples = [most_common_df[most_common_df['lang'] == lang][0:n] for lang in most_common_langs]
selection = pd.concat(first_n_examples)

# requesting for m first characters of all examples, answers are received in a list
results = []
progress = 0
total = len(selection['text'])

for text in selection['text']:
    results.append(get_language(text[:MAX_LENGTH]))
    progress += 1
    print(f'{progress*100//total}%',end='\r', flush=True)

selection['guess'] = results 
selection.reset_index(drop=True, inplace=True)

100%

In [27]:
selection

Unnamed: 0,text,lang,guess
0,In 1978 Johnson was awarded an American Instit...,eng,en
1,Bussy-Saint-Georges has built its identity on ...,eng,en
2,Minnesota's state parks are spread across the ...,eng,en
3,Nordahl Road is a station served by North Coun...,eng,en
4,A talk by Takis Fotopoulos about the Internati...,eng,en
5,胡赛尼本人和小说的主人公阿米尔一样，都是出生在阿富汗首都喀布尔，少年时代便离开了这个国家。胡...,zho,zh-Hans
6,2017年1月7日，參與了「SNH48第三屆年度金曲大賞BEST 50」。2月15日，出演由...,zho,zh-Hans
7,在他们出发之前，罗伯特·菲茨罗伊送给了达尔文一卷查尔斯·赖尔所著《地质学原理》（在南美他得到...,zho,zh-Hans
8,系列的第一款作品《薩爾達傳說》（ゼルダの伝説）在1986年2月21日於日本發行，之後在198...,zho,zh-Hant
9,历史上的柔远驿是为了给琉球贡使及随员提供食宿之所，同时它也成为中琉间商业和文化交流的枢纽。琉...,zho,zh-Hans
