## Import Languages

dataset location:
> `\dataset\languages\`

### languages.json
data from WikiPedia and in ISO format, includes extra data such as native name and number of speakers. Source:
* https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers

### language-codes
ISO Language Codes (639-1 and 693-2) and IETF Language Types("language-codes" directory): 
* https://datahub.io/core/language-codes

more on specific standards:
* https://www.iso.org/iso-639-language-codes.html

"file" collumn in "ietf-language-tags.csv" point to "core/main" directory files with usefull information about languages.

### core
provided by [CLDR project](https://cldr.unicode.org/) available in:
* https://www.unicode.org/Public/cldr/latest/core.zip

In [1]:
# Cleaning and setup.
import os, json
import pandas as pd
from tqdm.auto import tqdm
from django.conf import settings
from datacore.models import Language

# OPTIONAL: Deleting all Languages from database.
#Language.objects.all().delete()
# OPTIONAL: Deleting all Languages from database except English.
#Language.objects.all().exclude(alpha2='en').delete()

In [2]:
# Importing Languages.
# import alpha2, alpha3b, en_name
lang_file = os.path.join(settings.BASE_DIR, '../dataset/languages/language-codes/archive/language-codes-3b2.csv')
df = pd.read_csv(lang_file, encoding='utf-8', sep=',')

for row in tqdm(df.iterrows(), total=len(df)):
    language, created = Language.objects.get_or_create(alpha2=row[1]['alpha2'])
    language.alpha3b = row[1]['alpha3-b']
    language.en_name = row[1]['English']
    language.save()

  0%|          | 0/184 [00:00<?, ?it/s]

In [3]:
# import alpha3t and less used languages
lang_file = os.path.join(settings.BASE_DIR, '../dataset/languages/language-codes/archive/language-codes-full.csv')
df = pd.read_csv(lang_file, encoding='utf-8', sep=',')
df.fillna('', inplace=True)

for row in tqdm(df.iterrows(), total=len(df)):
    try:
        if row[1]['alpha2'] != "":
            language = Language.objects.get(alpha2=row[1]['alpha2'])
            if row[1]['alpha3-t'] != "":
                language.alpha3t = row[1]['alpha3-t']
                language.save()
        else:
            language = Language(alpha3b=row[1]['alpha3-b'], en_name=row[1]['English'])
            language.save()       
    except:
        tqdm.write(f"Error in importing row: {row[1]}")
    

  0%|          | 0/487 [00:00<?, ?it/s]

Error in importing row: alpha3-b                     qaa-qtz
alpha3-t                            
alpha2                              
English       Reserved for local use
French      réservée à l'usage local
Name: 352, dtype: object


In [4]:
# import custom data(for now: language in native and number of speakers)
lang_file_json = os.path.join(settings.BASE_DIR, '../dataset/languages/languages.json')
lang_data_json = json.loads(open(lang_file_json, "rb").read().decode('utf-8-sig'))
for lang in tqdm(lang_data_json):
    try:
        language = Language.objects.get(alpha2=lang)
        if 'speakers' in lang_data_json[lang]:
            language.native_speakers = lang_data_json[lang]['speakers']
        language.native_name = lang_data_json[lang]['nativeName']
        language.save()
    except:
        print(f"Error in importing: {lang_data_json[lang]}")

# TODO: import dialect, territory, and other language data from ietf-language-tags.csv and it's link to 'core' dataset
# TODO: locations can be imported from http://www.geonames.org/

  0%|          | 0/182 [00:00<?, ?it/s]