## Language similarity 

### Lang2vec

In [2]:
%pip install lang2vec





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import lang2vec.lang2vec as l2v
from scipy.spatial import distance

Lang2vec uses ISO 639-3 codes to represent the langugaes. 
Here are the relevant codes for our langugaes gotten from https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes: 
- English: eng
- Danish: dan
- German: deu
- Polish: pol
- Slovak: slk
- Chinese: zho
- Russian: rus

Getting the vector representations of the languages 

In [4]:
category = "syntax_knn"
# Makes a lsit of all the vectors for the languages 
features = l2v.get_features(["eng","zho", "deu", "pol", "slk", "dan", "rus"], category)
# index at the language code to get the vector for that language 
features["eng"]

[1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0,
 0.0,
 1.0,
 1.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0,
 1.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0,
 1.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 1.0,
 0.0,
 0.0,
 1.0,
 0.0]

In [5]:
# getting each language vector seperately 
eng = features["eng"]
zho = features["zho"]
deu = features["deu"]
pol = features["pol"]
slk = features["slk"]
dan = features["dan"]
rus = features["rus"]

Cosine distance to get the language similarity - cosine distance is inversely proportional to the language similarity i.e. smaller cosine distance = higher language similarity 

In [7]:
# cosine distance for english: 
eng_slk = distance.cosine(eng, slk)
eng_deu = distance.cosine(eng, deu)
eng_zho = distance.cosine(eng, zho)
eng_dan = distance.cosine(eng, dan)
eng_rus = distance.cosine(eng, rus)

print("Distance from polish to slovak", eng_slk)
print("Distance from english to german", eng_deu)
print("Distance from english to danish", eng_dan)
print("Distance from english to chinese", eng_zho)
print("Distance from english to russian", eng_rus)


Distance form polish to slovak 0.17840613445904518
Distance form english to german 0.09745802098497958
Distance form english to danish 0.11992176734660276
Distance form english to chinese 0.2892275292349139
Distance form english to russian 0.18824593690905445


In [8]:
# cosine distance for polish
pol_slk = distance.cosine(pol, slk)
pol_deu = distance.cosine(pol, deu)
pol_zho = distance.cosine(pol, zho)
pol_dan = distance.cosine(pol, dan)
pol_rus = distance.cosine(pol, rus)


print("Distance from polish to slovak", pol_slk)
print("Distance from polish to german", pol_deu)
print("Distance from polish to danish", pol_dan)
print("Distance from polish to chinese", pol_zho)
print("Distance from polish to russian", pol_rus)

Distance from polish to slovak 0.0842708838474282
Distance from polish to german 0.19994463093047998
Distance from polish to danish 0.17396681236909783
Distance from polish to chinese 0.33287561500500895
Distance from polish to russian 0.0714285714285714


Exploring differnt vector types: lang2vec has vectors for differnet features of the languages, each representing them in differnet ways 


In [51]:
l2v.FEATURE_SETS

['syntax_wals',
 'phonology_wals',
 'syntax_sswl',
 'syntax_ethnologue',
 'phonology_ethnologue',
 'inventory_ethnologue',
 'inventory_phoible_aa',
 'inventory_phoible_gm',
 'inventory_phoible_saphon',
 'inventory_phoible_spa',
 'inventory_phoible_ph',
 'inventory_phoible_ra',
 'inventory_phoible_upsid',
 'syntax_knn',
 'phonology_knn',
 'inventory_knn',
 'syntax_average',
 'phonology_average',
 'inventory_average',
 'fam',
 'id',
 'geo',
 'learned']