# Tutorial: Suggesting Subword Sizes and Correlating Language Distances

In this tutorial, we will use our $n$-gram coverage model to suggest near-optimal subword sizes for under-resourced languages, where optimal subword sizes are unknown. To show that the $n$-gram coverage model holds a wealth of typological, geographical, and phylogenetic information, we will also correlate the Euclidean distances between the suggested subword sizes of Wikipedia languages against the syntactic, geographic, phonological, genetic, and inventory language distances of [Littell et al. (2017)][1].

 [1]: https://github.com/antonisa/lang2vec#retrieving-pre-computed-distances

## Word Embeddings

Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval. The representation precision of log-bilinear fastText models is mostly due to their use of subword information. In previous work, the optimization of fastText's subword sizes has not been fully explored, and non-English fastText models were trained using subword sizes optimized for English and German word analogy tasks.

## Suggested Subword Sizes

We propose a cheap and simple $n$-gram coverage model that consistently improves the accuracy of fastText models on the word analogy tasks by up to 3% compared to the default subword sizes, and that it is within 1% accuracy of the optimal subword sizes on average. Subword sizes suggested by our $n$-gram coverage model can be used in applications of fastText as the new default for under-resourced languages, where the optimal subword sizes are unknown.

## Software Package

You can find our package [here][2].

 [2]: https://github.com/MIR-MU/fasttext-optimizer

# Installing lang2vec

First, we will install the lang2vec library for computing syntactic, geographic, phonological, genetic, and inventory language distances.

In [None]:
%%capture
! pip install -U pip
! pip install git+https://github.com/antonisa/lang2vec.git
! pip install pandas pycountry scipy

If you use lang2vec, please cite the following paper:

``` bibtex
@inproceedings{littell2017uriel,
  title = {Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors},
  author = {Littell, Patrick and Mortensen, David R and Lin, Ke and Kairis, Katherine and Turner, Carlisle and Levin, Lori},
  booktitle = {Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  volume = {2},
  pages = {8--14},
  year = {2017}
}
```

## Restart the Notebook

For the changes to take effect, we now need to restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Downloading the Data

Next, we will download the [MIR-MU/fasttext-optimizer][1] Git repository with the precomputed suggested subword sizes.

 [1]: https://github.com/MIR-MU/fasttext-optimizer

In [None]:
%%capture
! git clone https://github.com/MIR-MU/fasttext-optimizer.git
! ln -s fastText-subword-size-optimizer/data data

# Suggesting Subword Sizes

We can use the $n$-gram coverage model to suggest subword sizes for any Wikipedia language, including under-resourced languages such as Icelandic, Estonian, and Inuktitut for which optimal subword sizes are unknown.

In [None]:
import pycountry

denormalize_language_map = dict()

def get_ngram_coverage(language: str, minn: int, maxn: int, lower_limit: int = 1, upper_limit: int = 10) -> float:
    try:
        language = denormalize_language_map[language]
    except KeyError:
        language = pycountry.languages.lookup(language).alpha_2
    with open(f'data/wikimedia/wiki.{language}.json', 'rt') as f:
        subterm_length_freqs = json.load(f)['subterm_length_freqs']
    total_subterms = sum(subterm_length_freqs.values())
    coverage = sum(
        subterm_length_freqs[f'{subterm_size}']
        for subterm_size in range(minn, maxn + 1)
    ) / total_subterms
    return coverage * 100.0

In [None]:
from itertools import product
import json
from typing import Tuple
import re

def suggest_subword_sizes(language: str, lower_limit: int = 1, upper_limit: int = 10, optimum: float = 4.91) -> Tuple[int, int]:
    best_minn, best_maxn, best_coverage = None, None, float('inf')
    parameter_space = range(lower_limit, upper_limit + 1)
    parameter_space = ((i, j) for i, j in product(parameter_space, parameter_space) if i <= j)
    for minn, maxn in parameter_space:
        coverage = get_ngram_coverage(language, minn, maxn)
        if abs(coverage - optimum) < abs(best_coverage - optimum):
            best_minn, best_maxn, best_coverage = minn, maxn, coverage
    return (best_minn, best_maxn)

In [None]:
suggest_subword_sizes('isl')  # Icelandic

(1, 4)

In [None]:
suggest_subword_sizes('est')  # Estonian

(4, 5)

In [None]:
suggest_subword_sizes('iku')  # Inuktitut

(10, 10)

# Computing Language Distances

The suggested subword sizes are 2D vectors, which we can use to compute distances between languages.

In [None]:
from scipy.spatial.distance import euclidean, cosine

def suggested_subword_size_distance(first_language: str, second_language: str) -> float:
    first_vector = suggest_subword_sizes(first_language)
    second_vector = suggest_subword_sizes(second_language)
    distance = euclidean(first_vector, second_vector)
    return distance

def suggested_subword_size_distance_helper(args: Tuple[str, str]) -> float:
    distance = suggested_subword_size_distance(*args)
    return distance

In [None]:
suggested_subword_size_distance('ces', 'slk')  # Czech and Slovak

0.0

In [None]:
suggested_subword_size_distance('ces', 'ger')  # Czech and German

5.385164807134504

In [None]:
suggested_subword_size_distance('ces', 'kor')  # Czech and Korean

8.06225774829855

# Correlating Language Distances

We can correlate our language distance with the syntactic, geographic, phonological, genetic, and inventory language distances of [Littell et al. (2017)][1] to see if our language distance measure represents interpretable linguistic phenomena.

 [1]: https://github.com/antonisa/lang2vec#retrieving-pre-computed-distances

In [None]:
def normalize_language(language: str) -> str:
    try:
        lookup = pycountry.languages.lookup(language)
        try:
            return lookup.alpha_3
        except AttributeError:
            return None
    except LookupError:
        return None

In [None]:
import lang2vec.lang2vec as l2v

lang2vec_languages = l2v.DISTANCE_LANGUAGES
lang2vec_languages = map(normalize_language, lang2vec_languages)
lang2vec_languages = filter(lambda language: language is not None, lang2vec_languages)
lang2vec_languages = set(lang2vec_languages)

In [None]:
from pathlib import Path

coverage_languages = [pathname.suffixes[0][1:] for pathname in (Path('data')/'wikimedia').glob('*.json')]
for denormalized_language, normalized_language in zip(coverage_languages, map(normalize_language, coverage_languages)):
    denormalize_language_map[normalized_language] = denormalized_language
coverage_languages = map(normalize_language, coverage_languages)
coverage_languages = filter(lambda language: language is not None, coverage_languages)
coverage_languages = set(coverage_languages)

In [None]:
from IPython.display import display, Markdown

languages = sorted(lang2vec_languages & coverage_languages)

display(Markdown(f'We will correlate the distances for {len(languages)} languages.'))

We will correlate the distances for 282 languages.

In [None]:
from random import random, seed

seed(21)
random_scalars = {language: random() for language in languages}

def random_distance(first_language: str, second_language: str) -> float:
    distance = euclidean(random_scalars[first_language], random_scalars[second_language])
    return distance

def random_distance_helper(args: Tuple[str, str]) -> float:
    distance = random_distance(*args)
    return distance

In [None]:
geographic_distance_matrix = l2v.distance('geographic', languages)

def geographic_distance(first_language: str, second_language: str) -> float:
    distance = geographic_distance_matrix[languages.index(first_language), languages.index(second_language)]
    return distance

def geographic_distance_helper(args: Tuple[str, str]) -> float:
    distance = geographic_distance(*args)
    return distance

In [None]:
genetic_distance_matrix = l2v.distance('genetic', languages)

def genetic_distance(first_language: str, second_language: str) -> float:
    distance = genetic_distance_matrix[languages.index(first_language), languages.index(second_language)]
    return distance

def genetic_distance_helper(args: Tuple[str, str]) -> float:
    distance = genetic_distance(*args)
    return distance

In [None]:
syntactic_distance_matrix = l2v.distance('syntactic', languages)

def syntactic_distance(first_language: str, second_language: str) -> float:
    distance = syntactic_distance_matrix[languages.index(first_language), languages.index(second_language)]
    return distance

def syntactic_distance_helper(args: Tuple[str, str]) -> float:
    distance = syntactic_distance(*args)
    return distance

In [None]:
phonological_distance_matrix = l2v.distance('phonological', languages)

def phonological_distance(first_language: str, second_language: str) -> float:
    distance = phonological_distance_matrix[languages.index(first_language), languages.index(second_language)]
    return distance

def phonological_distance_helper(args: Tuple[str, str]) -> float:
    distance = phonological_distance(*args)
    return distance

In [None]:
inventory_distance_matrix = l2v.distance('inventory', languages)

def inventory_distance(first_language: str, second_language: str) -> float:
    distance = inventory_distance_matrix[languages.index(first_language), languages.index(second_language)]
    return distance

def inventory_distance_helper(args: Tuple[str, str]) -> float:
    distance = inventory_distance(*args)
    return distance

In [None]:
from typing import Callable
from multiprocessing import Pool

import scipy.stats

def correlate_distances(first_distance_callable: Callable[[str, str], float],
                        second_distance_type: str, correlation_type: str = 'pearsonr') -> dict:
    first_distances, second_distances = [], []
    second_distance_matrix = l2v.distance(second_distance_type, languages)
    parameter_space = [(first, second) for first, second in product(languages, languages) if first < second]
    with Pool(None) as pool:
        first_distances_iter = pool.imap(first_distance_callable, parameter_space)
        for (first_language, second_language), first_distance in zip(parameter_space, first_distances_iter):
            second_distance = second_distance_matrix[languages.index(first_language), languages.index(second_language)]
            first_distances.append(first_distance)
            second_distances.append(second_distance)
    correlation, _ = scipy.stats.__dict__[correlation_type](first_distances, second_distances)
    return correlation

In [None]:
import pandas as pd
from tqdm.notebook import tqdm

distance_types = ['geographic', 'genetic', 'syntactic', 'phonological', 'inventory']
correlations = pd.DataFrame.from_dict({
    distance_type: {
        'random': correlate_distances(random_distance_helper, distance_type),
        'suggested': correlate_distances(suggested_subword_size_distance_helper, distance_type),
        'geographic': correlate_distances(geographic_distance_helper, distance_type),
        'genetic': correlate_distances(genetic_distance_helper, distance_type),
        'syntactic': correlate_distances(syntactic_distance_helper, distance_type),
        'phonological': correlate_distances(phonological_distance_helper, distance_type),
        'inventory': correlate_distances(inventory_distance_helper, distance_type),
    }
    for distance_type
    in tqdm(distance_types)
})
correlations

  0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0,geographic,genetic,syntactic,phonological,inventory
random,0.014228,0.009718,-0.01191,0.027535,-0.010356
suggested,0.031968,0.025796,-0.003188,-0.025838,-0.017932
geographic,1.0,0.276044,-0.233165,-0.172121,-0.151254
genetic,0.276044,1.0,0.035659,0.067007,0.031538
syntactic,-0.233165,0.035659,1.0,0.198463,0.272059
phonological,-0.172121,0.067007,0.198463,1.0,0.280121
inventory,-0.151254,0.031538,0.272059,0.280121,1.0


We can see that the correlations between our language distance measure and the language distance measures of [Littell et al.][1] range between $-0.03$ (phonological) and $0.03$ (geographical). Since the absolute values are consistently smaller than random, we conclude that our language distance measure does not either correlate or anti-correlate with the other language  distance  measures. This is because our suggested subword sizes are based on latent data-driven features of text, which complement the hand-crafted linguistic features.

 [1]: https://github.com/antonisa/lang2vec#retrieving-pre-computed-distances