# Loading CLDF files into LingPy

Let's assume that we want to load the CLDF-formatted file mixezoquean.csv. We further want to automatically search this data for cognates and compare how well the algorithm performs against the expert cognates given in the data. We have opened the terminal in the folder lexibank-data/cookbook, so we need to properly reference the path. We first load a couple of LingPy modules.

In [1]:
from lingpy import * # general settings
from lingpy.basic.wordlist import get_wordlist # csv-to-wordlist converter
from lingpy.evaluate.acd import bcubes # cognate detection evaluation

We start by loading the CLDF-file as a simple LingPy-Wordlist. We specify the keyword "row" as "parameter_name", as this is the column in which we store the glosses for the concepts in CLDF. Likewise, we specify "col" as "language_name", since LingPy-Wordlists need to know where these columns are in the data.

In [2]:
wl = get_wordlist("../datasets/mixezoquean/cldf/mixezoquean.csv", row="parameter_name", col="language_name")

We check the content of the wordlist file.

In [3]:
print("Wordlist has {0} entries, {1} languages, {2} concepts, and {3} columns.".format(len(wl), wl.width, wl.height, len(wl.header)))

Wordlist has 1072 entries, 10 languages, 110 concepts, and 10 columns.


Now we pass the wordlist object to the LexStat class. We specify the same parameters, but we pass an additional parameter "segments", to inform LingPy-LexStat where the segments are in the CLDF.

In [4]:
lex = LexStat(wl, col='language_name', row="parameter_name", segments="segments")

We now carry out a quick cognate detection analysis, using LexStat's "lexstat" function. We set the keyword "ref" to "lexstat" to indicate in which column the automatic cognate detection should be stored.

In [5]:
lex.get_scorer(runs=1000)
lex.cluster(method="lexstat", ref="lexstat", threshold=0.65)

CORRESPONDENCE CALCULATION:   0%|          | 0/50.0 [00:00<?, ?it/s]2016-11-08 13:45:48,785 [INFO] Calculating alignments for pair Chiapas Zoque / Chiapas Zoque.
2016-11-08 13:45:48,821 [INFO] Calculating alignments for pair Chiapas Zoque / Lowland Mixe.
2016-11-08 13:45:48,880 [INFO] Calculating alignments for pair Chiapas Zoque / North Highland Mixe.
CORRESPONDENCE CALCULATION:   8%|▊         | 4/50.0 [00:00<00:01, 32.23it/s]2016-11-08 13:45:48,910 [INFO] Calculating alignments for pair Chiapas Zoque / Oluta Popoluca.
2016-11-08 13:45:48,941 [INFO] Calculating alignments for pair Chiapas Zoque / San Miguel Chimalapa Zoque.
2016-11-08 13:45:48,974 [INFO] Calculating alignments for pair Chiapas Zoque / Santa María Chimalapa Zoque.
2016-11-08 13:45:49,006 [INFO] Calculating alignments for pair Chiapas Zoque / Sayula Popoluca.
CORRESPONDENCE CALCULATION:  16%|█▌        | 8/50.0 [00:00<00:01, 32.12it/s]2016-11-08 13:45:49,036 [INFO] Calculating alignments for pair Chiapas Zoque / Soteapan

Now we can test how well the automatic cognate detection performed, by comparing the content in the column "cognacy" (default name for cognate sets in CLDF) with the content in the column "lexstat", using LingPy's bcubes-function.

In [6]:
a, b, c = bcubes(lex, 'cognacy', 'lexstat', pprint=True)

*************************
* B-Cubed-Scores        *
* --------------------- *
* Precision:     0.9498 *
* Recall:        0.8532 *
* F-Scores:      0.8989 *
*************************'


Wow, our precision is quite high, which is good, as it means there are not many false positives. Recall could be improved, but we should be happy with almost 90%, given the small size of the wordlist.