# Analyzing the Palaung data

We want to analyze the Palaung dataset. 

In [1]:
from lingpy import * # general settings
from lingpy.basic.wordlist import get_wordlist # csv-to-wordlist converter
from lingpy.evaluate.acd import bcubes # cognate detection evaluation

We start by loading the CLDF-file as a simple LingPy-Wordlist. We specify the keyword "row" as "parameter_name", as this is the column in which we store the glosses for the concepts in CLDF. Likewise, we specify "col" as "language_name", since LingPy-Wordlists need to know where these columns are in the data.

In [2]:
wl = get_wordlist("../datasets/palaung/cldf/palaung.csv", row="parameter_name", col="language_name")

We check the content of the wordlist file.

In [4]:
print("Wordlist has {0} entries, {1} languages, {2} concepts, and {3} columns.".format(len(wl), wl.width, wl.height, len(wl.header)))

Wordlist has 1567 entries, 16 languages, 99 concepts, and 10 columns.


Now we pass the wordlist object to the LexStat class. We specify the same parameters, but we pass an additional parameter "segments", to inform LingPy-LexStat where the segments are in the CLDF.

In [5]:
lex = LexStat(wl, col='language_name', row="parameter_name", segments="segments")

We now carry out a quick cognate detection analysis, using LexStat's "lexstat" function. We set the keyword "ref" to "lexstat" to indicate in which column the automatic cognate detection should be stored.

In [6]:
lex.get_scorer(runs=1000)
lex.cluster(method="lexstat", ref="lexstat", threshold=0.65)

|++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|
|++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|
|++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|


Now we can test how well the automatic cognate detection performed, by comparing the content in the column "cognacy" (default name for cognate sets in CLDF) with the content in the column "lexstat", using LingPy's bcubes-function.

In [7]:
a, b, c = bcubes(lex, 'cognacy', 'lexstat', pprint=True)

*************************
* B-Cubed-Scores        *
* --------------------- *
* Precision:     0.8879 *
* Recall:        0.8613 *
* F-Scores:      0.8744 *
*************************'


Wow, our precision is quite high, which is good, as it means there are not many false positives. Recall could be improved, but we should be happy with almost 90%, given the small size of the wordlist.