Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems in Serbian #11

Open
LinguList opened this issue Feb 15, 2022 · 4 comments
Open

Problems in Serbian #11

LinguList opened this issue Feb 15, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@LinguList
Copy link

There are some mismappings, as they have like 6 words for DEER in the data. We were informed by somebody who wrote to Joshua Jackson, who then wrote to me:

Cow would be what is translated as Deer. 
Krava = Cow
Vo = Ox
Bik = Bull
Jelen = Deer
Jelena is one of the common names in Serbia (likely related to Helen rather than Jelen, though)
Right beneath is KRV, which is BLOOD and definitely not the meat. MESO is meat. 
Above that KORA = it is a bark, but leather is KOŽA
Am I missing something important? 
Jegulja is an EEL, not a snake, ZMIJA is a snake. 
Konj is a male horse, kobila is a mare. 
Jare is NOT a lamb. Jagnje (janje) is a lamb, jare is a baby goat, not a sheep. 
Jagoda is a strawberry, not a grape. Grožđe is a term for the grapes, grozd is singular.”

I suggest we manually correct these cases via Lexemes. I would also inform the DIACL editors about this.

Or, @chrzyki, @xrotwang, is it possible that the error (something swapped here) is on the side of the pylexibank script?

@LinguList LinguList added the bug Something isn't working label Feb 15, 2022
@LinguList
Copy link
Author

BTW: checking with German, we have the same problems for DEER.

https://clics.clld.org/languages/diacl-41700

@LinguList
Copy link
Author

If one checks diacl, it becomes clear that they have mapped a huge number of partly related terms to one master concept.

https://diacl.ht.lu.se/WordList/Index

This problem is also but less problematically present in the Swadesh collection.

The problem is that DIACL did in some sense some Concepticon mapping, however, one to their internal concept lists, which are often much broader than what we'd do in Concepticon. Since all words in the database have meaning strings, one could circumvent this by making a master list of all meaning glosses we find in the data.

In the current form, however, it is unclear if the data is well aggregated into CLICS.

@chrzyki
Copy link
Contributor

chrzyki commented Feb 16, 2022

Good catch and thanks for relaying the issue. Given the relatively specific relations I would hope that there isn't too much of an effect on CLICS-based analyses (i.e. most of the mappings will be very rare), but I fully agree: In this state it's not something that should be used in CLICS & Co. I think your suggestion (i.e. list of all meaning glosses, map) sounds good!

@LinguList
Copy link
Author

So for CLICS4, we would either have fixed this issue by doing a re-mapping, or we'd not include it there, since this kind of mapping makes people who know the languages get upset, and we would like to avoid that. DIACL has the meaning glosses, so they use the concepts differently than we do in CLICS, so we do well in only aggregating from DIACL when we know that it corresponds to our models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants