Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of Serbian Cyrillic and Latin converter #1456

Open
astruo opened this issue Mar 16, 2017 · 2 comments
Open

Creation of Serbian Cyrillic and Latin converter #1456

astruo opened this issue Mar 16, 2017 · 2 comments
Labels
enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba.

Comments

@astruo
Copy link

astruo commented Mar 16, 2017

Cyrillic should be used as the main script of a Serbian sentence on Tatoeba not only because its official status in Serbia. Cyrillic to Latin conversion is straightforward with no exception, but when written in all caps or in abbreviations this can lead to such typographical result as " ЏУГАЊ " → "DžUGANj" instead of "DŽUGANJ"

Latin to Cyrillic conversion is not consistent, since Cyrillic letters "Љ" [ʎ], "Њ" [ɲ], "Џ" [dʒ] and sometimes "Ђ" [dʑ] are transliterated as combinations of two Latin letters.
Unicode standard has Latin digraphs with different Unicode codes treated as a unique letter, but these are not used in real life only for scientific morphological analysis Serbo-Croatian texts:
"LJ" U+01C7 LATIN CAPITAL LETTER LJ
"Lj" U+01C8 LATIN CAPITAL LETTER L WITH SMALL LETTER J
"lj" U+01C9 LATIN SMALL LETTER LJ etc.

Cyrillic to Latin transliteration:

А → A
а → a
Б → B
б → b
В → V
в → v
Г → G
г → g
Д → D
д → d
Ђ → Đ
ђ → đ
Е → E
е → e
Ж → Ž
ж → ž
З → Z
з → z
И → I
и → i
Ј → J
ј → j
К → K
к → k
Л → L
л → l
Љ → Lj
љ → lj
М → M
м → m
Н → N
н → n
Њ → Nj
њ → nj
О → O
о → o
П → P
п → p
Р → R
р → r
С → S
с → s
Т → T
т → t
Ћ → Ć
ћ → ć
У → U
у → u
Ф → F
ф → f
Х → H
х → h
Ц → C
ц → c
Ч → Č
ч → č
Џ → Dž
џ → dž
Ш → Š
ш → š

If small letters "ž", "j" are surrounded by capital Latin letters or are at the end of ALL CAPS Latin word, after conversion script should change them to capital "Ž", "J"

Latin to Cyrillic transliteration:
A → А
a → а
B → Б
b → б
C → Ц
c → ц
Ć → Ћ
ć → ћ
Č → Ч
č → ч
D → Д
d → д
Đ → Ђ
đ → ђ
Dž → Џ
dž → џ
E → Е
e → е
F → Ф
f → ф
G → Г
g → г
H → Х
h → х
I → И
i → и
J → Ј
j → ј
K → К
k → к
L → Л
l → л
Lj → Љ
lj → љ
M → М
m → м
N → Н
n → н
Nj → Њ
nj → њ
O → О
o → о
P → П
p → п
R → Р
r → р
S → С
s → с
Š → Ш
š → ш
T → Т
t → т
U → У
u → у
V → В
v → в
Z → З
z → з
Ž → Ж
ž → ж

The problem with letter lj, dž, nj are usually arose within foreign words, names and in combination of some prefixes and word roots.
Some exception that should be corrected by the script after Latin-Cyrillic conversion ("" is used for any word ending.)
odživ
→ оджив*
nadživ* → наджив*
podžanr* → поджанр*
podžupan* → поджупан*
predžet* → преджет*
predživ* → преджив*
injek* → инјек*
konjug* → конјуг*
konjuk* → конјук*
konjun* → конјун*
konjur* → конјур*
Tanjug* → Танјуг*
anjon* → анјон*
dablju → даблју

If you are a Serbian speaker, please correct and add more examples of such pairs

References:
https://en.wikipedia.org/wiki/Gaj's_Latin_alphabet
https://en.wikipedia.org/wiki/Serbian_Cyrillic_alphabet

@jiru
Copy link
Member

jiru commented Apr 20, 2017

If you are a Serbian speaker, please correct and add more examples of such pairs

Thanks for the transcription pairs, but it’s currently no use if they are not verified by a native speaker.

Cyrillic should be used as the main script of a Serbian sentence on Tatoeba not only because its official status in Serbia.

(The Wall is a better place to talk about this but I’ll answer here. If you want to answer me, please post on the Wall instead.)

I’m not sure this is a valid reason to enforce the use of Cyrillic in Serbian sentences in Tatoeba. I don’t know much about the Serbian language and Serbia, but since Tatoeba embraces all varieties of languages equally, having Serbian official in Serbia doesn’t change anything. I think we want to avoid that kind of politic arguments as much as possible and just allow all scripts that are used in real life, like it has been done for Chinese. Maybe a Serbian native would have some more insight about this though.

Cyrillic to Latin conversion is straightforward with no exception, but when written in all caps or in abbreviations this can lead to such typographical result as " ЏУГАЊ " → "DžUGANj" instead of "DŽUGANJ"

I don’t think this is a valid reason neither. In my opinion, we shouldn’t change the way contributors write on Tatoeba because of technical reasons. Technology should be shaped the way languages actually are, not the other way around.

I understand your concerns about the inconsistencies in Latin to Cyrillic conversion, but we should try to solve them first instead of enforcing Cyrillic. And if we can’t solve all of them, we can allow manual correction of the transcriptions by contributors. Manual correction is also a way to gather more transcription pairs.

@jiru jiru added the enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. label Jul 4, 2018
@trang trang added the out-of-scope Issue that we decided we won't do and will close within a few weeks. label Aug 25, 2018
@trang trang removed the out-of-scope Issue that we decided we won't do and will close within a few weeks. label Feb 15, 2020
@jiru
Copy link
Member

jiru commented Oct 21, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba.
Projects
None yet
Development

No branches or pull requests

3 participants