Skip to content

Languages

Rob Speer edited this page Nov 1, 2016 · 4 revisions

ConceptNet is built from multilingual data, and covers hundreds of languages.

Identifying languages

Languages in ConceptNet are identified by their BCP 47 language code, a two- or three-letter code that's been standardized by IANA (and formerly by ISO).

When languages overlap and sources disagree on how to distinguish them, we usually represent them only as the broadest language code. This allows us to recognize as many terms as possible, though as a result the terms they are linked to may not all be understood by all speakers of those languages.

  • Traditional and Simplified Chinese, with vocabulary from Mandarin, Cantonese, and other Chinese languages, all appear under the language code zh.
  • Dialects of Arabic all appear under the language code ar.
  • Serbian, Croatian, and Bosnian appear under the language code sh.
  • Bahasa Indonesia and Bahasa Malaysia both appear under the language code ms.
  • Norwegian Bokmål and Nynorsk both appear under the language code no.

Core languages

There are 10 core languages, where we believe ConceptNet supports the language well, with a very large vocabulary, lots of assertions expressed within that language, and varied sources of knowledge. In these 10 languages, we provide all API features, including word vectors.

Code Language Autonym Vocabulary size
en English English 1803873
fr French français 3023144
it Italian italiano 1078629
de German Deutsch 825741
es Spanish español 782760
ru Russian русский 680205
pt Portuguese português 473709
ja Japanese 日本語 363663
nl Dutch Nederlands 267641
zh Chinese 中文 242746

Common languages

There are 68 more common languages, with vocabularies of at least 10,000 terms. We generate word vectors for these languages, but don't provide them through the API (that would be too many vectors to keep around in memory). Some terms in these languages may only be connected to the rest of the graph via languages in the core set.

About 90% of the world's population natively speaks one of these languages or the core languages. This list also includes some languages that are historical (such as Latin) or constructed (such as Esperanto).

Code Language Autonym Vocabulary size
af Afrikaans Afrikaans 19804
ang Old English Ænglisc 14898
ar Arabic العربية 134311
ast Asturian asturianu 52118
az Azerbaijani azərbaycan dili 15465
be Belarusian беларуская 20199
bg Bulgarian български 337071
ca Catalan català 123780
cs Czech čeština 129183
cy Welsh Cymraeg 18184
da Danish dansk 67915
el Greek Ελληνικά 71970
eo Esperanto esperanto 171527
et Estonian eesti 29968
eu Basque euskara 52340
fa Persian فارسی 61883
fi Finnish suomi 381278
fil Filipino Filipino 17620
fo Faroese føroyskt 18081
fro Old French 33797
ga Irish Gaeilge 42963
gd Scottish Gaelic Gàidhlig 27313
gl Galician galego 76598
grc Ancient Greek 37717
gv Manx Gaelg 14404
he Hebrew עברית 40906
hi Hindi हिन्दी 21163
hsb Upper Sorbian hornjoserbšćina 52975
hu Hungarian magyar 81638
hy Armenian հայերեն 34969
io Ido Ido 39078
is Icelandic íslenska 55767
ka Georgian ქართული 41801
kk Kazakh қазақ тілі 20779
ko Korean 한국어 47268
ku Kurdish Kürtçe 15680
la Latin Lingua latina 1334135
lt Lithuanian lietuvių 30523
lv Latvian latviešu 48870
mg Malagasy Malagasy 53264
mk Macedonian македонски 33270
ms Malay Bahasa Melayu 124646
mul Multilingual 20214
no Norwegian norsk bokmål 125633
non Old Norse 7868
nrf Norman French Jèrriais / Guernésiais 19687
nv Navajo Diné bizaad 12432
oc Occitan occitan 42122
pl Polish polski 191190
ro Romanian română 66260
rup Aromanian armãneashti 7212
sa Sanskrit संस्कृत 9584
se Northern Sami davvisámegiella 134734
sh Croatian / Bosnian / Serbian srpskohrvatski 148819
sk Slovak slovenčina 29768
sl Slovenian slovenščina 160496
sq Albanian shqip 24692
sv Swedish svenska 268402
sw Swahili Kiswahili 12648
ta Tamil தமிழ் 11785
te Telugu తెలుగు 26434
th Thai ไทย 103096
tr Turkish Türkçe 65892
uk Ukrainian українська 46284
ur Urdu اردو 11832
vi Vietnamese Tiếng Việt 54774
vo Volapük Volapük 14678
xcl Classical Armenian 26356

All languages

The vocabulary of ConceptNet supports a total of 304 languages. To be represented in ConceptNet, a language must have a written orthography, and we must be able to extract a vocabulary of at least 300 words from ConceptNet's data sources.

We exclude languages below this cutoff because their data is too likely to be unrepresentative, unhelpful, or erroneous. (One language, Yi, meets this cutoff at one stage of building, even though its vocabulary in the final ConceptNet graph is too small.)

This table shows all the supported languages, alphabetically by their language code:

Code Language Vocabulary size
aa Afar 1451
ab Abkhazian 744
abe Western Abenaki 394
adx Amdo Tibetan 1107
ady Adyghe 7202
ae Avestan 371
af Afrikaans 19804
aii Assyrian Neo-Aramaic 720
ain Ainu 744
akk Akkadian 734
akz Alabama 371
alt Southern Altai 843
am Amharic 2396
an Aragonese 5173
ang Old English 14898
ar Arabic 134311
arc Aramaic 3871
arn Mapuche 2431
ast Asturian 52118
av Avar 479
axm Middle Armenian 596
az Azerbaijani 15465
ba Bashkir 5793
bal Baluchi 765
be Belarusian 20199
bg Bulgarian 337071
bi Bislama 257
bm Bambara 5211
bn Bengali 9907
bo Tibetan 1620
br Breton 18069
ca Catalan 123780
ccc Chamicuro 883
ce Chechen 3372
ceb Cebuano 11346
ch Chamorro 419
chk Chuukese 1356
chl Cahuilla 1021
cho Choctaw 364
chr Cherokee 2105
cic Chickasaw 1451
cim Cimbrian 738
cjs Shor 751
co Corsican 4936
cop Coptic 383
crh Crimean Turkish 5550
cs Czech 129183
csb Kashubian 1336
cu Church Slavic 15262
cv Chuvash 3818
cy Welsh 18184
da Danish 67915
dak Dakota 1067
de German 825741
dje Zarma 727
dlm Dalmatian 1796
dsb Lower Sorbian 8526
dua Duala 679
dum Middle Dutch 2351
dv Divehi 519
ee Ewe 1207
egl Emilian 849
egx Egyptian languages 1047
egy Ancient Egyptian 402
el Greek 71970
en English 1803873
enm Middle English 13013
eo Esperanto 171527
es Spanish 782760
esu Central Yupik 553
et Estonian 29968
eu Basque 52340
fa Persian 61883
ff Fula 315
fi Finnish 381278
fil Filipino 17620
fj Fijian 463
fo Faroese 18081
fon Fon 1642
fr French 3023144
frk Frankish 620
frm Middle French 10931
fro Old French 33797
frp Franco-Provençal 5568
frr Northern Frisian 1085
fur Friulian 5101
fy Western Frisian 12009
ga Irish 42963
gag Gagauz 1118
gd Scottish Gaelic 27313
gl Galician 76598
gmh Middle High German 2368
gml Middle Low German 2155
gn Guarani 624
goh Old High German 6385
got Gothic 3726
grc Ancient Greek 37717
gsw Swiss German 1752
gu Gujarati 3361
gv Manx 14404
ha Hausa 1164
hak Hakka 788
haw Hawaiian 3104
hbo Ancient Hebrew 5864
he Hebrew 40906
hi Hindi 21163
hil Hiligaynon 2914
hit Hittite 379
hke Hunde 492
hsb Upper Sorbian 52975
ht Haitian Creole 3799
hu Hungarian 81638
hy Armenian 34969
ia Interlingua 8632
ie Interlingue 698
ii Yi 105
ilo Ilocano 470
io Ido 39078
is Icelandic 55767
ist Istriot 848
it Italian 1078629
iu Inuktitut 3739
ja Japanese 363663
jbo Lojban 2043
jv Javanese 5287
ka Georgian 41801
kbd Kabardian 1422
khb Tai Lü 335
ki Kikuyu 473
kim Tofa 799
kjh Khakas 1009
kk Kazakh 20779
kl Kalaallisut 2127
km Khmer 5971
kn Kannada 4278
ko Korean 47268
koy Koyukon 422
krc Karachay-Balkar 1108
krl Karelian 943
ku Kurdish 15680
kum Kumyk 1213
kw Cornish 3707
ky Kyrgyz 6349
la Latin 1334135
lad Ladino 3108
lb Luxembourgish 16743
li Limburgish 1627
lij Ligurian 769
liv Livonian 883
lkt Lakota 1443
lld Ladin 9924
lmo Lombard 2332
ln Lingala 8624
lo Lao 3892
lt Lithuanian 30523
ltg Latgalian 1428
lv Latvian 48870
lzz Laz 376
mch Maquiritari 791
mdf Moksha 3828
mg Malagasy 53264
mga Middle Irish 425
mh Marshallese 377
mi Maori 8786
mk Macedonian 33270
ml Malayalam 7748
mn Mongolian 9821
mr Marathi 6879
ms Malay 124646
mt Maltese 5921
mul Multilingual 20214
mwl Mirandese 2593
my Burmese 5238
myv Erzya 1443
na Nauru 538
nah Nahuatl languages 2587
nan Min Nan Chinese 3963
nap Neapolitan 2406
nci Classical Nahuatl 5619
nds Low German 8668
ne Nepali 6041
nhn Central Nahuatl 437
nl Dutch 267641
nmn !Xóõ 594
no Norwegian 125633
nog Nogai 1064
non Old Norse 7868
nov Novial 1585
nrf Jèrriais 19687
nv Navajo 12432
oc Occitan 42122
odt Old Dutch 838
ofs Old Frisian 1848
oge Old Georgian 778
oj Ojibwa 1388
oma Omaha-Ponca 1502
or Oriya 538
orv Old Russian 597
os Ossetic 9608
osp Old Spanish 1186
osx Old Saxon 4375
ota Ottoman Turkish 1993
pa Punjabi 4648
pal Pahlavi 1235
pap Papiamento 7222
pcd Picard 3146
peo Old Persian 324
pi Pali 2027
pjt Pitjantjatjara 479
pl Polish 191190
pms Piedmontese 3157
ppl Pipil 491
prg Prussian 1298
pro Old Provençal 11884
ps Pashto 2314
pt Portuguese 473709
qu Quechua 7029
qya Quenya 370
raj Rajasthani 393
rap Rapa Nui 661
rm Romansh 9773
ro Romanian 66260
roa-opt Old Portuguese 2959
rom Romany 1370
ru Russian 680205
rue Rusyn 319
rup Aromanian 7212
rw Kinyarwanda 1016
sa Sanskrit 9584
sah Sakha 3551
sc Sardinian 3617
scn Sicilian 7215
sco Scots 11789
sd Sindhi 439
se Northern Sami 134734
ses Koyraboro Senni 6451
sga Old Irish 5976
sh Serbo-Croatian 148819
shh Shoshoni 366
si Sinhala 2877
sk Slovak 29768
sl Slovenian 160496
sm Samoan 1378
smn Inari Sami 632
sms Skolt Sami 599
so Somali 1514
sq Albanian 24692
srn Sranan Tongo 2344
st Sotho 458
stq Saterland Frisian 2402
su Sundanese 3044
sux Sumerian 1012
sv Swedish 268402
sw Swahili 12648
swb Comorian 1958
syc Classical Syriac 5540
szl Silesian 340
ta Tamil 11785
te Telugu 26434
tet Tetum 798
tg Tajik 5683
th Thai 103096
ti Tigrinya 653
tk Turkmen 3061
tpi Tok Pisin 1761
tpw Tupi 573
tr Turkish 65892
tt Tatar 6440
twf Northern Tiwa 941
txb Tokharian B 391
ty Tahitian 684
tyv Tuvinian 872
udm Udmurt 489
ug Uyghur 2089
uga Ugaritic 712
uk Ukrainian 46284
ur Urdu 11832
uz Uzbek 7519
vec Venetian 10830
vep Veps 3691
vi Vietnamese 54774
vo Volapük 14678
vot Votic 970
wa Walloon 4797
wae Walser 879
war Waray 12607
wau Waurá 380
wo Wolof 2549
wym Wymysorys 2621
xcl Classical Armenian 26356
xh Xhosa 407
xmf Mingrelian 484
xno Anglo-Norman 559
xpr Parthian 384
xto Tokharian A 280
xwo Oirat 1408
yi Yiddish 13591
yo Yoruba 2672
yua Yucateco 1546
za Zhuang 315
zdj Ngazidja Comorian 1908
zh Chinese 242746
zu Zulu 2217
zza Zaza 1006