The dataset is a collection of Estonian news and social media texts annotated with named entities.
The dataset is divided into training, development and test sets. The annotations can be hierarchical, meaning that there can be one named entity inside another. The maximum number of levels in the hierarchical annotations is three.
Train | Dev | Test | Total | |
---|---|---|---|---|
Documents | 78 | 16 | 15 | 109 |
Sentences | 7001 | 882 | 890 | 8773 |
Tokens | 111858 | 13130 | 14686 | 139674 |
1-level entities | 7480 | 497 | 938 | 8915 |
2-level entities | 571 | 44 | 59 | 674 |
3-level entities | 27 | 0 | 1 | 28 |
The dataset is annotated with the following entities:
- PER - person names
- GPE - geopolitical entities
- LOC - geographical locations
- ORG - organizations
- PROD - products, things, works of art
- EVENT - events
- DATE - dates
- TIME - times
- TITLE - titles and professions
- MONEY - monetary expressions
- PERCENT - percentages
Train | Dev | Test | Total | |
---|---|---|---|---|
PER | 2601 | 109 | 299 | 3009 |
ORG | 1177 | 85 | 150 | 1412 |
LOC | 449 | 31 | 35 | 515 |
GPE | 1253 | 129 | 231 | 1613 |
TITLE | 702 | 19 | 59 | 772 |
PROD | 624 | 60 | 117 | 801 |
EVENT | 230 | 15 | 26 | 271 |
DATE | 746 | 64 | 77 | 887 |
TIME | 103 | 6 | 6 | 115 |
PERCENT | 75 | 11 | 1 | 87 |
MONEY | 118 | 12 | 1 | 131 |
Total | 8078 | 541 | 994 | 9613 |
Train | Dev | Test | Total | |
---|---|---|---|---|
PER | 108 | 1 | 14 | 123 |
ORG | 92 | 8 | 6 | 106 |
LOC | 25 | 1 | 0 | 26 |
GPE | 379 | 35 | 42 | 456 |
TITLE | 3 | 0 | 0 | 3 |
PROD | 4 | 0 | 0 | 4 |
EVENT | 1 | 0 | 0 | 1 |
DATE | 10 | 0 | 0 | 10 |
TIME | 0 | 0 | 0 | 0 |
PERCENT | 0 | 0 | 0 | 0 |
MONEY | 0 | 0 | 0 | 0 |
Total | 622 | 45 | 62 | 729 |
Train | Dev | Test | Total | |
---|---|---|---|---|
PER | 1 | 0 | 0 | 1 |
ORG | 0 | 0 | 0 | 0 |
LOC | 1 | 0 | 0 | 1 |
GPE | 25 | 0 | 1 | 26 |
TITLE | 0 | 0 | 0 | 0 |
PROD | 0 | 0 | 0 | 0 |
EVENT | 0 | 0 | 0 | 0 |
DATE | 0 | 0 | 0 | 0 |
TIME | 0 | 0 | 0 | 0 |
PERCENT | 0 | 0 | 0 | 0 |
MONEY | 0 | 0 | 0 | 0 |
Total | 27 | 0 | 1 | 28 |
This work is licensed under a Creative Commons Attribution 4.0 International License.