<a href="https://colab.research.google.com/github/TurkuNLP/gf_summerschool/blob/main/gf_parser_output_explained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CoNLL-U file format

* Terminology reminder: format != schema
   * File format tells how the file is structured (syntax), while annotation schema describes the meaning of the annotation


* CoNLL-U format is based on lines and columns

* sentence/document metadata: lines starting with "#"
* empty line: sentence boundary
* each numbered line is a separate token, columns are different annotations for the token
* Underscore is used for empty fields (no annotation required or missing annotation)
* columns: ID, FORM, LEMMA, UPOS, XPOS, FEAT, HEAD, DEPREL, DEPS, MISC
  * ID: word index
  * FORM: original word form as appeared in the text
  * LEMMA: base form
  * UPOS: universal part-of-speech tag (17 values)
  * XPOS: language-specific part-of-speech tag (different in each corpus)
  * FEAT: list of morphological features
  * HEAD: governor in the dependency tree (id), or zero (root token)
  * DEPREL: dependency relation type
  * DEPS: Enhanced dependency graph
  * MISC: Any other annotation, especially original spacing

```
# sent_id = wn096.9
# text = Lisäksi katkos häiritsi merkittävästi Egyptin ja Intian verkkoliikennettä.
1  Lisäksi           lisäksi          ADV    Adv    _                       3 advmod    _    _
2  katkos            katkos           NOUN   N      Case=Nom|Number=Sing    3 nsubj     _    _
3  häiritsi          häiritä          VERB   V      Mood=Ind|Number=Sing... 0 root      _    _
4  merkittävästi     merkittävästi    ADV    Adv    Derivation=Sti          3 advmod    _    _
5  Egyptin           Egypti           PROPN  N      Case=Gen|Number=Sing    8 nmod:poss _    _
6  ja                ja               CCONJ  C      _                       7 cc        _    _
7  Intian            Intia            PROPN  N      Case=Gen|Number=Sing    5 conj      _    _
8  verkkoliikennettä verkko#liikenne  NOUN   N      Case=Par|Number=Sing    3 obj       _    SpaceAfter=No
9  .                 .                PUNCT  Punct  _                       3 punct     _    _

# sent_id = b112.2
# text = Sain sähköpostia.
1  Sain              saada            VERB   V      Mood=Ind|Number=Sing... 0 root      _    _
2  sähköpostia       sähkö#posti      NOUN   N      Case=Par|Number=Sing    1 obj       _    SpaceAfter=No
3  .                 .                PUNCT  Punct  _                       1 punct     _    _

```

## Quality of the parser output

* Machine learned parser will ALWAYS give an output, if it does not know the word, it will guess based on the word itself and the surrounding sentence.

* Quality of the predictions can be measured if correct analyses are known

* Evaluation results for the Finnish model trained on UD_Finnish-TDT training set, evaluated on UD_Finnish-TDT test set:



```
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.72 |     99.66 |     99.69 |
Sentences  |     88.18 |     84.89 |     86.50 |
Words      |     99.72 |     99.64 |     99.68 |
UPOS       |     97.84 |     97.77 |     97.80 |     98.12
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     96.46 |     96.39 |     96.42 |     96.73
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |     96.17 |     96.10 |     96.13 |     96.44
UAS        |     92.93 |     92.87 |     92.90 |     93.20
LAS        |     91.27 |     91.21 |     91.24 |     91.53
CLAS       |     90.64 |     90.45 |     90.55 |     90.75
MLAS       |     85.63 |     85.44 |     85.53 |     85.73
BLEX       |     86.51 |     86.32 |     86.42 |     86.61
```

How to read the table:

* *How often the parser predicts the universal part-of-speech tag correctly?* – F1 for UPOS is 97.80, so ~98 tokens out of 100 tokens are correctly predicted.
* *How often the parser predicts the syntactic tree correctly?* – LAS is 91.24, so ~91 tokens out of 100 tokens have correctly predicted parent token (HEAD) and relation type (DEPREL).



What affects the prediction quality?

* In-domain vs. out-of-domain data
* Common vs. rare words
* Errors often cluster

## How to read CoNLL-U in python?

In [None]:
conllu_data = """
# sent_id = b101.2
# text = Jäällä kävely avaa aina hauskoja ja erikoisia näkökulmia kaupunkiin.
1	Jäällä	jää	NOUN	N	Case=Ade|Number=Sing	2	nmod	2:nmod	_
2	kävely	kävely	NOUN	N	Case=Nom|Derivation=U|Number=Sing	3	nsubj	3:nsubj	_
3	avaa	avata	VERB	V	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	0:root	_
4	aina	aina	ADV	Adv	_	3	advmod	3:advmod	_
5	hauskoja	hauska	ADJ	A	Case=Par|Degree=Pos|Number=Plur	8	amod	8:amod	_
6	ja	ja	CCONJ	C	_	7	cc	7:cc	_
7	erikoisia	erikoinen	ADJ	A	Case=Par|Degree=Pos|Derivation=Inen|Number=Plur	5	conj	5:conj|8:amod	_
8	näkökulmia	näkö#kulma	NOUN	N	Case=Par|Number=Plur	3	obj	3:obj	_
9	kaupunkiin	kaupunki	NOUN	N	Case=Ill|Number=Sing	8	nmod	8:nmod	SpaceAfter=No
10	.	.	PUNCT	Punct	_	3	punct	3:punct	_

# sent_id = b101.3
# text = Vähän samanlainen tunne kuin silloin, kun ystävämme vei meidät kerran ylöstuomiokirkon torniin.
1	Vähän	vähän	ADV	Adv	_	2	advmod	2:advmod	_
2	samanlainen	samanlainen	ADJ	A	Case=Nom|Degree=Pos|Derivation=Lainen|Number=Sing	3	amod	3:amod	_
3	tunne	tunne	NOUN	N	Case=Nom|Number=Sing	0	root	0:root	_
4	kuin	kuin	SCONJ	C	_	5	mark	5:mark	_
5	silloin	silloin	ADV	Adv	_	2	advcl	2:advcl	SpaceAfter=No
6	,	,	PUNCT	Punct	_	9	punct	9:punct	_
7	kun	kun	SCONJ	C	_	9	mark	9:mark	_
8	ystävämme	ystävä	NOUN	N	Case=Nom|Number=Sing|Number[psor]=Plur|Person[psor]=1	9	nsubj	9:nsubj	_
9	vei	viedä	VERB	V	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act	5	advcl	5:advcl	_
10	meidät	minä	PRON	Pron	Case=Acc|Number=Plur|Person=1|PronType=Prs	9	obj	9:obj	_
11	kerran	kerran	ADV	Adv	_	9	advmod	9:advmod	_
12	ylös	ylös	ADV	Adv	_	14	advmod	14:advmod	SpaceAfter=No
13	tuomiokirkon	tuomio#kirkko	NOUN	N	Case=Gen|Number=Sing	14	nmod:poss	14:nmod:poss	_
14	torniin	torni	NOUN	N	Case=Ill|Number=Sing	9	obl	9:obl	SpaceAfter=No
15	.	.	PUNCT	Punct	_	3	punct	3:punct	_

"""

ID,FORM,LEMMA,UPOS,XPOS,FEAT,HEAD,DEPREL,DEPS,MISC=range(10)

def read_conllu(f):
    # f is open file object or list of conllu lines
    sent=[]
    comment=[]
    for line in f:
        line=line.strip()
        if not line: # new sentence
            if sent:
                yield comment,sent
            comment=[]
            sent=[]
        elif line.startswith("#"):
            comment.append(line)
        else: #normal line
            sent.append(line.split("\t"))
    else:
        if sent:
            yield comment, sent

for comm, sent in read_conllu(conllu_data.split("\n")):
    print(" ".join(token[FORM] for token in sent))

Jäällä rullailu avaa aina hauskoja ja erikoisia näkökulmia kaupunkiin .
Vähän samanlainen tunne kuin silloin , kun ystävämme vei meidät kerran ylös tuomiokirkon torniin .
