# Annotation

(Large topic, we'll only be scratching the surface here!)

* Manually annotated data is a key resource for natural language processing
    * Evaluation (regardless of approach)
    * Training supervised machine learning methods
* Wealth of annotated text corpora available, e.g.
    * Finnish: [Kielipankki](https://www.kielipankki.fi/)
    * English: [Linguistic Data Consortium](https://www.ldc.upenn.edu/), [CoNLL shared tasks](http://www.conll.org/previous-tasks)
* To address a new task, language, (sub)domain, etc. it may be necessary to create new annotation

An annotated corpus can have a very large impact: for example, the Penn Treebank was the main focus of research into automatic syntactic analysis research for more than a decade after its release in 1990.

<img src="figs/finnish_ner_example_large.png">

<div style="text-align:center; font-size:80%">Example: Finnish named entity annotation</div>

## Text selection

* Corpus texts must be *representative* (domain, genre, register, style) and *balanced*
    * Methods developed only on news text will perform poorly on tweets (and vice versa)
* (To share corpora, its texts must be have a license that allows it!)

Examples:

<img src="figs/ptb_sources.png" width="50%">

<div style="text-align:center; font-size:80%">Composition of Penn Treebank 1 (table from <a href="https://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports">Marcus et al. (1993)</a>)</div>

<img src="figs/tdt_sources.png" width="60%">

<div style="text-align:center; font-size:80%">Composition of Turku Dependency Treebank (table from <a href="https://link.springer.com/article/10.1007%2Fs10579-013-9244-1">Haverinen et al. (2013)</a>)</div>

## Annotation guidelines

Detailed documentation of the annotation guidelines -- what is annotated and how -- is necessary to assure the quality and consistency of annotation.

* Classes/categories/labels/types, for example
    * Sentiment analysis: `negative`, `neutral`, `positive`
    * Named entity recognition: `PERSON`, `LOCATION`, `ORGANIZATION`,
    * Part-of-speech annotation: `NOUN`, `VERB`, `ADJ`, ...
    * Dependency syntax: `subj`, `obj`, `nmod`, ...
* Annotation formalism and representation (e.g. continuous spans of characters)
* Scope of annotation, e.g. all words for POS/syntax, proper nouns for NER
* Examples, edge cases and exceptions

Guidelines should ideally be complete before annotation starts, but are frequently updated and extended thoughout annotation projects.

Examples: 

<img src="figs/ace_entity_example.png" width="60%">

<div style="text-align:center; font-size:80%">Extract from ACE entity annotation guidelines (<a href="https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-entities-guidelines-v6.6.pdf">LDC 2008</a>)</div>

<img src="figs/ud_cop_example.png" width="70%">

<div style="text-align:center; font-size:80%">Extract from <a href="http://universaldependencies.org/u/dep/cop.html">Universal Dependencies annotation guidelines</a></div>

* **English entity mention (named entity) annotation**: ACE (Automatic Content Extraction) entity guidelines ([LDC 2008](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-entities-guidelines-v6.6.pdf)): 72 pages
* **English relation annotation**: ACE (Automatic Content Extraction) relation guidelines ([LDC 2008](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-relations-guidelines-v6.2.pdf)): 33 pages
* **English constituency syntax**: Penn Treebank II bracketing guidelines ([Bies et al. 1995](http://cs.jhu.edu/~jason/465/hw-parse/treebank-manual.pdf)): 318 pages
* **Multilingual dependency syntax**: [Universal Dependencies](http://universaldependencies.org/) documentation: 15710 HTML pages (a **lot** of redundancy)

## Evaluation

Annotation is never perfect, and it is important to know the quality and consistency of an annotated corpus. One typical strategy:

* Two or more annotators are trained to perform the task
* A part of the data is annotated independently by each annotator, without communicating with others
* The redundant annotations are compared to identify differences

In classification-type tasks, *inter-annotator agreement* is frequently measured using [Cohen's kappa statistic](https://en.wikipedia.org/wiki/Cohen%27s_kappa), which accounts for the possibility of chance agreement.

## Tools

Many text annotations tools exist, ranging from custom tools created for single annotation projects to commercial systems. One category of interest are free, browser-based tools, including (nb: incomplete listing):

* Annotator.js: http://annotatorjs.org/
* brat: http://brat.nlplab.org/
* hypothes.is: https://web.hypothes.is/
* TextAE: http://textae.pubannotation.org/
* WebAnno: https://webanno.github.io/webanno/

# Demonstration

(This will only work with a tunnel to the VM)

http://localhost:8001/index.xhtml#/ud-fi-ne/train/b603