# Sequence labeling

* Many classification tasks produce a sequence of predictions, rather than a single prediction
* In this lecture we have a look at these tasks, and try to understand what makes this setting special

### POS Tagging

![posfig](figs/pos_voita.png)
![posfig](figs/pos_house.png)

* Every word is assigned to its part-of-speech category
* The number of categories is potentially quite large, in this case less than 20 though (You can see them [here](https://universaldependencies.org/u/pos/index.html) by the way)
* POS tagging is often used as a pre-processing step
* You can also use it to pick important words as features (nouns, verbs, etc)
* Note the context-dependence of the tags
  * `voita` can be a verb also, `voi` can be a noun also
  * `house` can be a noun or a verb
  * ...
* The tags also have a dependence among each other
  * Many sequences are impossible or at least highly unlikely, regardless of the input
  * In English, having seen a determiner, the likely next tag is a noun or an adjective, and e.g. a verb is extremely unlikely
  

### Named entity recognition

![nerfig](figs/ner_demo.png)
![nerfig](figs/ner_demo_en.png)


* NER is usually cast as a sequence labeling problem
* Entities are (typically) sequences of words, like `Turun Yliopisto` or `British Airways`
* The type tells what kind of an entity we have. The list of types is usually quite restricted: `Person, Organization, Location, Product, Event, Date, Other` would be a typical list

### BIO-coding

* NER and other similar tasks that involve locating multi-word entities are cast as classification of individual tokens into three groups of classes:

* **B-category**: The token begins an entity of type `category`. For example `B-Person` or `B-Location`
* **I-category**: The token continues an entity that is already started (with a `B-category`)
* **O**: The token is not a part of any entity

Here is an example from our [Finnish NER training data](https://github.com/TurkuNLP/turku-ner-corpus):

```
The	B-PRO
Garden	I-PRO
Collection	I-PRO
by	O
H&M	B-ORG

Viikonlopun	O
pyöritys	O
alkoi	O
H&M:n	B-ORG
järjestämällä	O
bloggaajabrunssilla	O
Helsingissä	B-LOC
.	O
```

* `BIO-coding` is suitable for cases where you do not have entity nesting and overlaps
* There are, once again, quite clear dependencies between labels regardless of the input:
  * Exmaples of legal: `O B-Person O O`, `B-Person I-Person O O`, `B-Person B-Person`
  * Examples of illegal: `B-Person O I-Person O`, `O O I-Person O O`, `O B-Person I-Event O`
* Preferably, the classifier should be prevented from producing illegal BIO sequences

### Text segmentation

* Text segmentation (splitting into tokens and sentences) is often carried out as sequence labeling
* One would label every individual character as one of:
  * token ends after this character
  * sentence ends after this character
  * inside token

Example:

```
Is it you?

I     inside
s     token-break
      token-break
i     inside
t     token-break
      token-break
y     inside
o     inside
u     token-break
?     sentence-break
```

* **Note:** what, precisely, happens at spaces is somewhat implementation-dependent and you can do it in various ways, this is only one of the possibilities

### Zoning

* In many applications, one may want to separate text into zones
    * scientific papers may need to be separated into backround, methods, results, citations
    * patents can be separated into background and claims
    * ...
* The BIO coding is applicable also here
    * perhaps the unit of classification are the whole sentences or even paragraphs, not words
    * depends on task, ie can you expect a zone to change half-way through the sentence

![zoningfig](figs/zones.gif)
Figure from: https://www.cl.cam.ac.uk/~sht25/az.html