# Off-the-shelf pipelines

* There are a number of ready-made pipelines for processing raw text into something more useful for us
* Use those unless you have special requirements and domain
* In this course, we focus on mastering these ready-made tools rather than building new ones (Intro to NLP, and NLP courses are for that)

## English: Stanford CoreNLP

* Set of tools for processing basic well-formed English
* The go-to solution for you unless you are in a very specific domain
* http://nlp.stanford.edu:8080/corenlp/ (online demo)
* http://stanfordnlp.github.io/CoreNLP/

## Finnish: Finnish dependency parser

* A full Finnish parsing pipeline developed here in Turku
* http://bionlp-www.utu.fi/parser_demo (online demo)
* http://turkunlp.github.io/Finnish-dep-parser/

## Others

There's a bunch of other similar pipelines / frameworks:

* Apache OpenNLP https://opennlp.apache.org/
* GATE https://gate.ac.uk/
* UIMA http://uima.apache.org/
* NLTK http://www.nltk.org/
* (...)
* ...but let's focus on the two above for our work with English and Finnish

# Elementary text analysis

* Raw text in, analyzed text out
* A typical text analysis pipeline will progress through a series of steps:

1. Split text into sentences
2. Split sentences into words (often steps 1 and 2 are reversed)
3. Get the base forms for the words and assign them linguistic categories
4. Analyze the syntactic structure of the sentence
5. Relate text segments to each other across sentences, recognize named entities,...

Note: These steps depend on each other in non-trivial ways: http://stanfordnlp.github.io/CoreNLP/dependencies.html

## Sentence splitting

* Not as easy as it sounds
* Split a sentence after a dot and space and capital letter and call it a day?
  * *F. Ginter is holding this course*
* Gets even more exciting on Internet text - capitalization not necessarily consistent, no space after dot, etc.
  * *hi.i am a dude on the internet.really i have not yet discovered the shift key!!!!!!!! LOL!!!!*
* Existing tools typically expect plain text input - extraction of text from whatever source you have must have happened already

## Tokenization

* Divide running text into *tokens*
* Tokens: words, numbers, dates, punctuation
* Lots of room for interpretation on what should constitute a token
    * March 2, 2016  vs.  2.3.2016
    * ettei -> että ei
    * can't -> can not
    * dog's -> dog 's
    * NACA 2415 profile - is *NACA 2415* a single token?
    * F.G. - two tokens or not?
    * e.g. - two tokens or not?
* In the end, you are restricted to what the tool you use produces
* ...and that tool is restricted by the data it was built on

## Lemmatization

* Assign words to their base form
  * Dogs -> dog
  * sinullahan -> sinä
  * voi -> voi or voida
* Especially tough for inflective languages like Finnish
* Sadly, also especially useful for inflective languages like Finnish
  * Why?

Let's try:

```
git clone https://github.com/TurkuNLP/Finnish-dep-parser.git
cd Finnish-dep-parser
./install.sh
python omorfi_pos.py -i -o
```

Observations:

* Lots of ambiguity
  * Correct baseform must be decided based on the context
* When a word is not known -> no luck
* Heavy colloquial language -> no luck
* Spelling errors -> no luck

## POS tagging

* Assign words to their linguistic categories (POS - part of speech): noun, verb, etc.
* In practice: give each word a tag from a predefined set (of which there are many)
* Not an easy task for most languages:
  * Ambiguity wherever you look
  * English: *Time flies like an arrow*
  * Finnish: *Haetaan lakkaa satamasta, kun lakkaa satamasta*
* Let's try:
  * OpenNLP demo for English
  * Finnish parser demo
* English tags: [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
* Finnish tags: [here](http://universaldependencies.org/fi/pos/all.html)
* Very useful if you want e.g. 
  * Only grab the content-bearing words in a text
  * Distinguish between *lead* the metal and *to lead* the verb
* An important step for downstream tasks like syntactic parsing (densifies data)
  
## Syntactic parsing

* Many languages have a relatively free word order
* Dealing with text as a linear sequence of words leads to sparsity problems
  * So many ways to order the words...
  * Related words can be far apart in the sentence:
    * *Ford, as you may well know, is a car maker*
* It is often more useful to know how the words in a sentence relate to each other
* ...which makes it easier to mine relations from text (company - company mergers, and the like)
* Every sentence can be given a *syntactic tree*

English:
<img src="figs/ptree_eng_1.png"/>

Finnish:
<img src="figs/ptree_fin_1.png"/>

* The graphs above are trees
  * Every word depends on one other word (*head*), and exactly one word in the sentence is the root
  * The edges (called *dependencies*) have labels (called *dependency types*)
  * Finnish types: http://universaldependencies.org/fi/dep/all.html
  * English types: http://universaldependencies.org/en/dep/all.html
* These trees tell us a lot about the structure but also the meaning of the sentence
  * Oftentimes the same sentence can have a number of different meanings
  * These correspond to different syntactic trees
  * The *parser* must distinguish these
* Note: a single edge connects *Ford* and *maker* in the tree -> perfect for us!

## Named Entity recognition

* Often, we are after *named entities* and their relations
* Names of companies, places, people, dates, drug names,...
* *Named Entity Recognition* is the task of identifying these
* The inventory of possible entity types entirely depends on the NER tool you use and the data it was built on / built for
* Typical off-the-shelf NER tools will recognize a small number of quite generic entity categories
  * Organiation, Location, Person, etc.
  * Specialized tools exist for different domains: a recognizer built for news text will not do good on medical research articles (and vice versa)
* Let us try on English:
  * CoreNLP demo

## Coreference

* Two text segments refering to the same object in the real world
* Form a chain of backreferences

<img src="figs/coref_eng_1.png"/>

* Unlike the syntactic trees, coreferences span across sentences
* Allow us to assemble information about entities, even without these being explicitly mentioned

# But how are all these tools built?

Short answer: **machine learning** from manually prepared (annotated) *training data*

* Build a syntactic parser -> you need thousands of example sentences for which the correct tree was given by a human
* To build an NER recognizer -> you need thousands of examples of entities in context
* Also sentence splitters and tokenizers are built via machine learning nowadays
* ...
* No free lunches here - someone needs to prepare the training data
    * http://www.universaldependencies.org is a good collection of training datasets for 30+ languages

But this all has practical implications for us:

* The tools work best on data which is similar in type to the data on which they were trained
  * A POS tagger trained on news will perform much worse on twitter
  * From twitter: *"What do we want?!!" "PSYCHIC POWERS! NOW!" "When do we want it?!!"*
* Often we don't really have choice - no special tools for processing Finnish Twitter, for example
* Be aware that the tools are not perfect! Nowhere near so. (as we have seen)


# How to run these tools in practice

* CoreNLP is really simple to run
  * Download and unpack, make sure you have Java v8
  * Pick which annotation you want, pick the output format, feed in text
  * http://stanfordnlp.github.io/CoreNLP/annotators.html
  

```
./corenlp.sh -annotators tokenize,ssplit,pos,lemma,depparse,ner -outputFormat conll -file ../wiki.txt
less wiki.txt.conll
column -t wiki.txt.conll
```

* Running the Finnish parsing pipeline is not harder:
  * Download and install using the instructions http://turkunlp.github.io/Finnish-dep-parser/

```
cat ../wiki-fin.txt | ./parser-wrapper.sh > wiki-fin.conllu
less wiki-fin.conllu
column -t wiki-fin.conllu
```


## CoNLL format

* **The** format in which analyzed text is passed around through the various tools
* Various versions of the format CoNLL-X, CoNLL-09, CoNLL-U (the last will pretty much replace the others in the near future) differ by which columns are present
* Sentences delimited by empty lines
* One line per token (word or punctuation)
* TAB-delimited columns containing per-token analysis

```
1	The	the	DT	O	3	det
2	PIK-15	pik-15	NN	MISC	3	compound
3	Hinu	hinu	NN	MISC	7	nsubj
4	was	be	VBD	O	7	cop
5	a	a	DT	O	7	det
6	light	light	JJ	O	7	amod
7	aircraft	aircraft	NN	O	0	ROOT
8	developed	develop	VBN	O	7	acl
9	in	in	IN	O	10	case
10	Finland	Finland	NNP	LOCATION	8	nmod
11	in	in	IN	O	13	case
12	the	the	DT	DATE	13	det
13	1960s	1960	NNS	DATE	8	nmod
14	for	for	IN	O	15	case
15	use	use	NN	O	13	nmod
16	as	as	IN	O	19	case
17	a	a	DT	O	19	det
18	glider	glider	NN	O	19	compound
19	tug	tug	NN	O	8	nmod
20	.	.	.	O	_	_
```

```
1       Kurt    Kurt    PROPN   _       Case=Nom|Number=Sing    2       name    _       _
2       Hedström        Hedström        PROPN   _       Case=Nom|Number=Sing    6       nsubj   _       _
3       ja      ja      CONJ    _       _       2       cc      _       _
4       Tuomo   Tuomo   PROPN   _       Case=Nom|Number=Sing    5       name    _       _
5       Tervo   Tervo   PROPN   _       Case=Nom|Number=Sing    2       conj    _       _
6       aloittivat      aloittaa        VERB    _       Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin|Voice=Act 0       root    _       _
7       suunnittelun    suunnittelu     NOUN    _       Case=Gen|Number=Sing    6       dobj    _       _
8       vuonna  vuosi   NOUN    _       Case=Ess|Number=Sing    6       nmod    _       _
9       1959    1959    NUM     _       NumType=Card    8       nummod  _       _
10      .       .       PUNCT   _       _       6       punct   _       _
```

### CoNLL-U format columns

Copied from http://universaldependencies.org/format.html

1. ID: Word index, integer starting at 1 for each new sentence; may be a range for tokens with multiple words.
2. FORM: Word form or punctuation symbol.
3. LEMMA: Lemma or stem of word form.
4. UPOSTAG: [Universal part-of-speech tag](http://universaldependencies.org/u/pos/index.html)
5. XPOSTAG: Language-specific part-of-speech tag; underscore if not available.
6. FEATS: List of morphological features from the [universal feature inventory](http://universaldependencies.org/u/feat/index.html) or from a defined [language-specific extension](http://universaldependencies.org/ext-feat-index.html); underscore if not available.
7. HEAD: Head of the current token, which is either a value of ID or zero (0).
8. DEPREL: [Universal Stanford dependency relation](http://universaldependencies.org/u/dep/index.html) to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
9. DEPS: List of secondary dependencies (head-deprel pairs).
10. MISC: Any other annotation.


# Recap

* Surprisingly easy to obtain pretty detailed analysis of input text
* Looking good also for Finnish, not just English
* Much of the analysis strives to normalize the variance present in natural language

* Tools are trained from example data and are sensitive to the type of text being processed
* There is a certain error rate in the output and that is something to live with
