# Raw data

* It is easy to get **raw** text data
* But before you can do anything reasonable with it, you need to:
  * Remove junk
  * Split to sentences and tokens
  * Analyze
  
# Off-the-shelf pipelines

* Ready-made pipelines exist for processing text
* Use those unless you have special requirements and domain

## English: Stanford CoreNLP

* Set of tools for processing basic well-formed English (and few other languages, no Finnish support as of yet)
* The go-to solution unless you are in a very specific domain
* http://nlp.stanford.edu:8080/corenlp/ (online demo, we'll use it a lot)
* http://stanfordnlp.github.io/CoreNLP/

## Finnish: Finnish dependency parser

* A full Finnish parsing pipeline developed here in Turku
* http://bionlp-www.utu.fi/parser_demo (online demo)
* http://turkunlp.github.io/Finnish-dep-parser/

## Others

* Apache OpenNLP https://opennlp.apache.org/
* GATE https://gate.ac.uk/
* UIMA http://uima.apache.org/
* NLTK http://www.nltk.org/

# Elementary text analysis

## POS tagging

* Assign words to their linguistic categories (POS - part of speech): noun, verb, etc.
* Not an easy task for most languages:
  * English: *Time flies like an arrow*
  * Finnish: *Haetaan lakkaa satamasta, kun lakkaa satamasta*
* English tags: [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
* Finnish tags: [here](http://universaldependencies.org/fi/pos/all.html)
* Very useful if you only want e.g. the content-bearing words in a text

## Syntactic parsing

* It is often very useful to know how the words in a sentence relate to each other
* Many languages have a relatively free word order
* Especially useful if you want to mine *relations* between known entities
* Produces the *syntactic tree* for every sentence

<img src="figs/ptree_eng_1.png"/>

## Named Entity recognition

* Often, we are after *named entities* and their relations
* Names of companies, places, people, dates, drug names,...


## CoNLL format

* **The** format in which analyzed text is passed around through the various tools
* Various versions of the format CoNLL-X, CoNLL-09, CoNLL-U (the last will pretty much replace the others in the near future)
* Sentences delimited by empty lines
* One line per token (word or punctuation)
* TAB-delimited columns containing per-token analysis

``
1	The	the	DT	O	3	det
2	PIK-15	pik-15	NN	MISC	3	compound
3	Hinu	hinu	NN	MISC	7	nsubj
4	was	be	VBD	O	7	cop
5	a	a	DT	O	7	det
6	light	light	JJ	O	7	amod
7	aircraft	aircraft	NN	O	0	ROOT
8	developed	develop	VBN	O	7	acl
9	in	in	IN	O	10	case
10	Finland	Finland	NNP	LOCATION	8	nmod
11	in	in	IN	O	13	case
12	the	the	DT	DATE	13	det
13	1960s	1960	NNS	DATE	8	nmod
14	for	for	IN	O	15	case
15	use	use	NN	O	13	nmod
16	as	as	IN	O	19	case
17	a	a	DT	O	19	det
18	glider	glider	NN	O	19	compound
19	tug	tug	NN	O	8	nmod
20	.	.	.	O	_	_
``

### CoNLL-U format columns

Copied from http://universaldependencies.org/format.html

1. ID: Word index, integer starting at 1 for each new sentence; may be a range for tokens with multiple words.
2. FORM: Word form or punctuation symbol.
3. LEMMA: Lemma or stem of word form.
4. UPOSTAG: [Universal part-of-speech tag](u/pos/index.html)
5. XPOSTAG: Language-specific part-of-speech tag; underscore if not available.
6. FEATS: List of morphological features from the [universal feature inventory](u/feat/index.html) or from a defined [language-specific extension](ext-feat-index.html); underscore if not available.
7. HEAD: Head of the current token, which is either a value of ID or zero (0).
8. DEPREL: [Universal Stanford dependency relation](u/dep/index.html) to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
9. DEPS: List of secondary dependencies (head-deprel pairs).
10. MISC: Any other annotation.


