# Named Entity Recognition

* Recognize named entities (places, people, events, companies, etc...) in text
* It is a classification task, not a simple dictionary lookup problem
    * Why? - list of entities is open and never complete
    * Presence in a dictionary is of course a good feature
    
# NER as classification

* Can't reasonably classify text (sub-)sequences
* Must classify individual tokens
* *BIO* coding most popular:
    * A token can **B**egin an entity, be **I**nside an entity, or be **O**utside an entity
    * Often the **B** class is associated with entity type
* After this, it could be a very simple multiclass classification task. In the data below,
  every token can belong to one of these five classes **B-org**, **I-org**, **B-pro**, **I-pro**, **O**
* You can try to train a normal classifier on this data and see what happens
    * Features: the word itself, POS tags, words before and after, word shape (capitalization, etc.) - whatever you find useful
* These are all individual decisions on the tokens:
    * **Independent of each other**
    * Have prediction errors that you must deal with: I without B, B-org followed by I-pro, etc...
    * You just do something with these errors
    
## NER data

* Need annotated data to train
* Lots of publicly available datasets for various languages and domains out there
* Finnish: https://github.com/mpsilfve/finer-data
    * Looks like this: https://github.com/mpsilfve/finer-data/blob/master/digitoday/ner_train_data_annotated/tietoturva_section/1.csv
    * Needs to be turned into something like this:

```
B-org   Nokia
O       ja
B-org   Continental
O       kehittävät
O       erittäin
O       tarkkaa
O       karttateknologiaa
B-pro   Electronic
I-pro   Horizon
I-pro   -alustalle
O       ,
O       jonka
O       on
O       tarkoitus
O       pystyä
O       jatkuvasti
O       paikantamaan
```
  
# Sequence classification

* Individual decisions on tokens do not take into account dependencies between classes
* Exactly the sort of "*I must be preceded by B or I of same class*" restrictions
    * But also less hard, probabilistic constraints
* Taking into account class dependencies gives a better model (hopefully :)

## Hidden Markov Models (HMM)

* The classic sequence classifier
* Assume an underlying "hidden" sequence of class labels, which generates the visible sequence of words
* Model the probability of a label following another one + a label producing a word
    * P(I-pro|B-pro)
    * P(Nokia|B-org)
* These can be obtained by counting in the training data
* Decoding: Viterbi algorithm - efficient polynomial algorithm to find the best hidden sequence of labels for the observed data (the sentence)
* Restricted in its modelling capabilities by the generative approach it takes
    * These two probabilities is pretty much all we've got to play with

## Conditional Random Fields (CRF)

I won't go into any real details here, you can check out one of the many tutorials out there if you want to know more about the inner workings of CRFs and the way they're trained. Like [this one](http://www.cs.upc.edu/~aquattoni/AllMyPapers/crf_tutorial_talk.pdf).

* The go-to sequence classifier
* Does not model in a generative manner like HMMs do:
    * Arbitrary features, not just the HMM-style conditional probabilities
    * The model learns weights for these features, much like an SVM would
    * Anything you like from the input sequence can be turned into a feature
    * In linear-chain CRFs, the current and previous (and future) tag also enters the equation
* Trained in an iterative fashion (can get stuck in a local optimum)
* Decoded in much the same way as HMMs - efficient polynomial algorithm to find the best sequence of labels

* From a practical point of view:
    * [CRFsuite](http://www.chokkan.org/software/crfsuite/) is a good general CRF training software
    * [NERSuite](http://nersuite.nlplab.org/) a driver script for *CRFsuite* with predefined features tuned for the NER task
    * CoreNLP also has a NER annotator (remember we played with it on one of the first lectures)


# Simple NER pipeline for Finnish

## Parsed training data

* Contains NER label and selected columns from conllu

```
B-org   Nokia   Nokia   PROPN   Case=Nom|Number=Sing    nsubj
O       ja      ja      CONJ    _       cc
B-org   Continental     Continental     PROPN   Case=Nom|Number=Sing    conj
O       kehittävät      kehittää        VERB    Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act root
O       erittäin        erittäin        ADV     _       advmod
O       tarkkaa tarkka  ADJ     Case=Par|Degree=Pos|Number=Sing amod
O       karttateknologiaa       kartta#teknologia       NOUN    Case=Par|Number=Sing    dobj
B-pro   Electronic      Electronic      PROPN   _       name
I-pro   Horizon Horizon PROPN   Case=Gen|Number=Sing    nmod:poss
I-pro   -alustalle      alusta  NOUN    Case=All|Number=Sing    nmod
O       ,       ,       PUNCT   _       punct
O       jonka   joka    PRON    Case=Gen|Number=Sing|PronType=Rel       nsubj
O       on      olla    VERB    Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act cop
O       tarkoitus       tarkoitus       NOUN    Case=Nom|Number=Sing    acl:relcl
O       pystyä  pystyä  VERB    InfForm=1|Number=Sing|VerbForm=Inf|Voice=Act    xcomp:ds
O       jatkuvasti      jatkuvasti      ADV     _       advmod
O       paikantamaan    paikantaa       VERB    Case=Ill|InfForm=3|Number=Sing|VerbForm=Inf|Voice=Act   xcomp
```

In [17]:
import codecs
import sys
import collections

def read_data(f):
    sent=[]
    for line in f:
        line=line.strip()
        if not line:
            if sent:
                yield sent
                sent=[]
        else:
            sent.append(line)
    if sent:
        yield sent
        
f=codecs.open(u"/home/jmnybl/NER/train-full.conllu",u"rt",u"utf-8")
labels=[]
examples=[]
count=0
for sent in read_data(f):
    for i,line in enumerate(sent):
        label,word=line.split(u"\t")[:2]
        labels.append(label)
    count+=1
f.close()

print "Training data size:", count, "sentences,", len(labels), "examples,", len(set(labels)), "classes"
print
        
counter=collections.Counter(labels)
for key in sorted(counter, key=counter.get, reverse=True):
    print counter[key],key
print

Training data size: 14796 sentences, 198830 examples, 15 classes

170459 O
9099 B-org
4909 B-pro
3725 I-pro
3120 I-org
2198 B-per
2042 B-loc
1219 I-per
764 B-misc
619 B-tit
262 I-misc
163 I-loc
101 I-event
88 B-event
62 I-tit



## Feature generation

* simple features:
* current: word, character n-grams, pos, morphology, dependency type, uppercased, is first/last token
* previous/next: word, pos 

In [18]:
import collections

def create_features(i,sent,analyzer):
    #        print token
    feats=[]
    cols=sent[i].split(u"\t")
    label,word=cols[0],cols[1]
    feats=analyzer(word) # character n-grams
    feats.append(u"word="+word)
    feats.append(u"pos="+cols[3])
    feats.append(u"deprel="+cols[5])
    if cols[4]!=u"_":
        for fe in cols[4].split(u"|"):
            feats.append(fe)
    if word[0].isupper()==True:
        feats.append(u"isupper")

    if i!=0: # take previous token
        feats.append(u"preword="+sent[i-1].split(u"\t")[1])
        feats.append(u"prepos="+sent[i-1].split(u"\t")[3])
        # pre and current pos
        feats.append(u"prethis="+sent[i-1].split(u"\t")[3]+cols[3])
    else:
        feats.append(u"firsttoken")
    if i<len(sent)-1:
        feats.append(u"nextword="+sent[i+1].split(u"\t")[1])
        feats.append(u"nextpos="+sent[i+1].split(u"\t")[3])
        # current and next pos
        feats.append(u"thisnext="+sent[i+1].split(u"\t")[3]+cols[3])
    else:
        feats.append(u"lasttoken")

    return feats

import sklearn.feature_extraction
vectorizer=sklearn.feature_extraction.text.TfidfVectorizer(analyzer='char',ngram_range=(2,4),lowercase=False)
analyzer=vectorizer.build_analyzer()

f=codecs.open(u"/home/jmnybl/NER/train-full.conllu",u"rt",u"utf-8")
labels=[]
examples=[]
for sent in read_data(f):
    for i,line in enumerate(sent):
        label,word=line.split(u"\t")[:2]
        labels.append(label)
        examples.append(create_features(i,sent,analyzer))
    examples.append(None) # sentence boundary
    labels.append(None)
f.close()

print "First example featurized:"
print labels[0],examples[0]

First example featurized:
B-pro [u'Wi', u'in', u'na', u'am', u'mp', u'Win', u'ina', u'nam', u'amp', u'Wina', u'inam', u'namp', u'word=Winamp', u'pos=PROPN', u'deprel=nsubj', u'Case=Nom', u'Number=Sing', u'isupper', u'firsttoken', u'nextword=pysyy', u'nextpos=VERB', u'thisnext=VERBPROPN']


## Save featurized data for crf

* scikit learn does not have crf, so must save data and run it in the terminal

In [24]:
ffile=codecs.open(u"train-data.featurized",u"wt",u"utf-8")
for example,label in zip(examples,labels):
    if example is None: # add empty line, new sentence starts
        assert label is None
        ffile.write(u"\n")
        continue
    feat=u"\t".join(e for e in example)
    feat=feat.replace(u":",u"_") # crfsuite special character, escape/change it
    ffile.write(u"\t".join(t for t in (label,feat))+u"\n")
ffile.close()
print "Saved to train-data.featurized"
print

# just checking the file looks ok
with codecs.open(u"train-data.featurized",u"rt",u"utf-8") as f:
    for line in f.readlines()[:5]:
        print line

Saved to train-data.featurized

B-pro	Wi	in	na	am	mp	Win	ina	nam	amp	Wina	inam	namp	word=Winamp	pos=PROPN	deprel=nsubj	Case=Nom	Number=Sing	isupper	firsttoken	nextword=pysyy	nextpos=VERB	thisnext=VERBPROPN

O	py	ys	sy	yy	pys	ysy	syy	pysy	ysyy	word=pysyy	pos=VERB	deprel=root	Mood=Ind	Number=Sing	Person=3	Tense=Pres	VerbForm=Fin	Voice=Act	preword=Winamp	prepos=PROPN	prethis=PROPNVERB	nextword=hengissä	nextpos=ADV	thisnext=ADVVERB

O	he	en	ng	gi	is	ss	sä	hen	eng	ngi	gis	iss	ssä	heng	engi	ngis	giss	issä	word=hengissä	pos=ADV	deprel=advmod	preword=pysyy	prepos=VERB	prethis=VERBADV	nextword=,	nextpos=PUNCT	thisnext=PUNCTADV

O	word=,	pos=PUNCT	deprel=punct	preword=hengissä	prepos=ADV	prethis=ADVPUNCT	nextword=sai	nextpos=VERB	thisnext=VERBPUNCT

O	sa	ai	sai	word=sai	pos=VERB	deprel=conj	Mood=Ind	Number=Sing	Person=3	Tense=Past	VerbForm=Fin	Voice=Act	preword=,	prepos=PUNCT	prethis=PUNCTVERB	nextword=uuden	nextpos=ADJ	thisnext=ADJVERB



## Train crfsuite

In [20]:
%%bash

# split training data
cat train-data.featurized | split -l 20000 -d
ls -la x*

# train crfsuite
# -a training algorithm: lbfgs
# -m save model to ner.model
# -e fifth file is holdout evaluation
crfsuite learn -a lbfgs -p max_iterations=80 -m ner.model -e 5 -l x00 x01 x02 x03 x04 x05 x06 x07 x08 x09 x10

ls -lh ner.model

-rw-r--r-- 1 ginter nlp 3921507 Mar 31 00:27 x00
-rw-r--r-- 1 ginter nlp 3881898 Mar 31 00:27 x01
-rw-r--r-- 1 ginter nlp 3879186 Mar 31 00:27 x02
-rw-r--r-- 1 ginter nlp 3858984 Mar 31 00:27 x03
-rw-r--r-- 1 ginter nlp 3950577 Mar 31 00:27 x04
-rw-r--r-- 1 ginter nlp 3982269 Mar 31 00:27 x05
-rw-r--r-- 1 ginter nlp 3935222 Mar 31 00:27 x06
-rw-r--r-- 1 ginter nlp 3970474 Mar 31 00:27 x07
-rw-r--r-- 1 ginter nlp 3767168 Mar 31 00:27 x08
-rw-r--r-- 1 ginter nlp 3912690 Mar 31 00:27 x09
-rw-r--r-- 1 ginter nlp 2684240 Mar 31 00:27 x10
CRFSuite 0.12  Copyright (c) 2007-2011 Naoaki Okazaki

Holdout group: 5

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 224429
Seconds required: 1.500

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 80
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iter

## Use the trained model to tag text from parsebank

* We want to find named entities from parsebank data

In [21]:
# now test it with this
with codecs.open(u"pb.test.conllu",u"rt",u"utf-8") as f:
    for line in f.readlines()[:10]:
        print line.strip()

1	HP	HP	PROPN	Case=Nom|Number=Sing	nsubj
2	julkisti	julkistaa	VERB	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act	root
3	myös	myös	ADV	_	advmod
4	toisen	toinen	ADJ	Case=Gen|Number=Sing|NumType=Ord	nummod
5	sukupolven	suku#polvi	NOUN	Case=Gen|Number=Sing	nmod:poss
6	version	versio	NOUN	Case=Gen|Number=Sing	dobj
7	maksuttomasta	maksuton	ADJ	Case=Ela|Degree=Pos|Number=Sing	amod
8	verkkopalvelustaan	verkko#palvelu	NOUN	Case=Ela|Number=Sing|Person[psor]=3	nmod
9	.	.	PUNCT	_	punct



## crfsuite tagging

* Remember to featurize the data also here

In [22]:
%%bash

cat pb.test.conllu | python /home/jmnybl/NER/featurize.py > pb.test.featurized

crfsuite tag -m ner.model pb.test.featurized > predicted.labels

# combine predictions and original text
paste predicted.labels pb.test.conllu > predicted.txt

ls -lh predicted*

-rw-r--r-- 1 ginter nlp  22K Mar 31 00:28 predicted.labels
-rw-r--r-- 1 ginter nlp 490K Mar 31 00:28 predicted.txt


## Let's have a look at the output

* Print sentences with named entities

In [23]:
def print_sent(sent):
    for line in sent:
        print u"\t".join(t for t in line.strip().split(u"\t")[:5]) # make it look prettier
    print
    
i=0
with codecs.open(u"predicted.txt",u"rt",u"utf-8") as f:
    for sent in read_data(f):
        for line in sent:
            if line.strip().split(u"\t")[0]!=u"O":
                print_sent(sent)
                i+=1
                break
        if i>5:
            break

B-org	1	HP	HP	PROPN
O	2	julkisti	julkistaa	VERB
O	3	myös	myös	ADV
O	4	toisen	toinen	ADJ
O	5	sukupolven	suku#polvi	NOUN
O	6	version	versio	NOUN
O	7	maksuttomasta	maksuton	ADJ
O	8	verkkopalvelustaan	verkko#palvelu	NOUN
O	9	.	.	PUNCT

B-pro	1	HP	HP	PROPN
I-pro	2	Designjet	Designjet	PROPN
I-pro	3	ePrint	ePrint	PROPN
I-pro	4	&	&	PROPN
I-pro	5	Share	Share	PROPN
I-pro	6	-palvelulla	palvelu	NOUN
O	7	on	olla	VERB
O	8	helppo	helppo	ADJ
O	9	käyttää	käyttää	VERB
O	10	ja	ja	CONJ
O	11	tulostaa	tulostaa	VERB
O	12	suurikokoisia	suuri#kokoinen	ADJ
O	13	asiakirjoja	asia#kirja	NOUN
B-pro	14	iOS-	iOS-	NOUN
O	15	tai	tai	CONJ
B-pro	16	Android-tabletilla	Android-tabletilla	NOUN
O	17	tai	tai	CONJ
O	18	-älypuhelimella	äly#puhelin	NOUN
O	19	,	,	PUNCT
O	20	kannettavalla	kantaa	VERB
O	21	tietokoneella	tieto#kone	NOUN
O	22	tai	tai	CONJ
O	23	ePrinter-tulostimen	ePrinter-tulostimen	NOUN
O	24	kosketusnäytöllä	kosketus#näyttö	NOUN
O	25	.	.	PUNCT

O	1	Se	se	PRON
O	2	edellyttää	edellyttää	VERB
O	3	alan	ala	NOUN
O	4	työn