# Named Entity Recognition

* Recognize named entities (places, people, events, companies, etc...) in text
* It is a classification task, not a simple dictionary lookup problem
    * Why? - list of entities is open and never complete
    * Presence in a dictionary is of course a good feature
    
# NER as classification

* Can't reasonably classify text (sub-)sequences
* Must classify individual tokens
* *BIO* coding most popular:
    * A token can **B**egin an entity, be **I**nside an entity, or be **O**utside an entity
    * Often the **B** class is associated with entity type
* After this, it could be a very simple multiclass classification task. In the data below,
  every token can belong to one of these five classes **B-org**, **I-org**, **B-pro**, **I-pro**, **O**
* You can try to train a normal classifier on this data and see what happens
    * Features: the word itself, POS tags, words before and after, word shape (capitalization, etc.) - whatever you find useful
* These are all individual decisions on the tokens:
    * **Independent of each other**
    * Have prediction errors that you must deal with: I without B, B-org followed by I-pro, etc...
    * You just do something with these errors
    
## NER data

* Need annotated data to train
* Lots of publicly available datasets for various languages and domains out there
* Finnish: https://github.com/mpsilfve/finer-data
    * Looks like this: https://github.com/mpsilfve/finer-data/blob/master/digitoday/ner_train_data_annotated/tietoturva_section/1.csv
    * Needs to be turned into something like this:

```
B-org   Nokia
O       ja
B-org   Continental
O       kehittävät
O       erittäin
O       tarkkaa
O       karttateknologiaa
B-pro   Electronic
I-pro   Horizon
I-pro   -alustalle
O       ,
O       jonka
O       on
O       tarkoitus
O       pystyä
O       jatkuvasti
O       paikantamaan
```
  
# Sequence classification

* Individual decisions on tokens do not take into account dependencies between classes
* Exactly the sort of "*I must be preceded by B or I of same class*" restrictions
    * But also less hard, probabilistic constraints
* Taking into account class dependencies gives a better model (hopefully :)

## Hidden Markov Models (HMM)

* The classic sequence classifier
* Assume an underlying "hidden" sequence of class labels, which generates the visible sequence of words
* Model the probability of a label following another one + a label producing a word
    * P(I-pro|B-pro)
    * P(Nokia|B-org)
* These can be obtained by counting in the training data
* Decoding: Viterbi algorithm - efficient polynomial algorithm to find the best hidden sequence of labels for the observed data (the sentence)
* Restricted in its modelling capabilities by the generative approach it takes
    * These two probabilities is pretty much all we've got to play with

## Conditional Random Fields (CRF)

I won't go into any real details here, you can check out one of the many tutorials out there if you want to know more about the inner workings of CRFs and the way they're trained. Like [this one](http://www.cs.upc.edu/~aquattoni/AllMyPapers/crf_tutorial_talk.pdf).

* The go-to sequence classifier
* Does not model in a generative manner like HMMs do:
    * Arbitrary features, not just the HMM-style conditional probabilities
    * The model learns weights for these features, much like an SVM would
    * Anything you like from the input sequence can be turned into a feature
    * In linear-chain CRFs, the current and previous (and future) tag also enters the equation
* Trained in an iterative fashion (can get stuck in a local optimum)
* Decoded in much the same way as HMMs - efficient polynomial algorithm to find the best sequence of labels

* From a practical point of view:
    * [CRFsuite](http://www.chokkan.org/software/crfsuite/) is a good general CRF training software
    * [NERSuite](http://nersuite.nlplab.org/) a driver script for *CRFsuite* with predefined features tuned for the NER task
    * CoreNLP also has a NER annotator (remember we played with it on one of the first lectures)


# Simple NER pipeline for Finnish

## Parsed training data

* Contains NER label and selected columns from conllu

```
B-org   Nokia   Nokia   PROPN   Case=Nom|Number=Sing    nsubj
O       ja      ja      CONJ    _       cc
B-org   Continental     Continental     PROPN   Case=Nom|Number=Sing    conj
O       kehittävät      kehittää        VERB    Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act root
O       erittäin        erittäin        ADV     _       advmod
O       tarkkaa tarkka  ADJ     Case=Par|Degree=Pos|Number=Sing amod
O       karttateknologiaa       kartta#teknologia       NOUN    Case=Par|Number=Sing    dobj
B-pro   Electronic      Electronic      PROPN   _       name
I-pro   Horizon Horizon PROPN   Case=Gen|Number=Sing    nmod:poss
I-pro   -alustalle      alusta  NOUN    Case=All|Number=Sing    nmod
O       ,       ,       PUNCT   _       punct
O       jonka   joka    PRON    Case=Gen|Number=Sing|PronType=Rel       nsubj
O       on      olla    VERB    Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act cop
O       tarkoitus       tarkoitus       NOUN    Case=Nom|Number=Sing    acl:relcl
O       pystyä  pystyä  VERB    InfForm=1|Number=Sing|VerbForm=Inf|Voice=Act    xcomp:ds
O       jatkuvasti      jatkuvasti      ADV     _       advmod
O       paikantamaan    paikantaa       VERB    Case=Ill|InfForm=3|Number=Sing|VerbForm=Inf|Voice=Act   xcomp
```

In [18]:
import sys
import collections

def read_data(f):
    sent=[]
    for line in f:
        line=line.strip()
        if not line:
            if sent:
                yield sent
                sent=[]
        else:
            sent.append(line)
    if sent:
        yield sent
        
f=open("/course_data/textmine/ner-fi/digitoday.2014.train.conllu", encoding="utf-8")
labels=[]
examples=[]
count=0
for sent in read_data(f):
    for i,line in enumerate(sent):
        label,word=line.split(u"\t")[:2]
        labels.append(label)
    count+=1
f.close()

print("Training data size:", count, "sentences,", len(labels), "examples,", len(set(labels)), "classes", "\n")
        
counter=collections.Counter(labels)
for key in sorted(counter, key=counter.get, reverse=True):
    print(counter[key],key)
print()

Training data size: 13497 sentences, 180178 examples, 13 classes 

155944 O
8592 B-ORG
4270 B-PRO
2886 I-PRO
2029 B-PER
1937 I-ORG
1754 B-LOC
1094 I-PER
904 B-DATE
463 I-DATE
131 I-LOC
91 B-EVENT
83 I-EVENT



## Feature generation

* simple features:
* current: word, character n-grams, pos, morphology, dependency type, uppercased, is first/last token
* previous/next: word, pos 

In [19]:
import collections

def create_features(i,sent,analyzer):
    #        print token
    feats=[]
    cols=sent[i].split(u"\t")
    label,word=cols[0],cols[1]
    feats=analyzer(word) # character n-grams
    feats.append(u"word="+word)
    feats.append(u"pos="+cols[3])
    feats.append(u"deprel="+cols[5])
    if cols[4]!=u"_":
        for fe in cols[4].split(u"|"):
            feats.append(fe)
    if word[0].isupper()==True:
        feats.append(u"isupper")

    if i!=0: # take previous token
        feats.append(u"preword="+sent[i-1].split(u"\t")[1])
        feats.append(u"prepos="+sent[i-1].split(u"\t")[3])
        # pre and current pos
        feats.append(u"prethis="+sent[i-1].split(u"\t")[3]+cols[3])
    else:
        feats.append(u"firsttoken")
    if i<len(sent)-1:
        feats.append(u"nextword="+sent[i+1].split(u"\t")[1])
        feats.append(u"nextpos="+sent[i+1].split(u"\t")[3])
        # current and next pos
        feats.append(u"thisnext="+sent[i+1].split(u"\t")[3]+cols[3])
    else:
        feats.append(u"lasttoken")

    return feats

import sklearn.feature_extraction
vectorizer=sklearn.feature_extraction.text.TfidfVectorizer(analyzer='char',ngram_range=(2,4),lowercase=False)
analyzer=vectorizer.build_analyzer()

f=open("/course_data/textmine/ner-fi/digitoday.2014.train.conllu", encoding="utf-8")
labels=[]
examples=[]
for sent in read_data(f):
    for i,line in enumerate(sent):
        label,word=line.split(u"\t")[:2]
        labels.append(label)
        examples.append(create_features(i,sent,analyzer))
    examples.append(None) # sentence boundary
    labels.append(None)
f.close()

print("First example featurized:")
print(labels[0],examples[0])

First example featurized:
O ['Im', 'mp', 'pe', 'er', 'ri', 'iu', 'um', 'mi', 'Imp', 'mpe', 'per', 'eri', 'riu', 'ium', 'umi', 'Impe', 'mper', 'peri', 'eriu', 'rium', 'iumi', 'word=Imperiumi', 'pos=NOUN', 'deprel=Case=Nom|Number=Sing', 'N', 'isupper', 'firsttoken', 'nextword=laajenee', 'nextpos=VERB', 'thisnext=VERBNOUN']


## Save featurized data for crf

* scikit learn does not have crf, so must save data and run it in the terminal

In [20]:
ffile=open("train-data.featurized", "w", encoding="utf-8")
for example,label in zip(examples,labels):
    if example is None: # add empty line, new sentence starts
        assert label is None
        ffile.write(u"\n")
        continue
    feat=u"\t".join(e for e in example)
    feat=feat.replace(u":",u"_") # crfsuite special character, escape/change it
    ffile.write(u"\t".join(t for t in (label,feat))+u"\n")
ffile.close()
print("Saved to train-data.featurized")
print()

# just checking the file looks ok
with open("train-data.featurized", encoding="utf-8") as f:
    for line in f.readlines()[:5]:
        print(line)

Saved to train-data.featurized

O	Im	mp	pe	er	ri	iu	um	mi	Imp	mpe	per	eri	riu	ium	umi	Impe	mper	peri	eriu	rium	iumi	word=Imperiumi	pos=NOUN	deprel=Case=Nom|Number=Sing	N	isupper	firsttoken	nextword=laajenee	nextpos=VERB	thisnext=VERBNOUN

O	la	aa	aj	je	en	ne	ee	laa	aaj	aje	jen	ene	nee	laaj	aaje	ajen	jene	enee	word=laajenee	pos=VERB	deprel=Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	V	preword=Imperiumi	prepos=NOUN	prethis=NOUNVERB	nextword=_	nextpos=PUNCT	thisnext=PUNCTVERB

O	word=_	pos=PUNCT	deprel=_	Punct	preword=laajenee	prepos=VERB	prethis=VERBPUNCT	nextword=Maailman	nextpos=NOUN	thisnext=NOUNPUNCT

O	Ma	aa	ai	il	lm	ma	an	Maa	aai	ail	ilm	lma	man	Maai	aail	ailm	ilma	lman	word=Maailman	pos=NOUN	deprel=Case=Gen|Number=Sing	N	isupper	preword=_	prepos=PUNCT	prethis=PUNCTNOUN	nextword=suurin	nextpos=ADJ	thisnext=ADJNOUN

O	su	uu	ur	ri	in	suu	uur	uri	rin	suur	uuri	urin	word=suurin	pos=ADJ	deprel=Case=Nom|Degree=Sup|Number=Sing	A	preword=Maailman	prepos=NOUN	prethis=NOU

## Train crfsuite

In [27]:
%%bash

# split training data
cat train-data.featurized | split -l 20000 -d
ls -la x*

# train crfsuite
# -a training algorithm: lbfgs
# -m save model to ner.model
crfsuite learn -a lbfgs -p max_iterations=80 -m ner.model -l x[0-9][0-9]

ls -lh ner.model

-rw-r--r-- 1 smp edu 3917818 Mar  8 14:01 x00
-rw-r--r-- 1 smp edu 3897534 Mar  8 14:01 x01
-rw-r--r-- 1 smp edu 3905399 Mar  8 14:01 x02
-rw-r--r-- 1 smp edu 3925088 Mar  8 14:01 x03
-rw-r--r-- 1 smp edu 3957527 Mar  8 14:01 x04
-rw-r--r-- 1 smp edu 3980124 Mar  8 14:01 x05
-rw-r--r-- 1 smp edu 3962859 Mar  8 14:01 x06
-rw-r--r-- 1 smp edu 3938585 Mar  8 14:01 x07
-rw-r--r-- 1 smp edu 3810774 Mar  8 14:01 x08
-rw-r--r-- 1 smp edu 2719565 Mar  8 14:01 x09
CRFSuite 0.12.2  Copyright (c) 2007-2013 Naoaki Okazaki

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 223293
Seconds required: 1.318

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 80
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 269673.650303
Feature norm: 1.000000
Error norm: 175

## Use the trained model to tag text from parsebank

* We want to find named entities from parsebank data

In [28]:
# now test it with this
with open("pb.test.conllu", encoding="utf-8") as f:
    for line in f.readlines()[:10]:
        print(line.strip())

FileNotFoundError: [Errno 2] No such file or directory: 'pb.test.conllu'

## crfsuite tagging

* Remember to featurize the data also here

In [29]:
%%bash

cat pb.test.conllu | python /home/jmnybl/NER/featurize.py > pb.test.featurized

crfsuite tag -m ner.model pb.test.featurized > predicted.labels

# combine predictions and original text
paste predicted.labels pb.test.conllu > predicted.txt

ls -lh predicted*

-rw-r--r-- 1 smp edu 0 Mar  8 14:02 predicted.labels
-rw-r--r-- 1 smp edu 0 Mar  8 14:02 predicted.txt


cat: pb.test.conllu: No such file or directory
python: can't open file '/home/jmnybl/NER/featurize.py': [Errno 2] No such file or directory
paste: pb.test.conllu: No such file or directory


## Let's have a look at the output

* Print sentences with named entities

In [31]:
def print_sent(sent):
    for line in sent:
        print("\t".join(t for t in line.strip().split("\t")[:5])) # make it look prettier
    print()
    
i=0
with open("predicted.txt", encoding="utf-8") as f:
    for sent in read_data(f):
        for line in sent:
            if line.strip().split("\t")[0]!="O":
                print_sent(sent)
                i+=1
                break
        if i>5:
            break