<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/sequence_labelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence labeling

* Many classification tasks produce a sequence of predictions, rather than a single prediction
* In this lecture we have a look at these tasks:
  * understand how this setting differs from basic text classification
  * how it affects our modelling
  * test sequence classification on an example problem
  * when done, you will be able to apply a sequence classification model to a problem
  * you will have the necessary background to move to more complex models at a later point
  
* Sequence classification is best explained through several example problems:

### POS Tagging

![posfig](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/pos_voita.png?raw=1)

![posfig](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/pos_house.png?raw=1)

* Every word is assigned to its part-of-speech category
* The number of categories is potentially quite large, in this case less than 20 though (You can see them [here](https://universaldependencies.org/u/pos/index.html) by the way)
* POS tagging is often used as a pre-processing step
* You can also use it to pick important words as features (nouns, verbs, etc)
* Note the context-dependence of the tags
  * `voita` can be a verb also, `voi` can be a noun also
  * `house` can be a noun or a verb
  * ...
* The tags also have a dependence among each other
  * Many sequences are impossible or at least highly unlikely, regardless of the input
  * In English, having seen a determiner, the likely next tag is a noun or an adjective, and e.g. a verb is extremely unlikely
  
 * The figs come from this demo: https://turkunlp.org/finnish_nlp.html#parser
 
  

### Named entity recognition

![nerfig](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/ner_demo.png?raw=1)

![nerfig](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/ner_demo_en.png?raw=1)


* NER is usually cast as a sequence labeling problem
* Entities are (typically) sequences of words, like `Turun Yliopisto` or `British Airways`
* The type tells what kind of an entity we have. The list of types is usually quite restricted: `Person, Organization, Location, Product, Event, Date, Other` would be a typical list

* These figs come from https://demo.allennlp.org/named-entity-recognition and a [temporary Finnish demo](http://86.50.253.19:8001/tagdemo/)

### BIO-coding

* NER and other similar tasks that involve locating multi-word entities are cast as classification of individual tokens into three groups of classes:

* **B-category**: The token begins an entity of type `category`. For example `B-Person` or `B-Location`
* **I-category**: The token continues an entity that is already started (with a `B-category`)
* **O**: The token is not a part of any entity

Here is an example from our [Finnish NER training data](https://github.com/TurkuNLP/turku-ner-corpus):

```
The	B-PRO
Garden	I-PRO
Collection	I-PRO
by	O
H&M	B-ORG

Viikonlopun	O
pyöritys	O
alkoi	O
H&M:n	B-ORG
järjestämällä	O
bloggaajabrunssilla	O
Helsingissä	B-LOC
.	O
```

And here is an example from the [CoNLL 2003 English data](https://raw.githubusercontent.com/davidsbatista/NER-datasets/master/CONLL2003/train.txt)

```
-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O
```

* `BIO-coding` is suitable for cases where you do not have entity nesting and overlaps
* There are, once again, quite clear dependencies between labels regardless of the input:
  * Exmaples of legal: `O B-Person O O`, `B-Person I-Person O O`, `B-Person B-Person`
  * Examples of illegal: `B-Person O I-Person O`, `O O I-Person O O`, `O B-Person I-Event O`
* Preferably, the classifier should be prevented from producing illegal BIO sequences

### Text segmentation

* Text segmentation (splitting into tokens and sentences) is often carried out as sequence labeling
* One would label every individual character as one of:
  * token ends after this character
  * sentence ends after this character
  * inside token

Example:

```
Is it you?

I     inside
s     token-break
      token-break
i     inside
t     token-break
      token-break
y     inside
o     inside
u     token-break
?     sentence-break
```

* **Note:** what, precisely, happens at spaces is somewhat implementation-dependent and you can do it in various ways, this is only one of the possibilities

### Zoning

* In many applications, one may want to separate text into zones
    * scientific papers may need to be separated into backround, methods, results, citations
    * patents can be separated into background and claims
    * ...
* This allows for focused information retrieval, etc.
   * e.g. when mining scientific literature for new factual statements, you may want to focus on the *Results* section
* The BIO coding is applicable also here
    * perhaps the unit of classification are the whole sentences or even paragraphs, not words
    * depends on task, ie can you expect a zone to change half-way through the sentence

![zoningfig](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/zones.gif?raw=1)

Figure from: https://www.cl.cam.ac.uk/~sht25/az.html

# Modelling considerations in sequence classification

![posfig](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/pos_house.png?raw=1)

* **Context** is of crucial importance
* *house* has two different labels in the above sequence
* The label depends on the context of the occurrence
  * *in my ______ .* is quite likely a noun
  * *can ______ you* is quite likely a verb
* NLP methods differ in how they model the context
  * Anything from simple left/right bag of features, perhaps marked for position...
  * ...to complex recurrent networks like LSTM or attention-based models like the Transformer

# CoNLL-03 POS and NER data

* You can get this data easily from this repository on github: https://github.com/davidsbatista/NER-datasets
* For some reason, this got deleted in January 2022, but we can go back in commit history on GitHub and get the data anyway
* (Remember, when you remove something from your Git repository, it still stays in the commit history!)
* The data is here: https://github.com/davidsbatista/NER-datasets/tree/dcb6c7439a7de43abc2448bad5b1d81a47f26c0d/CONLL2003
* Look at CONLL2003/valid.txt
* Let us try to learn a POS tagget based on this data

In [3]:
!mkdir -p CONLL2003
!wget -nc -O CONLL2003/train.txt https://github.com/davidsbatista/NER-datasets/raw/dcb6c7439a7de43abc2448bad5b1d81a47f26c0d/CONLL2003/train.txt
!wget -nc -O CONLL2003/valid.txt https://github.com/davidsbatista/NER-datasets/raw/dcb6c7439a7de43abc2448bad5b1d81a47f26c0d/CONLL2003/valid.txt

--2022-02-09 12:47:41--  https://github.com/davidsbatista/NER-datasets/raw/dcb6c7439a7de43abc2448bad5b1d81a47f26c0d/CONLL2003/train.txt
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/davidsbatista/NER-datasets/dcb6c7439a7de43abc2448bad5b1d81a47f26c0d/CONLL2003/train.txt [following]
--2022-02-09 12:47:41--  https://raw.githubusercontent.com/davidsbatista/NER-datasets/dcb6c7439a7de43abc2448bad5b1d81a47f26c0d/CONLL2003/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3283418 (3.1M) [text/plain]
Saving to: ‘CONLL2003/train.txt’


2022-02-09 12:47:42 (46.0 MB/s) - ‘CONLL2003/train.txt’ save

In [4]:
# this is how you read a file of this kind
# one item per line, empty lines between sequences

from collections import namedtuple

#Same as tuple but the fields are named for convenience
#this says we have four fields
OneWord=namedtuple("OneWord",["word","pos_label","chunk_label","entity_label"])

def read_conll2003(f_name):
    """Yield complete sentences"""
    current_sentence=[] #This will be a list of (word,label), which we accumulate for each sentence
    with open(f_name) as f:
        for line in f:
            line=line.strip() #drop whitespace
            if line.startswith("-DOCSTART-"): #let's not worry about these for the time being
                continue
            if not line: #sentence break
                if current_sentence: #if we gathered a sentence, we should yield it, because a new one starts
                    yield current_sentence #much like return, but continues past this line once the element has been consumed
                    current_sentence=[] #...and start a new one
                continue
            #if we made it here, we are on a normal line
            columns=line.split() #an actual word line
            assert len(columns)==4 #we should have four columns, looking at the data
            current_sentence.append(OneWord(*columns)) #* expands columns as arguments to OneWord constructor
        else: #for ... else -> the else part is executed once, when "for" runs out of elements
            if current_sentence: #yield also the last one!
                yield current_sentence

#Now just read the data in
sentences_train=list(read_conll2003("CONLL2003/train.txt"))
sentences_dev=list(read_conll2003("CONLL2003/valid.txt"))

print("First three sentences")
for sent in sentences_dev[:3]:
    print(sent)
    print()

First three sentences
[OneWord(word='CRICKET', pos_label='NNP', chunk_label='B-NP', entity_label='O'), OneWord(word='-', pos_label=':', chunk_label='O', entity_label='O'), OneWord(word='LEICESTERSHIRE', pos_label='NNP', chunk_label='B-NP', entity_label='B-ORG'), OneWord(word='TAKE', pos_label='NNP', chunk_label='I-NP', entity_label='O'), OneWord(word='OVER', pos_label='IN', chunk_label='B-PP', entity_label='O'), OneWord(word='AT', pos_label='NNP', chunk_label='B-NP', entity_label='O'), OneWord(word='TOP', pos_label='NNP', chunk_label='I-NP', entity_label='O'), OneWord(word='AFTER', pos_label='NNP', chunk_label='I-NP', entity_label='O'), OneWord(word='INNINGS', pos_label='NNP', chunk_label='I-NP', entity_label='O'), OneWord(word='VICTORY', pos_label='NN', chunk_label='I-NP', entity_label='O'), OneWord(word='.', pos_label='.', chunk_label='O', entity_label='O')]

[OneWord(word='LONDON', pos_label='NNP', chunk_label='B-NP', entity_label='B-LOC'), OneWord(word='1996-08-30', pos_label='CD',

* Now we have the input data
* Next we generate features for each word
* These will be used to predict its POS
* Let's start simple, the feature will be the word itself and nothing else

In [7]:
def generate_sentence_features(sent):
    #Given a sentence as a list of (word, label) pairs
    #generate the features for every word
    #The result should be a list of same length as the sentence
    #Each item is a dictionary of {"feature name"->feature value} mappings, holding all features of the word at that position
    
    sent_features=[] #this will be the result
    for one_word in sent:
        #We must do nothing with label
        #it just happens to be around
        word_features={}
        word_features["word_"+one_word.word]=1 #the word itself is a feature
        sent_features.append(word_features)
    return sent_features

print(generate_sentence_features(sentences_dev[0])  )

[{'word_CRICKET': 1}, {'word_-': 1}, {'word_LEICESTERSHIRE': 1}, {'word_TAKE': 1}, {'word_OVER': 1}, {'word_AT': 1}, {'word_TOP': 1}, {'word_AFTER': 1}, {'word_INNINGS': 1}, {'word_VICTORY': 1}, {'word_.': 1}]


* The code above takes care of basic feature generation
* For the simple classifier we will be building, we only need the sentence boundaries when generating the features
* After that, we can flatten the data into a single stream of words

In [8]:
#...now we can generate the training examples
def prep_data(sentences):
    all_labels=[] #here we gather labels for all words in all sentences
    all_features=[] #here we gather features for all words in all sentences
    for sentence in sentences:
        sent_features=generate_sentence_features(sentence)
        assert len(sent_features)==len(sentence)
        #Now we can get, for every position its label and its features
        for one_word,features in zip(sentence,sent_features):
            all_labels.append(one_word.pos_label) #label
            all_features.append(features)         #and features to go with it
    return all_labels, all_features

train_labels,train_features=prep_data(sentences_train)
dev_labels,dev_features=prep_data(sentences_dev)

* Now we have the data in the usual form
* We yet need to get the actual feature vectors
* sklearn's DictVectorizer is a useful tool here - turns a dictionary of {feature_name -> value} into the corresponding feature vector
* ...this gives us the freedom to build the dictionaries any way we like

In [9]:
from sklearn.feature_extraction import DictVectorizer
vectorizer=DictVectorizer()
vectorizer.fit(train_features)
print("Vectorizer vocab size:",len(vectorizer.vocabulary_))

feature_vectors_train=vectorizer.transform(train_features)
feature_vectors_dev=vectorizer.transform(dev_features)

print("Train shape",feature_vectors_train.shape)
print("Dev shape",feature_vectors_dev.shape)

Vectorizer vocab size: 23623
Train shape (203621, 23623)
Dev shape (51362, 23623)


* And now we can train the classifier as usual
* How well can we do?

In [10]:
import sklearn.svm

classifier=sklearn.svm.LinearSVC(C=0.05,verbose=1)
classifier.fit(feature_vectors_train, train_labels)

[LibLinear]

LinearSVC(C=0.05, verbose=1)

In [11]:
classifier.score(feature_vectors_dev,dev_labels)

0.8655426190568903

* Oh my, that is a pretty good score for such a simple classifier!
* The features are simply the words themselves, there is no context
* Then again, is 86% a good POS tagger accuracy?
* Can we do better?

In [12]:
def generate_sentence_features(sent):
    #Given a sentence as a list of (word, label) pairs
    #generate the features for every word
    #The result should be a list of same length as the sentence
    #Each item is a dictionary of {"feature name"->feature value} mappings, holding all features of the word at that position
    
    sent_features=[] #this will be the result
    for word_idx, one_word in enumerate(sent):
        #We do nothing with label
        #it just happens to be around
        word_features={}
        word_features["word_"+one_word.word]=1 #the word itself is a feature
        if word_idx!=0:
            word_features["left_word_"+sent[word_idx-1].word]=1
        if word_idx!=len(sent)-1:
            word_features["right_word_"+sent[word_idx+1].word]=1
        sent_features.append(word_features)
    return sent_features

train_labels,train_features=prep_data(sentences_train)
dev_labels,dev_features=prep_data(sentences_dev)
vectorizer=DictVectorizer()
vectorizer.fit(train_features)
feature_vectors_train=vectorizer.transform(train_features)
feature_vectors_dev=vectorizer.transform(dev_features)

print("Train shape",feature_vectors_train.shape)
print("Dev shape",feature_vectors_dev.shape)

classifier=sklearn.svm.LinearSVC(C=1,verbose=1)
classifier.fit(feature_vectors_train, train_labels)
classifier.score(feature_vectors_dev,dev_labels)

Train shape (203621, 68467)
Dev shape (51362, 68467)
[LibLinear]

0.9292862427475566

In [14]:
# Let us try to look at some predictions
sentence="I can house you in my house .".split()

sentence_data=[OneWord(w,"XXX","XXX","XXX") for w in sentence] #we need to fake this a bit, to get data in the correct format
_,sentence_features=prep_data([sentence_data])
sentence_vectors=vectorizer.transform(sentence_features)
predictions=classifier.predict(sentence_vectors)
for word,label in zip(sentence,predictions):
    print(word,label)


I PRP
can MD
house VB
you PRP
in IN
my PRP$
house NN
. .


* PRP - personal pronoun
* MD - modal verb
* VB - verb
* IN - preposition
* my - possessive pronoun
* NN - noun

* Happily, we can see that the classifier was able to distinguish between the two occurences of `house`


# What has the classifier learned?

* We can use the same approach to introspecting the classifier as before
* The classifier learns one decision hyperplane for each class
* Otherwise, the code is *exactly* the same as in feature_interpretation, so let's use it

In [15]:
print("Learned coefficients:",classifier.coef_.shape)
print("Classes in the data:",classifier.classes_)


Learned coefficients: (45, 68467)
Classes in the data: ['"' '$' "''" '(' ')' ',' '.' ':' 'CC' 'CD' 'DT' 'EX' 'FW' 'IN' 'JJ' 'JJR'
 'JJS' 'LS' 'MD' 'NN' 'NNP' 'NNPS' 'NNS' 'NN|SYM' 'PDT' 'POS' 'PRP' 'PRP$'
 'RB' 'RBR' 'RBS' 'RP' 'SYM' 'TO' 'UH' 'VB' 'VBD' 'VBG' 'VBN' 'VBP' 'VBZ'
 'WDT' 'WP' 'WP$' 'WRB']


In [16]:
import numpy

#Reverse the dictionary
index2feature={}
for feature,idx in vectorizer.vocabulary_.items():
    assert idx not in index2feature #This really should hold
    index2feature[idx]=feature
#Now we can query index2feature to get the feature names as we need

i=list(classifier.classes_).index("NN") #which of the coefficients corresponds to nouns?
indices=numpy.argsort(classifier.coef_[i])
print("Negative features")
for idx in indices[:30]:
    print(index2feature[idx])
print("-------------------------------")
print("Positive features")
for idx in indices[::-1][:30]: #you can also do it the other way round, reverse, then pick
    print(index2feature[idx])

Negative features
left_word_will
left_word_Sale
word_,
left_word_going
left_word_could
left_word_would
left_word_goals
right_word_A-rated
left_word_We
word_and
left_word_At
word_in
left_word_still
right_word_announcement
left_word_mixer
left_word_can
left_word_I
left_word_should
left_word_kms
left_word_might
left_word_8:00
left_word_prices
left_word_Mike
right_word_SCOREBOARD
left_word_must
left_word_n't
left_word_overs
right_word_effect
left_word_Services
word_two
-------------------------------
Positive features
word_world
word_power
word_consumer
word_peace
word_number
word_hospital
word_vouch
word_cricket
word_procure
word_soccer
word_victory
word_championship
word_staff
word_motor
word_value
word_cabinet
word_lunch
word_rain
word_injury
word_league
word_anyone
word_UNION
word_weekend
word_edge
word_parliament
word_shutdown
word_division
word_cash
word_tournament
word_race


# What have we learned?

* The (English) POS tagging task has a surprisingly high trivial baseline
* We can move the accuracy up by including features based on the context
* Introspecting the classifier shows that these are in fact picked up by the classifier very strongly
* Even then, we are left far behind the state-of-the-art which is typically in the 97-99% range for a vast number of languages
* It is only a tiny change to the code to e.g. predict named entities

# What we have not covered

* All predictions are independent of each other
* We did not treat the sequence as a sequence
* We have failed to directly account for dependencies among the output labels
* This is best done by a different class of machine learing models
* Classically this is the domain of Conditional Random Fields (CRF)
* These models take into account also tag-to-tag dependencies
* The code above is actually not that far from being able to be used with the CRF
* A fully worked example is here: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-use-conll-2002-data-to-build-a-ner-system

