# Sequence Labeling: Shallow Parsing

## Objectives

- Understanding: 
    - relation between Sequence Labeling and Shallow Parsing
    - IOB Notation
    - Joint Segmentation and Classification
    - Feature Engineering
- Learning how to:
    - use Named Entity Recognition in 
        - spacy
        - NLTK
    - train, test, and evaluate Conditional Random Fields models
    - perform feature engineering with CRF

### Recommended Reading
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)
- Steven Bird, Ewan Klein, and Edward Loper. [__Natural Language Processing with Python__ (NLTK)](https://www.nltk.org/book/)
- Conditional Random Fields
    - Lafferty et al. (2001) [Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.803&rep=rep1&type=pdf) (__original paper__)
    - Sutton & McCallum's [An Introduction to Conditional Random Fields](https://homepages.inf.ed.ac.uk/csutton/publications/crftutv2.pdf)
    - Edwin Chen's [Introduction to Conditional Random Fields](http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/)
    - Michael Collin's [Log-Linear Models, MEMMs, and CRFs](http://www.cs.columbia.edu/~mcollins/crf.pdf)

### Covered Material
- SLP
    - [Chapter 8: Sequence Labeling for Parts of Speech and Named Entities](https://web.stanford.edu/~jurafsky/slp3/8.pdf) 
- NLTK 
    - [Chapter 7: Extracting Information from Text](https://www.nltk.org/book/ch07.html) 

### Requirements

- [spaCy](https://spacy.io/)
- [NLTK](https://www.nltk.org/)
- [`conll.py`](https://github.com/esrel/LUS/) (in `src` folder)
- [CRFsuite](http://www.chokkan.org/software/crfsuite/)
    - [python-crfsuite](https://python-crfsuite.readthedocs.io) python binding to `CRFsuite`.
    - [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io) `python-crfsuite` wrapper providing API similar to `scikit-learn`
    - you need to install both `python_crfsuite` and `sklearn_crfsuite` to use `sklearn_crfsuite`

## 1. Sequence Labeling and Shallow Parsing

Below are some examples of NLP tasks that Sequence Labeling is applied to as one of the methods.

The scenario when members of a sequence are mapped to higher order units (i.e. grouped together `[['a'],['b','c']]`) and assigned a category) is known as __shallow parsing__.

- [Part-of-Speech Tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging)
- [Shallow Parsing](https://en.wikipedia.org/wiki/Shallow_parsing) (Chunking)
    - [Phrase Chunking](https://en.wikipedia.org/wiki/Phrase_chunking)
    - [Named-Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) 
    - [Semantic Role Labeling](https://en.wikipedia.org/wiki/Semantic_role_labeling)
    - Dependency [Parsing](https://en.wikipedia.org/wiki/Parsing) 
    - Discourse Parsing
    - (Natural/Spoken) __Language Understanding__: Concept Tagging/Entity Extraction

### 1.1. The General Setting for Sequence Labeling

- Create __training__ and __testing__ sets by tagging a certain amount of text by hand
    - i.e. map each word in corpus to a tag
- Train tagging model to extract generalizations from the annotated __training__ set
- Evaluate the trained tagging model on the annotated __testing__ set
- Use the trained tagging model too annotate new texts

## 2. Shallow Parsing

As we have already mentioned, [Shallow Parsing](https://en.wikipedia.org/wiki/Shallow_parsing) is a kind of Sequence Labeling. The main difference from Sequence Labeling task, such as Part-of-Speech tagging, where there is an output label (tag) per token; Shallow Parsing additionally performs __chunking__ -- segmentation of input sequence into constituents. Chunking is required to identify categories (or types) of *multi-word expressions*.
In other words, we want to be able to capture information that expressions like `"New York"` that consist of 2 tokens, constitute a single unit.

What this means in practice is that Shallow Parsing performs *jointly* (or not) 2 tasks:
- __Segmentation__ of input into constituents (__spans__)
- __Classification__ (Categorization, Labeling) of these constituents into predefined set of labels (__types__)


## 3. Encoding Segmentation Information: CoNLL Corpus Format

Corpus in CoNLL format consists of series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type. 

The set of columns used by CoNLL-style files can vary from corpus to corpus.

```
        Alex       B-PER
        is         O
        going      O
        to         O
        Los        B-LOC
        Angeles    I-LOC
        in         O
        California B-LOC
```

### 3.1. [IOB Scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))

- The notation scheme is used to label *multi-word* spans in token-per-line format.
    - *Los Angeles* is a *LOCATION* concept that has 2 tokens
- Both, prefix and suffix notations are commons: 
    - prefix: __B-LOC__
    - suffix: __LOC-B__

- Meaning of Prefixes
    - __B__ for (__B__)eginning of span
    - __I__ for (__I__)nside of span
    - __O__ for (__O__)utside of span (no prefix or suffix, just `O`)

#### Alternative Schemes:
- No prefix or suffix (useful when there are no *multi-word* concepts)
```
        Alex       PER
        is         O
        going      O
        to         O
        Los        LOC
        Angeles    LOC
        in         O
        California LOC
```
- __IOB/IOB2/BIO__

- __IOBE__
    - IOB + 
    - __E__ for (__E__)nd of span (or __L__ for (__L__)ast)
```
        Alex       B-PER
        is         O
        going      O
        to         O
        Los        B-LOC
        Angeles    E-LOC
        in         O
        California B-LOC
```
    
- __BILOU/BIOES__
    - IOB + 
    - __L__ for (__L__)ast word of span
    - __U__ for (__U__)nit word (or __S__ for (__S__)ingleton)
```
        Alex       U-PER
        is         O
        going      O
        to         O
        Los        B-LOC
        Angeles    E-LOC
        in         O
        California U-LOC
```

#### Choice of Scheme
- It is possible to convert IOB, IOBE, & BILOU formats to each other
- Each prefix is applied to every concept label, consequently we increase the number of transitions whose probabilities we need to estimate; 
    - increasing data sparseness, as for each label we will have less observations
- The choice of scheme depends on the amount of available data:
    - __IOB__ for least amount
    - __BILOU__ for the most amount 

#### Terminology
There is no strict naming convention regarding schemes (see alternatives) or how each constituent is termed. 
Below is the terminology used in this notebook. 

```
        Alex       B-PER
        is         O
        going      O
        to         O
        Los        B-LOC
        Angeles    I-LOC
        in         O
        California B-LOC
```

##### Interpretation
Segmentation and Labeling data formats encode the following information:
- in string (sentence) `"Alex is going to Los Angeles in California"`
- there are 3 __entities__ (a.k.a. chunks, concepts or slots, depending on NLP task and perspective), that have __types__ (labels)
    - `PERSON`
    - `LOCATION`
    
- entity of __type__ `PERSON`: 
    - has __span__:
        - as tokens from `0` for *CoNLL*: `[0:1]`
    - has __value__: `"Alex"`
        - string *covered by* (*on included*) in __span__
 
*CoNLL* format encodes __tokenization__ informations. In other words, how string `"Alex is going to Los Angeles in California"` is split into tokens. Since most Sequence Labeling algorithms operate on token level, internally the strings are split into tokens, applying *IOB*-like schemes.

#### Exercise
- Wirte the IOB tags for the sentence `"Steve Jobs established Apple Inc."`
    - 'Steve Jobs' is PER
    - 'Apple Inc' is ORG

```
Steve : B-PER
Jobs : I-PER
established : O
Apple : B-ORG
Inc. : I-ORG
```

## 4. Named Entity Recognition with NLTK
[NLTK](https://www.nltk.org/api/nltk.tag.html) provides implementations of popular sequence labeling algorithms for Part-of-Speech Tagging (including [HMM](https://www.nltk.org/api/nltk.tag.html#module-nltk.tag.hmm)), that can be used for Sequence Labeling in general. 

- Loading & working with CoNLL format corpora in NLTK
- Tagger training & testing (running)

To have a custom tagger that labels input text with our __custom label set__, we need to __train__ it on a corpus annotated with this __custom label set__.

Addtionally, NLTK provides [Chunking](http://www.nltk.org/api/nltk.chunk.html). 

### 4.1. NLTK Pre-trained NE Chunker

NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function `nltk.ne_chunk()`. If we set the parameter `binary=True`, then named entities are just tagged as `NE`; otherwise, the classifier adds category labels such as `PERSON`, `ORGANIZATION`, and `GPE`.

In [32]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\maxst\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\maxst\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [None]:
from nltk.corpus import treebank

for s in treebank.tagged_sents():
    print(s)
    print(nltk.ne_chunk(s))
    print(nltk.ne_chunk(s, binary=True))
    break

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
(S
  (PERSON Pierre/NNP)
  (ORGANIZATION Vinken/NNP)
  ,/,
  61/CD
  years/NNS
  old/JJ
  ,/,
  will/MD
  join/VB
  the/DT
  board/NN
  as/IN
  a/DT
  nonexecutive/JJ
  director/NN
  Nov./NNP
  29/CD
  ./.)
(S
  (NE Pierre/NNP Vinken/NNP)
  ,/,
  61/CD
  years/NNS
  old/JJ
  ,/,
  will/MD
  join/VB
  the/DT
  board/NN
  as/IN
  a/DT
  nonexecutive/JJ
  director/NN
  Nov./NNP
  29/CD
  ./.)


: 

: 

### 4.2. Training NLTK Taggers

In [None]:
nltk.download('conll2002')

[nltk_data] Downloading package conll2002 to
[nltk_data]     C:\Users\maxst\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


True

: 

: 

In [None]:
from nltk.corpus import conll2002

print(len(conll2002.tagged_sents()))
print(conll2002._chunk_types)
print(conll2002.sents('esp.train')[0])
print(conll2002.tagged_sents('esp.train')[0])
print(conll2002.chunked_sents('esp.train')[0])
print(conll2002.iob_sents('esp.train')[0])

35651
('LOC', 'PER', 'ORG', 'MISC')
['Melbourne', '(', 'Australia', ')', ',', '25', 'may', '(', 'EFE', ')', '.']
[('Melbourne', 'NP'), ('(', 'Fpa'), ('Australia', 'NP'), (')', 'Fpt'), (',', 'Fc'), ('25', 'Z'), ('may', 'NC'), ('(', 'Fpa'), ('EFE', 'NC'), (')', 'Fpt'), ('.', 'Fp')]
(S
  (LOC Melbourne/NP)
  (/Fpa
  (LOC Australia/NP)
  )/Fpt
  ,/Fc
  25/Z
  may/NC
  (/Fpa
  (ORG EFE/NC)
  )/Fpt
  ./Fp)
[('Melbourne', 'NP', 'B-LOC'), ('(', 'Fpa', 'O'), ('Australia', 'NP', 'B-LOC'), (')', 'Fpt', 'O'), (',', 'Fc', 'O'), ('25', 'Z', 'O'), ('may', 'NC', 'O'), ('(', 'Fpa', 'O'), ('EFE', 'NC', 'B-ORG'), (')', 'Fpt', 'O'), ('.', 'Fp', 'O')]


: 

: 

In [None]:
# training hmm on training data: exactly as above
import nltk.tag.hmm as hmm

hmm_model = hmm.HiddenMarkovModelTrainer()

print(conll2002.iob_sents('esp.train')[0])

# let's get only word and iob-tag
trn_sents = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.train')]
print(trn_sents[0])

tst_sents = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testa')]

hmm_ner = hmm_model.train(trn_sents)
    
# evaluation
accuracy = hmm_ner.accuracy(tst_sents)

print("Accuracy: {:6.4f}".format(accuracy))

[('Melbourne', 'NP', 'B-LOC'), ('(', 'Fpa', 'O'), ('Australia', 'NP', 'B-LOC'), (')', 'Fpt', 'O'), (',', 'Fc', 'O'), ('25', 'Z', 'O'), ('may', 'NC', 'O'), ('(', 'Fpa', 'O'), ('EFE', 'NC', 'B-ORG'), (')', 'Fpt', 'O'), ('.', 'Fp', 'O')]
[('Melbourne', 'B-LOC'), ('(', 'O'), ('Australia', 'B-LOC'), (')', 'O'), (',', 'O'), ('25', 'O'), ('may', 'O'), ('(', 'O'), ('EFE', 'B-ORG'), (')', 'O'), ('.', 'O')]


  X[i, j] = self._transitions[si].logprob(self._states[j])
  O[i, k] = self._output_logprob(si, self._symbols[k])
  P[i] = self._priors.logprob(si)
  O[i, k] = self._output_logprob(si, self._symbols[k])


Accuracy: 0.3760


: 

: 

##### NLTK Chunk Tagger Note
HMM uses only words as input, NLTK also povides trainable MaxEnt Chunker Tagger, which unfortunatelly requires `megam` file. Unfortunatelly, it is very convoluted to install. (http://www.umiacs.umd.edu/~hal/megam/index.html)

### Exercise

#### Segmentation 
Train a tagger to perform *segmentation* of input sentences into constituents
- Strip concept information from output labels (i.e. keep only IOB-prefix)
- Train tagger to predict segmentation labels
- Evaluate segmentation performance

In [None]:
# GOAL: train a Hidden Markov Model (HMM) to perform segmentation of sentences into constituents. The goal is to predict only the IOB-prefixes, not the actual entity types. This means you need to strip the entity information from the labels and keep only the "B-", "I-", or "O" parts.

import nltk.tag.hmm as hmm
from nltk.corpus import conll2002

# Initialize the HMM trainer
hmm_model = hmm.HiddenMarkovModelTrainer()

# let's get only word and iob-tag
trn_sents_seg = [[(text, iob.split('-')[0]) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.train')]
print(trn_sents_seg[0])

tst_sents_seg = [[(text, iob.split('-')[0]) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testa')]

hmm_seg = hmm_model.train(trn_sents_seg)

# Evaluate the model
accuracy = hmm_seg.accuracy(tst_sents)

print("Accuracy: {:6.4f}".format(accuracy))

[('Melbourne', 'B'), ('(', 'O'), ('Australia', 'B'), (')', 'O'), (',', 'O'), ('25', 'O'), ('may', 'O'), ('(', 'O'), ('EFE', 'B'), (')', 'O'), ('.', 'O')]
Accuracy: 0.3333


: 

: 

#### CoNLL Eval
CoNLL Community developed a perl script to evaluate *segmentation* and *labeling* performance jointly using IOB information. Such evaluation provides more accurate assessment of the shallow parsing performance, in comparison to token-level metrics (e.g. NLTK accuracy).

- import `evaluate` function from `conll.py` (example shown)
- evaluate tagger predictions
- compare performances to token-level accuracies

In [None]:
# to import conll
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

from conll import evaluate
# for nice tables
import pandas as pd

# getting references (note that it is testb this time)
refs = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testb')]
print(refs[0])
# getting hypotheses
hyps = [hmm_ner.tag(s) for s in conll2002.sents('esp.testb')]
print(hyps[0])

results = evaluate(refs, hyps)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

[('La', 'B-LOC'), ('Coruña', 'I-LOC'), (',', 'O'), ('23', 'O'), ('may', 'O'), ('(', 'O'), ('EFECOM', 'B-ORG'), (')', 'O'), ('.', 'O')]
[('La', 'B-LOC'), ('Coruña', 'I-LOC'), (',', 'O'), ('23', 'O'), ('may', 'O'), ('(', 'O'), ('EFECOM', 'B-ORG'), (')', 'O'), ('.', 'O')]


Unnamed: 0,p,r,f,s
PER,0.691,0.28,0.399,735
ORG,0.795,0.396,0.528,1400
MISC,0.57,0.203,0.299,340
LOC,0.03,0.849,0.058,1084
total,0.055,0.491,0.098,3559


: 

: 

## 5. Named Entity Recognition with spaCy

In [None]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

txt = 'Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.'
doc = nlp(txt)

print([ent.text for ent in doc.ents])
print([(t.ent_type_, t.ent_iob_) for t in doc])



RegistryError: [E893] Could not find function 'spacy.Tagger.v2' in function registry 'architectures'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: spacy-legacy.CharacterEmbed.v1, spacy-legacy.EntityLinker.v1, spacy-legacy.HashEmbedCNN.v1, spacy-legacy.MaxoutWindowEncoder.v1, spacy-legacy.MishWindowEncoder.v1, spacy-legacy.MultiHashEmbed.v1, spacy-legacy.Tagger.v1, spacy-legacy.TextCatBOW.v1, spacy-legacy.TextCatCNN.v1, spacy-legacy.TextCatEnsemble.v1, spacy-legacy.Tok2Vec.v1, spacy-legacy.TransitionBasedParser.v1, spacy.CharacterEmbed.v2, spacy.EntityLinker.v1, spacy.HashEmbedCNN.v2, spacy.MaxoutWindowEncoder.v2, spacy.MishWindowEncoder.v2, spacy.MultiHashEmbed.v2, spacy.PretrainCharacters.v1, spacy.PretrainVectors.v1, spacy.SpanCategorizer.v1, spacy.Tagger.v1, spacy.TextCatBOW.v2, spacy.TextCatCNN.v2, spacy.TextCatEnsemble.v2, spacy.TextCatLowData.v1, spacy.Tok2Vec.v2, spacy.Tok2VecListener.v1, spacy.TorchBiLSTMEncoder.v1, spacy.TransitionBasedParser.v2

: 

## 6. Shallow Parsing with Conditional Random Fields

[CRFs](https://en.wikipedia.org/wiki/Conditional_random_field) are a type of __discriminative undirected probabilistic graphical model__. 
It is a generalization of __any__ undirected graph structure.
In Natural Language Processing, the structure is a *sequence* of words, and conditioning is on *previous transition*. This is known as __Linear Chain CRFs__.

For general graphs, the problem of exact inference in CRFs is intractable. For __Linear Chain CRFs__, however, there is an exact solution, and the used algorithm is "analogous to the forward-backward and Viterbi algorithm for the case of Hidden Markov Models". 

### Python CRF Tutorials

Authors of the python packages provide tutorials in the form of notebooks already!

__Follow the tutorials to learn the tools.__

- [sklearn-crfsuite notebook](https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb)
- [python-crfsuite notebook](https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb)

- `bias` is explained [here](https://github.com/scrapinghub/python-crfsuite/issues/73)

### Baseline

Let's prepare a CRF baseline for our dataset that:
- considers only word itself and the previous tag (similar to HMM).

In [None]:
# data set
print(len(trn_sents))
print(len(tst_sents))
print(trn_sents[0])

8323
1915
[('Melbourne', 'B-LOC'), ('(', 'O'), ('Australia', 'B-LOC'), (')', 'O'), (',', 'O'), ('25', 'O'), ('may', 'O'), ('(', 'O'), ('EFE', 'B-ORG'), (')', 'O'), ('.', 'O')]


: 

Let's copy & re-define feature extraction functions from the tutorials.

In [None]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, label in sent]

def sent2tokens(sent):
    return [token for token, label in sent]

: 

In [None]:
def word2features(sent, i):
    word = sent[i][0]
    return {'bias': 1.0, 'word.lower()': word.lower()}

: 

Let's inspect our baseline features.

In [None]:
sent2labels(trn_sents[0])[0]

'B-LOC'

: 

#### Feature Extraction 

In [None]:
%%time
trn_feats = [sent2features(s) for s in trn_sents]
trn_label = [sent2labels(s) for s in trn_sents]

CPU times: total: 188 ms
Wall time: 258 ms


: 

#### Training

In [None]:
from sklearn_crfsuite import CRF

crf = CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)

: 

In [None]:
%%time
# workaround for scikit-learn 1.0
try:
    crf.fit(trn_feats, trn_label)
except AttributeError:
    pass

CPU times: total: 10.7 s
Wall time: 12.2 s


: 

#### Prediction

In [None]:
tst_feats = [sent2features(s) for s in tst_sents]
pred = crf.predict(tst_feats)

: 

#### Evaluation

We are going to use our `conll` evaluation script. (Notice that tools report token level metrics.)

For that we will need to modify our prediction output a bit, to make it a tuple.

In [None]:
print(pred[0])

['B-LOC', 'I-LOC', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O']


: 

In [None]:
hyp = [[(tst_feats[i][j], t) for j, t in enumerate(tokens)] for i, tokens in enumerate(pred)]

: 

In [None]:
results = evaluate(tst_sents, hyp)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
LOC,0.699,0.713,0.706,985
PER,0.859,0.375,0.522,1222
ORG,0.729,0.442,0.551,1700
MISC,0.542,0.261,0.352,445
total,0.729,0.466,0.569,4352


: 

## 7. Feature Engineering

One of the strengths of CRFs lies in its ability to make use of rich feature representation. The process of extracting features from raw data is know as [feature engineering](https://en.wikipedia.org/wiki/Feature_engineering).

Common features used in sequence labeling with CRFs are:
- part-of-speech tags
- lemmas
- token character prefixes and suffixes (e.g. first and last 1, 2, 3 characters of a word; `word[-3:]` in tutorial is suffix of length 3).

### 7.1. SpaCy Token Features

[spaCy](https://spacy.io/) provides a convenient way to augment our feature set with common features using in Natural Language Processing. 

The list of provided token-level features is available [here](https://spacy.io/api/token#attributes).

### 7.2. Adding Features to CRF

Let's modify `sent2features` function to make use of spaCy features.

In [None]:
import spacy
from spacy.tokenizer import Tokenizer
import es_core_news_sm
nlp = es_core_news_sm.load()

# nlp = spacy.load("es_core_news_sm")
nlp.tokenizer = Tokenizer(nlp.vocab)  # to use white space tokenization (generally a bad idea for unknown data)

def sent2spacy_features(sent):
    spacy_sent = nlp(" ".join(sent2tokens(sent)))
    feats = []
    for token in spacy_sent:
        token_feats = {
            'bias': 1.0,
            'word.lower()': token.lower_,
            'pos': token.pos_,
            'lemma': token.lemma_
        }
        feats.append(token_feats)
    
    return feats

: 

In [10]:
trn_feats = [sent2spacy_features(s) for s in trn_sents]
trn_label = [sent2labels(s) for s in trn_sents]
tst_feats = [sent2spacy_features(s) for s in tst_sents]

NameError: name 'sent2tokens' is not defined

In [None]:
print(trn_feats[0])

In [None]:
crf = CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)

In [None]:
%%time
# workaround for scikit-learn 1.0
try:
    crf.fit(trn_feats, trn_label)
except AttributeError:
    pass

In [None]:
pred = crf.predict(tst_feats)

hyp = [[(tst_feats[i][j], t) for j, t in enumerate(tokens)] for i, tokens in enumerate(pred)]

In [None]:
results = evaluate(tst_sents, hyp)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Regarding the last lab exercise, as for POS tags, you have to tokenize the sentences by white space. I added the code for white space tokenization in stanza, which is just a flag set to true in the pipeline. 

Moreover, some of your colleagues had some issues with spacy_conll. This is due to the version of pandas and I can confirm that with  pandas==1.3.5 the library works. Furthermore, depending on the version of spacy/stanza you may have to rename some columns of the data frame. 

## Lab Exercise

The exsecise is to experiment with a CRF model on NER task. You have to train and test a CRF with using different features on the conll2002 corpus (the same that we used here in the lab). 

The features that you have to experiment with are:
- Baseline using the fetures in sent2spacy_features
    - Train the model and print results on the test set
- Add the "suffix" feature
    - Train the model and print results on the test set
- Add all the features used in the [tutorial on CoNLL dataset](https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb)
    - Train the model and print results on the test set
- Increase the feature window (number of previous and next token) to:
    - `[-1, +1]`
        - Train the model and print results on the test set
    - `[-2, +2]`
        - Train the model and print results on the test set

        
The format of the results has to be the table that we used so far, that is:
```python
results = evaluate(tst_sents, hyp)
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)
```
