<a href="https://colab.research.google.com/github/KCL-Health-NLP/nlp_examples/blob/master/chunking/crfsuite_ner_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CRF for named entity recognition of clinical concepts

Named entity recognition is a structured learning problem, i.e., we want to learn sequence patterns.

In the previous practical we used spaCy to build a clinical NER system. spaCy used a multilayer CNN for NER, and can also be customised to use other ANN architectures such as transformers.

There are other machine learning algorithms that can be used for this sequence learning problem. In this practical we will try CRF, using crfsuite, a package developed to integrate CRF with scikit learn.

We will use data from mtsamples again, and build classifiers that find clinical concepts. 

The 'gold' standard data is *not* manually annotated, it is the output of a clinical concept recognition system developed by Zeljko Kraljevic called 'CAT' (a predecessor to MedCAT), thus this data is not perfect. This system matches concepts to the entire UMLS. We will only use a few example concepts here.

Part of this material is adapted, inspired etc from:

https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html,

Written by Sumithra Velupillai, March 2019, updated February 2021. Updated May 2023 by Angus Roberts acknowledgements and many thanks to Zeljko Kraljevic for the data preparations.

In [None]:
# By default, pip will install the original sklearn_crfsuite package from PyPI
# However, this is not compatible with more recent version of sklearn, and is no longer 
# being maintained. So we will install from a github fork that is being maintained.
# You might be able to go back to the PyPI version in the future, if someone
# starts maintaining it again.
try:
  import sklearn_crfsuite
except ImportError as e:
  !pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite.git#egg=sklearn_crfsuite
  #!pip install sklearn_crfsuite
  import sklearn_crfsuite


from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# We use sklearn for scoring, metrics,
# and parameter searching
#import sklearn
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

# We use scipy to make exponential continuous random variables
# when parameter searching
import scipy

# import random

# requests is a package to submit requests to URLs
# We will use it to fetch our data
import requests

# We use spacy to create our BILOU tags
import spacy
from spacy.training import offsets_to_biluo_tags

# You might choose to turn off warnings - could be for
# documents with no entities, etc
import warnings
#warnings.filterwarnings('ignore')

# 1: Preparing the data for crfsuite
Our data is in the same format as in the spaCy NER practical: a json file. We'll need a few functions to get it in to the right format for crfsuite.

Our sentences will be lists of tuples, each tuple containing a token string, its POS tag, and its BIO (or BILUO) entity tag. So for the sentence "he has cancer", we will have something like this:

```
[
  ('he','PRONOUN','O'),
  ('has','VERB','O'),
  ('cancer','NOUN','B-DISEASE')
]
```

We therefore need to get these POS tags and BIO / BILUO tags for our training data, from the json of text and entity annotations. We will use spaCy to do this. Note that we are using spaCy to do POS tagging and to convert gold standard entity annotations in to BIO / BILUO tags. We are not using it to do the NER itself.

So we will exclude the NER component from spaCy when we load it.


In [None]:
# We will use a spacy pipeline to POS and BILUO tag our data.
# We do not need to have NER, as we will use CRF for that.
try:
  nlp = spacy.load('en_core_web_sm', exclude=['ner'])
except OSError as e:
  !python -m spacy download en_core_web_sm
  nlp = spacy.load('en_core_web_sm', exclude=['ner'])

Let's take a look at the pipeline:

In [None]:
print(nlp.pipe_names)

We will now write a function to load our data and get the POS tags and BILUO or BIO tags.

In [None]:
# This function loads data from a json file of documents
# and their entity annotations. It uses spaCy to get POS
# and BIO / BILUO tags for each token, and returns a list of
# sentences, with each sentence itself being a list of
# (token, POS-tag, BILUO-tag) tuples.
# The parameter bio flags whether BIO tags should be used
# instead of BILUO tags.
def get_sentences(filename, bio=False):
    
    # Read in the data
    print('reading data: ', filename)
    r = requests.get(filename)
    data = r.json()
 
    # List to hold our sentences
    sentences = []

    # For each document and its entity annotations
    for text, entities in data:

        # Process with spaCy. This will POS tag, and
        # allow us to create BILUO tags on the tokens.
        doc = nlp(text)

        # A handy spacy function to create a list of
        # BILUO tags for a document, from a list of entities
        tags = offsets_to_biluo_tags(doc, entities['entities'])

        # Keep track of which BILUO tag we are on
        tag_counter = 0

        # Go through the sentences in the document,
        # and the tokens in the sentence
        for sent in doc.sents:
            tagged_sentence = []
            for tok in sent:

                # Get the current tag
                tag = tags[tag_counter]

                # Convert to BIO if the bio flag is set
                if bio:
                    tag = tag.replace('L-', 'I-')
                    tag = tag.replace('U-', 'B-')

                # Make a tuple for our token
                # and add it to the tagged sentences list
                w = (tok.text, tok.pos_, tag)
                tagged_sentence.append(w)

                # Move to the next BILUO tag
                tag_counter +=1

            # Add the list of token tuples for this sentence
            # to our list of all sentences
            sentences.append(tagged_sentence)

    print('done')
    return sentences

Let's use this function to read in our test and training data. There are different alternative token level representations that can be used. The BIO format (Begin, Inside, Outside) or the BILUO format (Begin, Inside, Last, Unit, Outside). What do you think is better or worse with each of these? In the function, you can choose either format with the boolean flag 'bio'. Let's start with BIO.

In [None]:
train_sents = get_sentences('https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true', bio=True)
test_sents = get_sentences('https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true', bio=True)

Let's take a look at some of the data. Try a few, to find some different BILUO / BIO tags

In [None]:
print(train_sents[7])

# 2: Defining features for our word instances

We now need to create some features for the CRF model. With spaCy, we did not have to consider feature engineering, as the neural model uses learns the features. For CRF, we will create features to represent the orthographical and lexical properties of our word, and of the words on each side of it. We are hoping that you can tell whether a word is the beginning / inside / outside of an entity based on things like its part of speech, how it is cased, and those same properties of the words around it.

The features for a word will be represented by a dictionary of feature names to feature values. We will write a function that given a sentence and an index i, will create a dictionary of features for the ith word in the sentence. The function needs the whole sentence, so it can make features representing words on either side of the ith word.

In [None]:
# Parameters: a sentence and an index i.
# The sentence is a list of words, each word being
# a tuple of the form (token, POS-tag, BILUO/BIO-tag).
# Index i is the index of a word in the sentence list.
# Returns: a dictionary of features for the
# ith word in the sentence
def word2features(sent, i):

    # Get our word
    word = sent[i][0]

    # Get the POS tag for the word
    postag = sent[i][1]

    # Make some features
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),     # the lower cased word
        'word[-3:]': word[-3:],           # the last three characters of the word
        'word.isupper()': word.isupper(), # Is the word upper case?
        'word.istitle()': word.istitle(), # is the word title case?
        'word.isdigit()': word.isdigit(), # is the word a digit
        'postag': postag,                 # pos tag of the word
        'postag[:2]': postag[:2],         # first two characters of pos tag
    }
    if i > 0:                             # If the word is not the first in a sentence
        word1 = sent[i-1][0]              # add some features from the word before it
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True            # If the word is the first in a sentence
                                          # set a Beginning Of Sentence feature

    if i < len(sent)-1:                   # If the word is not the last in a sentence
        word1 = sent[i+1][0]              # add some features from the word after it
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True            # If the word is the last in a sentence
                                          # set a End Of Sentence feature

    # We've built the features dictionary, return it
    return features

Let's try this out, and see what features we get. We'll try it on a dummy sentence:

```
[
  ('An', 'DET', 'O'), 
  ('elephant', 'NOUN', 'O'), 
  ('sitting', 'VERB', 'O')
]
```

and we will look at the features for word index 2, the last word (sitting). Take a look at the features. Why might they be useful in determining if something is an entity?



In [None]:
sentence = [('An', 'DET', 'O'), ('elephant', 'NOUN', 'O'),  ('sitting', 'VERB', 'O')]
features = word2features(sentence, 2)
for k, v in features.items():
  print(f'{k:20}{v}')

# 3: Making feature and label vectors

Finally, we will define two more convenience functions, to get all the features for all the words in a sentence, and to get all the BIO / BILUO labels from a sentence. We will use these to make our feature and label vectors.

In [None]:
# Given a sentence, returns all the features
# for all the words in that sentence
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

# Given a sentence, return all the BIO / BILUO tags
# for all the words in the sentence
def sent2labels(sent):
    return [label for token, postag, label in sent]


Now let's create the feature and label vectors for the training and test data.

In [None]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

Let's take a look at a little bit of these vectors. We will print the feature vector for a single sentence, and the label vector for the same sentence. These are our training instance features, and the class labels, for each word in a sentence.

In [None]:
sentence_number = 5
print("Feature vector:\n")
for features in X_train[sentence_number]:
  print(features)
print("\n\nLabel vector:\n")
print(y_train[sentence_number])

What labels do we have? What is the set of all possible labels?

In [None]:
labels = list(set(x for l in y_test for x in l))
labels

# 4: Train the model

Now we have features and labels for all of our words, we can finally train a CRF model.

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',             # gradient descent
    c1=0.1,                        # L1 regularisation
    c2=0.1,                        # L2 regularisation
    max_iterations=100,
    all_possible_transitions=True  # Consider transitions not in the training data
)
crf.fit(X_train, y_train);

# 5: Evaluation
How does this model perform on our test data? Let's look at the f1 score first.

In [None]:
# Make predictions for the test data
y_pred = crf.predict(X_test)

# Compare the predicitons to the gold standard labels
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)

We can also print a classification report with more details and metrics.

In [None]:
from sklearn_crfsuite.utils import flatten
print(metrics.flat_classification_report(y_test, y_pred, labels=labels))

What do you think? There's a huge imbalance in the number of instances. Do we really want to evaluate the 'O' label? There's also one instance with an erroneous label ('-') Let's look at the results without these labels.

In [None]:
# We'll change our list of labels to exclude O and -
labels = list(set(x for l in y_test for x in l if x !='O' and x!='-'))
labels

In [None]:
# Now re-evaluate with our restricted list of labels
print(metrics.flat_classification_report(y_test, y_pred, labels = labels))

This was quite different! Can you explain the difference?


# 6: Changing the tag scheme

Try training this model with the BILUO scheme instead. We can do this by converting the BIO tags in the get_sentences function with the boolean flag 'BIO'. Are results better or worse?

In [None]:
training_file = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true'
test_file = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true'
train_sents = get_sentences(training_file, bio=False)
test_sents = get_sentences(test_file, bio=False)
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

new_classes = list(set(x for l in y_test for x in l if x !='O' and x!='-'))

c1=0.1
c2=0.1

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=c1,
    c2=c2,
    max_iterations=100,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train)
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels = new_classes))

# Optional: cross-validation to find best parameters with crfsuite
We have used default parameters in the above. We can try to find the best parameters on the training data by cross-validation. 

__This takes some time, 20 - 30 minutes (even with only 3 folds)!__ 

You might make it a bit faster by re-reading your data, this time reverting to BIO tags

In [None]:
# from: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#hyperparameter-optimization

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=new_classes)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

In [None]:
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)

In [None]:
crf = rs.best_estimator_
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=new_classes, digits=3
))

What do you think? Are there other parameters that could be tested in the cross-validation setup? What about the measure used for optimisation?