# Lab 6: Information Extraction

In this lab, you'll learn how to carry out the two main steps of information extraction: named entity recognition, and relation extraction.


### Aims
* Know how to train an NER sequence tagger using a CRF.
* Learn how to construct a relation classifier for extracting relations between named entities.
* Understand how syntactic features can be used in NER and relation extraction (RE).

### Outline

* Loading the re3d corpus.
* Training and testing a CRF NER tagger.
* Experimenting with PoS speech tags and dependency parse features.
* Training a naive Bayes relation classifier.

### How To Complete This Lab

Read the text and the code then look for 'TODOs' that instruct you to complete some missing code or answer a question. You don't have to stick rigidly to the lab -- feel free to explore other methods and data to help you understand what's going on or to go beyond this lab. 

Aim to work through the lab during the scheduled lab hour. You can also contact TAs with questions at the scheduled times throughout the week, or post your questions to our Teams conversation.

The labs *will not be marked*. However, they will prepare you for the coursework, so try to keep up with the weekly labs and have fun with the exercises!

### More Information

This lab relates to [Chapter 17 on Information Extraction from Jurafsky and Martin's 2020 draft of Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/17.pdf).

For more on the NLTK library related to this lab, please see [chapters 5 (tagging words)](https://www.nltk.org/book/ch05.html) and [7 (extracting information from text)](https://www.nltk.org/book/ch07.html) of the NLTK book. For details of the sequence tagger implementations, refer to [the NLTK documentation](http://www.nltk.org/api/nltk.tag.html?highlight=hmm).


First, run the cell below to import various bits of NLTK and Sklearn.

In [6]:
!conda install -y python-crfsuite

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [7]:
# these magic lines make the notebook reload any module from an external file that changes.
# They're useful if you are working on an external script as well as the notebook
%load_ext autoreload
%autoreload 2

import numpy as np

import matplotlib.pyplot as plt
from IPython import display

import nltk
from nltk.tag import CRFTagger

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import os

We also need to install python-crfsuite for named entity recognition. 

# 1 Named Entity Recognition (NER)

How can we extract information from text? The first step is usually to identify the entities involved, i.e., the people, places, organisations, times and other subjects discussed in the text. NER is the task of tagging the entities in a piece of text, which is usually modelled as a BIO sequence labelling task. In a BIO task, we use different tags to differentiate the beginning ('B' tag) and inside ('I' tag) of a named entity span as well as tokens outside ('O' tag) any span. This allows us to extract named entities that span multiple tokens.

Let's start by loading some data with NER tags.

In this lab, we'll work with part of the [re3d dataset](https://github.com/dstl/re3d), which contains articles from government and news websites.

In [8]:
from lab6_data.load_re3d import load_re3d_data

sentences, tags, tag_names_to_idx, tag_idx_to_names, relation_pairs = load_re3d_data()

Found 98 documents
Loaded 2433 non-overlapping entities in total.
We have loaded a dataset with 953 sentences.


Now we can look at an example of some text and the kind of entity tags we would like to apply. 

(note if performance is poor, we can change to a simpler BIO task without entity types)

In [9]:
# Print an example of the text aligned with the tags (use tag names)
tagged_words = zip(sentences[0], tag_idx_to_names[tags[0]])
for word, tag in tagged_words:
    print(f'{word}\t\t\t{tag}')
        

Hibhib			B-Location
(			O
Arabic			B-Nationality
:			O
ناحية			O
هبهب‎‎			O
,			O
Hibhib			B-Location
Village			I-Location
)			O
is			O
a			O
village			O
in			O
northern			B-Location
Iraq			I-Location
,			O
located			O
8			B-Quantity
km			I-Quantity
(			O
5.0			B-Quantity
mi			I-Quantity
)			O
northwest			O
of			O
Baquba			B-Location
.			O


Let's create a training and test split, then show some basic counts of the data:

In [10]:
# Use the commented out code below if you want to remove non-ascii characters.

# def is_english(s):
#     try:
#         s.encode(encoding='utf-8').decode('ascii')
#     except UnicodeDecodeError:
#         return False
#     else:
#         return True
    
# for i, sent in enumerate(sentences):
#     for j, tok in enumerate(sent):
#         if not is_english(tok):
#             sentences[i][j] = 'BLANK'


# *** split dataset into train and test
data = []
for i in range(len(sentences)):
    tagged_words = zip(sentences[i], tag_idx_to_names[tags[i]])
    data.append(list(tagged_words))

tagged_sents_with_rels = list(zip(data, relation_pairs))

train_set_with_rels, test_set_with_rels = train_test_split(
    tagged_sents_with_rels,
    train_size=0.80,
    test_size=0.20,
    random_state=101
)

# separate out the relations data from the sentences and tags
train_set = [tagged_sent for tagged_sent, _ in train_set_with_rels]
test_set = [tagged_sent for tagged_sent, _ in test_set_with_rels]

train_rels = [rels for _, rels in train_set_with_rels]
test_rels = [rels for _, rels in test_set_with_rels]


print(f'Number of training sentences: {len(train_set)}')
print(f'Number of test sentences: {len(test_set)}')

Number of training sentences: 762
Number of test sentences: 191


In [11]:
train_tagged_words = [ tup for sent in train_set for tup in sent ]
test_tagged_words = [ tup for sent in test_set for tup in sent ]

print(f'Number of tagged words in the training set: {len(train_tagged_words)}')
print(f'Number of tagged words in the test set: {len(test_tagged_words)}')

Number of tagged words in the training set: 19506
Number of tagged words in the test set: 4799


Now, print out the different named entity tags. Notice that each entity type has both a 'B-' and an 'I-' tag. The labels include a lot of specialised types of entities for extracting information from security and defence news.

In [12]:
ne_tags = {tag for word, tag in train_tagged_words}  # a set of possible tags
ne_tags = list(ne_tags)  # let's turn the set into a list as it is more useful in future steps.
print(f'Number of possible tags: {len(ne_tags)}')
print(f'Possible tags: {ne_tags}')

Number of possible tags: 27
Possible tags: ['I-Person', 'B-DocumentRefere', 'B-Money', 'I-CommsIdentifie', 'O', 'B-MilitaryPlatfo', 'B-Nationality', 'I-Quantity', 'B-CommsIdentifie', 'B-Temporal', 'B-Weapon', 'B-Frequency', 'I-Money', 'I-Organisation', 'B-Quantity', 'I-Weapon', 'B-Organisation', 'I-DocumentRefere', 'I-MilitaryPlatfo', 'B-Location', 'I-Nationality', 'I-Vehicle', 'I-Location', 'B-Person', 'B-Vehicle', 'I-Frequency', 'I-Temporal']


Let's see an example sentence from the training set:

In [13]:
print('Sentence example: {}'.format(train_set[0]))

Sentence example: [('Last', 'B-Temporal'), ('year', 'I-Temporal'), ("'s", 'O'), ('influx', 'O'), ('of', 'O'), ('hundreds', 'B-Quantity'), ('of', 'I-Quantity'), ('thousands', 'I-Quantity'), ('to', 'O'), ('Europe', 'O'), ('partly', 'O'), ('resulted', 'O'), ('from', 'O'), ('cuts', 'O'), ('to', 'O'), ('food', 'O'), ('aid', 'O'), ('and', 'O'), ('cash', 'O'), ('payments', 'O'), ('.', 'O')]


Now, let's train a CRF tagger on our training set. The method you need to use from NLTK is the [train method of the conditional random field (CRF)](https://www.nltk.org/_modules/nltk/tag/crf.html). The interface differs from that of the HMM tagger, so you may need to refer to the documentation. Briefly, you need to call the constructor with default arguments, then the train() function.

**TODO 1.1: Write a function to train and return a CRF named entity recogniser.**

In [14]:
# Train a CRF NER tagger
def train_CRF_NER_tagger(train_set):
    ### WRITE YOUR OWN CODE HERE
    tagger = nltk.tag.CRFTagger()
    tagger.train(train_set, 'model.crf.tagger')
    return tagger  # return the trained model

tagger = train_CRF_NER_tagger(train_set)

Now, let's see how well it performs. We do prediction in the same way as with the HMM in lab 5. First, we need to get the test data into the right format as a list of sentences, where each sentence is a list of tokens. 

**TODO 1.2: Complete the function below to convert the test data to the correct format and predict the tags on the test set.**

In [15]:
# Test
def tag_test_set(test_set, tagger):
    ### WRITE YOUR OWN CODE HERE
    test_sents = [[token for token,tag in sent] for sent in test_set]
    predicted_tags = tagger.tag_sents(test_sents)
    ###
    return predicted_tags
test_sents_with_predicted_tags = tag_test_set(test_set,tagger)
print(test_sents_with_predicted_tags[:2])  # Print two tagged sentences

[[('The', 'B-Organisation'), ('SDF', 'I-Organisation'), (',', 'O'), ('made', 'O'), ('up', 'O'), ('in', 'O'), ('part', 'O'), ('by', 'O'), ('local', 'B-Organisation'), ('Arabs', 'I-Organisation'), ('and', 'I-Organisation'), ('its', 'I-Organisation'), ('Coalition', 'I-Organisation'), ('trained', 'O'), ('and', 'O'), ('equipped', 'O'), ('Arab', 'O'), ('component', 'O'), (',', 'O'), ('the', 'B-Organisation'), ('Syrian', 'I-Organisation'), ('Arab', 'I-Organisation'), ('Coalition', 'I-Organisation'), (',', 'O'), ('and', 'O'), ('supported', 'O'), ('by', 'O'), ('Coalition', 'B-Organisation'), ('advisers', 'I-Organisation'), ('and', 'O'), ('air', 'O'), ('strikes', 'O'), ('began', 'O'), ('the', 'O'), ('operation', 'O'), ('to', 'O'), ('isolate', 'O'), ('Raqqah', 'O'), ('on', 'O'), ('Nov.', 'B-Temporal'), ('5', 'I-Temporal'), ('.', 'O')], [('``', 'O'), ('``', 'O'), ("''", 'O'), ('Now', 'O'), ('these', 'O'), ('people', 'O'), ('live', 'O'), ('in', 'O'), ('dignity', 'O'), (',', 'O'), ("''", 'O'), ("''"

Let's see how well the tagger is performing. In NER, we evaluate performance by finding correctly matched entities, rather than correctly tagged tokens. Only an exact entity match counts as correct. Therefore, we need to compute precision, recall and F1 score by computing true positives, false positives and false negatives by looking for the predicted entity spans and the gold-labelled entity spans in the test set.

The code below contains a function that extract a list of spans from the tagged sentences. The next function calls extract_spans() and computes the precision, recall and f1 scores. However, the function is incomplete.

**TODO 1.3: Complete the cal_span_level_F1() function below to compute span-level F1 scores for the predictions.** 

Have a look at the results. Which types of entity are being recognised well and which are very poor?

In [16]:
def extract_spans(tagged_sents, ne_tags):
    """
    Extract a list of tagged spans for each named entity type, 
    where each span is represented by a tuple containing the 
    start token and end token indexes.
    
    returns: a dictionary containing a list of spans for each entity type.
    """
    spans = {}
    for ne_tag in ne_tags:
        if ne_tag == 'O':
            continue

        spans[ne_tag[2:]] = []  # create an empty list to store the spans of each type
        
    for sent in tagged_sents:
        start = -1
        entity_type = None
        for i, (tok, lab) in enumerate(sent):
            if 'B-' in lab:
                start = i
                end = i + 1
                entity_type = lab[2:]
            elif 'I-' in lab:
                end = i + 1
            elif lab == 'O' and start >= 0:
                spans[entity_type].append((start, end))
                start = -1
                
    return spans


def cal_span_level_f1(test_sents, test_sents_with_pred, ne_tags):
    # get a list of spans from the test set labels
    gold_spans = extract_spans(test_sents, ne_tags)

    # get a list of spans predicted by our tagger
    pred_spans = extract_spans(test_sents_with_pred, ne_tags)
    
    # compute the metrics for each class:
    f1_per_class = []
    
    ne_types = gold_spans.keys()  # get the list of named entity types (not the tags)
    
    for ne_type in ne_types:
        # compute the confusion matrix
        true_pos = 0
        false_pos = 0
        
        ### WRITE YOUR OWN CODE HERE TO COUNT TRUE POSITIVES, FALSE POSITIVES, AND FALSE NEGATIVES.
        for span in pred_spans[ne_type]:
            if span in gold_spans[ne_type]:
                true_pos += 1
            else:
                false_pos += 1
                
        false_neg = 0
        for span in gold_spans[ne_type]:
            if span not in pred_spans[ne_type]:
                false_neg += 1
                
        ### 
                
        if true_pos + false_pos == 0:
            precision = 0
        else:
            precision = true_pos / float(true_pos + false_pos)
            
        if true_pos + false_neg == 0:
            recall = 0
        else:
            recall = true_pos / float(true_pos + false_neg)
        
        if precision + recall == 0:
            f1 = 0
        else:
            f1 = 2 * precision * recall / (precision + recall)
            
        f1_per_class.append(f1)
        print(f'F1 score for class {ne_type} = {f1}')
        
    print(f'Macro-average f1 score = {np.mean(f1_per_class)}')

cal_span_level_f1(test_set, test_sents_with_predicted_tags, ne_tags)

F1 score for class Person = 0.7350427350427351
F1 score for class DocumentRefere = 0
F1 score for class Money = 0
F1 score for class CommsIdentifie = 0
F1 score for class MilitaryPlatfo = 0.16666666666666669
F1 score for class Nationality = 0.5
F1 score for class Quantity = 0
F1 score for class Temporal = 0.5714285714285714
F1 score for class Weapon = 0.22222222222222224
F1 score for class Frequency = 0
F1 score for class Organisation = 0.7017543859649124
F1 score for class Location = 0.45454545454545453
F1 score for class Vehicle = 0
Macro-average f1 score = 0.257820002759274


The code below prints out a sample of a sentence where there is an error. 

**TODO 1.4: Look at the errors below. Can you see any common patterns that you think might be causing some of the errors? Post your suggestions to the lab bubble chat.**

In [17]:
# some code to print out errors made by one of the sequence taggers. 
window_size = 3
for i, sent in enumerate(test_set):
    token_shown = -1
    
    for j, (tok, label) in enumerate(sent):
        predicted_label = test_sents_with_predicted_tags[i][j][1]

        if j < token_shown + window_size:
            continue
        
        token_shown = -1
        
        if label != predicted_label:
            start = j - window_size
            if start < 0:
                start = 0
                
            end = j + window_size
            if end > len(sent):
                end = len(sent)
            
            print('Error found:')
            
            text = [tok for tok, lab in test_sents_with_predicted_tags[i][start:end]]
            preds = [lab for tok, lab in test_sents_with_predicted_tags[i][start:end]]
            gold = [lab for tok, lab in sent[start:end]]
            print(f'       Text: {text}')
            print(f' Prediction: {preds}')
            print(f'Gold labels: {gold}')
            print()


Error found:
       Text: ['by', 'local', 'Arabs', 'and', 'its', 'Coalition']
 Prediction: ['O', 'B-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation']
Gold labels: ['O', 'B-Organisation', 'I-Organisation', 'O', 'O', 'O']

Error found:
       Text: ['local', 'Arabs', 'and', 'its', 'Coalition', 'trained']
 Prediction: ['B-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'O']
Gold labels: ['B-Organisation', 'I-Organisation', 'O', 'O', 'O', 'O']

Error found:
       Text: ['Arabs', 'and', 'its', 'Coalition', 'trained', 'and']
 Prediction: ['I-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'O', 'O']
Gold labels: ['I-Organisation', 'O', 'O', 'O', 'O', 'O']

Error found:
       Text: ['trained', 'and', 'equipped', 'Arab', 'component', ',']
 Prediction: ['O', 'O', 'O', 'O', 'O', 'O']
Gold labels: ['O', 'O', 'O', 'B-Organisation', 'I-Organisation', 'O']

Error found:
       Text: ['and', 'equipped', 

Error found:
       Text: ['condolences', 'to', 'this', 'hero', "'s", 'family']
 Prediction: ['O', 'O', 'O', 'O', 'O', 'O']
Gold labels: ['O', 'O', 'B-Person', 'I-Person', 'O', 'O']

Error found:
       Text: ['commander', 'of', 'Combined', 'Joint', 'Task', 'Force']
 Prediction: ['O', 'O', 'B-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation']
Gold labels: ['O', 'O', 'B-Organisation', 'B-Organisation', 'I-Organisation', 'I-Organisation']

Error found:
       Text: ['Joint', 'Task', 'Force', '-', 'Operation', 'Inherent']
 Prediction: ['I-Organisation', 'I-Organisation', 'I-Organisation', 'O', 'O', 'O']
Gold labels: ['B-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation']

Error found:
       Text: ['Task', 'Force', '-', 'Operation', 'Inherent', 'Resolve']
 Prediction: ['I-Organisation', 'I-Organisation', 'O', 'O', 'O', 'O']
Gold labels: ['I-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organis

Error found:
       Text: ['the', 'first', 'time', 'the', 'U.S.', 'Secretary']
 Prediction: ['O', 'O', 'O', 'O', 'B-Person', 'I-Person']
Gold labels: ['O', 'O', 'O', 'B-Person', 'I-Person', 'I-Person']

Error found:
       Text: ['first', 'time', 'the', 'U.S.', 'Secretary', 'of']
 Prediction: ['O', 'O', 'O', 'B-Person', 'I-Person', 'I-Person']
Gold labels: ['O', 'O', 'B-Person', 'I-Person', 'I-Person', 'I-Person']

Error found:
       Text: ['behalf', 'of', ',', 'a', 'terrorist', 'organization']
 Prediction: ['O', 'O', 'O', 'O', 'O', 'O']
Gold labels: ['O', 'O', 'O', 'B-Organisation', 'I-Organisation', 'I-Organisation']

Error found:
       Text: ['of', ',', 'a', 'terrorist', 'organization', '.']
 Prediction: ['O', 'O', 'O', 'O', 'O', 'O']
Gold labels: ['O', 'O', 'B-Organisation', 'I-Organisation', 'I-Organisation', 'O']

Error found:
       Text: [',', 'a', 'terrorist', 'organization', '.']
 Prediction: ['O', 'O', 'O', 'O', 'O']
Gold labels: ['O', 'B-Organisation', 'I-Organisation', '

       Text: [',', 'legitimate', 'ground', 'forces', ',', 'focused']
 Prediction: ['O', 'O', 'O', 'O', 'O', 'O']
Gold labels: ['I-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'O', 'O']

Error found:
       Text: ['the', 'construction', 'of', 'the', 'Mosul', 'Dam']
 Prediction: ['O', 'O', 'O', 'O', 'B-Location', 'I-Location']
Gold labels: ['O', 'O', 'O', 'B-Location', 'I-Location', 'I-Location']

Error found:
       Text: ['construction', 'of', 'the', 'Mosul', 'Dam', 'upstream']
 Prediction: ['O', 'O', 'O', 'B-Location', 'I-Location', 'I-Location']
Gold labels: ['O', 'O', 'B-Location', 'I-Location', 'I-Location', 'O']

Error found:
       Text: ['the', 'Mosul', 'Dam', 'upstream', 'and', 'several']
 Prediction: ['O', 'B-Location', 'I-Location', 'I-Location', 'O', 'O']
Gold labels: ['B-Location', 'I-Location', 'I-Location', 'O', 'O', 'O']

Error found:
       Text: ['several', 'other', 'large', 'dams', 'in', 'Turkey']
 Prediction: ['O', 'O', 'O', 'O', 'O', 'O']
Gol

Error found:
       Text: ['injuries', 'and', 'I', 'ask', 'myself', ',']
 Prediction: ['O', 'O', 'B-Person', 'I-Person', 'O', 'O']
Gold labels: ['O', 'O', 'B-Person', 'O', 'O', 'O']

Error found:
       Text: ['s', 'tenure', 'with', 'the', 'command', '.']
 Prediction: ['O', 'O', 'O', 'O', 'O', 'O']
Gold labels: ['O', 'O', 'O', 'B-Organisation', 'I-Organisation', 'O']

Error found:
       Text: ['tenure', 'with', 'the', 'command', '.']
 Prediction: ['O', 'O', 'O', 'O', 'O']
Gold labels: ['O', 'O', 'B-Organisation', 'I-Organisation', 'O']

Error found:
       Text: ['the', 'United', 'States', 'and', 'Iraq', 'signed']
 Prediction: ['B-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'O']
Gold labels: ['B-Organisation', 'I-Organisation', 'I-Organisation', 'O', 'B-Location', 'O']

Error found:
       Text: ['United', 'States', 'and', 'Iraq', 'signed', 'a']
 Prediction: ['I-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation', 'O', 'O']
Go

Error found:
       Text: ['strike', 'destroyed', 'a', 'VBIED', 'factory', '.']
 Prediction: ['O', 'O', 'B-Weapon', 'I-Weapon', 'I-Weapon', 'O']
Gold labels: ['O', 'O', 'B-Location', 'I-Location', 'I-Location', 'O']

Error found:
       Text: ['destroyed', 'a', 'VBIED', 'factory', '.']
 Prediction: ['O', 'B-Weapon', 'I-Weapon', 'I-Weapon', 'O']
Gold labels: ['O', 'B-Location', 'I-Location', 'I-Location', 'O']

Error found:
       Text: ['transition', 'away', 'from', 'the', 'murderous', 'regime']
 Prediction: ['O', 'O', 'O', 'B-Organisation', 'I-Organisation', 'I-Organisation']
Gold labels: ['O', 'O', 'O', 'O', 'O', 'B-Organisation']

Error found:
       Text: ['away', 'from', 'the', 'murderous', 'regime', 'of']
 Prediction: ['O', 'O', 'B-Organisation', 'I-Organisation', 'I-Organisation', 'I-Organisation']
Gold labels: ['O', 'O', 'O', 'O', 'B-Organisation', 'I-Organisation']

Error found:
       Text: ['from', 'the', 'murderous', 'regime', 'of', 'Assad']
 Prediction: ['O', 'B-Organisati

We can try to help the CRF tagger by adding some more features. Part-of-speech tags often provide useful information for identifying entites. The code below includes a modified CRFTagger class that adds a PoS feature to the feature vector for each word. 

In [18]:
# *** Improve the CRF NER tagger using parts of speech (see lab 5) as additional features.
class CRFTaggerWithPOS(nltk.tag.CRFTagger):
    _current_tokens = None
    
    def _get_features(self, tokens, index):
        """
        Extract the features for a token and append the POS tag as an additional feature.
        """
        basic_features = super()._get_features(tokens, index)
        
        if tokens != self._current_tokens:
            self._pos_tagged_tokens = nltk.pos_tag(tokens)
            self._current_tokens = tokens
            
        basic_features.append(self._pos_tagged_tokens[index][1])
            
        return basic_features

**TODO 1.5: Complete the training function below to use the new tagger class with PoS features. Then use the training function to train a tagger, predict the tags for the test set and compute the span-level F1 scores.**

In [19]:
# Train
def train_CRF_NER_tagger_POS(train_set):
    ### WRITE YOUR OWN CODE HERE
    tagger = CRFTaggerWithPOS()
    tagger.train(train_set, 'model.crf.tagger')
    return tagger  # return the trained model

### WRITE YOUR OWN CODE HERE
tagger_with_POS = train_CRF_NER_tagger_POS(train_set)
print(tagger_with_POS)


# Test
predicted_tags_with_POS = tag_test_set(test_set, tagger_with_POS)
print(predicted_tags_with_POS[0])


cal_span_level_f1(test_set, predicted_tags_with_POS, ne_tags)
###

<__main__.CRFTaggerWithPOS object at 0x7fed1c648e10>
[('The', 'B-Organisation'), ('SDF', 'I-Organisation'), (',', 'O'), ('made', 'O'), ('up', 'O'), ('in', 'O'), ('part', 'O'), ('by', 'O'), ('local', 'O'), ('Arabs', 'O'), ('and', 'O'), ('its', 'O'), ('Coalition', 'O'), ('trained', 'O'), ('and', 'O'), ('equipped', 'O'), ('Arab', 'O'), ('component', 'O'), (',', 'O'), ('the', 'B-Organisation'), ('Syrian', 'I-Organisation'), ('Arab', 'I-Organisation'), ('Coalition', 'I-Organisation'), (',', 'O'), ('and', 'O'), ('supported', 'O'), ('by', 'O'), ('Coalition', 'B-Organisation'), ('advisers', 'I-Organisation'), ('and', 'O'), ('air', 'O'), ('strikes', 'O'), ('began', 'O'), ('the', 'O'), ('operation', 'O'), ('to', 'O'), ('isolate', 'O'), ('Raqqah', 'O'), ('on', 'O'), ('Nov.', 'B-Temporal'), ('5', 'I-Temporal'), ('.', 'O')]
F1 score for class Person = 0.746031746031746
F1 score for class DocumentRefere = 0
F1 score for class Money = 0.22222222222222224
F1 score for class CommsIdentifie = 0
F1 score

**OPTIONAL TODO 1.6: use the code from above to print out some labelling errors for the CRF tagger with PoS tags. Have the types of errors that are present changed? How do the different features affect the NER results?**

Now, we can try adding in some more features to improve the tagger further. The code below starts the CoreNLP server,
which we will use to provide dependency parse features.

Hint: if the server does not restart after it was previously running, the old version may still be running. You can kill off any Java processes to allow the server to start again.

In [20]:
#Corenlp will need you have java added to path
# java_path = "C:/Program Files/Java/jre1.8.0_281/bin/java.exe"# You may need to replace it with the path on your PC
# os.environ['JAVAHOME'] = java_path

#Add GhostScripts to path to illustrate the trees
# gs_path = "C:/Program Files/gs/gs9.53.3/bin"# You may need to replace it with the path on your PC
# os.environ['PATH'] = gs_path

In [21]:
from nltk.parse.corenlp import CoreNLPServer, CoreNLPParser
from nltk.parse.corenlp import CoreNLPDependencyParser
import os

# Stanford Core NLP runs as a server on the local machine. 
# The NLTK wrapper will make calls to this server to parse our text.
STANFORD = "./stanford-corenlp-4.2.0"
# You may need to replace it with the path on your PC or use realtive path
server = CoreNLPServer(
   os.path.join(STANFORD, "stanford-corenlp-4.2.0.jar"),
   os.path.join(STANFORD, "stanford-corenlp-4.2.0-models.jar"),    
)
# server.stop()  # in case it's already running

# Start the server in the background
server.start()

CoreNLPServerError: Could not connect to the server.

Below we have a new CRF class that includes some dependency parse tree features as well as the PoS tags. 
For each token, we add a feature for the dependency relation type that connects the token to its head. We also add
the tag of the head of that word (i.e., the parent node in the dependency parse tree).

**TODO 1.6: Complete the training function and use the new class to train and test a tagger with dependency features.**

In [22]:
# *** Improve the CRF NER tagger using dependency features (see lab 5).
# Useful explanation: https://www.analyticsvidhya.com/blog/2020/07/part-of-speechpos-tagging-dependency-parsing-and-constituency-parsing-in-nlp/

from time import sleep 

class CRFTaggerWithDeps(CRFTaggerWithPOS):
    _dep_parser = CoreNLPDependencyParser()
    
    def _get_features(self, tokens, index):
        """
        Extract the features for a token and append the POS tag as an additional feature.
        """
        if tokens != self._current_tokens:  # if we haven't seen this sentence yet
            parsed = False
            sent = " ".join(tokens)
            sent = sent.replace('%', 'percent')  # it breaks on the percent symbol for some reason
            self._parse_tree = [tree for tree in self._dep_parser.raw_parse(sent)][0]
            parsed = True
            
        basic_features = super()._get_features(tokens, index)
            
        # we include the relation label
        basic_features.append(self._parse_tree.nodes[index+1]['rel'])
        
        # we include the tag of the parent node of this tag
        basic_features.append(self._parse_tree.nodes[self._parse_tree.nodes[index+1]['head']]['tag'])
        if basic_features[-1] is None:
            basic_features[-1] = 'None'
        if basic_features[-2] is None:
            basic_features[-2] = 'None'
                    
        return basic_features
  
# Train
def train_CRF_NER_tagger_deps(train_set):
    ### WRITE YOUR OWN CODE HERE
    tagger = CRFTaggerWithDeps()
    tagger.train(train_set, 'model.crf.tagger')
    return tagger  # return the trained model

### WRITE YOUR OWN CODE HERE
tagger_with_deps = train_CRF_NER_tagger_deps(train_set)
print(tagger_with_POS)

# Test
predicted_tags_with_deps = tag_test_set(test_set, tagger_with_deps)
print(predicted_tags_with_deps[0])

cal_span_level_f1(test_set, predicted_tags_with_deps, ne_tags)
###

KeyboardInterrupt: 

# Relation Extraction (RE)

We can extract semantic information from text by extracting relations between entities. The code below prints out the relations in the training dataset. Have a look at the output and see if you can identify any factual information from the relations.

In [23]:
from lab6_data.load_re3d import get_source_target_toks

# load the relation data
# split into train and test -- same split as for NER
counter = 0
nonzero = 0

tag_names_to_idx = {}
for idx, name in enumerate(tag_idx_to_names):
    tag_names_to_idx[name] = idx

sentences = [[tok for tok, lab in sent] for sent in train_set]
tags = [[tag_names_to_idx[lab] for tok, lab in sent] for sent in train_set]

for s, sentence_rels in enumerate(train_rels):
    for relation in sentence_rels:        
        source, target = get_source_target_toks(relation, sentences, tags, s)
        
        print(f'{source} --> {relation[2]} --> {target}')
        
        counter += 1
        if relation[2] != 'none':
            nonzero += 1
        
print(f'Total number of pairs = {counter}, positive examples = {nonzero}')

['Jordanian', 'intelligence', 'officers'] --> AlliesOf --> ['Jordanian', '71st', 'Counter', 'Terrorism', 'Battalion']
['Daesh'] --> FightingAgainst --> ['mourners']
['mourners'] --> CoLocated --> ['a', 'funeral', 'tent']
['a', 'funeral', 'tent'] --> none --> ['today']
['a', 'funeral', 'tent'] --> CoLocated --> ['Baghdad']
['Baghdad'] --> CoLocated --> ['Iraq']
['Baghdad'] --> none --> ['today']
['About', '30,000', 'Iraqi', 'security', 'force', 'personnel'] --> AlliesOf --> ['US-led', 'coalition']
['About', '30,000', 'Iraqi', 'security', 'force', 'personnel'] --> AlliesOf --> ['Kurdish', 'fighters']
['About', '30,000', 'Iraqi', 'security', 'force', 'personnel'] --> AlliesOf --> ['Sunni', 'Arab', 'tribesmen']
['About', '30,000', 'Iraqi', 'security', 'force', 'personnel'] --> AlliesOf --> ['Shia', 'militiamen']
['US-led', 'coalition'] --> AlliesOf --> ['About', '30,000', 'Iraqi', 'security', 'force', 'personnel']
['US-led', 'coalition'] --> AlliesOf --> ['Kurdish', 'fighters']
['US-led', 

To do relation extraction, we will train a classifier that takes pairs of entities as input and outputs a relation label. The pair of entities needs to be represented by a feature vector. The code below extracts a set of features, which will be treated like a bag of words. These include dependency parse and part of speech features.

**TODO 2.1: Examine the code below and write down a list of features that the function extracts for each pair of entities.**

In [24]:
# construct a feature vector for a pair of entities: word1, word2. 
dep_parser = CoreNLPDependencyParser()

def extract_relation_BoW(sent_index, rel_index, relations, sentences, tags):
    features = {}
    
    relation = relations[sent_index][rel_index]

    source, target = get_source_target_toks(relation, sentences, tags, sent_index)
    features['source_entity'] = "_".join(source)
    features['target_entity'] = "_".join(target)
    
    pos_tagged_sent = nltk.pos_tag(sentences[sent_index])
    features['source_pos'] = pos_tagged_sent[relation[0]][1]
    features['target_pos'] = pos_tagged_sent[relation[1]][1]
    
    sent = " ".join(sentences[sent_index])
    sent = sent.replace('%', 'percent')
    parsed = False
    while not parsed:
        try:
            parse_tree = [tree for tree in dep_parser.raw_parse(sent)][0]
            parsed = True
        except Exception as e:
            print(sent)
            print(e)
            server.stop()
            server.start()
        
        
    # traverse up the tree
    nodes_on_path0 = []
    current_node = parse_tree.nodes[relation[0] + 1]
    while current_node['address'] != relation[1] + 1 and current_node['head'] is not None:
        current_node = parse_tree.nodes[current_node['head']]
        nodes_on_path0.append(current_node['address'])
    
    # and up from the other side    
    # traverse up the tree
    nodes_on_path1 = []
    current_node = parse_tree.nodes[relation[1] + 1]
    while current_node['address'] != relation[0] + 1 and current_node['head'] is not None:
        current_node = parse_tree.nodes[current_node['head']]   
        nodes_on_path1.append(current_node['address'])
        
    join_node_i = len(nodes_on_path0) - 1
    join_node_j = len(nodes_on_path1) - 1
    for i, node in enumerate(nodes_on_path0):
        if node in nodes_on_path1:
            join_node_i = i
            join_node_j = np.argwhere(np.array(nodes_on_path1) == node).flatten()[0]
            break
    
    dep_path_source = []
    for node in nodes_on_path0[:join_node_i]:
        dep_path_source.append(parse_tree.nodes[node]['rel'])
        
    features['dep_path_source'] = "_".join(dep_path_source)
        
    dep_path_target = []
    for node in nodes_on_path1[:join_node_j]:
        dep_path_target.append(parse_tree.nodes[node]['rel'])
        
    features['dep_path_target'] = "_".join(dep_path_target)
    
    features['dep_path_length'] = join_node_i + join_node_j  # length of dependency path
    
    label = relation[2]  # the class of the relation. 'none' is a negative example of unrelated entities.
    return features, label
    
tag_names_to_idx = {}
for idx, name in enumerate(tag_idx_to_names):
    tag_names_to_idx[name] = idx
    
train_sentences = [[tok for tok, lab in sent] for sent in train_set]
train_tags = [[tag_names_to_idx[lab] for tok, lab in sent] for sent in train_set]

features = extract_relation_BoW(3, 1, train_rels, train_sentences, train_tags)    

print(features)

About 30,000 Iraqi security force personnel , Kurdish fighters , Sunni Arab tribesmen and Shia militiamen , assisted by US-led coalition air strikes , launched the long-awaited offensive to retake Mosul eight days ago .
HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%22outputFormat%22%3A+%22json%22%2C+%22annotators%22%3A+%22tokenize%2Cpos%2Clemma%2Cssplit%2Cdepparse%22%2C+%22ssplit.eolonly%22%3A+%22true%22%2C+%22tokenize.whitespace%22%3A+%22false%22%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fed1e3a49d0>: Failed to establish a new connection: [Errno 61] Connection refused'))


CoreNLPServerError: Could not connect to the server.

**TODO 2.2: Complete the function below to get a list of tuples for the training set, where each tuple is of the form (feature vector, label). The feature vector should be obtained by calling extract_relation_BoW() with the index of each sentence and relation in the training dataset.**

In [23]:
def get_features_for_relations(rels, sentences, tags):
    ### WRITE YOUR OWN CODE HERE
    rels_with_feats = []
    for s in range(len(rels)):
        if s % 100 == 0:
            print(s)
        for r in range(len(rels[s])):
            rel_feats, label = extract_relation_BoW(s, r, rels, sentences, tags)
            rels_with_feats.append((rel_feats, label))
    ###
    return rels_with_feats
    
train_set_rels = get_features_for_relations(train_rels, train_sentences, train_tags)

0
100
200
300
400
500
600
700


Run the code below to train a naïve Bayes classifier using NLTK's library. The class is described here: https://www.nltk.org/book/ch06.html

In [24]:
classifier = nltk.NaiveBayesClassifier.train(train_set_rels)

The code below will generate the test set for relation extraction, then use the classifier to predict the relation types for each pair of entities in the test set:

In [25]:
test_sentences = [[tok for tok, lab in sent] for sent in test_set]
test_tags = [[tag_names_to_idx[lab] for tok, lab in sent] for sent in test_set]

test_set_rels = get_features_for_relations(test_rels, test_sentences, test_tags)
test_set_rels_no_labels = [rel for rel, lab in test_set_rels]
gold_labels = [lab for rel, lab in test_set_rels]
predicted_labels = classifier.classify_many(test_set_rels_no_labels)

0
100


**TODO 2.3: Use the [sklearn F1_score function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) to compute the macro-average and per-class F1 scores for the relation extraction dataset.** 

The low performance on some classes may be due to the small dataset. 

In [26]:
# macro-average F1 score
### WRITE YOUR OWN CODE HERE
f1_score(gold_labels, predicted_labels, average='macro')

0.22240187650614632

In [27]:
# per-class F1 scores
### WRITE YOUR OWN CODE HERE
f1_score(gold_labels, predicted_labels, average=None)

array([0.08333333, 0.        , 0.22222222, 0.37288136, 0.06896552,
       0.        , 0.22641509, 0.5       , 0.18181818, 0.38461538,
       0.2       , 0.42857143])

**OPTIONAL TODO 2.4: Exclude or modify the feature vector and see if you can improve the performance of the RE classifier.**