<a href="https://www.kaggle.com/code/angevalli/named-entity-recognition-and-classification?scriptVersionId=133851826" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a> <a target="_blank" href="https://drive.google.com/drive/folders/1ox-a6KA2M7t_uVF26igTlW0ejWUuB-MC?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this notebook is to recognize entities and their types from a input sentence. For example, the input is 'EU rejects German call to boycott British lamb.' If we use {'Organization', 'Location', 'Person', 'Miscellaneous Entity'} to classify all entities in this task, then, 'EU' is an organization, and 'German' and 'British' are miscellaneous entities.

=== Input ===

The input for training is a file train.txt, which contains a certain number of annotated sentences:

EU B-ORG
rejects O
German B-MISC 
call O
to O
boycott O
British B-MISC
lamb O
. O

Reminder : B- stands for Beginning of the word, and the class then

Each line includes two columns using a blank separator.
The first column is a word, the second column is it's label.
Here, we use 'BIO' tagging schema (beginning of an entity name,
inside an entity name, outside/other).

The input for testing is a file test.txt, which contains sentences without the annotations:

JAPAN
GETS
LUCKY
WIN
,
CHINA
IN
SURPRISE
DEFEAT
.


=== Output ===

The output shall be a text file that assigns each word to one of the entity types {'ORG', 'LOC', 'PER', 'MISC'}:

JAPAN B-LOC
GET O
LUCKY O
WIN O
, O
CHINA B-PER
IN O
SURPRISE O
DEFEAT O
. O


=== Datasets ===

We provide 3 datasets:
1) a training dataset, which has the labels
2) a development dataset, which has the labels
3) a testing dataset, which does not have the labels, and which we use for grading

=== Suggestions for improval ===

1) Adopt deep learning (DL) models. If you decide to do this, you have to make a big change to this notebook. First, to satisfy a DL model's requirement, change the input format. Second, create a DL model. Finally, modify methods, train(),  evaluate_on_dev(), and predict_on_test(), based on your own model.

2) Well-designed handcrafted features can make CRF compete with a deep learning model.
Reference:https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-use-conll-2002-data-to-build-a-ner-system

3) How to implement a deep learning model for this task
   Paper
    LSTM+CRF:https://arxiv.org/pdf/1603.01360v3.pdf
    Bi-directional LSTM-CNNs-CRF:https://arxiv.org/pdf/1603.01354v5.pdf

   Code
    https://www.depends-on-the-definition.com/guide-sequence-tagging-neural-networks-python/
    https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/entity-and-type-recognition-from-sentence/test.txt
/kaggle/input/entity-and-type-recognition-from-sentence/train.txt
/kaggle/input/entity-and-type-recognition-from-sentence/dev.txt


In [2]:
"""
This cell provides some basic function to NERC.
Don't modify this file unless you want.
"""
class InputSample(object):
    def __init__(self, word, label):
        self.word = word
        self.label = label


def get_sentence(file_name, mode='train'):
    """Read data from a file sentence by sentence.
    Args:
        file_name (string): file name
    Returns:
        iterator: each element is an instance of InputSample class .
    """
    with open(file_name, 'r', encoding='utf8')as f:
        words = []
        labels = []
        for line in f:
            if line == "\n":
                if words:
                    assert len(words) == len(labels)
                    sent = [InputSample(w, l) for w, l in zip(words, labels)]
                    yield sent
                    words = []
                    labels = []
            else:
                splits = line.split(" ")
                words.append(splits[0])
                if len(splits) > 1:
                    labels.append(splits[-1].replace("\n", ""))
                else:
                    if mode == 'test':
                        labels.append('O')


def get_entities_bio(seq):
    """Get entities from sequence.
    note: BIO
    Args:
        seq (list): sequence of labels.
    Returns:
        list: list of (chunk_type, chunk_start, chunk_end).
    Example:
        seq = ['B-PER', 'I-PER', 'O', 'B-LOC', 'I-PER']
        get_entity_bio(seq)
        #output
        [['PER', 0,1], ['LOC', 3, 3]]
    """
    if any(isinstance(s, list) for s in seq):
        seq = [item for sublist in seq for item in sublist + ['O']]
    chunks = []
    chunk = [-1, -1, -1]
    for indx, tag in enumerate(seq):
        if tag.startswith("B-"):
            if chunk[2] != -1:
                chunks.append(chunk)
            chunk = [-1, -1, -1]
            chunk[1] = indx
            chunk[0] = tag.split('-')[1]
            chunk[2] = indx
            if indx == len(seq) - 1:
                chunks.append(chunk)
        elif tag.startswith('I-') and chunk[1] != -1:
            _type = tag.split('-')[1]
            if _type == chunk[0]:
                chunk[2] = indx

            if indx == len(seq) - 1:
                chunks.append(chunk)
        else:
            if chunk[2] != -1:
                chunks.append(chunk)
            chunk = [-1, -1, -1]
    return set([tuple(chunk) for chunk in chunks])


def f1_score(true_entities, pred_entities):
    """Compute the F1 score."""
    nb_correct = len(true_entities & pred_entities)
    nb_pred = len(pred_entities)
    nb_true = len(true_entities)

    p = nb_correct / nb_pred if nb_pred > 0 else 0
    r = nb_correct / nb_true if nb_true > 0 else 0
    score = 2 * p * r / (p + r) if p + r > 0 else 0

    return score

In [3]:
!pip install sklearn-crfsuite

Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (993 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m993.5/993.5 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.9 sklearn-crfsuite-0.3.6
[0m

In [4]:
# Import some basic packages
from itertools import chain
import pycrfsuite

# Import custom functions wrote by us
from utility_script_functions_for_nerc import get_sentence, get_entities_bio, f1_score

# input files
# [ train_file ] is a training dataset that contains 217K lines.
# [ test_file ] is a testing dataset that contains 54K lines. You can test your model based on this file.
# [ predict_file ] is a predicting dataset that contains 50K lines. Each sample in this file does not have occupation labels.

train_file = "/kaggle/input/entity-and-type-recognition-from-sentence/train.txt"
dev_file = "/kaggle/input/entity-and-type-recognition-from-sentence/dev.txt"
test_file = "/kaggle/input/entity-and-type-recognition-from-sentence/test.txt"

# [ model_file ] is used for store your trained model
# [ result_file ] is file that stores your predicted entity classifications.
# (This is the file you have to submit, once you ran on the test dataset)

model_file = "/kaggle/working/my_model"
result_file = "/kaggle/working/result.txt"

# Hyper-parameters: You don't have to change these, but you can.
# [ c1 ] # coefficient for L1 penalty in CRF. It can help your model avoid overfitting.
# [ c2 ] # coefficient for L2 penalty in CRF. It can help your model avoid overfitting.
# [ iteration ] the number of iteration. It means when to stop.
# A too small number may cause underfitting while a too big one may cause overfitting.

c1 = 1.0
c2 = 1e-3
iteration = 50

In [5]:
def train():
    '''
    train your model
    :return:
    '''

    # get features and labels
    train_features, train_labels = get_feature_data(train_file)
    # define a CRF trainer
    trainer = pycrfsuite.Trainer(verbose=True)

    # feed data into CRF
    for xseq, yseq in zip(train_features, train_labels):
        trainer.append(xseq, yseq)

    # define parameters
    trainer.set_params({
        'c1': c1,
        'c2': c2,
        'max_iterations': iteration,

        # include transitions that are possible, but not observed
        'feature.possible_transitions': True
    })

    # start to train
    trainer.train(model_file)
    print('Finish Training!')


def evaluate_on_dev():
    '''
    evaluate your model on the development dataset.
    :return:
    '''
    # get features and labels
    dev_features, dev_labels = get_feature_data(dev_file)

    # load well-trained CRF model
    tagger = pycrfsuite.Tagger()
    tagger.open(model_file)

    # predict each word's tag
    y_pred = [tagger.tag(xseq) for xseq in dev_features]
    # unfold this 2-dim array to a 1-dim list
    y_pred = list(chain.from_iterable(y_pred))
    dev_labels = list(chain.from_iterable(dev_labels))
    # get a set-styple result, each element looks like '['PER', 0, 1]'
    true_entities = get_entities_bio(dev_labels)
    pred_entities = get_entities_bio(y_pred)

    # print wrong and missing predictions
    sents = get_sentence(dev_file)
    words = [sample.word for sent in sents for sample in sent]
    wrong_entities = pred_entities - true_entities
    if len(wrong_entities) > 0:
        for tag, start, end in wrong_entities:
            name = words[start:end+1]
            print('[ wrong prediction ] this entity {a} is not a {b}'.format(a=name, b=tag))
    missing_entities = true_entities - pred_entities
    if len(missing_entities) > 0:
        for tag, start, end in missing_entities:
            name = words[start:end+1]
            print('[ missing prediction ] this entity {a} ({b}) is missing'.format(a=name, b=tag))
    print('---------------------------')

    # compute the f1
    f1 = f1_score(true_entities, pred_entities)
    print("F1 on dev dataset:", f1)


def predict_on_test():
    '''
    generate the result file
    :return:
    '''

    # get features
    test_features, _ = get_feature_data(test_file, 'test')

    # predict each word's tag
    tagger = pycrfsuite.Tagger()
    tagger.open(model_file)
    labels = [tagger.tag(xseq) for xseq in test_features]
    words = [[sample.word for sample in sent]for sent in get_sentence(test_file, 'test')]

    # write file
    with open(result_file, 'w', encoding='utf8')as f:
        for word, label in zip(words, labels):
            for w, l in zip(word, label):
                f.write(w.strip() + ' ' + l + '\n')
            f.write('\n')

In [6]:
def word2features(sent, i):
    '''
    extract features for each word
    :param sent: a complete sentence
    :param i: the index of word
    :return: a feature dictionary
    '''
    word = sent[i].word

    # common features

    features = [
        'bias=%s' % 1.0,
        'word.lower=%s' % word.lower(),
        'word[-3:]=%s' % word[-3:],
        'word[-2:]=%s' % word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(), # Adding feature of title
        'word.isdigit=%s' % word.isdigit()
    ]

    # left word's feature
    if i > 0:
        word1 = sent[i - 1].word
        features.extend([
            '-1:word.lower=%s' % word1.lower(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isdigit=%s' % word1.isdigit(),
        ])

    # right word's feature
    if i < len(sent) - 1:
        word1 = sent[i + 1].word
        features.extend([
            '+1:word.lower=%s' % word1.lower(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isdigit=%s' % word1.isdigit(),
        ])

    return features


def get_feature_data(file_name, mode='train'):
    '''
    get features
    :param file_name: datset file
    :param mode: set 'train' for 'train.txt' and 'dev.txt'
                 set 'test' for 'test.txt'
    :return:
    '''
    features = []
    labels = []
    for sent in get_sentence(file_name, mode):
        f = [word2features(sent, i) for i in range(len(sent))]
        features.append(f)
        labels.append([sample.label for sample in sent])

    return features, labels

In [7]:
# train & evaluate & predict
train()
evaluate_on_dev()
predict_on_test()

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 85471
Seconds required: 0.494

L-BFGS optimization
c1: 1.000000
c2: 0.001000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 155264.409042
Feature norm: 1.000000
Error norm: 66425.801345
Active features: 45540
Line search trials: 1
Line search step: 0.000002
Seconds required for this iteration: 0.737

***** Iteration #2 *****
Loss: 148196.282774
Feature norm: 1.147897
Error norm: 37152.064422
Active features: 40126
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.367

***** Iteration #3 *****
Loss: 142426.202283
Feature norm: 1.247104
Error norm: 37062.142348
Active features: 34300
Line search trials: 1
Line search step: 1.000000
Seconds required f

Using CRF, we can reach and F1 score up to 85 % with relevant features, introducing little bias and considering higher number of iterations.

With same features, we can go from 83 % to 85 % simply by pushing the number of iterations from 30 to 50. Going further does not improve the F1 score and is even worse because may cause overfitting.

The relevant features added are the lower case of the word, the check if it is upper or not, and same feature for precedent and next word. For precedent and next words we also added the checking of a digit or not.

Experimentally, adding the two or three first characters of precedent and next word are not relevant features. This may bring more confusion.

We would go deeper by adding POS-tagging which requires to hand-craft the information manually in the dataset.