# Named Entity Recognition (NER)

I will attempt to do the Named Entity Recognition task. I will train two models; one is quick & reliable (Gold Parse), and the other is a deep-learning (word-embedding) model.

The models are:<br>
1- spaCy NER Gold Parse system [1]. spaCy is an industrial strength NLP library that is widely used by large corporation to handle similar NLP applications. E.g. it's being used by BBC, Microsoft, and many other big companies. spaCy is quick to train with reasonable accuracy, and generates a light model that can be ran quickly on a small machine.

2- Flair embeddings [2]. Contextual String Embeddings for Sequence Labeling is currently the state-of-the-art [3] system in Named Entity Recognition tasks, and the only system outperforming Google's BERT [4] model. More information on Flair can be found on their paper [5]. Flair is an expensive to train model, however, it achieves state-of-the-art results.

I have selected these two models to demonstrate that I am capable of providing a quick & reliable solution when needed (spaCy). Also, when time/resource allows, I am capable of providing a significantly better solution that is considered the state-of-the-art in the field of NLP (Flair Contextual String Embeddings).

spaCy needs a couple of hours to train a decent model on a potato laptop, while Flair embeddings (with CharLMEmbeddings) can take days on the same laptop. Note, it took me less than an hour to fine-tune Flair with a big GPU Nvidia-Quadro-P6000. 

Requirements to run this code:
- python 3.6
- spacy '2.0.16'
- flair

[1] https://spacy.io/<br>
[2] https://github.com/zalandoresearch/flair<br>
[3] https://github.com/zalandoresearch/flair#comparison-with-state-of-the-art<br>
[4] https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html<br>
[5] https://drive.google.com/file/d/17yVpFA7MmXaQFTe-HDpZuqw9fJlmzg56/view

## Initialization

In [1]:
import spacy
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy import displacy
import json
import random
from flair.data import Sentence
from flair.models import SequenceTagger
from flair.trainers import SequenceTaggerTrainer
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharLMEmbeddings, CharacterEmbeddings
from typing import List

## Training, Validation, Testing

In here we divide the provided training data into 3 parts; training (80%), validation (10%) and testing (10%).

In [2]:
# reading the raw data
raw_data_PATH = 'Dataset/'
File_ = open(raw_data_PATH+'ner_dataset.txt')

DATA = []
sentence = []

for line in File_:
    try:
        if line == '\n':
            DATA.append(sentence)
            sentence = []
        else:
            sentence.append(line)
    except:
        print('you have a bad line..',line)

random.shuffle(DATA)    # shuffling the data is always good to preven overfitting

# dividing the data into trainig (80%), validation (10%) and testing (10%).
split_ = int(0.1 * len(DATA))
TRAIN_DATA, VAL_DATA, TEST_DATA = DATA[:8*split_], DATA[8*split_:9*split_], DATA[9*split_:]
print('  *Training has (',len(TRAIN_DATA),') instances.')
print('  *Validation has (',len(VAL_DATA),') instances.')
print('  *TEST has (',len(TEST_DATA),') instances.')

# print()
# storing the data
for name,data in zip(['training','validation','test'],[TRAIN_DATA, VAL_DATA, TEST_DATA]):
    F = open(raw_data_PATH+'ner_dataset_'+name+'.txt','w')
    F.write('\n'.join([''.join(x) for x in data]))
    F.close()


  *Training has ( 14760 ) instances.
  *Validation has ( 1845 ) instances.
  *TEST has ( 1849 ) instances.


## 1- spaCy NER

### Preparing the data for spaCy

Inspecting the data, shuffling it and preparing it to be fed into the spaCy model.

In [3]:
# theses functions create training data suitable for the Spacy tool
def _reformat_data(data):
    for counter, example_ in enumerate(data):
        index_ = 0
        annotations = {}
        sentence, ner_tag = example_
        for word, tag in zip(sentence, ner_tag):
            #-------------------------------------#
            # analysing the NER tag
            if '-' in tag:
                In, tag = tag.split('-')
                if tag not in annotations:
                    annotations[tag] = []
            else:
                In = tag
                
            #-------------------------------------#
            # creating the training data
            if In == 'B':
                annotations[tag].append([index_, index_+len(word)])
            elif In == 'I':
                annotations[tag][-1][1] = index_+len(word)
            elif In != 'O':
                print('=====!!!!!', In)
                
            index_ += len(word) + 1
        
        # fix the format
        ann = {'entities':[ (val[0],val[1],key) for key in annotations for val in annotations[key]]}
            
        ## update the training data to fit spacy format
        text = ' '.join(sentence)
        data[counter] = (text, ann)
    return data

def _create_training_data(raw_data):
    File_ = open(raw_data, 'r')
    TRAIN_DATA = []
    sentence = []
    ner_tag = []

    for line in File_:
        try:
            line = line.split('\n')[0]

            if line == '':
                TRAIN_DATA.append([sentence,ner_tag])
                sentence = []
                ner_tag = []
            else:
                word, POS1, CNK2, tag = line.split(' ')
                sentence.append(word)
                ner_tag.append(tag)
        except:
            print('you have a bad line..',line)
            
    File_.close()
    return _reformat_data(TRAIN_DATA)
            

In [4]:
# reading the raw data
raw_data_PATH = 'Dataset/'

# prepare the training data into spacy format
TRAIN_DATA = _create_training_data(raw_data_PATH+'ner_dataset_training.txt')
VAL_DATA = _create_training_data(raw_data_PATH+'ner_dataset_validation.txt')
TEST_DATA = _create_training_data(raw_data_PATH+'ner_dataset_test.txt')

# Inspecting the data        
print('We have a total of',len(TRAIN_DATA),'training instances.')

# print class analysis
all_tags = [ent[2] for data in TRAIN_DATA for ent in data[1]['entities']]
classes = set(all_tags)
print('We have',len(classes),'classes in this dataset.',classes)

for c_ in classes:
    print('  -class(',c_,') has',all_tags.count(c_),'instances.')


We have a total of 14759 training instances.
We have 4 classes in this dataset. {'PER', 'MISC', 'ORG', 'LOC'}
  -class( PER ) has 6663 instances.
  -class( MISC ) has 3466 instances.
  -class( ORG ) has 6101 instances.
  -class( LOC ) has 7153 instances.


### Training the model

In [5]:
# Load or create a blank English model
model = 'Spacy/'
output_dir = 'Spacy/'

if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
        

"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
    try:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    except:
        nlp = spacy.blank('en')  # create blank Language class
        print("Could not find the model, Created a blank 'en' model isntead")
else:
    nlp = spacy.blank('en')  # create blank Language class
    print("Created blank 'en' model")
    

Could not find the model, Created a blank 'en' model isntead


In [6]:
def predict_on_texts(texts):
    colors = {}
    colors['ORG'] = 'orange'
    colors['PER'] = '#aa9cfc'
    colors['LOC'] = 'green'
    colors['MISC'] = 'yellow'
    options = {'ents': classes, 'colors': colors}

    for text in texts:
        doc = nlp(text)
        Entities = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
        if len(Entities) > 0:
            displacy.render(doc, style='ent', jupyter=True, options=options)
        else:
            print('no entities detected: ',text)
        print('--------------------------')
        print()
    

In [7]:
def predict_on_test_set(filepath):
    ext = filepath.split('.')[-1]
    if ext == 'txt':
        VAL_DATA = _create_training_data(filepath) 
    else:
        VAL_DATA = []

    TP, FN, FP = 0, 0, 0 # True positives, False negatives, False Positives
    for text, ann in VAL_DATA:
        doc = nlp(text)
        GT = sorted(ann['entities'], key=lambda tup: tup[0])
        Entities = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
        Ground_Truth = [(text[a[0]:a[1]], a[0], a[1], a[2]) for a in GT]
        
        TP += len([value for value in Entities if value in Ground_Truth])
        FP += len([value for value in Entities if value not in Ground_Truth])
        FN += len([value for value in Ground_Truth if value not in Entities])
    Pr, Re = TP/(TP+FP), TP/(TP+FN) ## computing Precision and Recall
    print('  -Validation: -precision=%.3f -recall=%.3f -f1 score=%.3f'  % (Pr, Re, 2*(Pr*Re)/(Pr+Re)))
        

In [8]:
# parameters
n_iter = 20 # number of iteration

# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
    ner = nlp.get_pipe('ner')
    
# add labels to model
for ent in classes:
    ner.add_label(ent)

In [9]:
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    optimizer = nlp.begin_training()
    print('*started trianing..')
    for itn in range(n_iter):
        # shuffle the data (reduce the overfitting of the model)
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(
                texts,  # batch of texts
                annotations,  # batch of annotations
                drop=0.5,  # dropout - make it harder to memorise data
                sgd=optimizer,  # callable to update weights
                losses=losses)
        ## printing training and validation losses
        print('  -Trainnig loss', losses)
        predict_on_test_set(raw_data_PATH+'ner_dataset_validation.txt')
        
        # save model to output directory
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)



*started trianing..
  -Trainnig loss {'ner': 530.0087570839813}
  -Validation: -precision=0.734 -recall=0.759 -f1 score=0.746
Saved model to Spacy
  -Trainnig loss {'ner': 193.98312801231032}
  -Validation: -precision=0.839 -recall=0.864 -f1 score=0.851
Saved model to Spacy
  -Trainnig loss {'ner': 133.6101827629867}
  -Validation: -precision=0.866 -recall=0.881 -f1 score=0.873
Saved model to Spacy
  -Trainnig loss {'ner': 103.8997348449231}
  -Validation: -precision=0.890 -recall=0.890 -f1 score=0.890
Saved model to Spacy
  -Trainnig loss {'ner': 91.7709303233944}
  -Validation: -precision=0.891 -recall=0.904 -f1 score=0.897
Saved model to Spacy
  -Trainnig loss {'ner': 81.74727408357197}
  -Validation: -precision=0.902 -recall=0.913 -f1 score=0.907
Saved model to Spacy
  -Trainnig loss {'ner': 68.8015570272417}
  -Validation: -precision=0.900 -recall=0.908 -f1 score=0.904
Saved model to Spacy
  -Trainnig loss {'ner': 62.71420286399199}
  -Validation: -precision=0.898 -recall=0.914 -f

#### Testing the model

In [10]:
# To test the model on a txt file use:
# predict_on_test_set(raw_data_PATH+'ner_dataset.txt')

# To test the model with a sequence of sentences use:
predict_on_texts(['New York city','My name is Muhannad, and I live in the US. I work in Rolls Royce.'])


--------------------------



--------------------------



In [11]:
# Plotting the NER tags.
# using display-spacy to show the ner values (showing a sample only).
random_examples = [int(random.random()*len(VAL_DATA)) for i in range(4)]
texts = [VAL_DATA[ex][0] for ex in random_examples]
predict_on_texts(texts)

--------------------------



--------------------------



--------------------------



--------------------------



## Flair model

Flair includes contextual word embeddings to predict the named entites within the text.

In [12]:
# define columns
columns = {0: 'text', 1: 'pos', 2: 'cnk', 3:'ner'}

# this is the folder in which train, test and dev files reside
data_folder = 'Dataset/'

# training folder
flair_folder = 'Flair/'
flair_path = Path(flair_folder)
if not flair_path.exists():
    flair_path.mkdir()
        
# retrieve corpus using column format, data folder and the names of the train, dev and test files
# 1. get the corpus
corpus: TaggedCorpus = NLPTaskDataFetcher.fetch_column_corpus(data_folder, columns,
                                                              train_file='ner_dataset_training.txt',
                                                              test_file='ner_dataset_test.txt',
                                                              dev_file='ner_dataset_validation.txt')
                
# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [

    # use this if you have a potato PC
    WordEmbeddings('glove'),

    # comment in this line to use character embeddings
    # CharacterEmbeddings(),

    # comment in these lines to use contextual string embeddings
    CharLMEmbeddings('news-forward'),
    CharLMEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
    

# 5. initialize sequence tagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)
    

# 6. initialize trainer
trainer: SequenceTaggerTrainer = SequenceTaggerTrainer(tagger, corpus)

# 7. start training
trainer.train(flair_folder,
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=20)



[b'<unk>', b'O', b'B-MISC', b'B-PER', b'I-PER', b'B-LOC', b'B-ORG', b'I-LOC', b'I-ORG', b'I-MISC', b'<START>', b'<STOP>']
2018-11-29 10:24:41,638 Evaluation method: F1
2018-11-29 10:24:41,640 ----------------------------------------------------------------------------------------------------
2018-11-29 10:24:41,917 epoch 1 - iter 0/462 - loss 42.15810394
2018-11-29 10:24:54,769 epoch 1 - iter 46/462 - loss 8.74935113
2018-11-29 10:25:07,774 epoch 1 - iter 92/462 - loss 5.99806832
2018-11-29 10:25:21,112 epoch 1 - iter 138/462 - loss 4.81196134
2018-11-29 10:25:33,686 epoch 1 - iter 184/462 - loss 4.13121788
2018-11-29 10:25:46,115 epoch 1 - iter 230/462 - loss 3.67478945
2018-11-29 10:25:58,969 epoch 1 - iter 276/462 - loss 3.35587804
2018-11-29 10:26:11,713 epoch 1 - iter 322/462 - loss 3.15919730
2018-11-29 10:26:25,144 epoch 1 - iter 368/462 - loss 2.95615985
2018-11-29 10:26:37,924 epoch 1 - iter 414/462 - loss 2.79990584
2018-11-29 10:26:50,589 epoch 1 - iter 460/462 - loss 2.6650

2018-11-29 10:34:34,397 epoch 7 - iter 460/462 - loss 0.63046413
2018-11-29 10:34:34,486 ----------------------------------------------------------------------------------------------------
2018-11-29 10:34:48,858 EPOCH 7: lr 0.1000 - bad epochs 1
2018-11-29 10:34:48,860 DEV : f-score 0.9519 - acc 0.9519 - tp 2892 - fp 132 - fn 160 - tn 2892
2018-11-29 10:34:48,860 TEST: f-score 0.9472 - acc 0.9473 - tp 2830 - fp 146 - fn 169 - tn 2830
2018-11-29 10:34:48,862 ----------------------------------------------------------------------------------------------------
2018-11-29 10:34:49,014 epoch 8 - iter 0/462 - loss 0.32683650
2018-11-29 10:34:54,793 epoch 8 - iter 46/462 - loss 0.62219968
2018-11-29 10:35:00,343 epoch 8 - iter 92/462 - loss 0.62487941
2018-11-29 10:35:06,147 epoch 8 - iter 138/462 - loss 0.61804323
2018-11-29 10:35:11,162 epoch 8 - iter 184/462 - loss 0.61333458
2018-11-29 10:35:16,983 epoch 8 - iter 230/462 - loss 0.61726546
2018-11-29 10:35:23,395 epoch 8 - iter 276/462 - 

2018-11-29 10:42:27,523 epoch 14 - iter 230/462 - loss 0.46271749
2018-11-29 10:42:33,613 epoch 14 - iter 276/462 - loss 0.45838888
2018-11-29 10:42:39,357 epoch 14 - iter 322/462 - loss 0.45494782
2018-11-29 10:42:44,917 epoch 14 - iter 368/462 - loss 0.45402290
2018-11-29 10:42:50,523 epoch 14 - iter 414/462 - loss 0.44845991
2018-11-29 10:42:56,148 epoch 14 - iter 460/462 - loss 0.45012914
2018-11-29 10:42:56,243 ----------------------------------------------------------------------------------------------------
2018-11-29 10:43:10,931 EPOCH 14: lr 0.1000 - bad epochs 2
2018-11-29 10:43:10,933 DEV : f-score 0.9614 - acc 0.9614 - tp 2927 - fp 110 - fn 125 - tn 2927
2018-11-29 10:43:10,933 TEST: f-score 0.9547 - acc 0.9547 - tp 2857 - fp 129 - fn 142 - tn 2857
2018-11-29 10:43:10,935 ----------------------------------------------------------------------------------------------------
2018-11-29 10:43:11,074 epoch 15 - iter 0/462 - loss 0.12862249
2018-11-29 10:43:16,775 epoch 15 - iter

2018-11-29 10:50:16,456 TEST: f-score 0.9597 - acc 0.9598 - tp 2874 - fp 116 - fn 125 - tn 2874
2018-11-29 10:50:16,456 LOC : f-score 0.9730 - acc 0.9730 - tp 883 - fp 26 - fn 23 - tn 883
2018-11-29 10:50:16,456 MISC: f-score 0.9260 - acc 0.9260 - tp 413 - fp 27 - fn 39 - tn 413
2018-11-29 10:50:16,457 ORG : f-score 0.9377 - acc 0.9377 - tp 715 - fp 47 - fn 48 - tn 715
2018-11-29 10:50:16,457 PER : f-score 0.9823 - acc 0.9824 - tp 863 - fp 16 - fn 15 - tn 863


#### The end!