We are trying to build a POC to prove if retraining SpaCy's pretrained model on private data improves the performance. This logically seems so. But need to see by how much does it improve.

One hurdle in this process is that the annotation labels that SpaCy's NER model is trained is different than that of our dataset. So, we would need to find the equivalence between the labels to make this happen. You can find the labels that SpaCy parses on its [documentation page](https://spacy.io/api/annotation#named-entities). Whereas  [our dataset](https://www.kaggle.com/alaakhaled/conll003-englishversion) has these labels: PER, ORG, LOC, MISC

The following are the dependencies of this notebook. Please make sure they are installed before running this notebook:
* SpaCy
* Sklearn

We need to compare the performance on the following scenarios:
* Performance on spaCy's vanilla pretrained model 'en_core_web_md'
* Performance after retraining vanilla model ('en_core_web_md') on training dataset.
* Performance on a model obtained by training from scratch on a training dataset.
* Performance on spaCy's vanilla pretrained model 'en_core_web_lg' which is the <b>largest</b> model
* Performance after retraining vanilla model ('en_core_web_lg') on training dataset.

In [1]:
import json
from pprint import pprint
import spacy 
from sklearn.metrics import precision_recall_fscore_support
from spacy.gold import GoldParse
from sklearn.metrics import accuracy_score

Load the training json file from the disk

In [2]:
with open('./train.json', 'r') as f:
    read_bytes = f.read()
print(len(read_bytes))
train_json = json.loads(read_bytes)
train_json

26772141


[{'id': 0,
  'paragraphs': [{'sentences': [{'tokens': [{'orth': '-DOCSTART-',
        'tag': '-X-',
        'ner': 'O'}]}]}]},
 {'id': 1,
  'paragraphs': [{'sentences': [{'tokens': [{'orth': 'EU',
        'tag': 'NNP',
        'ner': 'U-ORG'},
       {'orth': 'rejects', 'tag': 'VBZ', 'ner': 'O'},
       {'orth': 'German', 'tag': 'JJ', 'ner': 'U-MISC'},
       {'orth': 'call', 'tag': 'NN', 'ner': 'O'},
       {'orth': 'to', 'tag': 'TO', 'ner': 'O'},
       {'orth': 'boycott', 'tag': 'VB', 'ner': 'O'},
       {'orth': 'British', 'tag': 'JJ', 'ner': 'U-MISC'},
       {'orth': 'lamb', 'tag': 'NN', 'ner': 'O'},
       {'orth': '.', 'tag': '.', 'ner': 'O'}]}]}]},
 {'id': 2,
  'paragraphs': [{'sentences': [{'tokens': [{'orth': 'Peter',
        'tag': 'NNP',
        'ner': 'B-PER'},
       {'orth': 'Blackburn', 'tag': 'NNP', 'ner': 'L-PER'}]}]}]},
 {'id': 3,
  'paragraphs': [{'sentences': [{'tokens': [{'orth': 'BRUSSELS',
        'tag': 'NNP',
        'ner': 'U-LOC'},
       {'orth': '1996-08-

Define a function to convert the ConLL ner data to a format that spaCy understands

In [3]:
def convert_conll_ner_to_spacy(json_data):
    training_data = []
    set1 = set()
    for j,document in enumerate(json_data):
        sentence_tokens = document['paragraphs'][0]['sentences'][0]['tokens']
        byte_count = 0
        end = None
        full = " ".join([token['orth'] for token in sentence_tokens])
        #print(full)
        entities = []
        for i,token in enumerate(sentence_tokens):
            #print('\n\t' + str(token))
            byte_count += len(token['orth'])
            ner_tag_scheme_list = token['ner'].split('-')
            biluo_scheme = ner_tag_scheme_list[0]
            ner_tag = None

            if biluo_scheme == 'O':
                #print(f'\tbc:{byte_count}')
                pass
            else: 
                set1.add(token['ner'].split('-')[1])
                ner_tag = ner_tag_scheme_list[1]
                if biluo_scheme == 'B':
                    start = byte_count - len(token['orth'])
                elif biluo_scheme == 'I':
                    pass
                elif biluo_scheme == 'L':
                    end = byte_count
                elif biluo_scheme == 'U':
                    start = byte_count - len(token['orth'])
                    end = byte_count
            byte_count += 1      # For a single space between tokens
            if end != None:
                #print(f'\ttoken:{full[start:end]} -- start:{start} end:{end} bc:{byte_count} -- tag:{ner_tag}')
                entities.append((start, end, ner_tag))
                end = None
        training_data.append((full, {"entities" : entities}))
    print(set1)
    return training_data


Convert training examples into a format that spaCy understands

In [241]:
training_examples = convert_conll_ner_to_spacy(train_json)
#for t in training_examples + test_examples:
#    print(t)

{'MISC', 'LOC', 'ORG', 'PER'}


Load pretrained model 

In [252]:
pretrained_nlp = spacy.load('en_core_web_md')

Load test data and convert it into a format that spaCy understands

In [6]:
with open('./test.json', 'r') as f:
    read_bytes = f.read()
print(len(read_bytes))
test_json = json.loads(read_bytes)
test_examples = convert_conll_ner_to_spacy(test_json)

6145340
{'MISC', 'LOC', 'ORG', 'PER'}


Define helper functions to :
* Take convert human annotated examples in spacy format and build a dictionary that maps annotation labels to annotated text. This wil help up in peering into the annotations to find what the labels actually mean.
* To display the this annotations label to text dictionary
* Help merging the dictionaries for training and test examples

In [43]:
import random
def spacy_get_annotations_by_labels(examples_in_spacy_fmt, labels='all'):
    if labels != 'all':
        raise('Not implemented for specific label')
    label_to_text_map =  {} 
    for text,annotations in examples_in_spacy_fmt:
        entities = annotations['entities']
        #print(text)
        for (start, end, label) in entities:
            #print('\t',start, end, label, '\''+text[start:end]+'\'')
            if label not in label_to_text_map:
                label_to_text_map[ label ] = set()
            else:
                label_to_text_map[ label ].add( text[start:end] )
    return label_to_text_map

def display_labels2text_dict(dictionary, num_samples_per_entity):
    temp_dict = {}
    for key, set_ in dictionary.items():
        temp_dict[key] = []
        set_list = list(set_)
        print(f'set_len({key}) = {len(set_)}')
        if len(set_list) < num_samples_per_entity:
            [ temp_dict[key].append(e) for e in set_list ]
        else: 
            for i in range(num_samples_per_entity):
                temp_dict[key].append( set_list[ int(random.random() * len(set_list)) ] )
    pprint(temp_dict, width=200)

def copy_dict(src_dict, dest_dict):
    for label,set_ in src_dict.items():
        if label not in dest_dict:
            dest_dict[label] = set()
        else:
            [ dest_dict[label].add(i) for i in src_dict[label] ]
    return
    
def merge_labels_to_text_dict(dict1, dict2):
    merged_dict = {}
    copy_dict(dict1, merged_dict)
    copy_dict(dict2, merged_dict)
    return merged_dict


Show the text corresponding the various annotation-labels in the <u>training set<u>

In [49]:
labels2text_dict_train    = spacy_get_annotations_by_labels(training_examples)
labels2text_dict_test     = spacy_get_annotations_by_labels(test_examples)
merged_dict               = merge_labels_to_text_dict(labels2text_dict_train, labels2text_dict_test)

display_labels2text_dict({'MISC':merged_dict['MISC']}, 302)

set_len(MISC) = 301
{'MISC': ['Social Democrats',
          'Holocaust',
          'ENGLISH',
          'Algerians',
          'American',
          'Polish',
          'Portuguese',
          'Arabian Light',
          'Cup',
          'Lantau Peak',
          'Uzbek',
          'Moroccan',
          'Pilkington Cup',
          'Y-GREEN BAY',
          'MEDITERRANEAN',
          'Warsaw Pact',
          'Dow',
          'African',
          'World Cup',
          'Sheffield Shield',
          'BRAZILIAN',
          'Minas',
          'Tasmanian',
          'office-Conservatives',
          'East Timorese-born',
          'AMERICAN',
          'Spanish',
          'Greek',
          'Jones',
          'East Timorese',
          'Arabic',
          'Pro-European',
          'Tia Juana',
          'Grand Slam Cup',
          'Mercedes',
          'mid-Mississippi',
          'Queenslander',
          'Hindus',
          'Turkish',
          'post-Communist',
          "Saturday'sWorld Cu

Show the text corresponding the various annoatation-labels in the <u>test set<u>

In [48]:
import random

# Dictionary that maps entities in the pretrained model (en_core_web_xx) to the unique texts of the test set
# they resolve to. So that we know what all texts match a particular entity label. This will help us 
# in understanding which label maps to the labels of the dataset that is used for pre-training.
def spacy_ner_predictions_to_dict(examples_in_spacy_fmt):
    pred_ent_to_text_map = {} 
    for text,_ in examples_in_spacy_fmt:
        pred_doc = pretrained_nlp(text)
        for ent in pred_doc.ents: 
            if ent.label_ not in pred_ent_to_text_map:
                pred_ent_to_text_map[ ent.label_ ] = set()
            else:
                pred_ent_to_text_map[ ent.label_ ].add( ent.text )
    return pred_ent_to_text_map

# Following call takes time. Uncommnet if contents of the parameter has changed
#pred_dict = spacy_ner_predictions_to_dict(test_examples)

display_labels2text_dict(pred_dict, 80)


set_len(GPE) = 572
set_len(PERSON) = 1099
set_len(ORG) = 794
set_len(DATE) = 618
set_len(EVENT) = 41
set_len(CARDINAL) = 621
set_len(ORDINAL) = 28
set_len(TIME) = 76
set_len(NORP) = 147
set_len(QUANTITY) = 86
set_len(PRODUCT) = 14
set_len(MONEY) = 162
set_len(PERCENT) = 58
set_len(LOC) = 44
set_len(FAC) = 19
set_len(WORK_OF_ART) = 9
set_len(LANGUAGE) = 2
set_len(LAW) = 4
{'CARDINAL': ['10,900',
              '204',
              '83',
              '30',
              '106',
              '136',
              '80',
              '6 1/2',
              '153,231',
              '269',
              '10,900',
              '369',
              '1000',
              '53.01',
              '74.45',
              '209',
              '18,000',
              'about 120',
              '10=',
              '180',
              '53.29',
              '302',
              '44',
              'half',
              '41.08',
              '1:18.98',
              'more than 160',
              '18=

             'about three percent',
             '0.68 percent',
             '4.5 %',
             '43.56 percent',
             'seven percent',
             '120 percent',
             '2.21 percent',
             'minus 0.59 percent',
             '150 percent',
             'more than 20 percent',
             '0.02 percent',
             'one percent',
             '77 percent',
             'more than eight percent',
             '25 percent',
             'at least 15 percent',
             '22.00 / 22.75 percent',
             '85 percent',
             '1.35 percent',
             'two percent',
             '0.37 percent',
             'three percent',
             '195 percent',
             '13.6 percent',
             '80 percent',
             'minus 3.03 percent',
             '105 percent'],
 'PERSON': ['Dietmar Beiersdorfer',
            'Shahid Afridi',
            'Marcelo Otero',
            'Panutan Duta',
            'Chris',
            'Healy',
            'Pao

In [94]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

The NER entities of SpaCy and the manually annotated labels of the dataset dont match. So we need to map those labels to the closest entities that SpaCy's NER model deals with.  

In [217]:
PRED_LABELS_EQUIV_MAP = {
    'NORP'       : 'MISC',
    'WORK_OF_ART': 'MISC',
    'FAC'        : 'MISC',
    'PRODUCT'    : 'MISC',
    'EVENT'      : 'MISC',
    'GPE'        : 'LOC',
    'LOC'        : 'LOC',
    'ORG'        : 'ORG', 
    'PERSON'     : 'PER'
}

Evaluating spaCy's pretrained model on test examples. This score would be compared with the score on the same model which has been retrained on the training data. We count the
* Number of true positives  - Labelled correctly
* Number of false positives - Labelled and did not got it correct AND those that should not have been labelled.
* Number of false negatives - Those that were not labelled at all

In [253]:
def print_annotaions_and_predictions(spacy_examples):
    for text,annotation in spacy_examples:
        pred_doc = pretrained_nlp(text)
        ypred = [ (ent.label_,ent.text) for ent in pred_doc.ents ]
        
        annot_list = [( ent[2], text[ent[0]:ent[1]] ) for ent in annotation['entities']]
        
        print('Human annotated: ', annot_list)
        print('Predictions    : ', ypred)
        print('\n')

def map_pred_tag_to_domain(pred_bilou_tag, equivalence_map):
    if pred_bilou_tag[0] == 'O':
        return 'O'
    bilou_part = pred_bilou_tag.split('-')[0]
    label_part = pred_bilou_tag.split('-')[1]
        
    if label_part not in equivalence_map.keys():
        return 'O'
    return bilou_part + '-' + equivalence_map[label_part]

def convert_doc_to_bilou_tags(doc):
    list_ = [] 
    for i in range(len(doc)):
        # Process BILOU tag
        if doc[i].ent_iob_ == 'O':
            bilou_tag = 'O'
        else:
            if doc[i].ent_iob_ == 'B':
                bilou_tag = 'U' if (i+1) < len(doc) and doc[i+1].ent_iob_ != 'I' else 'B'
            elif doc[i].ent_iob_ == 'I':
                bilou_tag = 'I' if (i+1) < len(doc) and doc[i+1].ent_iob_ == 'I' else 'L'
            else:
                assert "This is unexpected"
        bilou_tag = 'O' if doc[i].ent_type_ == '' else bilou_tag + '-' + doc[i].ent_type_
        
        list_.append( (bilou_tag, doc[i].text) )    
    #print('--->> ',list_)
    return list_

def perf_measure(y_actual, y_hat, label):
    TP = 0
    FP = 0
    TN = 0
    FN = 0
    for i in range(len(y_hat)):
        if y_actual[i]==y_hat[i]==label:
            TP += 1
        if y_hat[i]==label and y_actual[i]!=label:
            FP += 1
        if y_actual[i]!=label and y_hat[i]!=label:
            TN += 1
        if y_hat[i]!=label and y_actual[i]==label:
            FN += 1
    return(TP, FP, TN, FN)

def compute_scores_2(spacy_examples, model, label_map):
    perf_stats_per_tag = { }
    for text,annotation in spacy_examples:
        doc = model.make_doc(text)
        gold = GoldParse(doc, entities=annotation['entities'])
        gold_tag_list = [i for i in zip(gold.ner,gold.words)]
        #print('\nGold       : ', gold_tag_list)
        
        ner_tag_predict_doc = model(text)
        ner_tag_predict_list = convert_doc_to_bilou_tags(ner_tag_predict_doc)
        ner_tag_predict_list = list( map(lambda e: (map_pred_tag_to_domain(e[0],PRED_LABELS_EQUIV_MAP), e[1]), 
                                         ner_tag_predict_list) 
                                   )
        #print(  'Predicted  : ', ner_tag_predict_list)
        
        #for i in range(len(ner_tag_predict_list)):
        #    if ner_tag_predict_list[i][0] != gold_tag_list[i][0]:
        #        print('\t',ner_tag_predict_list[i], gold_tag_list[i])
        #        continue
        #    if ner_tag_predict_list[i][1] != gold_tag_list[i][1]:
        #        print('\t',ner_tag_predict_list[i], gold_tag_list[i])
        #        continue
        
        # Compute unique labels and populate y_true and y_pred
        unique_labels = set()
        y_true, y_pred = [],[]
        for t in gold_tag_list:
            if t[0] != 'O':
                unique_labels.add( t[0] )
            y_true.append( t[0] )
        for t in ner_tag_predict_list:
            if t[0] != 'O':
                unique_labels.add( t[0] )
            y_pred.append( t[0] )
            
        #print('\tUnique Labels :', unique_labels)
        #print('\ty_true        :', y_true)
        #print('\ty_pred        :', y_pred)
        for label in unique_labels:
            (TP, FP, TN, FN) = perf_measure(y_true, y_pred, label)
            CNT = len(y_true)
            #print(label,' ',f'(TP:{TP}, FP:{FP}, TN:{TN}, FN:{FN}, CNT:{CNT})')
            
            label_part = label.split('-')[1]
            if label_part not in perf_stats_per_tag:
                perf_stats_per_tag[label_part] = {'TP':0, 'FP':0, 'TN':0, 'FN':0, 'CNT':0}
                
            perf_stats_per_tag[label_part]['TP'] += TP
            perf_stats_per_tag[label_part]['FP'] += FP
            perf_stats_per_tag[label_part]['TN'] += TN
            perf_stats_per_tag[label_part]['FN'] += FN
            perf_stats_per_tag[label_part]['CNT'] += CNT
    return perf_stats_per_tag

def display_perf_stats_per_tag( stats_per_tag ):
    # Now compute the scores
    pprint(stats_per_tag)
    for tag,st in stats_per_tag.items():
        print(f'For label: "{tag}"')
        accuracy = (st['TP'] + st['TN']) / st['CNT']
        print("\tAccuracy : "  + str(accuracy * 100) + "%")
        
        precision = 0
        if (st['TP'] + st['FP']) != 0:
            precision = st['TP'] / (st['TP'] + st['FP'])
        print("\tPrecision : " + str(precision))
        
        recall = 0
        if (st['TP'] + st['FN']) != 0:
            recall = st['TP'] / (st['TP'] + st['FN'])
        print("\tRecall : "    + str(recall))
        
        fscore = 0
        if (precision + recall) != 0:
            fscore = (2 * precision * recall) / (precision + recall)
        print("\tF-score : "   + str(fscore))
        


        
#print_annotaions_and_predictions(test_examples)

stats_per_tag__on_vanilla_md = compute_scores_2(test_examples, pretrained_nlp, PRED_LABELS_EQUIV_MAP)
display_perf_stats_per_tag(stats_per_tag__on_vanilla_md)

{'LOC': {'CNT': 36306, 'FN': 453, 'FP': 791, 'TN': 33571, 'TP': 1491},
 'MISC': {'CNT': 23206, 'FN': 468, 'FP': 276, 'TN': 21906, 'TP': 556},
 'ORG': {'CNT': 50164, 'FN': 1462, 'FP': 1431, 'TN': 46196, 'TP': 1075},
 'PER': {'CNT': 45528, 'FN': 608, 'FP': 648, 'TN': 42053, 'TP': 2219}}
For label: "LOC"
	Accuracy : 96.57356910703466%
	Precision : 0.6533742331288344
	Recall : 0.7669753086419753
	F-score : 0.705631803123521
For label: "PER"
	Accuracy : 97.24125812686698%
	Precision : 0.77397976979421
	Recall : 0.7849310222851079
	F-score : 0.7794169301018616
For label: "ORG"
	Accuracy : 94.23291603540387%
	Precision : 0.4289704708699122
	Recall : 0.423728813559322
	F-score : 0.42633353162799914
For label: "MISC"
	Accuracy : 96.79393260363699%
	Precision : 0.6682692307692307
	Recall : 0.54296875
	F-score : 0.5991379310344827


Now lets <b><u>retrain</u></b> the SpaCy model on out training data

In [257]:
import copy 
def conv_dataset_to_match_domain(spacy_examples, dataset_to_model_tag_map):
    spacy_examples = copy.deepcopy(spacy_examples)
    for text,annotations in spacy_examples:
        entities = annotations['entities']
        for i,ent in enumerate(entities):
            if ent[2][0] == 'O':
                continue
            entities[i] = (ent[0],ent[1],dataset_to_model_tag_map[ent[2]])                                        
    return spacy_examples

# The 'model' parameter could either be a pretrained model. Default behavior is to 
# training from scratch. 
def train_spacy_model(train_examples, model=None):
    nlp = model  # create blank Language class
    if nlp == None:
        nlp = spacy.blank('en')
    
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if model == None:
        if 'ner' not in nlp.pipe_names:
            ner = nlp.create_pipe('ner')
            nlp.add_pipe(ner, last=True)
        # add labels
        for _, annotations in train_examples:
             for ent in annotations.get('entities'):
                ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(10):
            print("Statring iteration " + str(itn))
            random.shuffle(train_examples)
            losses = {}
            for text, annotations in train_examples:
                try:
                    nlp.update(
                        [text],         # batch of texts
                        [annotations],  # batch of annotations
                        drop=0.2,       # dropout - make it harder to memorise data
                        sgd=optimizer,  # callable to update weights
                        losses=losses)
                except:
                    continue
                    #print('Hello ' + str(sys.exc_info()))
            print(losses)
    return nlp

dataset_to_model_tag_map = { 'MISC': 'NORP',
                             'LOC' : 'LOC',
                             'ORG' : 'ORG',
                             'PER' : 'PERSON'}
mapped_examples = conv_dataset_to_match_domain(training_examples, 
                                               dataset_to_model_tag_map)
#train_spacy_model(mapped_examples, pretrained_nlp)

Evaluating spaCy's <b><u>retrained</u></b> model on test examples 

In [250]:
# saved_model = copy.deepcopy(pretrained_nlp) # *ExpensiveResource*. comment after execution

#stats_per_tag__retrained = compute_scores_2(test_examples, saved_model, PRED_LABELS_EQUIV_MAP)
display_perf_stats_per_tag(stats_per_tag__retrained)

{'LOC': {'CNT': 34285, 'FN': 243, 'FP': 416, 'TN': 31925, 'TP': 1701},
 'MISC': {'CNT': 23083, 'FN': 304, 'FP': 218, 'TN': 21841, 'TP': 720},
 'ORG': {'CNT': 41865, 'FN': 673, 'FP': 359, 'TN': 38969, 'TP': 1864},
 'PER': {'CNT': 43017, 'FN': 273, 'FP': 251, 'TN': 39939, 'TP': 2554}}
For label: "LOC"
	Accuracy : 98.07787662242964%
	Precision : 0.8034955125177138
	Recall : 0.875
	F-score : 0.8377246983501601
For label: "PER"
	Accuracy : 98.78187693237557%
	Precision : 0.9105169340463458
	Recall : 0.9034311991510435
	F-score : 0.9069602272727272
For label: "ORG"
	Accuracy : 97.53493371551414%
	Precision : 0.8385065227170491
	Recall : 0.7347260543949546
	F-score : 0.7831932773109245
For label: "MISC"
	Accuracy : 97.73859550318417%
	Precision : 0.767590618336887
	Recall : 0.703125
	F-score : 0.7339449541284404


Now lets train a model <b><u>from sratch</u></b> and compute the score. The idea here would be to see if the performance on the retrained spaCy model is better than that of on a model that has been trained from scratch 

In [258]:
#model_from_scratch = train_spacy_model(training_examples)
#stats_per_tag__from_scratch = compute_scores_2(test_examples, model_from_scratch, PRED_LABELS_EQUIV_MAP)
display_perf_stats_per_tag(stats_per_tag__from_scratch)

Statring iteration 0
{'ner': 20394.866084106878}
Statring iteration 1
{'ner': 13091.086114697599}
Statring iteration 2
{'ner': 10036.00885534475}
Statring iteration 3
{'ner': 8481.742086296188}
Statring iteration 4
{'ner': 7155.71140668863}
Statring iteration 5
{'ner': 6390.112653583615}
Statring iteration 6
{'ner': 5746.511855209186}
Statring iteration 7
{'ner': 5282.283485583606}
Statring iteration 8
{'ner': 5020.81613579561}
Statring iteration 9
{'ner': 5065.791847710549}
{'LOC': {'CNT': 31682, 'FN': 428, 'FP': 248, 'TN': 29490, 'TP': 1516},
 'MISC': {'CNT': 19528, 'FN': 1024, 'FP': 0, 'TN': 18504, 'TP': 0},
 'ORG': {'CNT': 49463, 'FN': 610, 'FP': 914, 'TN': 46012, 'TP': 1927},
 'PER': {'CNT': 39024, 'FN': 2827, 'FP': 0, 'TN': 36197, 'TP': 0}}
For label: "ORG"
	Accuracy : 96.91890908355741%
	Precision : 0.6782822949665611
	Recall : 0.759558533701222
	F-score : 0.7166232800297508
For label: "PER"
	Accuracy : 92.75574005740057%
	Precision : 0
	Recall : 0.0
	F-score : 0
For label: "LOC