We are trying to build a POC to prove if retraining SpaCy's pretrained model on private data improves the performance. This logically seems so. But need to see by how much does it improve.

One hurdle in this process is that the annotation labels that SpaCy's NER model is trained is different than that of our dataset. So, we would need to find the equivalence between the labels to make this happen. You can find the labels that SpaCy parses on its [documentation page](https://spacy.io/api/annotation#named-entities). Whereas  [our dataset](https://www.kaggle.com/alaakhaled/conll003-englishversion) has these labels: PER, ORG, LOC, MISC

The following are the dependencies of this notebook. Please make sure they are installed before running this notebook:
* SpaCy
* Sklearn

We need to compare the performance on the following scenarios:
* Performance on spaCy's vanilla pretrained model 'en_core_web_md'
* Performance after retraining vanilla model ('en_core_web_md') on training dataset.
* Performance on a model obtained by training from scratch on a training dataset.
* Performance on spaCy's vanilla pretrained model 'en_core_web_lg' which is the <b>largest</b> model
* Performance after retraining vanilla model ('en_core_web_lg') on training dataset.

In [2]:
import spacy
spacy.prefer_gpu()

True

In [3]:
import json
from pprint import pprint
import spacy 
import random
import copy 
from sklearn.metrics import precision_recall_fscore_support
from spacy.gold import GoldParse
from sklearn.metrics import accuracy_score

In [5]:
import sys
in_ipythonkernel = 'ipykernel' in sys.modules
in_collab = 'google.colab' in sys.modules
if in_ipythonkernel == True:
    in_jupyter = in_collab == False
print('Google Collab :       ', in_collab)
print('In Jupyter Notebook : ', in_jupyter)


Google Collab :        False
In Jupyter Notebook :  True


**Convert dataset files in NER format to JSON**

In [2]:
if in_jupyter == True:
    ! python -m spacy convert -c ner valid.txt > valid.json
    ! python -m spacy convert -c ner test.txt > test.json
    ! python -m spacy convert -c ner train.txt > train.json
elif in_collab == True:
    ! mkdir dir
    ! python -m spacy convert -c ner valid.txt dir
    ! python -m spacy convert -c ner test.txt dir
    ! python -m spacy convert -c ner train.txt dir
    ! mv dir/* . 
    ! rmdir dir

Google Collab :        False
In Jupyter Notebook :  True


# **Utility functions**

In [18]:
if in_collab == True:
    from IPython.display import display as print

# Define a function to convert the ConLL ner data to a format that spaCy understands
def convert_conll_ner_to_spacy(json_obj):
    if in_collab == True:
        return convert_conll_ner_to_spacy_google_collab_(json_obj)
    else:
        assert in_jupyter == True
        return convert_conll_ner_to_spacy_jupyter_nbook_(json_obj)
        
def process_sentence_token_(sentence_tokens):
    end = None
    byte_count = 0
    entities = []
    set1 = set()
    for i,token in enumerate(sentence_tokens):
        #print('\n\t' + str(token))
        byte_count += len(token['orth'])
        ner_tag_scheme_list = token['ner'].split('-')
        biluo_scheme = ner_tag_scheme_list[0]
        ner_tag = None

        if biluo_scheme == 'O':
            #print(f'\tbc:{byte_count}')
            pass
        else: 
            set1.add(token['ner'].split('-')[1])
            ner_tag = ner_tag_scheme_list[1]
            if biluo_scheme == 'B':
                start = byte_count - len(token['orth'])
            elif biluo_scheme == 'I':
                pass
            elif biluo_scheme == 'L':
                end = byte_count
            elif biluo_scheme == 'U':
                start = byte_count - len(token['orth'])
                end = byte_count
        byte_count += 1      # For a single space between tokens
        if end != None:
            #print(f'\ttoken:{full[start:end]} -- start:{start} end:{end} bc:{byte_count} -- tag:{ner_tag}')
            entities.append((start, end, ner_tag))
            end = None
    return entities
    

def convert_conll_ner_to_spacy_google_collab_(json_obj):
    training_data = []
    sentences = json_obj[0]['paragraphs'][0]['sentences']
    for j,sentence in enumerate(sentences):
        byte_count = 0
        sentence_tokens = sentence['tokens']
        full = " ".join([token['orth'] for token in sentence_tokens])
        entities = process_sentence_token_(sentence_tokens)
        training_data.append((full, {"entities" : entities}))
    #print(set1)
    return training_data

def convert_conll_ner_to_spacy_jupyter_nbook_(json_obj):
    training_data = []
    for j,document in enumerate(json_obj):
        sentence_tokens = document['paragraphs'][0]['sentences'][0]['tokens']
        #print(sentence_tokens)
        full = " ".join([token['orth'] for token in sentence_tokens])
        entities = process_sentence_token_(sentence_tokens)
        training_data.append((full, {"entities" : entities}))
    #print(set1)
    return training_data


# Take convert human annotated examples in spacy format and build a dictionary 
# that maps annotation labels to annotated text. This wil help up in peering 
# into the annotations to find what the labels actually mean.
def spacy_get_annotations_by_labels(examples_in_spacy_fmt, labels='all'):
    if labels != 'all':
        raise('Not implemented for specific label')
    label_to_text_map =  {} 
    for text,annotations in examples_in_spacy_fmt:
        entities = annotations['entities']
        #print(text)
        for (start, end, label) in entities:
            #print('\t',start, end, label, '\''+text[start:end]+'\'')
            if label not in label_to_text_map:
                label_to_text_map[ label ] = set()
            else:
                label_to_text_map[ label ].add( text[start:end] )
    return label_to_text_map

# To display the this annotations label to text dictionary
def display_labels2text_dict(dictionary, num_samples_per_entity):
    temp_dict = {}
    for key, set_ in dictionary.items():
        temp_dict[key] = []
        set_list = list(set_)
        print(f'set_len({key}) = {len(set_)}')
        if len(set_list) < num_samples_per_entity:
            [ temp_dict[key].append(e) for e in set_list ]
        else: 
            for i in range(num_samples_per_entity):
                temp_dict[key].append( set_list[ int(random.random() * len(set_list)) ] )
    pprint(temp_dict, width=200)

def copy_dict(src_dict, dest_dict):
    for label,set_ in src_dict.items():
        if label not in dest_dict:
            dest_dict[label] = set()
        else:
            [ dest_dict[label].add(i) for i in src_dict[label] ]
    return
    
# Help merging the dictionaries for training and test examples
def merge_labels_to_text_dict(dict1, dict2):
    merged_dict = {}
    copy_dict(dict1, merged_dict)
    copy_dict(dict2, merged_dict)
    return merged_dict

import random

# Dictionary that maps entities in the pretrained model (en_core_web_xx) to the unique texts of the test set
# they resolve to. So that we know what all texts match a particular entity label. This will help us 
# in understanding which label maps to the labels of the dataset that is used for pre-training.
def spacy_ner_predictions_to_dict(examples_in_spacy_fmt, model):
    pred_ent_to_text_map = {} 
    for text,_ in examples_in_spacy_fmt:
        pred_doc = model(text)
        for ent in pred_doc.ents: 
            if ent.label_ not in pred_ent_to_text_map:
                pred_ent_to_text_map[ ent.label_ ] = set()
            else:
                pred_ent_to_text_map[ ent.label_ ].add( ent.text )
    return pred_ent_to_text_map

def print_annotaions_and_predictions(spacy_examples):
    for text,annotation in spacy_examples:
        pred_doc = pretrained_nlp(text)
        ypred = [ (ent.label_,ent.text) for ent in pred_doc.ents ]
        
        annot_list = [( ent[2], text[ent[0]:ent[1]] ) for ent in annotation['entities']]
        
        print('Human annotated: ', annot_list)
        print('Predictions    : ', ypred)
        print('\n')

def map_pred_tag_to_domain(pred_bilou_tag, equivalence_map):
    if pred_bilou_tag[0] == 'O':
        return 'O'
    bilou_part = pred_bilou_tag.split('-')[0]
    label_part = pred_bilou_tag.split('-')[1]
        
    if label_part not in equivalence_map.keys():
        return 'O'
    return bilou_part + '-' + equivalence_map[label_part]

def convert_doc_to_bilou_tags(doc):
    list_ = [] 
    for i in range(len(doc)):
        # Process BILOU tag
        if doc[i].ent_iob_ == 'O':
            bilou_tag = 'O'
        else:
            if doc[i].ent_iob_ == 'B':
                bilou_tag = 'U' if (i+1) < len(doc) and doc[i+1].ent_iob_ != 'I' else 'B'
            elif doc[i].ent_iob_ == 'I':
                bilou_tag = 'I' if (i+1) < len(doc) and doc[i+1].ent_iob_ == 'I' else 'L'
            else:
                assert "This is unexpected"
        bilou_tag = 'O' if doc[i].ent_type_ == '' else bilou_tag + '-' + doc[i].ent_type_
        
        list_.append( (bilou_tag, doc[i].text) )    
    #print('--->> ',list_)
    return list_

def perf_measure(y_actual, y_hat, label):
    TP = 0
    FP = 0
    TN = 0
    FN = 0
    for i in range(len(y_hat)):
        if y_actual[i]==y_hat[i]==label:
            TP += 1
        if y_hat[i]==label and y_actual[i]!=label:
            FP += 1
        if y_actual[i]!=label and y_hat[i]!=label:
            TN += 1
        if y_hat[i]!=label and y_actual[i]==label:
            FN += 1
    return(TP, FP, TN, FN)


def compute_scores(spacy_examples, model, label_map):
    perf_stats_per_tag = { }
    for text,annotation in spacy_examples:
        doc = model.make_doc(text)
        gold = GoldParse(doc, entities=annotation['entities'])
        gold_tag_list = [i for i in zip(gold.ner,gold.words)]
        #print('\nGold       : ', gold_tag_list)
        
        ner_tag_predict_doc = model(text)
        ner_tag_predict_list = convert_doc_to_bilou_tags(ner_tag_predict_doc)
        ner_tag_predict_list = list( map(lambda e: (map_pred_tag_to_domain(e[0],PRED_LABELS_EQUIV_MAP), e[1]), 
                                         ner_tag_predict_list) 
                                   )
        #print(  'Predicted  : ', ner_tag_predict_list)
        
        #for i in range(len(ner_tag_predict_list)):
        #    if ner_tag_predict_list[i][0] != gold_tag_list[i][0]:
        #        print('\t',ner_tag_predict_list[i], gold_tag_list[i])
        #        continue
        #    if ner_tag_predict_list[i][1] != gold_tag_list[i][1]:
        #        print('\t',ner_tag_predict_list[i], gold_tag_list[i])
        #        continue
        
        # Compute unique labels and populate y_true and y_pred
        unique_labels = set()
        y_true, y_pred = [],[]
        for t in gold_tag_list:
            if t[0] != 'O':
                unique_labels.add( t[0] )
            y_true.append( t[0] )
        for t in ner_tag_predict_list:
            if t[0] != 'O':
                unique_labels.add( t[0] )
            y_pred.append( t[0] )
            
        #print('\tUnique Labels :', unique_labels)
        #print('\ty_true        :', y_true)
        #print('\ty_pred        :', y_pred)
        for label in unique_labels:
            (TP, FP, TN, FN) = perf_measure(y_true, y_pred, label)
            CNT = len(y_true)
            #print(label,' ',f'(TP:{TP}, FP:{FP}, TN:{TN}, FN:{FN}, CNT:{CNT})')
            
            label_part = label.split('-')[1]
            if label_part not in perf_stats_per_tag:
                perf_stats_per_tag[label_part] = {'TP':0, 'FP':0, 'TN':0, 'FN':0, 'CNT':0}
                
            perf_stats_per_tag[label_part]['TP'] += TP
            perf_stats_per_tag[label_part]['FP'] += FP
            perf_stats_per_tag[label_part]['TN'] += TN
            perf_stats_per_tag[label_part]['FN'] += FN
            perf_stats_per_tag[label_part]['CNT'] += CNT
    return perf_stats_per_tag

def display_perf_stats_per_tag( stats_per_tag ):
    # Now compute the scores
    pprint(stats_per_tag)
    for tag,st in stats_per_tag.items():
        print(f'For label: "{tag}"')
        accuracy = (st['TP'] + st['TN']) / st['CNT']
        print("\tAccuracy : "  + str(accuracy * 100) + "%")
        
        precision = 0
        if (st['TP'] + st['FP']) != 0:
            precision = st['TP'] / (st['TP'] + st['FP'])
        print("\tPrecision : " + str(precision))
        
        recall = 0
        if (st['TP'] + st['FN']) != 0:
            recall = st['TP'] / (st['TP'] + st['FN'])
        print("\tRecall : "    + str(recall))
        
        fscore = 0
        if (precision + recall) != 0:
            fscore = (2 * precision * recall) / (precision + recall)
        print("\tF-score : "   + str(fscore))
        

def conv_dataset_to_match_domain(spacy_examples, dataset_to_model_tag_map):
    spacy_examples = copy.deepcopy(spacy_examples)
    for text,annotations in spacy_examples:
        entities = annotations['entities']
        for i,ent in enumerate(entities):
            if ent[2][0] == 'O':
                continue
            entities[i] = (ent[0],ent[1],dataset_to_model_tag_map[ent[2]])                                        
    return spacy_examples

# The 'model' parameter could either be a pretrained model. Default behavior is to 
# training from scratch. 
def train_spacy_model(train_examples, model=None):
    nlp = model  # create blank Language class
    if nlp == None:
        nlp = spacy.blank('en')
    
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if model == None:
        if 'ner' not in nlp.pipe_names:
            ner = nlp.create_pipe('ner')
            nlp.add_pipe(ner, last=True)
        # add labels
        for _, annotations in train_examples:
             for ent in annotations.get('entities'):
                ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(10):
            print("Starting iteration " + str(itn))
            random.shuffle(train_examples)
            losses = {}
            for text, annotations in train_examples:
                try:
                    nlp.update(
                        [text],         # batch of texts
                        [annotations],  # batch of annotations
                        drop=0.2,       # dropout - make it harder to memorise data
                        sgd=optimizer,  # callable to update weights
                        losses=losses)
                except:
                    continue
                    #print('Hello ' + str(sys.exc_info()))
            print(losses)
    return nlp

# Serialize model
def save_model_to_file(model, file_name):
    model_bytes = model.to_bytes()
    with open(file_name, 'wb') as f:
        f.write(model_bytes)
        f.flush()

# Deserialize
def load_model_from_file(file_name, nlp_load_into=None):
    with open(file_name, 'rb') as f:
        read_bytes = f.read()
    print(f'#Bytes-read: {len(read_bytes)}')
    #if nlp_load_into == None:
    #    nlp_load_into = spacy.load('en_core_web_md')
    #    #nlp_load_into.remove_pipe('ner')
    #    #for pipe_name in ['tagger', 'parser', 'ner']:
    #    #    if pipe_name not in nlp_load_into.pipe_names:
    #    #        pipe = nlp_load_into.create_pipe(pipe_name)
    #    #        nlp_load_into.add_pipe(pipe)
    nlp_load_into.from_bytes(read_bytes)
    return nlp_load_into


# **Working with datasets**

**Loading datasets**

In [7]:
with open('./train.json', 'r') as f:
    read_bytes = f.read()
print(len(read_bytes))
train_obj = json.loads(read_bytes)

26772142


In [8]:
with open('./test.json', 'r') as f:
    read_bytes = f.read()
print(len(read_bytes))
test_obj = json.loads(read_bytes)

6145341


**Convert dataset json docs into a format that spaCy understands**

In [15]:
training_examples = convert_conll_ner_to_spacy(train_obj)
test_examples = convert_conll_ner_to_spacy(test_obj)

In [7]:
for t in (training_examples + test_examples)[0:1000]:
    print(t)

('-DOCSTART-', {'entities': []})
('EU rejects German call to boycott British lamb .', {'entities': [(0, 2, 'ORG'), (11, 17, 'MISC'), (34, 41, 'MISC')]})
('Peter Blackburn', {'entities': [(0, 15, 'PER')]})
('BRUSSELS 1996-08-22', {'entities': [(0, 8, 'LOC')]})
('The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .', {'entities': [(4, 23, 'ORG'), (59, 65, 'MISC'), (94, 101, 'MISC')]})
("Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .", {'entities': [(0, 7, 'LOC'), (33, 47, 'ORG'), (72, 88, 'PER'), (164, 171, 'LOC')]})
('" We do n\'t support any such recommendation because we do n\'t see any grounds for it , " the Commission \'s chief spokesman Nikolaus van der Pas told a news briefing .', {'e

('Tension has mounted since Israeli Prime Minister Benjamin Netanyahu took office in June vowing to retain the Golan Heights Israel captured from Syria in the 1967 Middle East war .', {'entities': [(26, 33, 'MISC'), (49, 67, 'PER'), (109, 122, 'LOC'), (123, 129, 'LOC'), (144, 149, 'LOC'), (162, 173, 'LOC')]})
("Israeli-Syrian peace talks have been deadlocked over the Golan since 1991 despite the previous government 's willingness to make Golan concessions .", {'entities': [(0, 14, 'MISC'), (57, 62, 'LOC'), (129, 134, 'LOC')]})
('Peace talks between the two sides were last held in February .', {'entities': []})
('" The voices coming out of Damascus are bad , not good .', {'entities': [(27, 35, 'LOC')]})
('The media ...', {'entities': []})
('are full of expressions and declarations that must be worrying ...', {'entities': []})
('this artificial atmosphere is very dangerous because those who spread it could become its prisoners , " Levy said .', {'entities': [(104, 108, 'PER')]})
('" We e

('" Our stand is firm , namely we are calling on ( the Russian ) government to end the economic embargo on Iraq and resume trade ties between Russia and Iraq , " he told reporters .', {'entities': [(53, 60, 'MISC'), (105, 109, 'LOC'), (140, 146, 'LOC'), (151, 155, 'LOC')]})
('Zhirinovsky visited Iraq twice in 1995 .', {'entities': [(0, 11, 'PER'), (20, 24, 'LOC')]})
("Last October he was invited to attend the referendum held on Iraq 's presidency , which extended Saddam 's term for seven more years .", {'entities': [(61, 65, 'LOC'), (97, 103, 'PER')]})
('-DOCSTART-', {'entities': []})
('PRESS DIGEST - Iraq - Aug 22 .', {'entities': [(15, 19, 'LOC')]})
('BAGHDAD 1996-08-22', {'entities': [(0, 7, 'LOC')]})
('These are some of the leading stories in the official Iraqi press on Thursday .', {'entities': [(54, 59, 'MISC')]})
('Reuters has not verified these stories and does not vouch for their accuracy .', {'entities': [(0, 7, 'ORG')]})
('THAWRA', {'entities': [(0, 6, 'ORG')]})
("- Iraq 's 

('68 Steve Stricker', {'entities': [(3, 17, 'PER')]})
('69 Justin Leonard , Mark Brooks', {'entities': [(3, 17, 'PER'), (20, 31, 'PER')]})
('70 Tim Herron , Duffy Waldorf , Davis Love , Anders Forsbrand', {'entities': [(3, 13, 'PER'), (16, 29, 'PER'), (32, 42, 'PER'), (45, 61, 'PER')]})
('( Sweden ) , Nick Faldo ( Britain ) , John Cook , Steve Jones , Phil', {'entities': [(2, 8, 'LOC'), (13, 23, 'PER'), (26, 33, 'LOC'), (38, 47, 'PER'), (50, 61, 'PER'), (64, 68, 'PER')]})
('Mickelson , Greg Norman ( Australia )', {'entities': [(0, 9, 'PER'), (12, 23, 'PER'), (26, 35, 'LOC')]})
('71 Ernie Els ( South Africa ) , Scott Hoch', {'entities': [(3, 12, 'PER'), (15, 27, 'LOC'), (32, 42, 'PER')]})
('72 Clarence Rose , Loren Roberts , Fred Funk , Sven Struver', {'entities': [(3, 16, 'PER'), (19, 32, 'PER'), (35, 44, 'PER'), (47, 59, 'PER')]})
('( Germany ) , Alexander Cejka ( Germany ) , Hal Sutton , Tom Lehman', {'entities': [(2, 9, 'LOC'), (14, 29, 'PER'), (32, 39, 'LOC'), (44, 54, 'PER'), (57,

('Ireland midfielder Roy Keane has signed a new four-year contract with English league and F.A. Cup champions Manchester United .', {'entities': [(0, 7, 'LOC'), (19, 28, 'PER'), (70, 77, 'MISC'), (89, 97, 'MISC'), (108, 125, 'ORG')]})
('" Roy agreed a new deal before last night \'s game against Everton and we are delighted , " said United manager Alex Ferguson on Thursday .', {'entities': [(2, 5, 'PER'), (58, 65, 'ORG'), (96, 102, 'ORG'), (111, 124, 'PER')]})
('-DOCSTART-', {'entities': []})
('TENNIS - RESULTS AT CANADIAN OPEN .', {'entities': [(20, 33, 'MISC')]})
('TORONTO 1996-08-21', {'entities': [(0, 7, 'LOC')]})
('Results from the Canadian Open', {'entities': [(17, 30, 'MISC')]})
('tennis tournament on Wednesday ( prefix number denotes', {'entities': []})
('seeding ) :', {'entities': []})
('Second round', {'entities': []})
('Daniel Nestor ( Canada ) beat 1 - Thomas Muster ( Austria ) 6-3 7-5', {'entities': [(0, 13, 'PER'), (16, 22, 'LOC'), (34, 47, 'PER'), (50, 57, 'LOC')]})
('Mik

("Australia last won the Davis Cup in 1986 , but they were beaten finalists against Germany three years ago under Fraser 's guidance .", {'entities': [(0, 9, 'LOC'), (23, 32, 'MISC'), (82, 89, 'LOC'), (112, 118, 'PER')]})
('-DOCSTART-', {'entities': []})
('BADMINTON - MALAYSIAN OPEN RESULTS .', {'entities': [(12, 26, 'MISC')]})
('KUALA LUMPUR 1996-08-22', {'entities': [(0, 12, 'LOC')]})
('Results in the Malaysian', {'entities': [(15, 24, 'MISC')]})
('Open badminton tournament on Thursday ( prefix number denotes', {'entities': [(0, 4, 'MISC')]})
('seeding ) :', {'entities': []})
("Men 's singles , third round", {'entities': []})
('9/16 - Luo Yigang ( China ) beat Hwang Sun-ho ( South Korea ) 15-3', {'entities': [(7, 17, 'PER'), (20, 25, 'LOC'), (33, 38, 'PER'), (39, 45, 'MISC'), (48, 59, 'LOC')]})
('15-7', {'entities': []})
('Jason Wong ( Malaysia ) beat Abdul Samad Ismail ( Malaysia ) 16-18', {'entities': [(0, 10, 'PER'), (13, 21, 'LOC'), (29, 47, 'PER'), (50, 58, 'LOC')]})
('15-2 17-1

('TEXAS AT MINNESOTA', {'entities': [(0, 5, 'ORG'), (9, 18, 'LOC')]})
('NATIONAL LEAGUE', {'entities': [(0, 15, 'MISC')]})
('EASTERN DIVISION', {'entities': [(0, 16, 'MISC')]})
('W L PCT GB', {'entities': []})
('ATLANTA 79 46 .632 -', {'entities': [(0, 7, 'ORG')]})
('MONTREAL 67 58 .536 12', {'entities': [(0, 8, 'ORG')]})
('NEW YORK 59 69 .461 21 1/2', {'entities': [(0, 8, 'ORG')]})
('FLORIDA 58 69 .457 22', {'entities': [(0, 7, 'ORG')]})
('PHILADELPHIA 52 75 .409 28', {'entities': [(0, 12, 'ORG')]})
('CENTRAL DIVISION', {'entities': [(0, 16, 'MISC')]})
('HOUSTON 68 59 .535 -', {'entities': [(0, 7, 'ORG')]})
('ST LOUIS 67 59 .532 1/2', {'entities': [(0, 8, 'ORG')]})
('CHICAGO 63 62 .504 4', {'entities': [(0, 7, 'ORG')]})
('CINCINNATI 62 62 .500 4 1/2', {'entities': [(0, 10, 'ORG')]})
('PITTSBURGH 53 73 .421 14 1/2', {'entities': [(0, 10, 'ORG')]})
('WESTERN DIVISION', {'entities': [(0, 16, 'MISC')]})
('SAN DIEGO 70 59 .543 -', {'entities': [(0, 9, 'ORG')]})
('LOS ANGELES 66 60 .524 2 1

('Stanley owns a .367 career batting average with the bases loaded ( 33-for-90 ) .', {'entities': [(0, 7, 'PER')]})
("Boston 's Mo Vaughn went 3-for-3 with a walk , stole home for one of his three runs scored and collected his 116th RBI .", {'entities': [(0, 6, 'LOC'), (10, 19, 'PER'), (115, 118, 'MISC')]})
('Scott Brosius homered and drove in two runs for the Athletics , who have lost seven of their last nine games .', {'entities': [(0, 13, 'PER'), (52, 61, 'ORG')]})
("In Detroit , Brad Ausmus 's three-run homer capped a four-run eighth and lifted the Tigers to a 7-4 victory over the reeling Chicago White Sox .", {'entities': [(3, 10, 'LOC'), (13, 24, 'PER'), (84, 90, 'ORG'), (125, 142, 'ORG')]})
('The Tigers have won consecutive games after dropping eight in a row , but have won nine of their last 12 at home .', {'entities': [(4, 10, 'ORG')]})
('The White Sox have lost six of their last eight games .', {'entities': [(4, 13, 'ORG')]})
('In Kansas City , Juan Guzman tossed a complete-g

# **Download and Load pretrained model**

In [24]:
import spacy.cli
spacy.cli.download("en_core_web_md")
pretrained_nlp = spacy.load('en_core_web_md')

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_md')


# **Exploratory analysis of the dataset to find the texts corresponding different labels**

The NER entities of SpaCy's model and the manually annotated labels of the dataset dont match. So we need to map those labels to the closest entities that SpaCy's NER model deals with. 

This analysis will help us figure out the intention of the human annotator such that we can then be able to map the labels from the dataset's domain to the closest labels that the model was trained on. 

Labels is the dataset:
* 'LOC'
* 'PER'
* 'ORG'
* 'MISC'

Labels that the model was trained with:
* 'NORP'       
* 'WORK_OF_ART'
* 'FAC'        
* 'PRODUCT'    
* 'EVENT'      
* 'GPE'        
* 'LOC'        
* 'ORG'        
* 'PERSON'     



**Printing annotated texts corresponding labels in the dataset's domain**

In [10]:
labels2text_dict_train    = spacy_get_annotations_by_labels(training_examples)
labels2text_dict_test     = spacy_get_annotations_by_labels(test_examples)
merged_dict               = merge_labels_to_text_dict(labels2text_dict_train, labels2text_dict_test)

label_ = 'MISC'  # Query this labels
display_labels2text_dict(merged_dict, 50)

set_len(ORG) = 795
set_len(MISC) = 301
set_len(PER) = 1099
set_len(LOC) = 505
{'LOC': ['MACEDONIA',
         'Guangxi',
         'Venezuela',
         'East Kalimantan',
         'Dampier',
         'India',
         'Nebraska',
         'BALKAN',
         'Hartford',
         'Nevada',
         'Kaohsiung',
         'SEOUL',
         'Minn',
         'Venezuela',
         'N.IRELAND',
         'NJ',
         'Asia',
         'ABIDJAN',
         'Fos',
         'BONN',
         'Burkina Faso',
         'ALBANIA',
         'Milan',
         'League',
         'KUUSAMO',
         'San Francisco',
         'Pakistan',
         'Saudi Arabia',
         'Czech Republic',
         'London',
         'PAKISTAN',
         'SYRIA',
         'Mexico City',
         'Slovenia',
         'DENVER',
         'GOLDEN STATE',
         'DETROIT',
         'Hebron',
         'WARSAW',
         'Pacific Coast',
         'NEW ORLEANS',
         'SAN JOSE',
         'Arad',
         'TIGNES',
         'Min

**Printing annotated texts corresponding labels in the model's domain**

In [25]:
doc = pretrained_nlp('Apple is a fruit')
print(pretrained_nlp.pipe_names)
for t in doc.ents: print(t.ent_label_)

['tagger', 'parser', 'ner']


In [19]:
# Following call takes time. Uncommnet if contents of the parameter has changed
pred_dict = spacy_ner_predictions_to_dict(test_examples, pretrained_nlp)

display_labels2text_dict(pred_dict, 80)

{}


In [17]:
print(pred_dict)

{}


**Post analysis mapping of labels from the dataset's domain to the model's domain**
It was found after analysis that :
* the label MISC from the dataset's domain map roughly to model domain labels: 'NORP', 'WORK_OF_ART', 'FAC', 'PRODUCT', 'EVENT'.
* Dataset label 'LOC' can be mapped to model label 'GPE'/'LOC'
* Dataset label 'ORG' can be mapped to model label 'ORG'
* Dataset label 'PER' can be mapped to model label 'PERSON'

In [11]:
PRED_LABELS_EQUIV_MAP = {
    'NORP'       : 'MISC',
    'WORK_OF_ART': 'MISC',
    'FAC'        : 'MISC',
    'PRODUCT'    : 'MISC',
    'EVENT'      : 'MISC',
    'GPE'        : 'LOC',
    'LOC'        : 'LOC',
    'ORG'        : 'ORG', 
    'PERSON'     : 'PER'
}

# **Evaluating on spaCy's pretrained, medium, vanilla model**

This score would be compared with the score on the same model which has been retrained on the training data. We count the
* Number of true positives  - Labelled correctly
* Number of false positives - Labelled and did not got it correct AND those that should not have been labelled.
* Number of false negatives - Those that were not labelled at all

In [12]:
#print_annotaions_and_predictions(test_examples)
stats_per_tag = compute_scores(test_examples, pretrained_nlp, PRED_LABELS_EQUIV_MAP)
display_perf_stats_per_tag(stats_per_tag)

{'LOC': {'CNT': 34672, 'FN': 632, 'FP': 486, 'TN': 32242, 'TP': 1312},
 'MISC': {'CNT': 23987, 'FN': 541, 'FP': 304, 'TN': 22659, 'TP': 483},
 'ORG': {'CNT': 50198, 'FN': 1575, 'FP': 1179, 'TN': 46482, 'TP': 962},
 'PER': {'CNT': 46624, 'FN': 958, 'FP': 731, 'TN': 43066, 'TP': 1869}}
For label: "PER"
	Accuracy : 96.37740219629374%
	Precision : 0.7188461538461538
	Recall : 0.6611248673505483
	F-score : 0.6887783305693754
For label: "LOC"
	Accuracy : 96.77549607752654%
	Precision : 0.7296996662958843
	Recall : 0.6748971193415638
	F-score : 0.7012292891501871
For label: "ORG"
	Accuracy : 94.5137256464401%
	Precision : 0.44932274638019615
	Recall : 0.37918801734331886
	F-score : 0.4112868747327918
For label: "MISC"
	Accuracy : 96.47725851502898%
	Precision : 0.613722998729352
	Recall : 0.4716796875
	F-score : 0.5334069574820541


# **Retraining pretrained-medium SpaCy model and evaluation**

**Convert dataset to match domain**

In [33]:
dataset_to_model_tag_map = { 'MISC': 'NORP',
                             'LOC' : 'LOC',
                             'ORG' : 'ORG',
                             'PER' : 'PERSON'}
mapped_examples = conv_dataset_to_match_domain(training_examples, 
                                               dataset_to_model_tag_map)

In [None]:
train_spacy_model(mapped_examples, pretrained_nlp)

Starting iteration 0


**Evaluating the retrained model on test data** 

In [14]:
#nlp_md_retrained = copy.deepcopy(pretrained_nlp) # *ExpensiveResource*. comment after execution

#save_model_to_file(nlp_md_retrained, 'saved_models/retrained__md_spacy_model.bin')

loaded = load_model_from_file('saved_models/retrained__md_spacy_model.bin', 
                              spacy.load('en_core_web_md'))

stats_per_tag = compute_scores(test_examples, 
                               loaded,          ## Need to review before executing the cell
                               PRED_LABELS_EQUIV_MAP)
display_perf_stats_per_tag(stats_per_tag)

#Bytes-read: 224171774
{'LOC': {'CNT': 34285, 'FN': 243, 'FP': 416, 'TN': 31925, 'TP': 1701},
 'MISC': {'CNT': 23083, 'FN': 304, 'FP': 218, 'TN': 21841, 'TP': 720},
 'ORG': {'CNT': 41865, 'FN': 673, 'FP': 359, 'TN': 38969, 'TP': 1864},
 'PER': {'CNT': 43017, 'FN': 273, 'FP': 251, 'TN': 39939, 'TP': 2554}}
For label: "PER"
	Accuracy : 98.78187693237557%
	Precision : 0.9105169340463458
	Recall : 0.9034311991510435
	F-score : 0.9069602272727272
For label: "LOC"
	Accuracy : 98.07787662242964%
	Precision : 0.8034955125177138
	Recall : 0.875
	F-score : 0.8377246983501601
For label: "ORG"
	Accuracy : 97.53493371551414%
	Precision : 0.8385065227170491
	Recall : 0.7347260543949546
	F-score : 0.7831932773109245
For label: "MISC"
	Accuracy : 97.73859550318417%
	Precision : 0.767590618336887
	Recall : 0.703125
	F-score : 0.7339449541284404


# **Training a SpaCy model from scratch and its evaluation**

Now lets train a model from sratch and compute the score. The idea here would be to see if the performance on the retrained spaCy model is better than that of on a model that has been trained from scratch 

In [29]:
#nlp_from_scratch = train_spacy_model(training_examples)   # No model specified i.e spacy.blank(...)

#save_model_to_file(nlp_from_scratch, 'saved_models/space_model_from_scratch.bin')

loaded =  load_model_from_file('saved_models/space_model_from_scratch.bin')

stats_per_tag = compute_scores(test_examples, 
                               loaded,     ## Need to review before executing the cell
                               PRED_LABELS_EQUIV_MAP)
display_perf_stats_per_tag(stats_per_tag)

#Bytes-read: 151121373


  "__main__", mod_spec)


{'LOC': {'CNT': 28470, 'FN': 1944, 'FP': 0, 'TN': 26526, 'TP': 0},
 'MISC': {'CNT': 19528, 'FN': 1024, 'FP': 0, 'TN': 18504, 'TP': 0},
 'ORG': {'CNT': 37694, 'FN': 2537, 'FP': 0, 'TN': 35157, 'TP': 0},
 'PER': {'CNT': 39024, 'FN': 2827, 'FP': 0, 'TN': 36197, 'TP': 0}}
For label: "PER"
	Accuracy : 92.75574005740057%
	Precision : 0
	Recall : 0.0
	F-score : 0
For label: "LOC"
	Accuracy : 93.17175974710221%
	Precision : 0
	Recall : 0.0
	F-score : 0
For label: "MISC"
	Accuracy : 94.75624743957395%
	Precision : 0
	Recall : 0.0
	F-score : 0
For label: "ORG"
	Accuracy : 93.26948585981853%
	Precision : 0
	Recall : 0.0
	F-score : 0


# **Evaluating on SpaCy's pretrained, large, vanilla model**

Download spaCy's large model (en_core_web_lg) and find how it performs on the test examples. Get the performance scores.

Later, retrain this "large" model and see how it compares with the vanilla and the retrained results on "medium" model. 

In [30]:
# Downloading large model
import spacy.cli
spacy.cli.download("en_core_web_lg")
nlp_lg_pretrained = spacy.load('en_core_web_lg')

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_lg')


In [31]:
stats_per_tag = compute_scores(test_examples, nlp_lg_pretrained, PRED_LABELS_EQUIV_MAP)
display_perf_stats_per_tag(stats_per_tag)

{'LOC': {'CNT': 36046, 'FN': 423, 'FP': 789, 'TN': 33313, 'TP': 1521},
 'MISC': {'CNT': 24347, 'FN': 483, 'FP': 356, 'TN': 22967, 'TP': 541},
 'ORG': {'CNT': 48924, 'FN': 1372, 'FP': 1282, 'TN': 45105, 'TP': 1165},
 'PER': {'CNT': 45835, 'FN': 679, 'FP': 706, 'TN': 42302, 'TP': 2148}}
For label: "PER"
	Accuracy : 96.97829169848369%
	Precision : 0.7526278906797477
	Recall : 0.7598160594269544
	F-score : 0.7562048935046648
For label: "LOC"
	Accuracy : 96.63762969538922%
	Precision : 0.6584415584415585
	Recall : 0.7824074074074074
	F-score : 0.7150916784203103
For label: "ORG"
	Accuracy : 94.57525958629711%
	Precision : 0.47609317531671436
	Recall : 0.4592037839968467
	F-score : 0.4674959871589085
For label: "MISC"
	Accuracy : 96.55399022466834%
	Precision : 0.6031215161649944
	Recall : 0.5283203125
	F-score : 0.5632483081728266


# **Retraining pretrained-large SpaCy model and evaluation**

In [None]:
train_spacy_model(mapped_examples, pretrained_nlp) # Comment after use

In [0]:
#nlp_retrained_lg = copy.deepcopy(pretrained_nlp) # *ExpensiveResource*. comment after execution

#save_model_to_file(nlp_retrained_lg, 'saved_models/space_model_from_scratch.bin')

loaded =  load_model_from_file('saved_models/space_model_from_scratch.bin',  
                               spacy.blank('en'))

stats_per_tag = compute_scores(test_examples, 
                               loaded,     ## Need to review before executing the cell
                               PRED_LABELS_EQUIV_MAP)
display_perf_stats_per_tag(stats_per_tag)

{'LOC': {'CNT': 38238, 'FN': 264, 'FP': 637, 'TN': 35657, 'TP': 1680},
 'MISC': {'CNT': 24728, 'FN': 325, 'FP': 345, 'TN': 23359, 'TP': 699},
 'ORG': {'CNT': 46069, 'FN': 679, 'FP': 733, 'TN': 42799, 'TP': 1858},
 'PER': {'CNT': 43711, 'FN': 441, 'FP': 371, 'TN': 40513, 'TP': 2386}}


'For label: "LOC"'

'\tAccuracy : 97.6437052147079%'

'\tPrecision : 0.7250755287009063'

'\tRecall : 0.8641975308641975'

'\tF-score : 0.7885472893686927'

'For label: "MISC"'

'\tAccuracy : 97.29052086703332%'

'\tPrecision : 0.6695402298850575'

'\tRecall : 0.6826171875'

'\tF-score : 0.6760154738878144'

'For label: "ORG"'

'\tAccuracy : 96.9350322342573%'

'\tPrecision : 0.7170976456966422'

'\tRecall : 0.7323610563657864'

'\tF-score : 0.7246489859594384'

'For label: "PER"'

'\tAccuracy : 98.14234403239459%'

'\tPrecision : 0.8654334421472615'

'\tRecall : 0.844004244782455'

'\tF-score : 0.8545845272206304'

# **Results**
Scores on 

# **Conclusions**
