# Evaluating the results of Training (K-fold Cross-Validation)

The results of training (and its evaluation) will depend on how the data was split into training and testing sets. In this worksheet, we use repeated random subsampling to assess the performance of our trained model.

According to [Wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics)): 
>In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[11] but in general k remains an unfixed parameter.

More information available [here](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation).

For us, measuring performance with different samples is important because of the wide variation in the data: texts vary widely in length, in type, and in transcription conventions. We cannot tell clearly whether the performance of the model, when measured only once, reflects an improvement in the model through training or whether it is the result of the division into training and testing data. 

In [31]:
#Import necessary modules

from __future__ import unicode_literals, print_function
import spacy
from spacy.lang.es import Spanish 
from spacy import displacy
from spacy.tokens import Doc
from collections import defaultdict, Counter
from spacy.attrs import ORTH
from spacy.scorer import Scorer
from spacy.language import GoldParse
from spacy.util import minibatch, compounding

import pandas as pd
import numpy as np
import json
import plac
import random
from sklearn.model_selection import train_test_split
from pathlib import Path
from copy import deepcopy
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import KFold
import itertools

In [56]:
# Read Tagged Data from JSON file
with open('TaggedData_SF.json', 'r', encoding='utf-8') as fp2:
    TAGGED_DATA = json.load(fp2)
    
TD_np = np.array(TAGGED_DATA)

Spacy has a built-in function for evaluating a model's performance using the [command line](https://spacy.io/api/cli#evaluate), but alternatively you can define a function like the one below. It takes the NER model and examples that you input and returns several metrics:
        - UAS (Unlabelled Attachment Score) 
        - LAS (Labelled Attachment Score)
        - ents_p
        - ents_r
        - ents_f
        - tags_acc
        - token_acc

[According](https://github.com/explosion/spaCy/issues/2405) to one of the creators of Spacy, 
>The UAS and LAS are standard metrics to evaluate dependency parsing. UAS is the proportion of tokens whose head has been correctly assigned, LAS is the proportion of tokens whose head has been correctly assigned with the right dependency label (subject, object, etc).
>ents_p, ents_r, ents_f are the precision, recall and fscore for the NER task.
>tags_acc is the POS tagging accuracy.
>token_acc seems to be the precision for token segmentation.

The key metrics for this task are the precision, recall and f-score.
**Precision** (ents_p) is the ratio of correctly-labeled entities out of all the entities labeled. (True Positive/(True Positive+False Positive)).
**Recall**  (ents_r) is the ratio of correctly-labeled entities out of all true entities (True Positive/(True Positive+False Negative)). The F-score is the mean of both values.  

These metrics all appear averaged out through all the entity types (labels) and then detailed for each label in particular. We want these values to be as close as possible to 100. 

In [33]:
#Define the evaluate function
def evaluate(ner_model, examples):
    scorer = Scorer()
    for sents, ents in examples:
        doc_gold = ner_model.make_doc(sents)
        gold = GoldParse(doc_gold, entities=ents['entities'])
        pred_value = ner_model(sents)
        scorer.score(pred_value, gold)
    return scorer.scores

Next, we will load the spacy model and split the data into the n batches that we will use in the cross-validation. In this procedure, we will train the model n-1 times, reserving one fold for testing the model each time. 

In [34]:
# Load the Spacy Model
nlp= spacy.load('es_core_news_md')

In [57]:
#Define parameters of k-fold split (5 batches, with random shuffle, set seed = 2)

kf = KFold(n_splits=5, random_state=7, shuffle=True)

In [58]:
split= kf.split(TD_np)

We also create a dataframe to store the results of each training, with the evaluation scores for each label type. 

In [43]:
#Define a blank dataframe with columns for the information we are interested in

columns=['ents_p', 'ents_r', 'ents_f', 'label']
eval_data = pd.DataFrame(columns=columns)
eval_data = eval_data.fillna(0)

Finally, we run the training loop for each set of training data excluding one fold and evaluate the results, storing these in our dataframe. We are using a copy of the NLP model because we want the training to start afresh for each set of training data. Otherwise, the model would be trained on all the data including the test data, leading to the model overperforming on the tagged data compared to new samples that we are interested in tagging later.

In [44]:
for train_index, test_index in split:
    
    #Generate training and test data
    traindata = TD_np[train_index]
    testdata = TD_np[test_index]
    
    #Load the model to be trained (save separately, because we do not want to repeatedly retrain the same model)
    nlp1 = deepcopy(nlp)
    
    #Create object for retrieving the NER pipeline component
    ner=nlp1.get_pipe("ner")

    #Generate new labels for the NER component (if you wish to create new labels)
    ner.add_label("OBJ")
    ner.add_label("MON")
    ner.add_label("DATE")

    #This piece of code creates a loop in which we train the model, but only for the NER component (disabling the tagger and the parser, which we are not using here).
    with nlp1.disable_pipes('tagger','parser'):
    #Here we resume training, alternatively you could begin_training if you are starting on a new model.
        optimizer= nlp1.resume_training()
    #Would need to figure this out, they are the sizes for the minibatching
        sizes = compounding(1.0, 4.0, 1.001)
    #This loops the training mechanism 10 times, randomly shuffling the training data and creating mini-batches from which the algorithm learns to label. Each time a batch is processed, the model is updated.
        for itn in range(10):
            random.shuffle(traindata)
            batches = minibatch(traindata, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp1.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    results = evaluate(nlp1,testdata)
    evaluation= dict((k, results[k]) for k in ['ents_per_type'] 
                                        if k in results)
    
    ev_date = [val.get('DATE') for val in evaluation.values()]
    ev_mon= [val.get('MON') for val in evaluation.values()]
    ev_obj= [val.get('OBJ') for val in evaluation.values()]
    ev_org= [val.get('ORG') for val in evaluation.values()]
    ev_per= [val.get('PER') for val in evaluation.values()]
    ev_loc= [val.get('LOC') for val in evaluation.values()]
    
    dlist = list(ev_date[0].values())
    newrow1= {'ents_p': dlist[0],'ents_r': dlist[1],'ents_f':dlist[2],'label':'DATE'}
    
    mlist = list(ev_mon[0].values())
    newrow2= {'ents_p': mlist[0],'ents_r':mlist[1],'ents_f':mlist[2],'label':'MON'}
                  
    oblist = list(ev_obj[0].values())
    newrow3= {'ents_p':oblist[0],'ents_r':oblist[1],'ents_f':oblist[2],'label':'OBJ'}
                  
    orlist = list(ev_org[0].values())
    newrow4= {'ents_p':orlist[0],'ents_r':orlist[1],'ents_f':orlist[2],'label':'ORG'}
                  
    plist = list(ev_per[0].values())
    newrow5= {'ents_p':plist[0],'ents_r':plist[1],'ents_f':plist[2],'label':'PER'}
                  
    llist = list(ev_loc[0].values())
    newrow6= {'ents_p':llist[0],'ents_r':llist[1],'ents_f':llist[2],'label':'LOC'}
                  
    eval_data=eval_data.append(newrow1,ignore_index=True)
    eval_data=eval_data.append(newrow2,ignore_index=True)
    eval_data=eval_data.append(newrow3,ignore_index=True)
    eval_data=eval_data.append(newrow4,ignore_index=True)
    eval_data=eval_data.append(newrow5,ignore_index=True)
    eval_data=eval_data.append(newrow6,ignore_index=True)

  "__main__", mod_spec)


Losses {'ner': 30721.031245049875}
Losses {'ner': 35812.11033296434}
Losses {'ner': 42480.974824460296}
Losses {'ner': 33017.01195325819}
Losses {'ner': 27781.430804745098}
Losses {'ner': 20094.126222006045}
Losses {'ner': 20143.287001546472}
Losses {'ner': 23534.748939459212}
Losses {'ner': 25472.321697831154}
Losses {'ner': 26433.625859245658}


  "__main__", mod_spec)


Losses {'ner': 33503.69745281518}
Losses {'ner': 21147.750624527725}
Losses {'ner': 15577.639918600704}
Losses {'ner': 13817.42245296805}
Losses {'ner': 16201.630242561194}
Losses {'ner': 20305.903943861354}
Losses {'ner': 23259.951532332227}
Losses {'ner': 25911.089790869504}
Losses {'ner': 27671.270382288523}
Losses {'ner': 29422.120310395956}


  "__main__", mod_spec)


Losses {'ner': 29816.778590951126}
Losses {'ner': 22040.97248751149}
Losses {'ner': 18571.245137903257}
Losses {'ner': 12753.101068754551}
Losses {'ner': 12725.461273144523}
Losses {'ner': 11771.969168598589}
Losses {'ner': 11694.639036175777}
Losses {'ner': 13953.93449570646}
Losses {'ner': 15851.748594797857}
Losses {'ner': 19931.722536871108}


  "__main__", mod_spec)


Losses {'ner': 31870.99545417662}
Losses {'ner': 37097.16218843594}
Losses {'ner': 22824.430444302758}
Losses {'ner': 18109.200683364295}
Losses {'ner': 18200.57962171536}
Losses {'ner': 21151.260026738048}
Losses {'ner': 22151.514805567043}
Losses {'ner': 25604.13780531753}
Losses {'ner': 26728.886173009872}
Losses {'ner': 28074.35024024546}


  "__main__", mod_spec)


Losses {'ner': 33761.27361989506}
Losses {'ner': 36317.208633122435}
Losses {'ner': 45372.61358036703}
Losses {'ner': 50140.59016311784}
Losses {'ner': 47966.425998374965}
Losses {'ner': 31716.675456080302}
Losses {'ner': 28185.290889454925}
Losses {'ner': 26854.486466115366}
Losses {'ner': 24467.148040562693}
Losses {'ner': 14209.33681134712}


Below, we print the contents of our evaluation dataframe:

In [45]:
print(eval_data)

        ents_p     ents_r     ents_f label
0     0.000000   0.000000   0.000000  DATE
1    80.000000   4.705882   8.888889   MON
2     0.000000   0.000000   0.000000   OBJ
3    50.000000   4.000000   7.407407   ORG
4    86.065574  83.333333  84.677419   PER
5    88.461538  64.335664  74.493927   LOC
6     0.000000   0.000000   0.000000  DATE
7    50.000000   1.265823   2.469136   MON
8     0.000000   0.000000   0.000000   OBJ
9    84.615385  16.666667  27.848101   ORG
10   83.018868  80.586081  81.784387   PER
11   73.006135  68.000000  70.414201   LOC
12    0.000000   0.000000   0.000000  DATE
13   40.000000   4.545455   8.163265   MON
14    0.000000   0.000000   0.000000   OBJ
15  100.000000   5.714286  10.810811   ORG
16   83.486239  84.259259  83.870968   PER
17   83.561644  82.432432  82.993197   LOC
18    0.000000   0.000000   0.000000  DATE
19   66.666667   3.076923   5.882353   MON
20    0.000000   0.000000   0.000000   OBJ
21   75.000000   5.084746   9.523810   ORG
22   84.210

From which we can create estimates of performance averaged over all the trials, providing a better estimate of each measurement with its standard deviation.

In [46]:
#Measure mean and standard deviation of f, p and r scores for each label 
a = eval_data.groupby('label').agg({'ents_f':['mean','std'],'ents_p':['mean','std'],'ents_r':['mean','std']})

In [47]:
print(a)

          ents_f               ents_p                ents_r          
            mean       std       mean        std       mean       std
label                                                                
DATE    0.000000  0.000000   0.000000   0.000000   0.000000  0.000000
LOC    74.689103  6.262153  81.593906   6.570961  69.265073  8.568399
MON     5.080729  3.784209  47.333333  30.586853   2.718817  2.056479
OBJ     0.000000  0.000000   0.000000   0.000000   0.000000  0.000000
ORG    16.451359  9.947839  72.449393  21.273657   9.864568  6.793646
PER    82.587286  1.857769  83.206864   2.496253  81.997533  1.693985


As can be seen, the different labels perform consistently at the levels printed above. The PER and LOC labels are perhaps the most useful, whereas the others can still be improved. 

# Evaluating Spelling Normalization

We can apply the evaluation above to a model trained with text whose spelling has been normalized, thus evaluating whether the inclusion of a normalization dictionary improves training results.

To apply the spelling normalization, we create a pipeline component that modifies the NORM attribute of each token according to a dictionary we provide. Spacy does not modify any text supplied permanently, this is the way they provide for correcting for spelling variation. 

In [59]:
# Read Norm Exceptions from JSON file
with open('normalizeddict.json', 'r', encoding='utf-8') as fp3:
    NORM_EXCEPTIONS = json.load(fp3)

In [60]:
# Load model
nlp2= spacy.load('es_core_news_md')

#Define and add pipeline component that updates .norm attribute

def add_custom_norms(doc):
    for token in doc:
        if token.text in NORM_EXCEPTIONS:
            token.norm_ = NORM_EXCEPTIONS[token.text]
    return doc

#Add component to the pipeline

nlp2.add_pipe(add_custom_norms, first=True)

In [61]:
#Define a new blank dataframe with columns for the information we are interested in

columns=['ents_p', 'ents_r', 'ents_f', 'label']
eval_data2 = pd.DataFrame(columns=columns)
eval_data2 = eval_data2.fillna(0)

In [62]:
# Train and evaluate Model trained with EMS dictionary

for train_index, test_index in split:
    
    #Generate training and test data
    traindata = TD_np[train_index]
    testdata = TD_np[test_index]
    
    #Load the model to be trained (save separately, because we do not want to repeatedly retrain the same model)
    nlp3 = deepcopy(nlp2)
    
    #Create object for retrieving the NER pipeline component
    ner=nlp3.get_pipe("ner")

    #Generate new labels for the NER component (if you wish to create new labels)
    ner.add_label("OBJ")
    ner.add_label("MON")
    ner.add_label("DATE")

    #This piece of code creates a loop in which we train the model, but only for the NER component (disabling the tagger and the parser, which we are not using here).
    with nlp3.disable_pipes('tagger','parser'):
    #Here we resume training, alternatively you could begin_training if you are starting on a new model.
        optimizer= nlp3.resume_training()
    #Would need to figure this out, they are the sizes for the minibatching
        sizes = compounding(1.0, 4.0, 1.001)
    #This loops the training mechanism 10 times, randomly shuffling the training data and creating mini-batches from which the algorithm learns to label. Each time a batch is processed, the model is updated.
        for itn in range(10):
            random.shuffle(traindata)
            batches = minibatch(traindata, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp3.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    results = evaluate(nlp3,testdata)
    evaluation= dict((k, results[k]) for k in ['ents_per_type'] 
                                        if k in results)
    
    ev_date = [val.get('DATE') for val in evaluation.values()]
    ev_mon= [val.get('MON') for val in evaluation.values()]
    ev_obj= [val.get('OBJ') for val in evaluation.values()]
    ev_org= [val.get('ORG') for val in evaluation.values()]
    ev_per= [val.get('PER') for val in evaluation.values()]
    ev_loc= [val.get('LOC') for val in evaluation.values()]
    
    dlist = list(ev_date[0].values())
    newrow1= {'ents_p': dlist[0],'ents_r': dlist[1],'ents_f':dlist[2],'label':'DATE'}
    
    mlist = list(ev_mon[0].values())
    newrow2= {'ents_p': mlist[0],'ents_r':mlist[1],'ents_f':mlist[2],'label':'MON'}
                  
    oblist = list(ev_obj[0].values())
    newrow3= {'ents_p':oblist[0],'ents_r':oblist[1],'ents_f':oblist[2],'label':'OBJ'}
                  
    orlist = list(ev_org[0].values())
    newrow4= {'ents_p':orlist[0],'ents_r':orlist[1],'ents_f':orlist[2],'label':'ORG'}
                  
    plist = list(ev_per[0].values())
    newrow5= {'ents_p':plist[0],'ents_r':plist[1],'ents_f':plist[2],'label':'PER'}
                  
    llist = list(ev_loc[0].values())
    newrow6= {'ents_p':llist[0],'ents_r':llist[1],'ents_f':llist[2],'label':'LOC'}
                  
    eval_data2=eval_data.append(newrow1,ignore_index=True)
    eval_data2=eval_data.append(newrow2,ignore_index=True)
    eval_data2=eval_data.append(newrow3,ignore_index=True)
    eval_data2=eval_data.append(newrow4,ignore_index=True)
    eval_data2=eval_data.append(newrow5,ignore_index=True)
    eval_data2=eval_data.append(newrow6,ignore_index=True)

  "__main__", mod_spec)


Losses {'ner': 27369.295011232218}
Losses {'ner': 25288.68191463001}
Losses {'ner': 22012.064205526956}
Losses {'ner': 20950.277915531755}
Losses {'ner': 15612.289248465873}
Losses {'ner': 14711.363196730337}
Losses {'ner': 16880.3947647771}
Losses {'ner': 17808.167656041456}
Losses {'ner': 19775.498188718222}
Losses {'ner': 21840.016411602497}


  "__main__", mod_spec)


Losses {'ner': 30200.0166880758}
Losses {'ner': 32703.116319365163}
Losses {'ner': 25241.50693244629}
Losses {'ner': 24407.429937910696}
Losses {'ner': 22122.65288857594}
Losses {'ner': 22206.00163769722}
Losses {'ner': 22984.677610472776}
Losses {'ner': 25015.930037163198}
Losses {'ner': 24900.114258908667}
Losses {'ner': 25856.664362482727}


  "__main__", mod_spec)


Losses {'ner': 26335.389584492088}
Losses {'ner': 22103.978002444463}
Losses {'ner': 18943.524095241548}
Losses {'ner': 15619.313920508259}
Losses {'ner': 15434.403349751716}
Losses {'ner': 16787.341459652074}
Losses {'ner': 20236.44105457893}
Losses {'ner': 22059.997543136327}
Losses {'ner': 24899.461682455614}
Losses {'ner': 25841.326803692617}


  "__main__", mod_spec)


Losses {'ner': 31956.094522983014}
Losses {'ner': 34606.20847437257}
Losses {'ner': 25913.667436758915}
Losses {'ner': 18937.027232236862}
Losses {'ner': 19383.52967597225}
Losses {'ner': 19120.842931861873}
Losses {'ner': 19254.03517245932}
Losses {'ner': 21625.609265983105}
Losses {'ner': 23578.9522626698}
Losses {'ner': 26010.224218841642}


  "__main__", mod_spec)


Losses {'ner': 37095.27128314915}
Losses {'ner': 47508.65459043867}
Losses {'ner': 37057.82105511987}
Losses {'ner': 31642.729106775874}
Losses {'ner': 25610.1483249697}
Losses {'ner': 23099.846375429377}
Losses {'ner': 16554.86995130687}
Losses {'ner': 13155.681421290585}
Losses {'ner': 10452.325621203985}
Losses {'ner': 11027.603701612912}


In [63]:
b= eval_data2.groupby('label').agg({'ents_f':['mean','std'],'ents_p':['mean','std'],'ents_r':['mean','std']})

Below, we print the statistics for the training with (b) and without (a) spelling normalization. As can be seen, there is a slight improvement on most measurements (as well as a reduction in variability) when we normalize spelling. 

This measurement provides really poor performance of the DATE and OBJ labels; this must be reviewed, but may be because of the way the data was shuffled.

In [48]:
print(a)

          ents_f                ents_p                ents_r           
            mean        std       mean        std       mean        std
label                                                                  
DATE    0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
LOC    73.367465   4.679829  79.860304   6.678700  68.402639   7.278991
MON    19.821021  21.853128  71.780220  25.798075  13.186639  15.690364
OBJ     0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
ORG    16.164672  13.458877  57.213033  33.369487   9.846753   9.190651
PER    82.528246   2.583081  84.060799   1.987227  81.115796   3.888144


In [64]:
print(b)

          ents_f               ents_p                ents_r          
            mean       std       mean        std       mean       std
label                                                                
DATE    0.000000  0.000000   0.000000   0.000000   0.000000  0.000000
LOC    74.051074  5.814992  80.465784   6.494457  68.936827  7.705870
MON     5.080729  3.784209  47.333333  30.586853   2.718817  2.056479
OBJ     0.000000  0.000000   0.000000   0.000000   0.000000  0.000000
ORG    16.451359  9.947839  72.449393  21.273657   9.864568  6.793646
PER    82.587286  1.857769  83.206864   2.496253  81.997533  1.693985
