# Evaluating the results of Training (K-fold Cross-Validation)

The results of training (and its evaluation) will depend on how the data was split into training and testing sets. In this worksheet, we use repeated random subsampling to assess the performance of our trained model.

According to [Wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics)): 
>In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[11] but in general k remains an unfixed parameter.

More information available [here](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation).

For us, measuring performance with different samples is important because of the wide variation in the data: texts vary widely in length, in type, and in transcription conventions. We cannot tell clearly whether the performance of the model, when measured only once, reflects an improvement in the model through training or whether it is the result of the division into training and testing data. 

In [1]:
#Import necessary modules
from __future__ import unicode_literals, print_function
import spacy
from spacy.lang.es import Spanish 
from spacy.scorer import Scorer
from spacy.language import GoldParse
from spacy.util import minibatch, compounding

import pandas as pd
import numpy as np
import json
import plac
import random
from sklearn.model_selection import train_test_split
from pathlib import Path
from copy import deepcopy
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import KFold
import itertools

In [2]:
# Read Tagged Data from JSON file
with open('AMSTrainingII_SF.json', 'r', encoding='utf-8') as fp2:
    TAGGED_DATA = json.load(fp2)
    
TD_np = np.array(TAGGED_DATA)

Spacy has a built-in function for evaluating a model's performance using the [command line](https://spacy.io/api/cli#evaluate), but alternatively you can define a function like the one below. It takes the NER model and examples that you input and returns several metrics:
        - UAS (Unlabelled Attachment Score) 
        - LAS (Labelled Attachment Score)
        - ents_p
        - ents_r
        - ents_f
        - tags_acc
        - token_acc

[According](https://github.com/explosion/spaCy/issues/2405) to one of the creators of Spacy, 
>The UAS and LAS are standard metrics to evaluate dependency parsing. UAS is the proportion of tokens whose head has been correctly assigned, LAS is the proportion of tokens whose head has been correctly assigned with the right dependency label (subject, object, etc).
>ents_p, ents_r, ents_f are the precision, recall and fscore for the NER task.
>tags_acc is the POS tagging accuracy.
>token_acc seems to be the precision for token segmentation.

The key metrics for this task are the precision, recall and f-score.
**Precision** (ents_p) is the ratio of correctly-labeled entities out of all the entities labeled. (True Positive/(True Positive+False Positive)).
**Recall**  (ents_r) is the ratio of correctly-labeled entities out of all true entities (True Positive/(True Positive+False Negative)). The F-score is the mean of both values.  

These metrics all appear averaged out through all the entity types (labels) and then detailed for each label in particular. We want these values to be as close as possible to 100. 

In [3]:
#Define the evaluate function
def evaluate(ner_model, examples):
    scorer = Scorer()
    for sents, ents in examples:
        doc_gold = ner_model.make_doc(sents)
        gold = GoldParse(doc_gold, entities=ents['entities'])
        pred_value = ner_model(sents)
        scorer.score(pred_value, gold)
    return scorer.scores

Next, we will load the spacy model and split the data into the n batches that we will use in the cross-validation. In this procedure, we will train the model n-1 times, reserving one fold for testing the model each time. 

In [4]:
# Load the Spacy Model
nlp= spacy.load('es_core_news_ml_EMS')

In [5]:
#Define parameters of k-fold split (5 batches, with random shuffle, set seed = 2)

kf = KFold(n_splits=5, random_state=7, shuffle=True)

In [6]:
split= kf.split(TD_np)

We also create a dataframe to store the results of each training, with the evaluation scores for each label type. 

In [7]:
#Define a blank dataframe with columns for the information we are interested in

columns=['ents_p', 'ents_r', 'ents_f', 'label']
eval_data = pd.DataFrame(columns=columns)
eval_data = eval_data.fillna(0)

Finally, we run the training loop for each set of training data excluding one fold and evaluate the results, storing these in our dataframe. We are using a copy of the NLP model because we want the training to start afresh for each set of training data. Otherwise, the model would be trained on all the data including the test data, leading to the model overperforming on the tagged data compared to new samples that we are interested in tagging later.

In [8]:
for train_index, test_index in split:
    
    #Generate training and test data
    traindata = TD_np[train_index]
    testdata = TD_np[test_index]
    
    #Load the model to be trained (save separately, because we do not want to repeatedly retrain the same model)
    nlp1 = deepcopy(nlp)
    
    #Create object for retrieving the NER pipeline component
    ner=nlp1.get_pipe("ner")

    #Generate new labels for the NER component (if you wish to create new labels)
    #ner.add_label("OBJ")
    #ner.add_label("MON")
    #ner.add_label("DATE")

    #This piece of code creates a loop in which we train the model, but only for the NER component (disabling the tagger and the parser, which we are not using here).
    with nlp1.disable_pipes('tagger','parser'):
    #Here we resume training, alternatively you could begin_training if you are starting on a new model.
        optimizer= nlp1.resume_training()
    #Would need to figure this out, they are the sizes for the minibatching
        sizes = compounding(1.0, 4.0, 1.001)
    #This loops the training mechanism 10 times, randomly shuffling the training data and creating mini-batches from which the algorithm learns to label. Each time a batch is processed, the model is updated.
        for itn in range(10):
            random.shuffle(traindata)
            batches = minibatch(traindata, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp1.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    results = evaluate(nlp1,testdata)
    evaluation= dict((k, results[k]) for k in ['ents_per_type'] 
                                        if k in results)
    
    ev_date = [val.get('DATE') for val in evaluation.values()]
    ev_mon= [val.get('MON') for val in evaluation.values()]
    #ev_obj= [val.get('OBJ') for val in evaluation.values()]
    ev_org= [val.get('ORG') for val in evaluation.values()]
    ev_per= [val.get('PER') for val in evaluation.values()]
    ev_loc= [val.get('LOC') for val in evaluation.values()]
    
    dlist = list(ev_date[0].values())
    newrow1= {'ents_p': dlist[0],'ents_r': dlist[1],'ents_f':dlist[2],'label':'DATE'}
    
    mlist = list(ev_mon[0].values())
    newrow2= {'ents_p': mlist[0],'ents_r':mlist[1],'ents_f':mlist[2],'label':'MON'}
                  
    #oblist = list(ev_obj[0].values())
    #newrow3= {'ents_p':oblist[0],'ents_r':oblist[1],'ents_f':oblist[2],'label':'OBJ'}
                  
    orlist = list(ev_org[0].values())
    newrow4= {'ents_p':orlist[0],'ents_r':orlist[1],'ents_f':orlist[2],'label':'ORG'}
                  
    plist = list(ev_per[0].values())
    newrow5= {'ents_p':plist[0],'ents_r':plist[1],'ents_f':plist[2],'label':'PER'}
                  
    llist = list(ev_loc[0].values())
    newrow6= {'ents_p':llist[0],'ents_r':llist[1],'ents_f':llist[2],'label':'LOC'}
                  
    eval_data=eval_data.append(newrow1,ignore_index=True)
    eval_data=eval_data.append(newrow2,ignore_index=True)
    #eval_data=eval_data.append(newrow3,ignore_index=True)
    eval_data=eval_data.append(newrow4,ignore_index=True)
    eval_data=eval_data.append(newrow5,ignore_index=True)
    eval_data=eval_data.append(newrow6,ignore_index=True)

  "__main__", mod_spec)


Losses {'ner': 3870.829282823463}
Losses {'ner': 2763.04435870662}
Losses {'ner': 1476.2307875847337}
Losses {'ner': 773.8409690879708}
Losses {'ner': 180.4159515845463}
Losses {'ner': 27.676898249483802}
Losses {'ner': 14.246774299018416}
Losses {'ner': 6.202203421710105}
Losses {'ner': 0.12942503925426443}
Losses {'ner': 0.27166528410552937}


  "__main__", mod_spec)


Losses {'ner': 3933.884142343929}
Losses {'ner': 2291.4247039360093}
Losses {'ner': 1121.7010050259992}
Losses {'ner': 342.57781070640607}
Losses {'ner': 260.33463196042584}
Losses {'ner': 40.44939581508031}
Losses {'ner': 17.991775464927834}
Losses {'ner': 0.6208079673153009}
Losses {'ner': 0.004879211893098394}
Losses {'ner': 0.005636297302901938}


  "__main__", mod_spec)


Losses {'ner': 4092.368336050526}
Losses {'ner': 2428.0468781576}
Losses {'ner': 1279.1446418118621}
Losses {'ner': 438.7236478671269}
Losses {'ner': 197.8198385347393}
Losses {'ner': 48.96829818833708}
Losses {'ner': 16.633989205062946}
Losses {'ner': 0.6730541260740408}
Losses {'ner': 0.24261554933569401}
Losses {'ner': 0.018371099717744258}


  "__main__", mod_spec)


Losses {'ner': 4045.306469922817}
Losses {'ner': 2404.9514029736038}
Losses {'ner': 1499.7526436548756}
Losses {'ner': 566.0273877225928}
Losses {'ner': 125.58810682910273}
Losses {'ner': 59.50530844948313}
Losses {'ner': 20.238625950698346}
Losses {'ner': 11.445776533489672}
Losses {'ner': 1.6540043926710177}
Losses {'ner': 2.250796257274562}


  "__main__", mod_spec)


Losses {'ner': 3498.829717476414}
Losses {'ner': 1995.7019147103404}
Losses {'ner': 1095.0194388449631}
Losses {'ner': 371.8198865141883}
Losses {'ner': 122.3746062087221}
Losses {'ner': 44.52421632393097}
Losses {'ner': 15.987618927356936}
Losses {'ner': 4.788533532276276}
Losses {'ner': 2.2928582122737917}
Losses {'ner': 0.0036807689633403676}


Below, we print the contents of our evaluation dataframe:

In [9]:
print(eval_data)

       ents_p     ents_r     ents_f label
0   70.588235  77.419355  73.846154  DATE
1   73.717949  86.466165  79.584775   MON
2   57.746479  56.944444  57.342657   ORG
3   89.487871  89.972900  89.729730   PER
4   80.952381  78.461538  79.687500   LOC
5   90.000000  78.947368  84.112150  DATE
6   71.428571  73.170732  72.289157   MON
7   65.217391  46.153846  54.054054   ORG
8   92.248062  88.805970  90.494297   PER
9   90.647482  86.896552  88.732394   LOC
10  86.153846  67.469880  75.675676  DATE
11  75.490196  82.795699  78.974359   MON
12  61.290323  44.705882  51.700680   ORG
13  92.413793  85.079365  88.595041   PER
14  81.595092  88.079470  84.713376   LOC
15  79.629630  72.881356  76.106195  DATE
16  76.666667  84.146341  80.232558   MON
17  55.555556  46.666667  50.724638   ORG
18  89.430894  90.659341  90.040928   PER
19  76.433121  73.619632  75.000000   LOC
20  84.745763  90.909091  87.719298  DATE
21  43.037975  80.952381  56.198347   MON
22  47.826087  50.000000  48.88888

From which we can create estimates of performance averaged over all the trials, providing a better estimate of each measurement with its standard deviation.

In [10]:
#Measure mean and standard deviation of f, p and r scores for each label 
a = eval_data.groupby('label').agg({'ents_f':['mean','std'],'ents_p':['mean','std'],'ents_r':['mean','std']})

In [11]:
print(a)

          ents_f                ents_p                ents_r          
            mean        std       mean        std       mean       std
label                                                                 
DATE   79.491894   6.060895  82.223495   7.489842  77.525410  8.715461
LOC    83.073463   5.671613  83.623457   5.830144  82.614236  6.283174
MON    73.455839  10.162726  68.068272  14.131026  81.506264  5.074967
ORG    52.542184   3.266820  57.527167   6.544983  48.894168  4.900068
PER    90.613698   2.128453  91.413442   1.846182  89.891842  3.550883


As can be seen, the different labels perform consistently at the levels printed above. The PER and LOC labels are perhaps the most useful, whereas the others can still be improved. 

# Evaluating Spelling Normalization

We can apply the evaluation above to a model trained with text whose spelling has been normalized, thus evaluating whether the inclusion of a normalization dictionary improves training results.

To apply the spelling normalization, we create a pipeline component that modifies the NORM attribute of each token according to a dictionary we provide. Spacy does not modify any text supplied permanently, this is the way they provide for correcting for spelling variation. 

In [7]:
# Read Norm Exceptions from JSON file
with open('normalizeddict.json', 'r', encoding='utf-8') as fp3:
    NORM_EXCEPTIONS = json.load(fp3)

In [8]:
#These steps are all addressed in more detail in another notebook, "Adding a Custom Pipeline Component in Spacy"

#Define and add pipeline component that updates .norm attribute

def add_custom_norms(doc):
    for token in doc:
        if token.text in NORM_EXCEPTIONS:
            token.norm_ = NORM_EXCEPTIONS[token.text]
    return doc

#Add component to the pipeline

nlp.add_pipe(add_custom_norms, first=True)

In [9]:
#Define a new blank dataframe with columns for the information we are interested in

columns=['ents_p', 'ents_r', 'ents_f', 'label']
eval_data2 = pd.DataFrame(columns=columns)
eval_data2 = eval_data2.fillna(0)

In [10]:
eval_data2

Unnamed: 0,ents_p,ents_r,ents_f,label


In [11]:
# Train and evaluate Model trained with EMS dictionary

for train_index, test_index in split:
    
    #Generate training and test data
    traindata = TD_np[train_index]
    testdata = TD_np[test_index]
    
    #Load the model to be trained (save separately, because we do not want to repeatedly retrain the same model)
    nlp2 = deepcopy(nlp)
    
    #Create object for retrieving the NER pipeline component
    ner=nlp2.get_pipe("ner")

    #Generate new labels for the NER component (if you wish to create new labels)
    #ner.add_label("OBJ")
    #ner.add_label("MON")
    #ner.add_label("DATE")

    #This piece of code creates a loop in which we train the model, but only for the NER component (disabling the tagger and the parser, which we are not using here).
    with nlp2.disable_pipes('tagger','parser'):
    #Here we resume training, alternatively you could begin_training if you are starting on a new model.
        optimizer= nlp2.resume_training()
    #Would need to figure this out, they are the sizes for the minibatching
        sizes = compounding(1.0, 4.0, 1.001)
    #This loops the training mechanism 10 times, randomly shuffling the training data and creating mini-batches from which the algorithm learns to label. Each time a batch is processed, the model is updated.
        for itn in range(10):
            random.shuffle(traindata)
            batches = minibatch(traindata, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp2.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    results = evaluate(nlp2,testdata)
    evaluation= dict((k, results[k]) for k in ['ents_per_type'] 
                                        if k in results)
    
    ev_date = [val.get('DATE') for val in evaluation.values()]
    ev_mon= [val.get('MON') for val in evaluation.values()]
    #ev_obj= [val.get('OBJ') for val in evaluation.values()]
    ev_org= [val.get('ORG') for val in evaluation.values()]
    ev_per= [val.get('PER') for val in evaluation.values()]
    ev_loc= [val.get('LOC') for val in evaluation.values()]
    
    dlist = list(ev_date[0].values())
    newrow1= {'ents_p': dlist[0],'ents_r': dlist[1],'ents_f':dlist[2],'label':'DATE'}
    
    mlist = list(ev_mon[0].values())
    newrow2= {'ents_p': mlist[0],'ents_r':mlist[1],'ents_f':mlist[2],'label':'MON'}
                  
    #oblist = list(ev_obj[0].values())
    #newrow3= {'ents_p':oblist[0],'ents_r':oblist[1],'ents_f':oblist[2],'label':'OBJ'}
                  
    orlist = list(ev_org[0].values())
    newrow4= {'ents_p':orlist[0],'ents_r':orlist[1],'ents_f':orlist[2],'label':'ORG'}
                  
    plist = list(ev_per[0].values())
    newrow5= {'ents_p':plist[0],'ents_r':plist[1],'ents_f':plist[2],'label':'PER'}
                  
    llist = list(ev_loc[0].values())
    newrow6= {'ents_p':llist[0],'ents_r':llist[1],'ents_f':llist[2],'label':'LOC'}
                  
    eval_data2=eval_data2.append(newrow1,ignore_index=True)
    eval_data2=eval_data2.append(newrow2,ignore_index=True)
    #eval_data2=eval_data2.append(newrow3,ignore_index=True)
    eval_data2=eval_data2.append(newrow4,ignore_index=True)
    eval_data2=eval_data2.append(newrow5,ignore_index=True)
    eval_data2=eval_data2.append(newrow6,ignore_index=True)

  "__main__", mod_spec)


Losses {'ner': 3893.4023136952587}
Losses {'ner': 2581.9710635599113}
Losses {'ner': 1523.733720502525}
Losses {'ner': 628.7234550421648}
Losses {'ner': 187.32588130995907}
Losses {'ner': 49.037406406120894}
Losses {'ner': 24.944327399552055}
Losses {'ner': 1.8219491690634046}
Losses {'ner': 0.02951674448542816}
Losses {'ner': 0.0730018346335396}


  "__main__", mod_spec)


Losses {'ner': 4304.295866752207}
Losses {'ner': 2853.9652720915055}
Losses {'ner': 968.0967388995275}
Losses {'ner': 315.7326493885426}
Losses {'ner': 123.62306165866235}
Losses {'ner': 10.214118412095203}
Losses {'ner': 1.642406302719747}
Losses {'ner': 0.07822679792910586}
Losses {'ner': 2.661344697384856}
Losses {'ner': 0.09041250821899786}


  "__main__", mod_spec)


Losses {'ner': 3951.276671785179}
Losses {'ner': 2164.780988466331}
Losses {'ner': 775.9616581189613}
Losses {'ner': 407.6480709102726}
Losses {'ner': 81.59152221909333}
Losses {'ner': 17.358426573759935}
Losses {'ner': 19.85774139467456}
Losses {'ner': 3.454616478935028}
Losses {'ner': 0.18575660293935267}
Losses {'ner': 2.660929280552897}


  "__main__", mod_spec)


Losses {'ner': 3991.850618977065}
Losses {'ner': 2765.8190093470216}
Losses {'ner': 909.7052457272707}
Losses {'ner': 306.5152535692142}
Losses {'ner': 112.45289561134007}
Losses {'ner': 29.137735492815107}
Losses {'ner': 10.020323340201747}
Losses {'ner': 13.744271439609456}
Losses {'ner': 1.4102048315957108}
Losses {'ner': 0.34048508198137556}


  "__main__", mod_spec)


Losses {'ner': 4659.727620554147}
Losses {'ner': 2441.2476337305056}
Losses {'ner': 1480.1646121630135}
Losses {'ner': 658.9802001129704}
Losses {'ner': 286.54711319366584}
Losses {'ner': 92.7157910272058}
Losses {'ner': 18.969198043135027}
Losses {'ner': 0.05823734139601938}
Losses {'ner': 6.763841825655848}
Losses {'ner': 0.006211504160673457}


In [12]:
b= eval_data2.groupby('label').agg({'ents_f':['mean','std'],'ents_p':['mean','std'],'ents_r':['mean','std']})

Below, we print the statistics for the training with (b) and without (a) spelling normalization. As can be seen, there is a slight improvement on most measurements (as well as a reduction in variability) when we normalize spelling. 

This measurement shows null performance of the DATE and OBJ labels; this must be reviewed, but may be because of the way the data was shuffled.

In [13]:
print(a)

NameError: name 'a' is not defined

In [14]:
print(b)

          ents_f               ents_p                ents_r          
            mean       std       mean        std       mean       std
label                                                                
DATE   80.009995  5.963633  82.810063   6.841508  77.998465  9.247658
LOC    82.947133  4.536340  83.798343   7.382361  82.476082  4.633619
MON    72.658621  9.496365  67.678875  13.748218  79.963806  4.016407
ORG    51.784561  5.077312  59.120165   6.512367  47.191741  9.022116
PER    89.491461  4.065115  89.947766   3.294965  89.067978  4.987185
