# Evaluating the results of Training (K-fold Cross-Validation)

The results of training (and its evaluation) will depend on how the data was split into training and testing sets. In this worksheet, we use repeated random subsampling to assess the performance of our trained model.

According to [Wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics)): 
>In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[11] but in general k remains an unfixed parameter.

More information available [here](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation).

For us, measuring performance with different samples is important because of the wide variation in the data: texts vary widely in length, in type, and in transcription conventions. We cannot tell clearly whether the performance of the model, when measured only once, reflects an improvement in the model through training or whether it is the result of the division into training and testing data. 

In [1]:
#Import necessary modules
from __future__ import unicode_literals, print_function
import spacy
from spacy.lang.es import Spanish 
from spacy.scorer import Scorer
from spacy.language import GoldParse
from spacy.util import minibatch, compounding

import pandas as pd
import numpy as np
import json
import plac
import random
from sklearn.model_selection import train_test_split
from pathlib import Path
from copy import deepcopy
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import KFold
import itertools

In [28]:
# Read Tagged Data from JSON file
with open('TaggedData_SF.json', 'r', encoding='utf-8') as fp2:
    TAGGED_DATA = json.load(fp2)
    
TD_np = np.array(TAGGED_DATA)

Spacy has a built-in function for evaluating a model's performance using the [command line](https://spacy.io/api/cli#evaluate), but alternatively you can define a function like the one below. It takes the NER model and examples that you input and returns several metrics:
        - UAS (Unlabelled Attachment Score) 
        - LAS (Labelled Attachment Score)
        - ents_p
        - ents_r
        - ents_f
        - tags_acc
        - token_acc

[According](https://github.com/explosion/spaCy/issues/2405) to one of the creators of Spacy, 
>The UAS and LAS are standard metrics to evaluate dependency parsing. UAS is the proportion of tokens whose head has been correctly assigned, LAS is the proportion of tokens whose head has been correctly assigned with the right dependency label (subject, object, etc).
>ents_p, ents_r, ents_f are the precision, recall and fscore for the NER task.
>tags_acc is the POS tagging accuracy.
>token_acc seems to be the precision for token segmentation.

The key metrics for this task are the precision, recall and f-score.
**Precision** (ents_p) is the ratio of correctly-labeled entities out of all the entities labeled. (True Positive/(True Positive+False Positive)).
**Recall**  (ents_r) is the ratio of correctly-labeled entities out of all true entities (True Positive/(True Positive+False Negative)). The F-score is the mean of both values.  

These metrics all appear averaged out through all the entity types (labels) and then detailed for each label in particular. We want these values to be as close as possible to 100. 

In [29]:
#Define the evaluate function
def evaluate(ner_model, examples):
    scorer = Scorer()
    for sents, ents in examples:
        doc_gold = ner_model.make_doc(sents)
        gold = GoldParse(doc_gold, entities=ents['entities'])
        pred_value = ner_model(sents)
        scorer.score(pred_value, gold)
    return scorer.scores

Next, we will load the spacy model and split the data into the n batches that we will use in the cross-validation. In this procedure, we will train the model n-1 times, reserving one fold for testing the model each time. 

In [30]:
# Load the Spacy Model
nlp= spacy.load('es_core_news_md')

In [31]:
#Define parameters of k-fold split (5 batches, with random shuffle, set seed = 2)

kf = KFold(n_splits=5, random_state=7, shuffle=True)

In [32]:
split= kf.split(TD_np)

We also create a dataframe to store the results of each training, with the evaluation scores for each label type. 

In [7]:
#Define a blank dataframe with columns for the information we are interested in

columns=['ents_p', 'ents_r', 'ents_f', 'label']
eval_data = pd.DataFrame(columns=columns)
eval_data = eval_data.fillna(0)

Finally, we run the training loop for each set of training data excluding one fold and evaluate the results, storing these in our dataframe. We are using a copy of the NLP model because we want the training to start afresh for each set of training data. Otherwise, the model would be trained on all the data including the test data, leading to the model overperforming on the tagged data compared to new samples that we are interested in tagging later.

In [8]:
for train_index, test_index in split:
    
    #Generate training and test data
    traindata = TD_np[train_index]
    testdata = TD_np[test_index]
    
    #Load the model to be trained (save separately, because we do not want to repeatedly retrain the same model)
    nlp1 = deepcopy(nlp)
    
    #Create object for retrieving the NER pipeline component
    ner=nlp1.get_pipe("ner")

    #Generate new labels for the NER component (if you wish to create new labels)
    ner.add_label("OBJ")
    ner.add_label("MON")
    ner.add_label("DATE")

    #This piece of code creates a loop in which we train the model, but only for the NER component (disabling the tagger and the parser, which we are not using here).
    with nlp1.disable_pipes('tagger','parser'):
    #Here we resume training, alternatively you could begin_training if you are starting on a new model.
        optimizer= nlp1.resume_training()
    #Would need to figure this out, they are the sizes for the minibatching
        sizes = compounding(1.0, 4.0, 1.001)
    #This loops the training mechanism 10 times, randomly shuffling the training data and creating mini-batches from which the algorithm learns to label. Each time a batch is processed, the model is updated.
        for itn in range(10):
            random.shuffle(traindata)
            batches = minibatch(traindata, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp1.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    results = evaluate(nlp1,testdata)
    evaluation= dict((k, results[k]) for k in ['ents_per_type'] 
                                        if k in results)
    
    ev_date = [val.get('DATE') for val in evaluation.values()]
    ev_mon= [val.get('MON') for val in evaluation.values()]
    ev_obj= [val.get('OBJ') for val in evaluation.values()]
    ev_org= [val.get('ORG') for val in evaluation.values()]
    ev_per= [val.get('PER') for val in evaluation.values()]
    ev_loc= [val.get('LOC') for val in evaluation.values()]
    
    dlist = list(ev_date[0].values())
    newrow1= {'ents_p': dlist[0],'ents_r': dlist[1],'ents_f':dlist[2],'label':'DATE'}
    
    mlist = list(ev_mon[0].values())
    newrow2= {'ents_p': mlist[0],'ents_r':mlist[1],'ents_f':mlist[2],'label':'MON'}
                  
    oblist = list(ev_obj[0].values())
    newrow3= {'ents_p':oblist[0],'ents_r':oblist[1],'ents_f':oblist[2],'label':'OBJ'}
                  
    orlist = list(ev_org[0].values())
    newrow4= {'ents_p':orlist[0],'ents_r':orlist[1],'ents_f':orlist[2],'label':'ORG'}
                  
    plist = list(ev_per[0].values())
    newrow5= {'ents_p':plist[0],'ents_r':plist[1],'ents_f':plist[2],'label':'PER'}
                  
    llist = list(ev_loc[0].values())
    newrow6= {'ents_p':llist[0],'ents_r':llist[1],'ents_f':llist[2],'label':'LOC'}
                  
    eval_data=eval_data.append(newrow1,ignore_index=True)
    eval_data=eval_data.append(newrow2,ignore_index=True)
    eval_data=eval_data.append(newrow3,ignore_index=True)
    eval_data=eval_data.append(newrow4,ignore_index=True)
    eval_data=eval_data.append(newrow5,ignore_index=True)
    eval_data=eval_data.append(newrow6,ignore_index=True)

  "__main__", mod_spec)


Losses {'ner': 29990.3521411966}
Losses {'ner': 27886.84531629397}
Losses {'ner': 19379.183527336078}
Losses {'ner': 23267.713834718776}
Losses {'ner': 22871.687328162086}
Losses {'ner': 24932.390145298734}
Losses {'ner': 26685.45910784602}
Losses {'ner': 26078.608320454136}
Losses {'ner': 26746.972224920988}
Losses {'ner': 27233.27111905627}


  "__main__", mod_spec)


Losses {'ner': 40258.40534344755}
Losses {'ner': 38347.65755278419}
Losses {'ner': 28128.2670029179}
Losses {'ner': 26443.775922638204}
Losses {'ner': 23749.037599277814}
Losses {'ner': 22436.196288708597}
Losses {'ner': 23618.400182686746}
Losses {'ner': 25132.60542197805}
Losses {'ner': 26051.640869945288}
Losses {'ner': 26891.268652012222}


  "__main__", mod_spec)


Losses {'ner': 26593.121696714035}
Losses {'ner': 21575.30916965817}
Losses {'ner': 18491.861198539922}
Losses {'ner': 18741.562300101556}
Losses {'ner': 18413.864992712763}
Losses {'ner': 19239.24104174958}
Losses {'ner': 22582.034950674977}
Losses {'ner': 25981.066629868}
Losses {'ner': 27067.824368900387}
Losses {'ner': 28114.243262693286}


  "__main__", mod_spec)


Losses {'ner': 36249.6091657142}
Losses {'ner': 39667.303020850275}
Losses {'ner': 35393.552207695386}
Losses {'ner': 33965.30115976177}
Losses {'ner': 28090.060746957046}
Losses {'ner': 27077.88383237197}
Losses {'ner': 26646.217350698076}
Losses {'ner': 27135.614880967885}
Losses {'ner': 29054.58570339717}
Losses {'ner': 28718.937085229903}


  "__main__", mod_spec)


Losses {'ner': 25819.953981815663}
Losses {'ner': 22649.607092659404}
Losses {'ner': 19590.493973777015}
Losses {'ner': 14764.599927769552}
Losses {'ner': 9349.48336654413}
Losses {'ner': 8263.741737522445}
Losses {'ner': 8986.926781303526}
Losses {'ner': 9863.53484527146}
Losses {'ner': 10335.228494862953}
Losses {'ner': 10955.456667964922}


Below, we print the contents of our evaluation dataframe:

In [9]:
print(eval_data)

        ents_p     ents_r     ents_f label
0     0.000000   0.000000   0.000000  DATE
1    75.000000   3.529412   6.741573   MON
2     0.000000   0.000000   0.000000   OBJ
3    50.000000   4.000000   7.407407   ORG
4    91.111111  81.349206  85.953878   PER
5    76.033058  64.335664  69.696970   LOC
6     0.000000   0.000000   0.000000  DATE
7     0.000000   0.000000   0.000000   MON
8     0.000000   0.000000   0.000000   OBJ
9   100.000000  15.151515  26.315789   ORG
10   83.984375  78.754579  81.285444   PER
11   82.517483  67.428571  74.213836   LOC
12    0.000000   0.000000   0.000000  DATE
13  100.000000   9.090909  16.666667   MON
14    0.000000   0.000000   0.000000   OBJ
15  100.000000   5.714286  10.810811   ORG
16   81.900452  83.796296  82.837529   PER
17   85.950413  70.270270  77.323420   LOC
18    0.000000   0.000000   0.000000  DATE
19  100.000000   3.076923   5.970149   MON
20    0.000000   0.000000   0.000000   OBJ
21   50.000000   6.779661  11.940299   ORG
22   84.656

From which we can create estimates of performance averaged over all the trials, providing a better estimate of each measurement with its standard deviation.

In [10]:
#Measure mean and standard deviation of f, p and r scores for each label 
a = eval_data.groupby('label').agg({'ents_f':['mean','std'],'ents_p':['mean','std'],'ents_r':['mean','std']})

In [11]:
print(a)

          ents_f               ents_p                ents_r          
            mean       std       mean        std       mean       std
label                                                                
DATE    0.000000  0.000000   0.000000   0.000000   0.000000  0.000000
LOC    71.958773  4.622225  81.072891   3.609012  64.851936  5.930612
MON     7.811162  6.071459  58.529412  46.925362   4.472782  3.499442
OBJ     0.000000  0.000000   0.000000   0.000000   0.000000  0.000000
ORG    17.009147  9.703169  74.285714  25.050968   9.900521  6.183875
PER    82.479257  2.407120  83.986142   4.694122  81.141815  1.809540


As can be seen, the different labels perform consistently at the levels printed above. The PER and LOC labels are perhaps the most useful, whereas the others can still be improved. 

# Evaluating Spelling Normalization

We can apply the evaluation above to a model trained with text whose spelling has been normalized, thus evaluating whether the inclusion of a normalization dictionary improves training results.

To apply the spelling normalization, we create a pipeline component that modifies the NORM attribute of each token according to a dictionary we provide. Spacy does not modify any text supplied permanently, this is the way they provide for correcting for spelling variation. 

In [33]:
# Read Norm Exceptions from JSON file
with open('normalizeddict.json', 'r', encoding='utf-8') as fp3:
    NORM_EXCEPTIONS = json.load(fp3)

In [34]:
#These steps are all addressed in more detail in another notebook, "Adding a Custom Pipeline Component in Spacy"

#Define and add pipeline component that updates .norm attribute

def add_custom_norms(doc):
    for token in doc:
        if token.text in NORM_EXCEPTIONS:
            token.norm_ = NORM_EXCEPTIONS[token.text]
    return doc

#Add component to the pipeline

nlp.add_pipe(add_custom_norms, first=True)

In [35]:
#Define a new blank dataframe with columns for the information we are interested in

columns=['ents_p', 'ents_r', 'ents_f', 'label']
eval_data2 = pd.DataFrame(columns=columns)
eval_data2 = eval_data2.fillna(0)

In [36]:
eval_data2

Unnamed: 0,ents_p,ents_r,ents_f,label


In [38]:
# Train and evaluate Model trained with EMS dictionary

for train_index, test_index in split:
    
    #Generate training and test data
    traindata = TD_np[train_index]
    testdata = TD_np[test_index]
    
    #Load the model to be trained (save separately, because we do not want to repeatedly retrain the same model)
    nlp2 = deepcopy(nlp)
    
    #Create object for retrieving the NER pipeline component
    ner=nlp2.get_pipe("ner")

    #Generate new labels for the NER component (if you wish to create new labels)
    ner.add_label("OBJ")
    ner.add_label("MON")
    ner.add_label("DATE")

    #This piece of code creates a loop in which we train the model, but only for the NER component (disabling the tagger and the parser, which we are not using here).
    with nlp2.disable_pipes('tagger','parser'):
    #Here we resume training, alternatively you could begin_training if you are starting on a new model.
        optimizer= nlp2.resume_training()
    #Would need to figure this out, they are the sizes for the minibatching
        sizes = compounding(1.0, 4.0, 1.001)
    #This loops the training mechanism 10 times, randomly shuffling the training data and creating mini-batches from which the algorithm learns to label. Each time a batch is processed, the model is updated.
        for itn in range(10):
            random.shuffle(traindata)
            batches = minibatch(traindata, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp2.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    results = evaluate(nlp2,testdata)
    evaluation= dict((k, results[k]) for k in ['ents_per_type'] 
                                        if k in results)
    
    ev_date = [val.get('DATE') for val in evaluation.values()]
    ev_mon= [val.get('MON') for val in evaluation.values()]
    ev_obj= [val.get('OBJ') for val in evaluation.values()]
    ev_org= [val.get('ORG') for val in evaluation.values()]
    ev_per= [val.get('PER') for val in evaluation.values()]
    ev_loc= [val.get('LOC') for val in evaluation.values()]
    
    dlist = list(ev_date[0].values())
    newrow1= {'ents_p': dlist[0],'ents_r': dlist[1],'ents_f':dlist[2],'label':'DATE'}
    
    mlist = list(ev_mon[0].values())
    newrow2= {'ents_p': mlist[0],'ents_r':mlist[1],'ents_f':mlist[2],'label':'MON'}
                  
    oblist = list(ev_obj[0].values())
    newrow3= {'ents_p':oblist[0],'ents_r':oblist[1],'ents_f':oblist[2],'label':'OBJ'}
                  
    orlist = list(ev_org[0].values())
    newrow4= {'ents_p':orlist[0],'ents_r':orlist[1],'ents_f':orlist[2],'label':'ORG'}
                  
    plist = list(ev_per[0].values())
    newrow5= {'ents_p':plist[0],'ents_r':plist[1],'ents_f':plist[2],'label':'PER'}
                  
    llist = list(ev_loc[0].values())
    newrow6= {'ents_p':llist[0],'ents_r':llist[1],'ents_f':llist[2],'label':'LOC'}
                  
    eval_data2=eval_data2.append(newrow1,ignore_index=True)
    eval_data2=eval_data2.append(newrow2,ignore_index=True)
    eval_data2=eval_data2.append(newrow3,ignore_index=True)
    eval_data2=eval_data2.append(newrow4,ignore_index=True)
    eval_data2=eval_data2.append(newrow5,ignore_index=True)
    eval_data2=eval_data2.append(newrow6,ignore_index=True)

  "__main__", mod_spec)


Losses {'ner': 36681.309747757856}
Losses {'ner': 28037.00775129828}
Losses {'ner': 26839.918097875427}
Losses {'ner': 21681.210179250193}
Losses {'ner': 22691.217211608542}
Losses {'ner': 24099.61153436615}
Losses {'ner': 24012.12943678908}
Losses {'ner': 26742.850501861423}
Losses {'ner': 27828.868456542492}
Losses {'ner': 28967.126047454774}


  "__main__", mod_spec)


Losses {'ner': 28155.425471669838}
Losses {'ner': 23234.60611417357}
Losses {'ner': 17871.88060454513}
Losses {'ner': 16091.027753552835}
Losses {'ner': 16774.26912097751}
Losses {'ner': 16791.395927003003}
Losses {'ner': 22314.34671662748}
Losses {'ner': 24901.671098547056}
Losses {'ner': 25097.306005179882}
Losses {'ner': 27372.017644405365}


  "__main__", mod_spec)


Losses {'ner': 34922.73215593383}
Losses {'ner': 23168.115738138247}
Losses {'ner': 20729.01249424047}
Losses {'ner': 20437.462174263477}
Losses {'ner': 21176.84745838726}
Losses {'ner': 24841.752232989296}
Losses {'ner': 25889.203417696204}
Losses {'ner': 27083.804209765978}
Losses {'ner': 28313.46140109212}
Losses {'ner': 29566.480840966105}


  "__main__", mod_spec)


Losses {'ner': 29131.860930730174}
Losses {'ner': 22860.817320672286}
Losses {'ner': 22136.24543326888}
Losses {'ner': 17300.281348499666}
Losses {'ner': 15263.59707153529}
Losses {'ner': 12860.694824622837}
Losses {'ner': 11344.360642946906}
Losses {'ner': 12008.03018742557}
Losses {'ner': 11328.806644731507}
Losses {'ner': 11477.042313059741}


In [39]:
b= eval_data2.groupby('label').agg({'ents_f':['mean','std'],'ents_p':['mean','std'],'ents_r':['mean','std']})

Below, we print the statistics for the training with (b) and without (a) spelling normalization. As can be seen, there is a slight improvement on most measurements (as well as a reduction in variability) when we normalize spelling. 

This measurement shows null performance of the DATE and OBJ labels; this must be reviewed, but may be because of the way the data was shuffled.

In [41]:
print(a)

          ents_f               ents_p                ents_r          
            mean       std       mean        std       mean       std
label                                                                
DATE    0.000000  0.000000   0.000000   0.000000   0.000000  0.000000
LOC    71.958773  4.622225  81.072891   3.609012  64.851936  5.930612
MON     7.811162  6.071459  58.529412  46.925362   4.472782  3.499442
OBJ     0.000000  0.000000   0.000000   0.000000   0.000000  0.000000
ORG    17.009147  9.703169  74.285714  25.050968   9.900521  6.183875
PER    82.479257  2.407120  83.986142   4.694122  81.141815  1.809540


In [40]:
print(b)

          ents_f                ents_p                ents_r           
            mean        std       mean        std       mean        std
label                                                                  
DATE    0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
LOC    75.037776   5.809667  82.289110   4.286447  69.089502   7.240889
MON     4.944057   3.993193  54.166667  41.666667   2.590609   2.097429
OBJ     0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
ORG    25.192883  14.734935  63.179348   9.102998  16.683038  11.137080
PER    81.216748   1.827689  81.409103   1.435101  81.046223   2.619820
