# Evaluating the results of Training

The results of training (and its evaluation) will depend on how the data was split into training and testing sets. In this worksheet, we use repeated random subsampling to assess the performance of our trained model.

According to wikipedia: 
>In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[11] but in general k remains an unfixed parameter.

>For example, setting k = 2 results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on d0 and validate on d1, followed by training on d1 and validating on d0.

>When k = n (the number of observations), the k-fold cross-validation is exactly the leave-one-out cross-validation.[12]

>In stratified k-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of binary classification, this means that each fold contains roughly the same proportions of the two types of class labels.

>In repeated cross-validation the data is randomly split into k folds several times. The performance of the model can thereby be averaged over several runs, but this is rarely desirable in practice.[13]

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

In [37]:
#Import necessary modules

from __future__ import unicode_literals, print_function
import spacy
from spacy.lang.es import Spanish 
from spacy import displacy
from spacy.tokens import Doc
from collections import defaultdict, Counter
from spacy.attrs import ORTH
from spacy.scorer import Scorer
from spacy.language import GoldParse
from spacy.util import minibatch, compounding

import pandas as pd
import numpy as np
import json
import plac
import random
from sklearn.model_selection import train_test_split
from pathlib import Path
from copy import deepcopy
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import KFold
import itertools

In [38]:
def convert_dataturks_to_spacy(dataturks_JSON_FilePath):
    try:
        training_data = []
        lines=[]
        with open(dataturks_JSON_FilePath, 'r') as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['content']
            entities = []
            for annotation in data['annotation']:
                #only a single point in text annotation.
                point = annotation['points'][0]
                labels = annotation['label']
                # handle both list of labels or a single label.
                if not isinstance(labels, list):
                    labels = [labels]

                for label in labels:
                    #dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                    entities.append((point['start'], point['end'] + 1 ,label))


            training_data.append((text, {"entities" : entities}))

        return training_data
    except Exception as e:
        logging.exception("Unable to process " + dataturks_JSON_FilePath + "\n" + "error = " + str(e))
        return None

In [39]:
#Convert Dataturks to Spacy Format

TAGGED_DATA = np.array(convert_dataturks_to_spacy("/Users/Felipe/Documents/Research/NPL/SevillianPaintersNPL/seville painters test 2-3.json"))

In [40]:
#Define evaluate function
def evaluate(ner_model, examples):
    scorer = Scorer()
    for sents, ents in examples:
        doc_gold = ner_model.make_doc(sents)
        gold = GoldParse(doc_gold, entities=ents['entities'])
        pred_value = ner_model(sents)
        scorer.score(pred_value, gold)
    return scorer.scores

In [41]:
# Load Spacy Model
nlp= spacy.load('es_core_news_md')

In [42]:
#Define parameters of k-fold split (5 batches, with random shuffle, set seed = 2)

kf = KFold(n_splits=5, random_state=2, shuffle=True)

In [43]:
split= kf.split(TAGGED_DATA)

In [44]:
#Define a blank dataframe with columns for the information we are interested in

columns=['ents_p', 'ents_r', 'ents_f', 'label']
eval_data = pd.DataFrame(columns=columns)
eval_data = eval_data.fillna(0)

In [45]:
for train_index, test_index in split:
    
    #Generate training and test data
    traindata = TAGGED_DATA[train_index]
    testdata = TAGGED_DATA[test_index]
    
    #Load the model to be trained (save separately, because we do not want to repeatedly retrain the same model)
    nlp1 = deepcopy(nlp)
    
    #Create object for retrieving the NER pipeline component
    ner=nlp1.get_pipe("ner")

    #Generate new labels for the NER component (if you wish to create new labels)
    ner.add_label("OBJ")
    ner.add_label("MON")
    ner.add_label("DATE")

    #This piece of code creates a loop in which we train the model, but only for the NER component (disabling the tagger and the parser, which we are not using here).
    with nlp1.disable_pipes('tagger','parser'):
    #Here we resume training, alternatively you could begin_training if you are starting on a new model.
        optimizer= nlp1.resume_training()
    #Would need to figure this out, they are the sizes for the minibatching
        sizes = compounding(1.0, 4.0, 1.001)
    #This loops the training mechanism 10 times, randomly shuffling the training data and creating mini-batches from which the algorithm learns to label. Each time a batch is processed, the model is updated.
        for itn in range(10):
            random.shuffle(traindata)
            batches = minibatch(traindata, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp1.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    results = evaluate(nlp1,testdata)
    evaluation= dict((k, results[k]) for k in ['ents_per_type'] 
                                        if k in results)
    
    ev_date = [val.get('DATE') for val in evaluation.values()]
    ev_mon= [val.get('MON') for val in evaluation.values()]
    ev_obj= [val.get('OBJ') for val in evaluation.values()]
    ev_org= [val.get('ORG') for val in evaluation.values()]
    ev_per= [val.get('PER') for val in evaluation.values()]
    ev_loc= [val.get('LOC') for val in evaluation.values()]
    
    dlist = list(ev_date[0].values())
    newrow1= {'ents_p': dlist[0],'ents_r': dlist[1],'ents_f':dlist[2],'label':'DATE'}
    
    mlist = list(ev_mon[0].values())
    newrow2= {'ents_p': mlist[0],'ents_r':mlist[1],'ents_f':mlist[2],'label':'MON'}
                  
    oblist = list(ev_obj[0].values())
    newrow3= {'ents_p':oblist[0],'ents_r':oblist[1],'ents_f':oblist[2],'label':'OBJ'}
                  
    orlist = list(ev_org[0].values())
    newrow4= {'ents_p':orlist[0],'ents_r':orlist[1],'ents_f':orlist[2],'label':'ORG'}
                  
    plist = list(ev_per[0].values())
    newrow5= {'ents_p':plist[0],'ents_r':plist[1],'ents_f':plist[2],'label':'PER'}
                  
    llist = list(ev_loc[0].values())
    newrow6= {'ents_p':llist[0],'ents_r':llist[1],'ents_f':llist[2],'label':'LOC'}
                  
    eval_data=eval_data.append(newrow1,ignore_index=True)
    eval_data=eval_data.append(newrow2,ignore_index=True)
    eval_data=eval_data.append(newrow3,ignore_index=True)
    eval_data=eval_data.append(newrow4,ignore_index=True)
    eval_data=eval_data.append(newrow5,ignore_index=True)
    eval_data=eval_data.append(newrow6,ignore_index=True)

  "__main__", mod_spec)


Losses {'ner': 28763.270657695703}
Losses {'ner': 25000.547394746973}
Losses {'ner': 21981.59359280667}
Losses {'ner': 17886.673262105265}
Losses {'ner': 20244.352677792045}
Losses {'ner': 24499.74955254048}
Losses {'ner': 25568.97612799052}
Losses {'ner': 27677.37665426545}
Losses {'ner': 28639.05391213298}
Losses {'ner': 29057.6745602265}


  "__main__", mod_spec)


Losses {'ner': 28284.365855179843}
Losses {'ner': 27317.948948207977}
Losses {'ner': 26979.25418621886}
Losses {'ner': 23497.794500031094}
Losses {'ner': 18448.238092696294}
Losses {'ner': 19399.105394817656}
Losses {'ner': 21368.216864025337}
Losses {'ner': 23970.236314468086}
Losses {'ner': 26129.32080714032}
Losses {'ner': 27405.21348118782}


  "__main__", mod_spec)


Losses {'ner': 35633.34429153843}
Losses {'ner': 32028.725445642172}
Losses {'ner': 22923.35013491605}
Losses {'ner': 17980.818063028128}
Losses {'ner': 18253.946180973053}
Losses {'ner': 17280.585044681793}
Losses {'ner': 17710.128905594174}
Losses {'ner': 18595.123207718134}
Losses {'ner': 22507.744688019156}
Losses {'ner': 23510.849381119013}


  "__main__", mod_spec)


Losses {'ner': 27091.85743642479}
Losses {'ner': 22025.774144502764}
Losses {'ner': 15901.80189734892}
Losses {'ner': 12806.28643740952}
Losses {'ner': 10244.279531438795}
Losses {'ner': 10797.844315719667}
Losses {'ner': 10856.318150190167}
Losses {'ner': 11154.773612475054}
Losses {'ner': 11250.234395478881}
Losses {'ner': 11382.200327911145}


  "__main__", mod_spec)


Losses {'ner': 36625.64222108561}
Losses {'ner': 38606.43633953873}
Losses {'ner': 37973.322053941694}
Losses {'ner': 27511.949096575616}
Losses {'ner': 22411.499419410015}
Losses {'ner': 21987.13741102442}
Losses {'ner': 24146.043497385457}
Losses {'ner': 25628.12248749286}
Losses {'ner': 27520.854878157377}
Losses {'ner': 28561.475986121222}


In [46]:
print(eval_data)

        ents_p     ents_r     ents_f label
0     0.000000   0.000000   0.000000  DATE
1   100.000000   1.587302   3.125000   MON
2     0.000000   0.000000   0.000000   OBJ
3    66.666667   8.000000  14.285714   ORG
4    82.727273  80.888889  81.797753   PER
5    71.532847  70.503597  71.014493   LOC
6     0.000000   0.000000   0.000000  DATE
7    80.952381  36.170213  50.000000   MON
8     0.000000   0.000000   0.000000   OBJ
9    60.000000   7.142857  12.765957   ORG
10   82.710280  83.098592  82.903981   PER
11   84.251969  76.978417  80.451128   LOC
12    0.000000   0.000000   0.000000  DATE
13   33.333333   1.149425   2.222222   MON
14    0.000000   0.000000   0.000000   OBJ
15   85.714286   9.090909  16.438356   ORG
16   83.482143  74.501992  78.736842   PER
17   77.205882  61.046512  68.181818   LOC
18    0.000000   0.000000   0.000000  DATE
19   84.615385  22.916667  36.065574   MON
20    0.000000   0.000000   0.000000   OBJ
21   73.684211  25.000000  37.333333   ORG
22   87.500

In [36]:
print(results)

{'uas': 0.0, 'las': 0.0, 'ents_p': 83.10626702997274, 'ents_r': 48.41269841269841, 'ents_f': 61.18355065195586, 'ents_per_type': {'PER': {'p': 84.38818565400844, 'r': 81.63265306122449, 'f': 82.98755186721992}, 'MON': {'p': 66.66666666666666, 'r': 5.47945205479452, 'f': 10.126582278481013}, 'LOC': {'p': 81.81818181818183, 'r': 61.111111111111114, 'f': 69.96466431095408}, 'ORG': {'p': 66.66666666666666, 'r': 3.8461538461538463, 'f': 7.272727272727273}, 'DATE': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'OBJ': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'tags_acc': 0.0, 'token_acc': 100.0, 'textcat_score': 0.0, 'textcats_per_cat': {}}


In [47]:
#Measure mean and standard deviation of f, p and r scores for each label 
a = eval_data.groupby('label').agg({'ents_f':['mean','std'],'ents_p':['mean','std'],'ents_r':['mean','std']})

In [None]:
# Evaluate Model trained with EMS dictionary

#Generate empty dictionary for storing evaluation results of different trials
d2 = {}

#Loop 10 times
for x in range(0,11):
    
    random.shuffle(TAGGED_DATA)
    train_data = TAGGED_DATA[:326]
    test_data = TAGGED_DATA[326:]
    
    #Load the model to be trained
    nlp2 = nlp
    
    #Create object for retrieving the NER pipeline component
    ner=nlp2.get_pipe("ner")

    #Generate new labels for the NER component (if you wish to create new labels)
    ner.add_label("OBJ")
    ner.add_label("MON")
    ner.add_label("DATE")

    #This piece of code creates a loop in which we train the model, but only for the NER component (disabling the tagger and the parser, which we are not using here).
    with nlp2.disable_pipes('tagger','parser'):
    #Here we resume training, alternatively you could begin_training if you are starting on a new model.
        optimizer= nlp2.resume_training()
    #Would need to figure this out, they are the sizes for the minibatching
        sizes = compounding(1.0, 4.0, 1.001)
    #This loops the training mechanism 10 times, randomly shuffling the training data and creating mini-batches from which the algorithm learns to label. Each time a batch is processed, the model is updated.
        for itn in range(10):
            random.shuffle(train_data)
            batches = minibatch(train_data, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp1.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)
            
    results = evaluate(nlp2,test_data)
    d2[x] = pd.DataFrame(results)
    
    
eval_data2 = pd.DataFrame(columns=columns)
eval_data2 = eval_data2.fillna(0)

for x in d2:
    ev_date= d2[x].loc['DATE','ents_per_type']
    ev_loc= d2[x].loc['LOC','ents_per_type']
    ev_mon= d2[x].loc['MON','ents_per_type']
    ev_obj= d2[x].loc['OBJ','ents_per_type']
    ev_org= d2[x].loc['ORG','ents_per_type']
    ev_per= d2[x].loc['PER','ents_per_type']
    newrow1={'ents_p':ev_date['p'],'ents_r':ev_date['r'],'ents_f':ev_date['f'],'label':'DATE','trial':x}
    newrow2={'ents_p':ev_loc['p'],'ents_r':ev_loc['r'],'ents_f':ev_loc['f'],'label':'LOC','trial':x}
    newrow3={'ents_p':ev_mon['p'],'ents_r':ev_mon['r'],'ents_f':ev_mon['f'],'label':'MON','trial':x}
    newrow4={'ents_p':ev_obj['p'],'ents_r':ev_obj['r'],'ents_f':ev_obj['f'],'label':'OBJ','trial':x}
    newrow5={'ents_p':ev_org['p'],'ents_r':ev_org['r'],'ents_f':ev_org['f'],'label':'ORG','trial':x}
    newrow6={'ents_p':ev_per['p'],'ents_r':ev_per['r'],'ents_f':ev_per['f'],'label':'PER','trial':x}
    eval_data2=eval_data2.append(newrow1,ignore_index=True)
    eval_data2=eval_data2.append(newrow2,ignore_index=True)
    eval_data2=eval_data2.append(newrow3,ignore_index=True)
    eval_data2=eval_data2.append(newrow4,ignore_index=True)
    eval_data2=eval_data2.append(newrow5,ignore_index=True)
    eval_data2=eval_data2.append(newrow6,ignore_index=True)

In [55]:
b= eval_data2.groupby('label').agg({'ents_f':['mean','std'],'ents_p':['mean','std'],'ents_r':['mean','std']})

In [48]:
print(a)

          ents_f                ents_p                ents_r           
            mean        std       mean        std       mean        std
label                                                                  
DATE    0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
LOC    73.367465   4.679829  79.860304   6.678700  68.402639   7.278991
MON    19.821021  21.853128  71.780220  25.798075  13.186639  15.690364
OBJ     0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
ORG    16.164672  13.458877  57.213033  33.369487   9.846753   9.190651
PER    82.528246   2.583081  84.060799   1.987227  81.115796   3.888144


In [None]:
print(b)

In [6]:
53.52-1.96*19.66/(3)

40.67546666666667

In [7]:
53.52+1.96*19.66/(3)

66.36453333333334

In [9]:
96.06+1.96*1.85/3

97.26866666666668

In [10]:
96.06-1.96*1.85/3

94.85133333333333

In [11]:
93.39-1.96*2.13/3

91.9984

In [12]:
93.39+1.96*2.13/3

94.7816