# Evaluating the results of Training

The results of training (and its evaluation) will depend on how the data was split into training and testing sets. In this worksheet, we use repeated random subsampling to assess the performance of our trained model.

According to [Wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics)):

>This method, also known as Monte Carlo cross-validation,[16] creates multiple random splits of the dataset into training and validation data.[17] For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. The results are then averaged over the splits. The advantage of this method (over k-fold cross validation) is that the proportion of the training/validation split is not dependent on the number of iterations (folds). The disadvantage of this method is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. In other words, validation subsets may overlap. This method also exhibits Monte Carlo variation, meaning that the results will vary if the analysis is repeated with different random splits.

We will be dividing our data into an 80-20 split, using 80% for training and 20% for testing. This will be repeated randomly for each iteration of training to evaluate how much the training improves results on average.


In [1]:
#Import necessary modules

from __future__ import unicode_literals, print_function
import spacy
from spacy.lang.es import Spanish 
from spacy import displacy
from spacy.tokens import Doc
from collections import defaultdict, Counter
from spacy.attrs import ORTH
from spacy.scorer import Scorer
from spacy.language import GoldParse
from spacy.util import minibatch, compounding

import pandas as pd
import numpy as np
import json
import plac
import random
from sklearn.model_selection import train_test_split
from pathlib import Path
from copy import deepcopy

In [10]:
# Read Tagged Data from JSON file
with open('TaggedData_SF.json', 'r', encoding='utf-8') as fp2:
    TAGGED_DATA = json.load(fp2)

Spacy has a built-in function for evaluating a model's performance using the [command line](https://spacy.io/api/cli#evaluate), but alternatively you can define a function like the one below. It takes the NER model and examples that you input and returns several metrics:
        - UAS (Unlabelled Attachment Score) 
        - LAS (Labelled Attachment Score)
        - ents_p
        - ents_r
        - ents_f
        - tags_acc
        - token_acc

[According](https://github.com/explosion/spaCy/issues/2405) to one of the creators of Spacy, 
>The UAS and LAS are standard metrics to evaluate dependency parsing. UAS is the proportion of tokens whose head has been correctly assigned, LAS is the proportion of tokens whose head has been correctly assigned with the right dependency label (subject, object, etc).
>ents_p, ents_r, ents_f are the precision, recall and fscore for the NER task.
>tags_acc is the POS tagging accuracy.
>token_acc seems to be the precision for token segmentation.

The key metrics for this task are the precision, recall and f-score.
**Precision** (ents_p) is the ratio of correctly-labeled entities out of all the entities labeled. (True Positive/(True Positive+False Positive)).
**Recall**  (ents_r) is the ratio of correctly-labeled entities out of all true entities (True Positive/(True Positive+False Negative)). The F-score is the mean of both values.  

These metrics all appear averaged out through all the entity types (labels) and then detailed for each label in particular. We want these values to be as close as possible to 100. 

In [None]:
 def evaluate(ner_model, examples):
        scorer = Scorer()
        for sents, ents in examples:
            doc_gold = ner_model.make_doc(sents)
            gold = GoldParse(doc_gold, entities=ents['entities'])
            pred_value = ner_model(sents)
            scorer.score(pred_value, gold)
        return scorer.scores

Next, we load the Spacy model, define a blank dataframe to store the output of our different trials, and calculate the amount of data necessary for an 80-20 split. 

In [11]:
# Load Spacy Model
nlp= spacy.load('es_core_news_md')

In [19]:
#Define a blank dataframe with columns for the information we are interested in

columns=['ents_p', 'ents_r', 'ents_f', 'label']
eval_data = pd.DataFrame(columns=columns)
eval_data = eval_data.fillna(0)

In [24]:
# Calculate 80% of data for an 80-20 split

len(TAGGED_DATA)*0.8

326.40000000000003

Finally, we run the training loop ten times, each with a different 80-20 split, and store the evaluation statistics of our NER model in our dataframe. We are using a copy of the NLP model because we want the training to start afresh for each set of training data. Otherwise, the model would be trained on all the data including the test data, leading to the model overperforming on the tagged data compared to new samples that we are interested in tagging later.

In [20]:
# Testing how much the evaluation depends on texts included in testing data

#Loop 10 times
for x in range(0,10):
    
    #Batching the Tagged Data into training and evaluation data (80-20 split)

    random.shuffle(TAGGED_DATA)
    train_data = TAGGED_DATA[:326]
    test_data = TAGGED_DATA[326:]

    #Load the model to be trained (save separately, because we do not want to repeatedly retrain the same model)
    nlp1 = deepcopy(nlp)
    
    #Create object for retrieving the NER pipeline component
    ner=nlp1.get_pipe("ner")

    #Generate new labels for the NER component (if you wish to create new labels)
    ner.add_label("OBJ")
    ner.add_label("MON")
    ner.add_label("DATE")

    #This piece of code creates a loop in which we train the model, but only for the NER component (disabling the tagger and the parser, which we are not using here).
    with nlp1.disable_pipes('tagger','parser'):
    #Here we resume training, alternatively you could begin_training if you are starting on a new model.
        optimizer= nlp1.resume_training()
    #Would need to figure this out, they are the sizes for the minibatching
        sizes = compounding(1.0, 4.0, 1.001)
    #This loops the training mechanism 10 times, randomly shuffling the training data and creating mini-batches from which the algorithm learns to label. Each time a batch is processed, the model is updated.
        for itn in range(10):
            random.shuffle(train_data)
            batches = minibatch(train_data, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp1.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)
    
    #Testing NER results of existing model on test data

    results = evaluate(nlp1,test_data)
    evaluation= dict((k, results[k]) for k in ['ents_per_type'] 
                                        if k in results)
    
    ev_date = [val.get('DATE') for val in evaluation.values()]
    ev_mon= [val.get('MON') for val in evaluation.values()]
    ev_obj= [val.get('OBJ') for val in evaluation.values()]
    ev_org= [val.get('ORG') for val in evaluation.values()]
    ev_per= [val.get('PER') for val in evaluation.values()]
    ev_loc= [val.get('LOC') for val in evaluation.values()]
    
    dlist = list(ev_date[0].values())
    newrow1= {'ents_p': dlist[0],'ents_r': dlist[1],'ents_f':dlist[2],'label':'DATE'}
    
    mlist = list(ev_mon[0].values())
    newrow2= {'ents_p': mlist[0],'ents_r':mlist[1],'ents_f':mlist[2],'label':'MON'}
                  
    oblist = list(ev_obj[0].values())
    newrow3= {'ents_p':oblist[0],'ents_r':oblist[1],'ents_f':oblist[2],'label':'OBJ'}
                  
    orlist = list(ev_org[0].values())
    newrow4= {'ents_p':orlist[0],'ents_r':orlist[1],'ents_f':orlist[2],'label':'ORG'}
                  
    plist = list(ev_per[0].values())
    newrow5= {'ents_p':plist[0],'ents_r':plist[1],'ents_f':plist[2],'label':'PER'}
                  
    llist = list(ev_loc[0].values())
    newrow6= {'ents_p':llist[0],'ents_r':llist[1],'ents_f':llist[2],'label':'LOC'}
                  
    eval_data=eval_data.append(newrow1,ignore_index=True)
    eval_data=eval_data.append(newrow2,ignore_index=True)
    eval_data=eval_data.append(newrow3,ignore_index=True)
    eval_data=eval_data.append(newrow4,ignore_index=True)
    eval_data=eval_data.append(newrow5,ignore_index=True)
    eval_data=eval_data.append(newrow6,ignore_index=True)

  "__main__", mod_spec)


Losses {'ner': 28067.362139642217}
Losses {'ner': 26077.473458575434}
Losses {'ner': 25732.790645901237}
Losses {'ner': 25423.783981692308}
Losses {'ner': 24873.045012149792}
Losses {'ner': 25221.563670720905}
Losses {'ner': 25339.74750509858}
Losses {'ner': 25090.522224128246}
Losses {'ner': 24895.211273133755}
Losses {'ner': 24554.592849495355}


  "__main__", mod_spec)


Losses {'ner': 29028.443981474196}
Losses {'ner': 26535.32327965144}
Losses {'ner': 26292.086815213435}
Losses {'ner': 25841.25038647733}
Losses {'ner': 25902.866968294635}
Losses {'ner': 25656.89452091232}
Losses {'ner': 25887.78905776143}
Losses {'ner': 25367.45244167745}
Losses {'ner': 25995.728346973658}
Losses {'ner': 25172.58657830581}


  "__main__", mod_spec)


Losses {'ner': 27290.306967409793}
Losses {'ner': 25090.18081061895}
Losses {'ner': 24641.981659894846}
Losses {'ner': 24662.654106218833}
Losses {'ner': 24211.69880866655}
Losses {'ner': 23966.144414107664}
Losses {'ner': 24124.79556529969}
Losses {'ner': 23943.54584768042}
Losses {'ner': 23849.92142687738}
Losses {'ner': 24112.667848318815}


  "__main__", mod_spec)


Losses {'ner': 29341.542794615954}
Losses {'ner': 27261.05710970097}
Losses {'ner': 26495.96780495113}
Losses {'ner': 26205.65187996399}
Losses {'ner': 25990.564476336272}
Losses {'ner': 25806.83585464582}
Losses {'ner': 26052.673229801003}
Losses {'ner': 25636.212341895327}
Losses {'ner': 26052.60259948671}
Losses {'ner': 25656.602459728718}


  "__main__", mod_spec)


Losses {'ner': 30275.82644467696}
Losses {'ner': 28098.2131320345}
Losses {'ner': 27537.53798761471}
Losses {'ner': 27208.02515631076}
Losses {'ner': 27252.625558234715}
Losses {'ner': 27290.96278436482}
Losses {'ner': 27089.783796951175}
Losses {'ner': 26991.23956760578}
Losses {'ner': 26833.597617149353}
Losses {'ner': 26979.624137340114}


  "__main__", mod_spec)


Losses {'ner': 28509.36463215958}
Losses {'ner': 26050.167337943512}
Losses {'ner': 25457.364234812892}
Losses {'ner': 25443.007019339377}
Losses {'ner': 25178.34919881077}
Losses {'ner': 25597.729673262686}
Losses {'ner': 25344.103919938672}
Losses {'ner': 24958.77422168851}
Losses {'ner': 25418.82894744724}
Losses {'ner': 25086.66230070591}


  "__main__", mod_spec)


Losses {'ner': 32535.204346468145}
Losses {'ner': 30088.455238804494}
Losses {'ner': 29453.92096221316}
Losses {'ner': 29313.49556240329}
Losses {'ner': 29326.282054278767}
Losses {'ner': 29167.1843454279}
Losses {'ner': 28902.969515618868}
Losses {'ner': 29165.487021811306}
Losses {'ner': 29344.095430403948}
Losses {'ner': 28950.12413278222}


  "__main__", mod_spec)


Losses {'ner': 27702.989935073965}
Losses {'ner': 26101.388527461655}
Losses {'ner': 25277.2713517674}
Losses {'ner': 25013.269684964027}
Losses {'ner': 25484.252986246778}
Losses {'ner': 25022.646217403933}
Losses {'ner': 24980.72864601761}
Losses {'ner': 25011.730789244175}
Losses {'ner': 24909.540938850492}
Losses {'ner': 24457.738883562386}


  "__main__", mod_spec)


Losses {'ner': 31080.09514921786}
Losses {'ner': 28670.748923293104}
Losses {'ner': 27945.224625468247}
Losses {'ner': 27965.923443717766}
Losses {'ner': 27794.135915773688}
Losses {'ner': 27817.828810952604}
Losses {'ner': 27758.504013635218}
Losses {'ner': 28048.27471022308}
Losses {'ner': 27625.66261018254}
Losses {'ner': 27580.358522176743}


  "__main__", mod_spec)


Losses {'ner': 29640.437973298034}
Losses {'ner': 27341.978425757316}
Losses {'ner': 26470.14491775891}
Losses {'ner': 26183.058625902864}
Losses {'ner': 26456.184109028058}
Losses {'ner': 26332.198845272884}
Losses {'ner': 26281.250217121094}
Losses {'ner': 26147.041104391217}
Losses {'ner': 25935.223890662193}
Losses {'ner': 25969.405787262483}


  "__main__", mod_spec)


Losses {'ner': 29561.011135987133}
Losses {'ner': 27376.658474487678}
Losses {'ner': 26681.567853248638}
Losses {'ner': 26967.920826007787}
Losses {'ner': 26529.398723179038}
Losses {'ner': 26340.85194108216}
Losses {'ner': 26422.428576783743}
Losses {'ner': 26818.571851305664}
Losses {'ner': 26468.83990008384}
Losses {'ner': 26221.15822866559}


Below, we print the contents of our evaluation dataframe:

In [21]:
print(eval_data)

        ents_p     ents_r     ents_f label
0     0.000000   0.000000   0.000000  DATE
1    61.224490  50.000000  55.045872   MON
2    50.000000   2.020202   3.883495   OBJ
3    55.813953  38.095238  45.283019   ORG
4    78.599222  82.786885  80.638723   PER
5    85.227273  87.719298  86.455331   LOC
6     0.000000   0.000000   0.000000  DATE
7    65.000000  63.934426  64.462810   MON
8     0.000000   0.000000   0.000000   OBJ
9    39.473684  29.411765  33.707865   ORG
10   85.507246  85.922330  85.714286   PER
11   84.285714  81.379310  82.807018   LOC
12    0.000000   0.000000   0.000000  DATE
13   80.487805  33.673469  47.482014   MON
14  100.000000   1.162791   2.298851   OBJ
15   20.930233  16.981132  18.750000   ORG
16   94.190871  82.246377  87.814313   PER
17   87.837838  77.844311  82.539683   LOC
18   33.333333   6.250000  10.526316  DATE
19   71.698113  57.575758  63.865546   MON
20   40.000000   2.941176   5.479452   OBJ
21   60.606061  35.087719  44.444444   ORG
22   86.666

In [22]:
#Measure mean and standard deviation of f, p and r scores for each label 
a = eval_data.groupby('label').agg({'ents_f':['mean','std'],'ents_p':['mean','std'],'ents_r':['mean','std']})

In [23]:
print(a)

          ents_f                ents_p                ents_r           
            mean        std       mean        std       mean        std
label                                                                  
DATE   12.425382  13.227116  33.030303  29.267143   7.806532   8.790330
LOC    83.570682   2.360039  86.212179   3.180820  81.217373   3.732953
MON    58.878322   8.416557  71.064200   9.380229  51.436929  10.623568
OBJ     1.996951   2.077158  40.000000  43.588989   1.036099   1.089365
ORG    31.989093  13.655120  41.346887  17.342943  26.485795  11.824742
PER    86.409244   2.541786  87.188831   3.788883  85.726376   2.600771


#  Evaluating Spelling Normalization

We can apply the evaluation above to a model trained with text whose spelling has been normalized, thus evaluating whether the inclusion of a normalization dictionary improves training results.

In [25]:
# Read Norm Exceptions from JSON file
with open('normalizeddict.json', 'r', encoding='utf-8') as fp3:
    NORM_EXCEPTIONS = json.load(fp3)

In [26]:
# Load model
nlp2= spacy.load('es_core_news_md')

#Define and add pipeline component that updates .norm attribute

def add_custom_norms(doc):
    for token in doc:
        if token.text in NORM_EXCEPTIONS:
            token.norm_ = NORM_EXCEPTIONS[token.text]
    return doc

#Add component to the pipeline

nlp2.add_pipe(add_custom_norms, first=True)

In [27]:
#Define a new blank dataframe with columns for the information we are interested in

columns=['ents_p', 'ents_r', 'ents_f', 'label']
eval_data2 = pd.DataFrame(columns=columns)
eval_data2 = eval_data2.fillna(0)

In [29]:
# Train and evaluate Model trained with EMS dictionary

#Loop 10 times
for x in range(0,10):
    
    random.shuffle(TAGGED_DATA)
    train_data = TAGGED_DATA[:326]
    test_data = TAGGED_DATA[326:]
    
    #Load the model to be trained
    nlp3 = deepcopy(nlp2)
    
    #Create object for retrieving the NER pipeline component
    ner=nlp3.get_pipe("ner")

    #Generate new labels for the NER component (if you wish to create new labels)
    ner.add_label("OBJ")
    ner.add_label("MON")
    ner.add_label("DATE")

    #This piece of code creates a loop in which we train the model, but only for the NER component (disabling the tagger and the parser, which we are not using here).
    with nlp3.disable_pipes('tagger','parser'):
    #Here we resume training, alternatively you could begin_training if you are starting on a new model.
        optimizer= nlp3.resume_training()
    #Would need to figure this out, they are the sizes for the minibatching
        sizes = compounding(1.0, 4.0, 1.001)
    #This loops the training mechanism 10 times, randomly shuffling the training data and creating mini-batches from which the algorithm learns to label. Each time a batch is processed, the model is updated.
        for itn in range(10):
            random.shuffle(train_data)
            batches = minibatch(train_data, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp3.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)
   
 #Testing NER results of existing model on test data

    results = evaluate(nlp3,test_data)
    evaluation= dict((k, results[k]) for k in ['ents_per_type'] 
                                        if k in results)
    
    ev_date = [val.get('DATE') for val in evaluation.values()]
    ev_mon= [val.get('MON') for val in evaluation.values()]
    ev_obj= [val.get('OBJ') for val in evaluation.values()]
    ev_org= [val.get('ORG') for val in evaluation.values()]
    ev_per= [val.get('PER') for val in evaluation.values()]
    ev_loc= [val.get('LOC') for val in evaluation.values()]
    
    dlist = list(ev_date[0].values())
    newrow1= {'ents_p': dlist[0],'ents_r': dlist[1],'ents_f':dlist[2],'label':'DATE'}
    
    mlist = list(ev_mon[0].values())
    newrow2= {'ents_p': mlist[0],'ents_r':mlist[1],'ents_f':mlist[2],'label':'MON'}
                  
    oblist = list(ev_obj[0].values())
    newrow3= {'ents_p':oblist[0],'ents_r':oblist[1],'ents_f':oblist[2],'label':'OBJ'}
                  
    orlist = list(ev_org[0].values())
    newrow4= {'ents_p':orlist[0],'ents_r':orlist[1],'ents_f':orlist[2],'label':'ORG'}
                  
    plist = list(ev_per[0].values())
    newrow5= {'ents_p':plist[0],'ents_r':plist[1],'ents_f':plist[2],'label':'PER'}
                  
    llist = list(ev_loc[0].values())
    newrow6= {'ents_p':llist[0],'ents_r':llist[1],'ents_f':llist[2],'label':'LOC'}
                  
    eval_data2=eval_data2.append(newrow1,ignore_index=True)
    eval_data2=eval_data2.append(newrow2,ignore_index=True)
    eval_data2=eval_data2.append(newrow3,ignore_index=True)
    eval_data2=eval_data2.append(newrow4,ignore_index=True)
    eval_data2=eval_data2.append(newrow5,ignore_index=True)
    eval_data2=eval_data2.append(newrow6,ignore_index=True)

  "__main__", mod_spec)


Losses {'ner': 29704.185964004595}
Losses {'ner': 27239.500042788895}
Losses {'ner': 26941.961559539537}
Losses {'ner': 26145.953828093996}
Losses {'ner': 26334.20562622226}
Losses {'ner': 26604.300484430045}
Losses {'ner': 26098.342754028272}
Losses {'ner': 26418.023478515446}
Losses {'ner': 25914.664099514484}
Losses {'ner': 26235.914868056774}


  "__main__", mod_spec)


Losses {'ner': 30703.52576867009}
Losses {'ner': 28725.902252893466}
Losses {'ner': 28159.463954733743}
Losses {'ner': 27959.68541136016}
Losses {'ner': 27966.129337390652}
Losses {'ner': 27301.701287878677}
Losses {'ner': 27441.524910437874}
Losses {'ner': 27437.85645264387}
Losses {'ner': 27486.76589106745}
Losses {'ner': 27322.532069921494}


  "__main__", mod_spec)


Losses {'ner': 30608.731875505335}
Losses {'ner': 28358.62261412488}
Losses {'ner': 28061.892892574677}
Losses {'ner': 27869.223551096162}
Losses {'ner': 27870.966918962076}
Losses {'ner': 27565.368818713352}
Losses {'ner': 27750.462330672424}
Losses {'ner': 27679.12888814509}
Losses {'ner': 27550.064000189304}
Losses {'ner': 27454.43350493908}


  "__main__", mod_spec)


Losses {'ner': 30636.544206978608}
Losses {'ner': 28662.39909793232}
Losses {'ner': 27990.096906065486}
Losses {'ner': 27770.806052751723}
Losses {'ner': 28045.94076333518}
Losses {'ner': 27441.458400078118}
Losses {'ner': 28033.74474290572}
Losses {'ner': 27503.52598297596}
Losses {'ner': 27305.03563812375}
Losses {'ner': 27613.302244063467}


  "__main__", mod_spec)


Losses {'ner': 29822.95023346947}
Losses {'ner': 27813.459726089448}
Losses {'ner': 27374.243101541037}
Losses {'ner': 27357.323377413064}
Losses {'ner': 26978.66976794449}
Losses {'ner': 26844.57572968828}
Losses {'ner': 27291.27380744554}
Losses {'ner': 27105.986064648023}
Losses {'ner': 27108.813636779785}
Losses {'ner': 27120.995013475418}


  "__main__", mod_spec)


Losses {'ner': 31271.56411062826}
Losses {'ner': 29240.612178586984}
Losses {'ner': 28423.878584734834}
Losses {'ner': 28382.362940905965}
Losses {'ner': 28346.81670781219}
Losses {'ner': 28201.512787211686}
Losses {'ner': 28134.756526775658}
Losses {'ner': 28543.895895455033}
Losses {'ner': 28476.213165938854}
Losses {'ner': 28297.5568472445}


  "__main__", mod_spec)


Losses {'ner': 31785.638530219283}
Losses {'ner': 29261.247385079063}
Losses {'ner': 29008.69970434797}
Losses {'ner': 28910.39984441354}
Losses {'ner': 28785.794552255975}
Losses {'ner': 28358.168015688658}
Losses {'ner': 28308.551798445405}
Losses {'ner': 28462.10376200825}
Losses {'ner': 28321.0605584383}
Losses {'ner': 28554.489853855222}


  "__main__", mod_spec)


Losses {'ner': 30592.462978096773}
Losses {'ner': 28048.832299594138}
Losses {'ner': 27656.271471435477}
Losses {'ner': 27689.566155280103}
Losses {'ner': 27565.01172095465}
Losses {'ner': 27682.711843401194}
Losses {'ner': 27400.2855032878}
Losses {'ner': 27560.821975003928}
Losses {'ner': 27233.499860771}
Losses {'ner': 27121.40555819869}


  "__main__", mod_spec)


Losses {'ner': 30191.85820032967}
Losses {'ner': 28012.922388982814}
Losses {'ner': 27230.52329029574}
Losses {'ner': 27153.121265769354}
Losses {'ner': 27153.325990892423}
Losses {'ner': 27121.976557270857}
Losses {'ner': 26927.481867267983}
Losses {'ner': 27001.31964278221}
Losses {'ner': 27125.673155542463}
Losses {'ner': 26938.167551059276}


  "__main__", mod_spec)


Losses {'ner': 31167.908232477508}
Losses {'ner': 28852.424275443653}
Losses {'ner': 28278.796518534036}
Losses {'ner': 28049.632881542348}
Losses {'ner': 27915.398711656686}
Losses {'ner': 27721.87783415802}
Losses {'ner': 27537.995221124962}
Losses {'ner': 27828.64074844122}
Losses {'ner': 27496.009865987115}
Losses {'ner': 27907.930468946695}


In [30]:
b= eval_data2.groupby('label').agg({'ents_f':['mean','std'],'ents_p':['mean','std'],'ents_r':['mean','std']})

Below, we print the statistics for the training with (b) and without (a) spelling normalization. As can be seen, there is a slight improvement on most measurements (as well as a reduction in variability) when we normalize spelling. 

In [31]:
print(a)

          ents_f                ents_p                ents_r           
            mean        std       mean        std       mean        std
label                                                                  
DATE   12.425382  13.227116  33.030303  29.267143   7.806532   8.790330
LOC    83.570682   2.360039  86.212179   3.180820  81.217373   3.732953
MON    58.878322   8.416557  71.064200   9.380229  51.436929  10.623568
OBJ     1.996951   2.077158  40.000000  43.588989   1.036099   1.089365
ORG    31.989093  13.655120  41.346887  17.342943  26.485795  11.824742
PER    86.409244   2.541786  87.188831   3.788883  85.726376   2.600771


In [32]:
print(b)

          ents_f                ents_p                ents_r          
            mean        std       mean        std       mean       std
label                                                                 
DATE   18.951137  14.983290  55.000000  39.907300  11.682644  9.683086
LOC    82.371995   3.815829  87.378347   4.921542  78.021471  4.186711
MON    57.162421   8.427770  66.077303   6.438534  50.754813  9.905893
OBJ     4.121914   5.276423  51.500000  46.070598   2.213534  2.899095
ORG    24.539355  10.168417  31.333530  10.121525  20.720597  9.793148
PER    87.089760   3.189638  90.179737   3.023785  84.264011  4.038805
