# Resume NER
## Extract Information from Resumes using NER (Named Entity Recognition)

### Part 2 - NER with Spacy
We will be using the preprocessed data from part one to start training NER models with spacy (https://spacy.io/) and to perform some additional preprocessing on our data before moving to training with flair. 
We will also explore evaluation metrics for NER, and decide how we want to quantify the performance of our trained models.

##### Load preprocessed data

In [1]:
import json
path = "converted_resumes.json"
with open(path, 'rt') as input_file:
    resumes = json.load(input_file)
print('{} resumes loaded'.format(len(resumes)))

690 resumes loaded


##### Prepare Training Data for NER model training
We need to do some more preprocessing of our training data before we can train our model.

As a first step, we will gather all resumes that contain at least one training annotation for each of the entities chosen in part 1.

In [2]:
chosen_entity_labels = ['Name', 'College Name', 'Companies worked at']

# gathers all resumes which have all of the chosen entities above
def gather_candidates(dataset,entity_labels):
    candidates = list()
    for resume in dataset:
        res_ent_labels = list(zip(*resume[1]["entities"]))[2]
        if set(entity_labels).issubset(res_ent_labels):
            candidates.append(resume)
    return candidates

training_data = gather_candidates(resumes, chosen_entity_labels)
print("Gathered {} training examples".format(len(training_data)))

Gathered 466 training examples


Now we have those training examples which contain the entities we are interested in.

The next step is to remove duplicates.

In [3]:
# look for text duplicates in our training data
# if duplicates found: only use the first resume encountered
import collections
text_to_res_index = collections.OrderedDict()
for index, res in enumerate(training_data):
    text = res[0]
    if text not in text_to_res_index:
        text_to_res_index[text] = []
    text_to_res_index[text].append(index)
duplicates_indices = []
unique_resumes_indices = []
for _, indices in text_to_res_index.items():
    if len(indices) > 1:
        duplicates_indices.extend(indices)
        unique_resumes_indices.append(indices[0])
    else:
        unique_resumes_indices.extend(indices)
print('{} total resumes as training data, reducing to {} unique'.format(
        len(training_data), len(unique_resumes_indices)))
print_duplicates = False
if print_duplicates:
    for index in duplicates_indices:
        res = resumes[index]
        text = res[0]
        ents = res[1]['entities']
        print('{: >4}: {}...'.format(index, text[:80].replace('\n', r'\n')))
        for ent in ents:
            print('      {} = {}'.format(ent[2], text[ent[0]:ent[1]].replace('\n', r'\n')))

# update training_data to only use unique resumes
training_data = [training_data[i] for i in unique_resumes_indices]

466 total resumes as training data, reducing to 345 unique


##### Remove other entity annotations from training data
Now that we have our training data, we want to remove all but relevant (chosen) entity annotations from this data, so that the model we train will only train for our entities.

In [4]:
# filter all annotation based on filter list
def filter_ents(ents, filter):
    filtered = [ent for ent in ents if ent[2] in filter]
    return filtered

# remove all but relevant (chosen) entity annotations
X = [
    [resume[0], {'entities': 
        filter_ents(resume[1]['entities'], chosen_entity_labels)
    }]
    for resume in training_data
]
print('len(X) = {}'.format(len(X)))
#print(X[5][1]['entities'])

len(X) = 345


##### Remove resumes that cause errors in spacy
Some of the resumes might cause errors in spacy. We don't need to get into details as to why, suffice to say it has to do with whitespace and syntax in the entity annotations. If these resumes are not removed from our training data, spacy will throw an exception during training, so we need to remove them first. 

We will use the remove_bad_data function below to do this. This function does the following:
* calls train_spacy_ner with debug=True and n_iter=1. This causes spacy to process the documents one-by-one, and gather the documents that throw an exception in a list of "bad docs" which it returns.

In [5]:
from spacy_train_resume_ner import train_spacy_ner

def remove_bad_data(training_data):
    model, baddocs = train_spacy_ner(training_data, debug=True, n_iter=1)
    # training data is a list of lists with each list containing text and annotations
    # baddocs is a set of strings/resume texts.
    # filter bad docs and store filter result (good docs) in filtered variable
    filtered = [data for data in training_data if data[0] not in baddocs]
    print("Unfiltered training data size: ",len(training_data))
    print("Filtered training data size: ", len(filtered))
    print("Bad data size: ", len(baddocs))
    return filtered

X_filtered = remove_bad_data(X)

Created blank 'en' model
Exception thrown when processing doc:
("Neeraj Dwivedi\nSenior Sales Executive - Kansai Nerolac Paints Ltd\n\nMumbai, Maharashtra - Email me on Indeed: indeed.com/r/Neeraj-Dwivedi/8f053ed44cdef8b2\n\nWORK EXPERIENCE\n\nSenior Sales Executive\n\nKansai Nerolac Paints Ltd -  Mumbai, Maharashtra -\n\nAugust 2017 to Present\n\nIncreased the sales from -22% to +6% through existing dealers and prospecting new dealers to\nachieve the target within 6 months of joining.\n* Maintaining excellent relations with dealers to increase revenue by 10%.\n* Management of sales team (up to 3 members)\n* Achieved monthly targets amounting to 40 - 60 lakhs monthly with timely collection.\n* Maintain accounts clarity with dealers and ensure that the credit notes reach the dealers on\ntime.\n* Business development by conducting regular meetings with influencers such as architects,\nbuilding contractors, housing societies, painters.\n* Ensuring proper visibility of the products at vari

Exception thrown when processing doc:
("Sheldon Creado\nSr. Manager - Regional Sales\n\nMumbai, Maharashtra - Email me on Indeed: indeed.com/r/Sheldon-Creado/\nb73c053d2691e84a\n\n* Result-oriented professional with experience of 15 years in Sales Planning/ Execution, Process\nImprovement and Business Development.\n* Excellent track record in performing challenging strategic & leadership roles, building\nstrategic service plans and CSAT.\n* Demonstrated effectiveness in high-profile executive roles driving large scale gains in business\nvolumes through on-ground business strategies and consistent acquisition, deepening &\nretention of customer base.\n\nWilling to relocate to: Mumbai, Maharashtra - Pune, Maharashtra - Bangalore, Karnataka\n\nWORK EXPERIENCE\n\nSr. Manager - Regional Sales\n\nTata Teleservices Ltd\n\nJob Profile:\n* Managed and developed an assigned portfolio of accounts, increasing product penetration and\nrevenue market share.\n* Successfully achieved set Business Acqu

Losses {'ner': 33393.11133236157}
Unfiltered training data size:  345
Filtered training data size:  341
Bad data size:  4


##### Number of bad docs? Size of new (filtered) training data? 
4 bad docs; 341

##### Train/Test Split
Now before we train our model, we have to split our available training data into training and test sets. Splitting our data into train and test (or holdout) datasets is a fundamental technique in machine learning, and essential to avoid the problem of overfitting.

relevant links:
* https://machinelearningmastery.com/a-simple-intuition-for-overfitting/
* https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

##### What is overfitting and how does doing a train/test split help us avoid overfitting when training our models?
A model overfits when it performs significantly better on training data than on unknown data. Reserving parts of a dataset as the test set (must not overlap with the training set) for evaluation of the model allows to get an objective view of its performance.

##### Create a train and test dataset

In [6]:
import random

def train_test_split(X, train_fraction):
    train_size = round(len(X) * train_fraction)
    shuffled_X = random.sample(X, k=len(X))
    train = shuffled_X[:train_size]
    test = shuffled_X[train_size:]
    return train,test

# populate train and test sets
train, test = train_test_split(X_filtered, 0.75)
assert len(train) + len(test) == len(X_filtered)
print('{} docs split into {} for training and {} for testing'.format(
    len(X), len(train), len(test)))

# get entity counts for the test set
def get_entity_count(dataset, entity_name):
    count = 0
    for res in dataset:
        ents = res[1]['entities']
        for ent in ents:
            if ent[2] == entity_name:
                count += 1
    return count

for entity_name in chosen_entity_labels:
    print('{}: support is {} for training, {} for testing'.format(
        entity_name,
        get_entity_count(train, entity_name),
        get_entity_count(test, entity_name)
    ))

345 docs split into 256 for training and 85 for testing
Name: support is 266 for training, 88 for testing
College Name: support is 435 for training, 135 for testing
Companies worked at: support is 1282 for training, 349 for testing


##### Train a spacy ner model with our training data

In [8]:
import os
import os.path
import spacy

model_directory = "custom_nlp"
if os.path.exists(model_directory):
    custom_nlp = spacy.load(model_directory)
else:
    custom_nlp,_= train_spacy_ner(train,n_iter=20)
    custom_nlp.to_disk(model_directory)

##### Inspect NER predictions on one sample resume
Now that we have a trained model, let's see how it works on one of our resumes. 

In [9]:
## fetch one resume out of our test dataset and store to the "resume" variable
resume = random.choice(test)
## create a spacy doc out of the resume using our trained model and save to the "doc" variable 
doc = custom_nlp(resume[0])

Now we will output the predicted entities and the existing annotated entities in that doc

In [10]:
## output label and text of predicted entities (in "ents" variable of the spacy doc created above)
print("PREDICTED:")
for ent in doc.ents:
    print('{} = {}'.format(ent.label_, ent.text))

## output labeled entities (in "entities" dictionary of resume)
print("\nLABELED:")
for ent in resume[1]['entities']:
    print('{} = {}'.format(ent[2], resume[0][ent[0]:ent[1]]))

PREDICTED:
Name = Puneet Bhandari
Companies worked at = Microsoft IT
Companies worked at = Microsoft IT
Companies worked at = Microsoft
Companies worked at = IIT Roorkee
College Name = Shri Vaishnav Institute of Technology and Science, RGPV University

LABELED:
College Name = Shri Vaishnav Institute of Technology and Science, RGPV University
College Name = IIT Roorkee
Companies worked at = Microsoft
Companies worked at = Microsoft
Companies worked at = Microsoft
Companies worked at = Microsoft
Name = Puneet Bhandari


#### Evaluation Metrics for NER
Now that we can predict entities using our trained model, we can compare our predictions with the original annotations in our training data to evaluate how well our model performs for our task. The original annotations have been annotated manually by human annotators, and represent a "Gold Standard" against which we can compare our predictions. 

For most classification tasks, the most common evaluation metrics are:
* accuracy
* precision
* recall
* f1 score

In order to understand these metrics, we need to understand the following concepts:
* True positives - How many of the predicted entities are "true" according to the Gold Standard? (training annotation) 
* True negatives - How many entities did the model not predict which are actually not entities according to the Gold Standard?
* False positives - How many entities did the model predict which are NOT entities according to the Gold Standard?  
* False negatives - How many entities did the model "miss" - e.g. did not recognize as entities which are entities according to the Gold Standard? 

Before we go on, it is important to understand true/false positives/negatives as well as the evaluation metrics above.

##### How are the evaluation metrics above defined in the context of evaluating Machine Learning models? How do they relate to True/False Positives/Negatives above?
* accuracy: fraction of correctly classificatied samples
  * accuracy = (tp + tn) / \#samples
* precision: measure of exactness (how successful in distinguishing between positives and negatives)
  * precision = tp / (tp + fp)
* recall: measure of completeness (how successful in recognizing all positives)
  * recall = tp / (tp + fn)
* f1 score: harmonic mean of precision and recall, optimizing for both precision and recall
  * f1 score = 2 \* precision \* recall / (precision + recall)

##### Calculating Metrics based on token-level annotations or full entity-level. 
The concepts above are our first step toward understanding how to evaluate our model effectively. However, in NER, we need to take into account that we can calculate our metrics either based on all tokens (words) found in the document, or only on the entities found in the document.  

##### Token-Level evaluation. 
Token level evaluation evaluates how accurately did the model tag *each individual word/token* in the input. In order to understand this, we need to understand something called the "BILUO" Scheme (or BILOU or BIO).

https://spacy.io/api/annotation#biluo

##### Conversion from offsets to BILOU format
Up to now, we have not been working with the BILUO scheme, but with "offsets" (for example: (112,150,"Email") - which says there is an "Email" entity between positions 112 and 150 in the text). We would like to be able to evaluate our models on a token-level using BILUO - so we need to convert our data to BILUO. Fortunately, Spacy provides a helper method to do this for us.

In [11]:
from spacy.gold import biluo_tags_from_offsets
import pandas as pd
from IPython.display import display, HTML

## returns a pandas dataframe with tokens, prediction, and true (Gold Standard) annotations of tokens
def make_bilou_df(nlp,resume):
    """
    param nlp - a trained spacy model
    param resume - a resume from our train or test set
    """
    doc = nlp(resume[0])
    bilou_ents_predicted = biluo_tags_from_offsets(doc, [(ent.start_char,ent.end_char,ent.label_) for ent in doc.ents])
    bilou_ents_true = biluo_tags_from_offsets(doc, [(ent[0], ent[1], ent[2]) for ent in resume[1]["entities"]])

    doc_tokens = [tok.text for tok in doc]
    bilou_df = pd.DataFrame()
    bilou_df["Tokens"] = doc_tokens
    bilou_df["Tokens"] = bilou_df["Tokens"].str.replace(r"\s+","") 
    bilou_df["Predicted"] = bilou_ents_predicted
    bilou_df["True"] = bilou_ents_true
    return bilou_df

bilou_df = make_bilou_df(custom_nlp, resume)
display(bilou_df)  


Unnamed: 0,Tokens,Predicted,True
0,Puneet,B-Name,B-Name
1,Bhandari,L-Name,L-Name
2,,O,O
3,SAP,O,O
4,SD,O,O
5,lead,O,O
6,-,O,O
7,Microsoft,B-Companies worked at,U-Companies worked at
8,IT,L-Companies worked at,O
9,,O,O


Based on this output, it should be very easy to calculate a token-level accuracy. We simply compare the "Predicted" to "True" columns and calculate what percentage are the same. 

In [12]:
same_df = bilou_df.loc[lambda df: df["Predicted"] == df["True"], :]
print('{}/{} tokens have been predicted correctly'.format(len(same_df), len(bilou_df)))
accuracy = len(same_df) / len(bilou_df)
print("Accuracy on one resume: ",accuracy)

1616/1623 tokens have been predicted correctly
Accuracy on one resume:  0.9956869993838571


In [13]:
# find all rows in bilou_df where "Predicted" not equal to "True" column. 
diff_df = bilou_df.loc[bilou_df["Predicted"] != bilou_df["True"]]
display(diff_df)

Unnamed: 0,Tokens,Predicted,True
7,Microsoft,B-Companies worked at,U-Companies worked at
8,IT,L-Companies worked at,O
34,Microsoft,B-Companies worked at,U-Companies worked at
35,IT,L-Companies worked at,O
1287,Microsoft,O,U-Companies worked at
1536,IIT,B-Companies worked at,B-College Name
1537,Roorkee,L-Companies worked at,L-College Name


Now let's calculate the accuracy on all our test resumes and average them for an accuracy score.

In [14]:
import numpy as np
doc_accuracy = []
for res in test:
    # calculate accuracy for each test resume and append to doc_accuracy list 
    bilou_df = make_bilou_df(custom_nlp, res)
    same_df = same_df = bilou_df.loc[bilou_df["Predicted"] == bilou_df["True"]]
    accuracy = accuracy = len(same_df) / len(bilou_df)
    doc_accuracy.append(accuracy)

total_acc = np.mean(doc_accuracy)
print("Accuracy: ",total_acc)

Accuracy:  0.9843059593260234


##### How does the model perform on token-level accuracy? What did it miss? In those cases where the predictions didn't match the gold standard, were the predictions plausible or just "spurious" (wrong)? 
Performs generally quite well. Some false positives. Some false negatives. Some cases of token boundaries not matching offsets of labeled entities (problem with input data).

##### What might the advantages and disadvantages be of calculating accuracy on token-level?
Token-level: majority of tokens are non-entities -> trivial to reach high accuracy. But proper entity recognition is 
(almost) not reflected in the high accuracy score. Pro: easy to calculate? Fits into the standard tp/tn/fp/fn schema of classification tasks.

##### Entity-Level evaluation #####
Another method of evaluating the performance of our NER model is to calculate metrics not on token-level, but on entity level. There is a good blog article that describes this method. 

http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/

The article goes into some detail, the most important part is the scenarios described in the section "Comparing NER system output and golden standard". 

##### How do the first 3 scenarios described in the section "Comparing NER system output and golden standard" correlate to  true/false positives/negatives? 
1. string and entity type match: true positive
2. entity imagined: false positive
3. entity missed: false negative

##### Precision, Recall, F1 #####

Now we would like to calculate precision, recall, and f1 for each entity type we are interested in (our chosen entities). To do this, we need to understand the formulas for each. A good article for this is https://skymind.ai/wiki/accuracy-precision-recall-f1. 

##### How can we calculate precision, recall and f1 score based on the information above?
* precision = tp / (tp + fp)
* recall = tp / (tp + fn)
* f1 = 2 * prec * rec / (prec + rec)
* tp = entities recognized as entities
* tn = non-entities recognized as non-entities
* fp = non-entities recognized as entities
* fn = entities recognized as non-entities

In [15]:
# cycle through chosen_entity_labels and calculate metrics for each entity using test data
metrics_data = []
for label in chosen_entity_labels:
    ## sums for all resumes for one entity type
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    for tres in test:
        tres_df = make_bilou_df(custom_nlp, tres)
        ## calculate true / false positives / negatives for each resume
        tp = ((tres_df['True'].str.endswith(label)) & (tres_df['Predicted'] == tres_df['True'])).sum()
        fp = ((tres_df['Predicted'].str.endswith(label)) & (tres_df['True'] == 'O')).sum()
        fn = ((tres_df['True'].str.endswith(label)) & (tres_df['Predicted'] == 'O')).sum()
        ## aggregate results
        true_positives += tp
        false_positives += fp
        false_negatives += fn
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1 = 2 * precision * recall / (precision + recall)
    print("For label '{}' tp: {} fp: {} fn: {}".format(label,true_positives,false_positives,false_negatives))
    print("Precision: ",precision)
    print("Recall: ",recall)
    print("F1: ",f1)
    row = [precision,recall,f1]
    metrics_data.append(row)

For label 'Name' tp: 526 fp: 446 fn: 86
Precision:  0.5411522633744856
Recall:  0.8594771241830066
F1:  0.6641414141414141
For label 'College Name' tp: 354 fp: 446 fn: 84
Precision:  0.4425
Recall:  0.8082191780821918
F1:  0.5718901453957996
For label 'Companies worked at' tp: 834 fp: 191 fn: 116
Precision:  0.8136585365853658
Recall:  0.8778947368421053
F1:  0.8445569620253164


Now we compute an average score for each metric. 

In [16]:
# compute average metrics
columns = ['Precision', 'Recall', 'F1']
metrics_df = pd.DataFrame(data=metrics_data, index=chosen_entity_labels, columns=columns)
metrics_df.loc['entities averaged'] = [metrics_df[m].mean() for m in columns]
display(metrics_df)

Unnamed: 0,Precision,Recall,F1
Name,0.541152,0.859477,0.664141
College Name,0.4425,0.808219,0.57189
Companies worked at,0.813659,0.877895,0.844557
entities averaged,0.599104,0.84853,0.69353


##### How do the average metrics here (computed on entity-level) compare to the token-level accuracy score above?
The average metrics here are macro-averaged, i.e. averaged at entity-level. Rarely occuring entities have a much higher influence than with micro-averaged metrics, i.e. metrics at token-level.

At the end I would prefer entity-level metrics to make it obvious if some entities (even if they are rare) perform badly.

##### Save BILUO training data for reuse in part 3
For part 3 we are using flair and loading our data from a .csv file into a flair "Corpus". This is described here:

https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md#reading-your-own-sequence-labeling-dataset

We need to create "train" and "test" .csv files using our train and test dataset which corresponds to the format described above. This format is one line containing (minimally) a text token and a NER Tag. These should be separated by whitespace. (Which means entity names must not contain whitespace!)

##### Sentences
Flair works with "Sentences" which is a list of tokens. If we simply write out our csv with one line for every token in our dataset, we will have 1 giant sentence with many thousands of words.. This is not what we want. 
We would like to partition our data so that we have a list of "Sentences" - corresponding to our intuition for a sentence - a sequence of words that belong together and is not all to long, usually separated by some punctuation. 
When we create our .csv strings/files, we need to do so so that they represent a list of sentences, each sentence consisting of a list of tokens/tags (each token/tag being one line in our csv). 

***To do this, we create a blank newline after each sentence.*** (1 line per token, empty line between sentences)

Each csv file will be a (long) list of token/tag lines, with sentences separated by newlines. 

In [17]:
import io

def export_bilou_df(df_idx, df, filehandle, sentence_chars='.*;', column_delimiter='\t', row_delimiter='\n'):
    sentence_counter = 1
    # rename entities (must not contain whitespace)
    for col in ['Predicted', 'True']:
        df[col] = df[col].str.replace(' ', '_', regex=False)
    for i, row in enumerate(df.itertuples(index=False, name=None)):
        token = row[0]
        true_bilou_tag = row[2]
        if len(token) == 0 or token.isspace():
            continue  # ignore newlines or whitespace
        filehandle.write(str(df_idx))
        filehandle.write(column_delimiter)
        filehandle.write(column_delimiter.join(row))
        filehandle.write(row_delimiter)
        # was the token an end-of-sentence marker?
        # and not the last token of the df
        # and not with BILOU tag B-... or I-... (happens in abbreviations inside names)
        # ^- commented out; reason: would leak information about gold standard to model
        if token in sentence_chars and i < len(df) - 1:  # and not true_bilou_tag.startswith(('B-', 'I-')):
            sentence_counter += 1
            # insert a blank line
            filehandle.write(row_delimiter)
    # write a final blank line (to enforce sentence boundary not crossing df boundary)
    filehandle.write(row_delimiter)
    # print('df {}: {} sentences'.format('?', sentence_counter))
    return sentence_counter

def bilou_for_flair(nlp, train, test):
    """
    make .csv strings from train and test for use in flair
    """    
    print("Make bilou dfs")
    # makes a list of pandas dataframes, one for each resume. 
    training_data_as_bilou = [make_bilou_df(nlp, res) for res in train]
    test_data_as_bilou = [make_bilou_df(nlp, res) for res in test]
    print("Done!")
    # strings to return
    training_file = io.StringIO()
    test_file = io.StringIO()
    sentence_counter = 0
    for df_idx, df in enumerate(training_data_as_bilou):
        sentence_counter += export_bilou_df(df_idx, df, training_file)
    for df_idx, df in enumerate(test_data_as_bilou):
        sentence_counter += export_bilou_df(df_idx, df, test_file)
    print('{} sentences'.format(sentence_counter))
    return training_file.getvalue(), test_file.getvalue()

training_str, test_str = bilou_for_flair(custom_nlp, train, test)
with open("flair/train_res_bilou.txt", 'w+', encoding="utf-8") as f:
    f.write(training_str)
with open("flair/test_res_bilou.txt",'w+',encoding="utf-8") as f:
    f.write(test_str)

Make bilou dfs
Done!
16521 sentences


Now let's load the data we persisted with flair before we go on

In [18]:
# requires flair 0.4.2
from flair.datasets import Corpus
from flair.data_fetcher import NLPTaskDataFetcher

# folder where training and test data are
data_folder = 'flair'
# your training file name
train_file = 'train_res_bilou.txt'
# your training file name
test_file = 'test_res_bilou.txt'

columns = {1:'text', 3:'ner'}

## Now load our csv into flair corpus
corpus: Corpus = NLPTaskDataFetcher.load_column_corpus(
    data_folder, columns, train_file=train_file, test_file=test_file, dev_file=None
)
print(corpus)

2019-06-19 02:43:29,248 Reading data from flair
2019-06-19 02:43:29,250 Train: flair/train_res_bilou.txt
2019-06-19 02:43:29,251 Dev: None
2019-06-19 02:43:29,252 Test: flair/test_res_bilou.txt


  app.launch_new_instance()
  train_file, column_format
  test_file, column_format


Corpus: 11116 train + 1235 dev + 4170 test sentences


If you could load the corpus without error, you are ready to go on to part 3, where we will work with flair nlp! 