Install Spacy and Flair, if it is not installed yet. Load the needed libraries and change path to the work folder in google drive

In [3]:
! pip install -U spacy
! pip install flair

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.1.4)
Collecting flair
  Using cached https://files.pythonhosted.org/packages/4e/3a/2e777f65a71c1eaa259df44c44e39d7071ba8c7780a1564316a38bf86449/flair-0.4.2-py3-none-any.whl
Collecting pytorch-pretrained-bert>=0.6.1 (from flair)
  Using cached https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl
Collecting segtok>=1.5.7 (from flair)
  Using cached https://files.pythonhosted.org/packages/1d/59/6ed78856ab99d2da04084b59e7da797972baa0efecb71546b16d48e49d9b/segtok-1.5.7.tar.gz
Collecting regex (from flair)
  Using cached https://files.pythonhosted.org/packages/6f/4e/1b178c38c9a1a184288f72065a65ca01f3154df43c6ad898624149b8b4e0/regex-2019.06.08.tar.gz
Collecting mpld3==0.3 (from flair)
Collecting bpemb>=0.2.9 (from flair)
  Using cached https://files.pythonhosted.org/packages/bc/70/468a9652095b370f797ed37ff77e742

In [4]:
import json
from google.colab import drive
import os
import spacy
from spacy.gold import biluo_tags_from_offsets
import pandas as pd
import re
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.trainers import ModelTrainer
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings
from flair.visual.training_curves import Plotter
from flair.models import SequenceTagger

drive.mount('/content/gdrive')
root = "/content/gdrive/My Drive/FAU/SAKI.A2"
os.chdir(root) 

from spacy_train_resume_ner import train_spacy_ner

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


# Preprocessing

Load the dataset, save the content in all_resumes.

In [0]:
dataset_path = "data/Entity Recognition in Resumes.json"
with open(dataset_path,encoding="utf8") as f:
    lines = f.readlines()
    
all_resumes = []

for line in lines:
    all_resumes.append(json.loads(line))


Convert data to spacy format and filter out resumes that have no entities. Data is preprocessed now for usage in spacy.

In [0]:
def convert_data(data):
    text = data['content']
    entities = []
    if data['annotation'] is not None:
        for annotation in data['annotation']:
            point = annotation['points'][0]
            labels = annotation['label']
            if not isinstance(labels, list):
                labels = [labels]
            for label in labels:
                entities.append((point['start'], point['end'] + 1, label))
    return (text, {"entities": entities})
   
converted_resumes = [convert_data(res) for res in all_resumes]
converted_resumes = [res for res in converted_resumes if len(res[1]["entities"]) > 0]

We have the data in spacy format now. We use flair, but spacy has a nice helper function to convert it to the BILOU data format which is recognizable for flair - instead of rewriting it, we use spacy just for this functionality. 

We choose the entity labels 'College', 'Degree' and 'Skills' and filter the dataset.

In [7]:
chosen_entity_labels = ["College Name", "Degree", "Skills"]

def gather_candidates(dataset,entity_labels):
    candidates = list()
    for resume in dataset:
        res_ent_labels = list(zip(*resume[1]["entities"]))[2]
        if set(entity_labels).issubset(res_ent_labels):
            candidates.append(resume)
    return candidates

training_data = gather_candidates(converted_resumes, chosen_entity_labels)
print("Gathered {} training examples".format(len(training_data)))

Gathered 380 training examples


Now that we have our training data, we remove all but the chosen entity annotations from this data, so that the model we train will only train for our entities. X is our training_data then.

In [0]:
ents = []
for resume in training_data:
    ents.append(resume[1]["entities"])

X = []
i = 0
for resume in training_data:
    X.append([])
    for ents in resume[1]["entities"]:
        if ents[2] in chosen_entity_labels:
            X[i].append(ents)
    i += 1

i = 0
for resume in training_data:
    resume[1]["entities"] = []
    resume[1]["entities"].extend(X[i])
    i += 1
    
X = training_data

Sometimes, there is bad data in it which is not working with spaCy. We filter these out.

In [0]:
def remove_bad_data(training_data):
    return training_data
    model, baddocs = train_spacy_ner(training_data, debug=True, n_iter=1)
    filtered = [data for data in training_data if data[0] not in baddocs]
    print("Unfiltered training data size: ",len(training_data))
    print("Filtered training data size: ", len(filtered))
    print("Bad data size: ", len(baddocs))
    return filtered

X = remove_bad_data(X)

Now, we split the dataset into train and test datasets. We use 80% training data and 20% test data.

In [0]:
def train_test_split(X,train_percent):
    train_size = int((len(X) / 100) * train_percent)
    train = X[0:train_size]
    test = X[train_size:]
    return train,test
  
train, test = train_test_split(X, 80)
assert (len(train) + len(test)) == len(X)    

We can now train the spacy model. Wait, train? Didn't we want flair to train and test? Yes, that's true, but in order to use the BILOU-export helper function, we have to train at least one epoch with spacy, we just need one model. That's okay for this routine, so let's do this.

In [11]:
nlp,_ = train_spacy_ner(train,n_iter=1)

Created blank 'en' model
Losses {'ner': 40290.960953430884}


Now that we trained the model, we can generate the bilou format with panda dataframes.

In [0]:
def make_bilou_df(nlp,resume):
    doc = nlp(resume[0])
    bilou_ents_predicted = biluo_tags_from_offsets(doc, [(ent.start_char,ent.end_char,ent.label_)for ent in doc.ents])
    bilou_ents_true = biluo_tags_from_offsets(doc, [(ent[0], ent[1], ent[2]) for ent in resume[1]["entities"]])

    doc_tokens = [tok.text for tok in doc]
    bilou_df = pd.DataFrame()
    bilou_df["Tokens"] = doc_tokens
    bilou_df["Tokens"] = bilou_df["Tokens"].str.replace("\\s+","") 
    bilou_df["Predicted"] = bilou_ents_predicted
    bilou_df["True"] = bilou_ents_true
    return bilou_df

training_data_as_bilou = [make_bilou_df(nlp,res) for res in train]
test_data_as_bilou = [make_bilou_df(nlp,res) for res in test]

training_df = pd.DataFrame(columns = ["text","ner","doc","ner_spacy"])
test_df = pd.DataFrame(columns = ["text","ner","doc","ner_spacy"])
for idx,df in enumerate(training_data_as_bilou):
    df2 = pd.DataFrame()
    df2["text"] = df["Tokens"]
    df2["ner"] = df["True"]
    df2["ner_spacy"]=df["Predicted"]
    df2["doc"]=idx
    training_df = training_df.append(df2, sort=True)
for idx,df in enumerate(test_data_as_bilou):
    df2 = pd.DataFrame()
    df2["text"] = df["Tokens"]
    df2["ner"] = df["True"]
    df2["ner_spacy"]=df["Predicted"]
    df2["doc"]=idx
    test_df = test_df.append(df2, sort=True)

with open("data/flair/train_res_bilou.txt",'w+',encoding="utf-8") as f:
    training_df.to_csv(f,sep=" ",encoding="utf-8",index=False)
with open("data/flair/test_res_bilou.txt",'w+',encoding="utf-8") as f:
    test_df.to_csv(f,sep=" ",encoding="utf-8",index=False)

**Manipulating the BILOU**

We have several problems with the generated BILOU file. First, flair takes sentences which are seperated by an empty line. We have no empty line which means we have one big sentence - we have to split it. Second, we look for skills, which are seperated by different characters but are annotated as one big skill. E. g. this is recognized as one skill, when it should be three skills.
1) SAP 2) Photoshop 3) Office 

We try to manipulate the BILOU dataset that it fits the needs. That means we have to restructure it. For example, we have three lines:



```
69 B-Skills O 1
69 I-Skills O )
69 I-Skills O SAP
69 I-Skills O 2
69 I-Skills O )
69 I-Skills O Photoshop
69 I-Skills O )
69 I-Skills O 3
69 L-Skills O Office
```

We want to convert it to this so it looks for SAP, Photoshop and Office as units.


```
69 O O 1
69 O O )
69 U-Skills O SAP
69 O O 2
69 O O )
69 U-Skills O Photoshop
69 O O )
69 O O 3
69 U-Skills O Office
```

Very similar, it should also keep in mind there are skills which are no units. For example:

```
69 B-Skills O 1
69 I-Skills O )
69 I-Skills O SAP
69 I-Skills O ABAP
69 I-Skills O 2
69 I-Skills O )
69 I-Skills O Adobe
69 I-Skills O Photoshop
69 I-Skills O CS5
69 I-Skills O )
69 I-Skills O 3
69 I-Skills O Microsoft
69 L-Skills O Office
```

After conversion

```
69 O O 1
69 O O )
69 B-Skills O SAP
69 L-Skills O ABAP
69 O O 2
69 O O )
69 B-Skills O Adobe
69 I-Skills O Photoshop
69 L-Skills O CS5
69 O O )
69 O O 3
69 B-Skills O Microsoft
69 L-Skills O Office
```

In [0]:
# Taken from https://stackoverflow.com/a/1884277
def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start
    
def get_number_of_training_sample(line):
    return line[0:find_nth(line, ' ', 1)]
  
def get_token_of_training_sample(line):
    return line[find_nth(line, ' ', 3)+1:]
  
def get_label_of_training_sample(line):
    if line.find(" O O ") != -1:
        return None
    first_space = find_nth(line, ' ', 1)
    second_space = find_nth(line, ' ', 2)
    return line[first_space+3:second_space]
  
def get_bilou_type_of_training_sample(line):
    first_space = find_nth(line, ' ', 1)
    return line[first_space + 1:first_space + 2]
  
def replace_bilou_type(line, new_bilou_character):
    first_dash = line.find('-')
    return line[0:first_dash - 1] + new_bilou_character + "-" + line[first_dash + 1:]

def fix_bilou_format(content):
    new_content = []
    lineNr = 0
    previous_training_sample = 0
    for line in content:
        if lineNr == 0:
            lineNr = lineNr + 1
            continue
        current_training_sample = get_number_of_training_sample(line)

        token = get_token_of_training_sample(line)
        label = get_label_of_training_sample(line)
        bilou_type = get_bilou_type_of_training_sample(line)

        if bilou_type == "U":
            new_content.append(line)
            lineNr = lineNr + 1
            continue

        token_has_characters = re.search("[a-zA-Z]", token) 
        if lineNr == 1:
            previous_bilou_type = "O"
        else:
            previous_bilou_type = get_bilou_type_of_training_sample(new_content[lineNr - 2])

        if token_has_characters is None:
            # if token has no character and the previous one is a B token,
            # we have to set it to a U token
            # if token has no character and the previous one is a I token,
            # we have to set it to a L token
            # also, since it has no characters, we say its an O token 
            if previous_bilou_type == "B":
                new_content[lineNr - 2] = new_content[lineNr - 2].replace("B-", "U-")
            elif previous_bilou_type == "I":
                new_content[lineNr - 2] = new_content[lineNr - 2].replace("I-", "L-")
            new_content.append(line.replace(bilou_type, "O"))
        else:
            # if token has characters, the bilou type is not O and the previous one is a O, U or L token,
            # we have to set it to a B token
            # if token has characters, the bilou type is not O and the previous one is a B or I token,
            # we have to set it to a I token

            if bilou_type == "O":
                new_content.append(line)
            else:
                if previous_bilou_type == "O" or previous_bilou_type == "U" or previous_bilou_type == "L":
                    new_string = replace_bilou_type(line, "B")
                    new_content.append(new_string)
                elif previous_bilou_type == "B" or previous_bilou_type == "I":
                    new_string = replace_bilou_type(line, "I")
                    new_content.append(new_string)
                else:
                    new_content.append(line)

        lineNr = lineNr + 1
    pre_last_bilou_token = get_bilou_type_of_training_sample(new_content[len(new_content)-2])
    last_bilou_token = get_bilou_type_of_training_sample(new_content[len(new_content)-1])
    
    if last_bilou_token != "O":
        if pre_last_bilou_token == "B" or pre_last_bilou_token == "I":
            new_content[len(new_content)-1] = new_content[len(new_content)-1].replace("B-", "L-")
            new_content[len(new_content)-1] = new_content[len(new_content)-1].replace("I-", "L-")
        else:
            new_content[len(new_content)-1] = new_content[len(new_content)-1].replace("I-", "U-")
            new_content[len(new_content)-1] = new_content[len(new_content)-1].replace("B-", "U-")
    return new_content
  
def make_sentences(content, regex_expressions):
    new_content = []
    for line in content:
        empty_line_appended = False
        token = get_token_of_training_sample(line)
        for regex_expression in regex_expressions:
            if re.search(regex_expression, token) is not None:
                new_content.append(line)
                new_content.append('\n')
                empty_line_appended = True
                break
        if empty_line_appended == True:
            continue
        new_content.append(line)            
    return new_content

In [0]:
with open("data/flair/train_res_bilou.txt") as f:
    train_content = f.readlines()
    
with open("data/flair/test_res_bilou.txt") as f:
    test_content = f.readlines()

fixed_bilou_train = fix_bilou_format(train_content)
fixed_bilou_test = fix_bilou_format(test_content)

# we skip the header, thats why we add 1
assert len(fixed_bilou_train) + 1 == len(train_content)
assert len(fixed_bilou_test) + 1 == len(test_content)

The data is now (almost) preprocessed and fixed. The following code illustrates how the result looks.

In [15]:
test_content_bilou = [
    'doc ner ner_spacy text',
    '69 B-Skills O 1',
    '69 I-Skills O )',
    '69 I-Skills O SAP',
    '69 I-Skills O ABAP',
    '69 I-Skills O 2',
    '69 I-Skills O )',
    '69 I-Skills O Adobe',
    '69 I-Skills O Photoshop',
    '69 I-Skills O CS5',
    '69 I-Skills O )',
    '69 I-Skills O 3',
    '69 I-Skills O Microsoft',
    '69 L-Skills O Office',
]

test_content_bilou_u_token = [
    'doc ner ner_spacy text',
    '69 B-Skills O 1',
    '69 I-Skills O )',
    '69 I-Skills O SAP',
    '69 I-Skills O 2',
    '69 I-Skills O )',
    '69 I-Skills O Adobe',
    '69 I-Skills O )',
    '69 I-Skills O 3',
    '69 L-Skills O Microsoft',
]

test_content_bilou_fixed = fix_bilou_format(test_content_bilou)
test_content_bilou_u_token_fixed = fix_bilou_format(test_content_bilou_u_token)
for t in test_content_bilou_fixed:
    print(t)
    
print("\n")
for t in test_content_bilou_u_token_fixed:
    print(t)

69 O-Skills O 1
69 O-Skills O )
69 B-Skills O SAP
69 L-Skills O ABAP
69 O-Skills O 2
69 O-Skills O )
69 B-Skills O Adobe
69 I-Skills O Photoshop
69 L-Skills O CS5
69 O-Skills O )
69 O-Skills O 3
69 B-Skills O Microsoft
69 L-Skills O Office


69 O-Skills O 1
69 O-Skills O )
69 U-Skills O SAP
69 O-Skills O 2
69 O-Skills O )
69 U-Skills O Adobe
69 O-Skills O )
69 O-Skills O 3
69 U-Skills O Microsoft


Now, we want to make sentences out of our data. We have the make_sentences function which takes the content and an array of regular expressions. If one of the regular expression matches the lines, we add an empty line to create a sentence. Having said that, the array of regular expressions are like delimiters and search for specific occurences of characters and when they occur, this is a sign for creating a new sentence.

In [0]:
# First regular expression searches for a point
regular_expressions = ["^\.$"]
ready_train_set = make_sentences(fixed_bilou_train, regular_expressions)
ready_test_set = make_sentences(fixed_bilou_test, regular_expressions)

with open('/content/gdrive/My Drive/FAU/SAKI.A2/data/flair/train_res_bilou_preprocessed.txt', 'w') as f:
    for item in ready_train_set:
        f.write("%s" % item)
        
with open('/content/gdrive/My Drive/FAU/SAKI.A2/data/flair/test_res_bilou_preprocessed.txt', 'w') as f:
    for item in ready_test_set:
        f.write("%s" % item)

# Training

The data is finally preprocessed and we can use it to train our Flair Model. First, we read the data into a corpus and specify the relevant columns.

In [17]:
columns = {1: 'ner', 3: 'text' }
tag_type = 'ner'
data_folder = '/content/gdrive/My Drive/FAU/SAKI.A2/data/flair'

corpus: Corpus = ColumnCorpus(data_folder, columns,
                                     train_file='train_res_bilou_preprocessed.txt',
                                     test_file='test_res_bilou_preprocessed.txt',
                                     dev_file=None)

print(len(corpus.train))
print(len(corpus.dev))
print(len(corpus.test))
  
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

2019-06-18 18:50:20,955 Reading data from /content/gdrive/My Drive/FAU/SAKI.A2/data/flair
2019-06-18 18:50:20,957 Train: /content/gdrive/My Drive/FAU/SAKI.A2/data/flair/train_res_bilou_preprocessed.txt
2019-06-18 18:50:20,959 Dev: None
2019-06-18 18:50:20,961 Test: /content/gdrive/My Drive/FAU/SAKI.A2/data/flair/test_res_bilou_preprocessed.txt
6268
696
2027
[b'<unk>', b'O', b'O-Skills', b'B-Skills', b'L-Skills', b'U-Degree', b'O-Degree', b'U-Skills', b'B-Degree', b'I-Degree', b'L-Degree', b'"B-College', b'"L-College', b'I-Skills', b'"I-College', b'-', b'<START>', b'<STOP>']


Now we can specify the Embeddings which we want to use. Embeddings can be combined with the StackedEmbeddings class. Flair recommends the usage of the glove embeddings with their pre-trained embeddings and this is what worked best for this dataset, so we use it.

In [18]:
stacked_embeddings = StackedEmbeddings([
                                        WordEmbeddings('glove'), 
                                        FlairEmbeddings('news-forward'), 
                                        FlairEmbeddings('news-backward')
                                       ])

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=stacked_embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

2019-06-18 18:50:25,477 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim.vectors.npy not found in cache, downloading to /tmp/tmpanau6egz


100%|██████████| 160000128/160000128 [00:07<00:00, 21044552.90B/s]

2019-06-18 18:50:33,684 copying /tmp/tmpanau6egz to cache at /root/.flair/embeddings/glove.gensim.vectors.npy





2019-06-18 18:50:33,925 removing temp file /tmp/tmpanau6egz
2019-06-18 18:50:34,480 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim not found in cache, downloading to /tmp/tmpd6pabjj3


100%|██████████| 21494764/21494764 [00:01<00:00, 13535940.18B/s]

2019-06-18 18:50:36,511 copying /tmp/tmpd6pabjj3 to cache at /root/.flair/embeddings/glove.gensim
2019-06-18 18:50:36,534 removing temp file /tmp/tmpd6pabjj3



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2019-06-18 18:50:38,506 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-forward--h2048-l1-d0.05-lr30-0.25-20/news-forward-0.4.1.pt not found in cache, downloading to /tmp/tmpr3i07_iu


100%|██████████| 73034624/73034624 [00:03<00:00, 18805839.82B/s]

2019-06-18 18:50:42,865 copying /tmp/tmpr3i07_iu to cache at /root/.flair/embeddings/news-forward-0.4.1.pt
2019-06-18 18:50:42,935 removing temp file /tmp/tmpr3i07_iu





2019-06-18 18:50:51,134 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-backward--h2048-l1-d0.05-lr30-0.25-20/news-backward-0.4.1.pt not found in cache, downloading to /tmp/tmp6jhj4i6o


100%|██████████| 73034575/73034575 [00:03<00:00, 20587480.29B/s]

2019-06-18 18:50:55,153 copying /tmp/tmp6jhj4i6o to cache at /root/.flair/embeddings/news-backward-0.4.1.pt
2019-06-18 18:50:55,234 removing temp file /tmp/tmp6jhj4i6o





Lets train the model now and print out the results!

In [0]:
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/resume-ner-40',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=40)

plotter = Plotter()
plotter.plot_training_curves('resources/taggers/resume-ner-40/loss.tsv')
plotter.plot_weights('resources/taggers/resume-ner-40/weights.txt')


2019-06-18 18:50:56,660 ----------------------------------------------------------------------------------------------------
2019-06-18 18:50:56,665 Evaluation method: MICRO_F1_SCORE
2019-06-18 18:50:57,460 ----------------------------------------------------------------------------------------------------
2019-06-18 18:50:59,298 epoch 1 - iter 0/196 - loss 92.63227844
2019-06-18 18:51:38,020 epoch 1 - iter 19/196 - loss 24.16082734
2019-06-18 18:52:04,713 epoch 1 - iter 38/196 - loss 17.20774236
2019-06-18 18:52:33,124 epoch 1 - iter 57/196 - loss 14.76257630
2019-06-18 18:53:07,591 epoch 1 - iter 76/196 - loss 12.56564394
2019-06-18 18:53:40,299 epoch 1 - iter 95/196 - loss 11.11060275
2019-06-18 18:54:10,788 epoch 1 - iter 114/196 - loss 10.36144746
