# Model Training and Evaluation
## Import package

In [1]:
from spacy.tokens import DocBin
import spacy
import json
import random

## Convert the data from json format to spaCy data format
### Load json data

In [2]:
with open("../data/ad_data_labeled.json", "r", encoding = "utf-8") as f:
    data = json.load(f)

In [3]:
data[0]

['If you have a passion for learning new technologies, want to deliver real client impact and work with like-minded technologists, Hypetap is the place to grow your career and have fun in the process!',
 {'entities': [[26, 34, 'SKILL'], [114, 127, 'SKILL']]}]

### Convert data
First of all, I split the processed data into three part(training, testing and validation) with the ratio 4:1:1 and then convert them to .spacy file, which is a new data format in spaCy v3.

In [4]:
def convert_data(data, path):
    nlp = spacy.blank('en')
    db = DocBin()
    for text, annotations in data:
        doc = nlp(text)
        ents = []
        for start, end, label in annotations['entities']:
            span = doc.char_span(start, end, label=label)
            ents.append(span)
        doc.ents = ents
        db.add(doc)
    db.to_disk(path)

training_size = round(len(data)*2/3)
test_size = round((len(data) - training_size)/2)
random.shuffle(data)
TRAIN_DATA = data[0: training_size]
TEST_DATA = data[training_size:training_size+test_size]
VALID_DATA = data[training_size+test_size:]

convert_data(TRAIN_DATA, "../data/train.spacy")
convert_data(TEST_DATA, "../data/test.spacy")
convert_data(VALID_DATA, "../data/dev.spacy")


## Train the model
Instead of python script, spaCy v3 use command line to train the model. The more information can be found [here](https://spacy.io/usage/training#quickstart).

In [5]:
!python -m spacy train config.cfg --output ../model/NER_spacy_v3

[38;5;4mℹ Saving to output directory: ../model/NER_spacy_v3[0m
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-16 05:01:22,390] [INFO] Set up nlp object from config
[2021-12-16 05:01:22,395] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-16 05:01:22,397] [INFO] Created vocabulary
[2021-12-16 05:01:22,398] [INFO] Finished initializing nlp object
[2021-12-16 05:01:25,667] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     41.50    0.00    0.00    0.00    0.00
  0     200        108.23   3633.31   68.08   71.94   64.61    0.68
  0     400        128.92   2336.76   78.75   84.05   74.08    0.79
  0     600        131.32   2298.87   81.72   81.22   82.22    0.82
  1     800        134.22   2205.02   84.10

## Evaluate the model
Most of skill entities can be found in this model, such as R, Python, coding data and so on. Some mislabeling might happen because the error from scraping data stage, such as learningTimeseries. The errors might also caused from the annotations while data preparation. For example. "IT (information technology)" and "it" may consider as the same word.

In [6]:
from spacy import displacy
nlp = spacy.load("../model/NER_spacy_v3/model-best")
for i in range(10):
    test_text = TEST_DATA[i][0]
    doc = nlp(test_text)
    displacy.render(doc, style="ent")



The model has 91% accuracy, 90% recall and 90% F1-score, which means it catches the entity well. To improve the performance, the dataset annotation should be done manually instead of rule-based matching.

In [7]:
!python -m spacy evaluate ../model/NER_spacy_v3/model-best ../data/test.spacy

[38;5;4mℹ Using CPU[0m
[1m

TOK     100.00
NER P   91.15 
NER R   89.71 
NER F   90.42 
SPEED   27447 

[1m

            P       R       F
SKILL   91.15   89.71   90.42

