# Named Entity Recognition

## Date: February 4, 2020

Link: https://confusedcoders.com/data-science/deep-learning/how-to-create-custom-ner-in-spacy

In [7]:
import os
os.chdir('Documents/Projects/AI4Good/data_aiminer/pages_selected_bis_archived')

In [6]:
os.chdir('..')

In [8]:
import json

In [9]:
with open('Analyses on characteristics of currents around Maoming Harbor.pdf_page_102.json') as f:
    dct = json.load(f)

In [10]:
dct

'Oral Presentations on 15 December  101surface compare favorably with experimental data. Numerical results are also presented for the instantaneous flow field, recirculation regions, vortex tubes, and maximal bed shear stress. The results indicate that the flow phenomena are very complicated after the bore breaks.  14:45 Internal generation of waves on an arced band in an unstructured grid system G. KIM Division of Ocean System Engineering, Mokpo National Maritime University, Mokpo City, Jeollanam-do, 530-729, Republic of Korea C. LEE Department of Civil and Environmental Engineering, Sejong University, 98 Gunja-Dong, Gwangjin-Gu, Seoul, 143-747, Korea In this study, we developed Gaussian source functions on an arced band to generate incident waves in the extended mild-slope equation. Numerical experiments were conducted for waves propagating on a flat bottom and also waves scattered by a vertical cylinder. The numerical results showed that the technique of wave generation using on an 

In [11]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.gold import GoldParse
from spacy.scorer import Scorer
import plac
from pathlib import Path
import random

In [12]:
import plac
import random
from pathlib import Path
from tqdm import tqdm

In [13]:
nlp = spacy.load('en')

In [14]:
doc_ = nlp(dct)

In [15]:
for token in doc_.ents:
    print(token.text, token.start_char, token.end_char, token.label_)

15 22 24 CARDINAL
December 25 33 DATE
14:45 316 321 TIME
G. KIM Division 399 414 PERSON
Ocean System Engineering 418 442 ORG
Mokpo National Maritime University 444 478 ORG
Mokpo City 480 490 GPE
Jeollanam 492 501 GPE
530 506 509 CARDINAL
Republic of Korea C. LEE Department of Civil and Environmental Engineering 515 589 ORG
Sejong University 591 608 ORG
98 610 612 CARDINAL
Gunja 613 618 ORG
Gwangjin 625 633 GPE
Seoul 638 643 GPE
143 645 648 CARDINAL
Korea 654 659 GPE
Gaussian 688 696 NORP
15:00 1072 1077 CARDINAL
the Bohai Sea 1119 1132 LOC
BO XIA School of Civil Engineering 1183 1217 ORG
Tianjin University 1219 1237 ORG
Tianjin 1239 1246 GPE
300072 1248 1254 DATE
China Changsha University of Science and Technology 1256 1307 ORG
Changsha 1309 1317 GPE
Hunan 1319 1324 GPE
410114 1326 1332 DATE
China 1334 1339 GPE
QINGHE ZHANG School of Civil Engineering 1341 1381 ORG
Tianjin University 1383 1401 ORG
Tianjin 1403 1410 GPE
300072 1412 1418 DATE
China 1420 1425 GPE
CHANGBO JIANG Changsha Un

In [16]:
model = None
output_dir = Path('models')
n_iter = 20

#### Load the model

In [17]:
if model is not None:
    nlp = spacy.load(model)
else:
    nlp = spacy.blank('en')

#### Set Up the Pipeline

In [18]:
if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner)
else:
    ner = nlp.get_pipe('ner')

Labels we want to identify, and definition of what we want those labels to include:
- **ORG**: Organisation: *companies, universities, government agencies...* 
- **PERSON**: People
- **NMB**: Numbers
- **DATE**: Date
- **LOC**: Localisation: *everyting that refers to a particular place*

In [37]:
#Labels = ['ORG', 'PERSON', 'NMB', 'DATE', 'LOC']

In [20]:
# Add new entity labels to entity recognizer
#for i in Labels:
#    ner.add_label(i)

In [21]:
#matcher = PhraseMatcher(nlp.vocab)
#for element in ['Tokyo', 'Japan']:
#    matcher.add(Labels[4], None, nlp(element))

In [22]:
#test = nlp(dct)
#matches = matcher(test)

In [38]:
#[match for match in matches]

In [39]:
#test[613]

#### Building a small training set

In [28]:
train_data = [
    ('Half-life time, simulated by a dispersion model, is chose to represent the exchange ability of the Bohai Sea', 
     {
        'entities':[(99,108, 'LOC')]
    }),
    ('Parallel Session 6 Thursday, 15 December Beach Erosion and Morphodynamics II Ballroom C Chair: Yongjun Lu, Zheng Bing Wang', 
     {
        'entities':[(19, 39, 'DATE'), (95, 105, 'PERSON'), (107, 121, 'PERSON')]
    }),
    ('The Bohai Sea is a semi-enclosed inland sea in northern China which has been polluted these years', {
        'entities':[(0, 12, 'LOC'), (47, 61, 'LOC'), (86, 97, 'DATE')]
    }),
]

In [29]:
#adding the labels from our train data
for _, annotations in train_data:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])

#### Train the model

The number of iteration refers to the number of time the model will see the training data. The training data is shuffled in order to avoid a bias toward the order of the data. 

In [30]:
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
#we are only focusing on the NER
with nlp.disable_pipes(*other_pipes):  
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(train_data)
        losses = {}
        for text, annotations in tqdm(train_data):
            nlp.update(
            [text],
            [annotations],
            drop=0.2,
            sgd=optimizer,
            losses=losses)
        print(losses)



100%|██████████| 3/3 [00:00<00:00, 12.97it/s]
100%|██████████| 3/3 [00:00<00:00, 21.54it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

{'ner': 45.566320328973234}
{'ner': 9.889341731335511}


100%|██████████| 3/3 [00:00<00:00, 19.39it/s]
100%|██████████| 3/3 [00:00<00:00, 20.97it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

{'ner': 11.691314076310212}
{'ner': 5.199264682064583}


100%|██████████| 3/3 [00:00<00:00, 17.00it/s]
100%|██████████| 3/3 [00:00<00:00, 18.00it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

{'ner': 5.77715517847445}
{'ner': 0.8514046666991749}


100%|██████████| 3/3 [00:00<00:00, 19.09it/s]
100%|██████████| 3/3 [00:00<00:00, 20.73it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

{'ner': 2.7082378478128475}
{'ner': 0.012655483930215285}


100%|██████████| 3/3 [00:00<00:00, 17.94it/s]
100%|██████████| 3/3 [00:00<00:00, 19.24it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

{'ner': 0.0015213495490211944}
{'ner': 0.0004185526784822161}


100%|██████████| 3/3 [00:00<00:00, 19.53it/s]
100%|██████████| 3/3 [00:00<00:00, 19.54it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

{'ner': 0.0004013056592174934}
{'ner': 2.6524812822905887e-09}


100%|██████████| 3/3 [00:00<00:00, 17.90it/s]
100%|██████████| 3/3 [00:00<00:00, 16.91it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

{'ner': 3.388976786677231e-10}
{'ner': 2.7484268761184016e-09}


100%|██████████| 3/3 [00:00<00:00, 14.97it/s]
100%|██████████| 3/3 [00:00<00:00, 15.85it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

{'ner': 1.1146543916894273e-08}
{'ner': 0.0008324269694142593}


100%|██████████| 3/3 [00:00<00:00, 17.00it/s]
100%|██████████| 3/3 [00:00<00:00, 16.65it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

{'ner': 1.6627832951830777}
{'ner': 1.5975698022512891e-09}


100%|██████████| 3/3 [00:00<00:00, 17.77it/s]
 67%|██████▋   | 2/3 [00:00<00:00, 14.25it/s]

{'ner': 1.0092140370588723e-05}


100%|██████████| 3/3 [00:00<00:00, 13.61it/s]

{'ner': 9.279715538013932e-09}





#### Save the model

In [34]:
if not os.path.exists(output_dir): os.mkdir(output_dir)

In [35]:
if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
    print('The model is saved to: ', output_dir)

The model is saved to:  models


#### ReLoad the model

In [36]:
nlp2 = spacy.load(output_dir)

Evaluation of the model:

In [54]:
def evaluate(model, examples):
    scorer = Scorer()
    for input_, annot in examples:
        doc_gold_text = model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot['entities'])
        pred_value = model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

In [40]:
#test_result = evaluate(new_model, test_data)

## TODO: Manually tag a large dataset to train the Named Entity Recognition model on (Prodigy)