## Tools & Models Installation

In [11]:
#Install spacy
!pip install -U spacy

In [12]:
#Display spaCy version information
!python -m spacy info

[1m

spaCy version    3.4.0                         
Location         C:\Users\MR\anaconda3\lib\site-packages\spacy
Platform         Windows-10-10.0.22000-SP0     
Python version   3.8.5                         
Pipelines                                      



In [65]:
#Download en_core_web_sm model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.4.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [31]:
#Download en_core_web_trf model
!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0-py3-none-any.whl (460.3 MB)
Installing collected packages: en-core-web-trf
Successfully installed en-core-web-trf-3.4.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_trf')


In [54]:
#Install CUDA 11.3 & CUPY compatible version
!pip install -U spacy[cuda113,transformers]

Collecting cupy-cuda113<11.0.0,>=5.0.0b4
  Using cached cupy_cuda113-10.6.0-cp38-cp38-win_amd64.whl (56.9 MB)
Installing collected packages: cupy-cuda113
Successfully installed cupy-cuda113-10.6.0


In [59]:
#Install PyTorch version compatible with CUDA 11.3 
!pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu113
Collecting torch==1.11.0+cu113
  Downloading https://download.pytorch.org/whl/cu113/torch-1.11.0%2Bcu113-cp38-cp38-win_amd64.whl (2186.1 MB)
Collecting torchvision==0.12.0+cu113
  Downloading https://download.pytorch.org/whl/cu113/torchvision-0.12.0%2Bcu113-cp38-cp38-win_amd64.whl (5.4 MB)
Collecting torchaudio==0.11.0
  Downloading https://download.pytorch.org/whl/cu113/torchaudio-0.11.0%2Bcu113-cp38-cp38-win_amd64.whl (573 kB)
Installing collected packages: torch, torchvision, torchaudio
  Attempting uninstall: torch
    Found existing installation: torch 1.12.0
    Uninstalling torch-1.12.0:
      Successfully uninstalled torch-1.12.0
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.13.0+cu113
    Uninstalling torchvision-0.13.0+cu113:
      Successfully uninstalled torchvision-0.13.0+cu113
  Attempting uninstall: torchaudio
    Found existing installation: torchaudi

---

# Libraries & Constants

In [49]:
import spacy
import spacy_transformers
from spacy.tokens import DocBin
from spacy import displacy
import pandas as pd
from tqdm import tqdm
import json
import time

In [67]:
# Constants
TRAIN_DATA_PATH = "../data/processed/ner_for_training/spacy_train_mini.json"
DEV_DATA_PATH = "../data/processed/ner_for_training/spacy_dev_mini.json"
TRAIN_DATA_EXPORT_PATH = "../data/processed/ner_for_training/spacy_train_mini.spacy"
DEV_DATA_EXPORT_PATH = "../data/processed/ner_for_training/spacy_dev_mini.spacy"
TEST_DATA_PATH = "../data/processed/ner_for_training/spacy_test_mini.json"
TEST_DATA_EXPORT_PATH = "../data/processed/ner_for_training/spacy_test_mini.spacy"

TRAINED_MODEL_PATH = "../models/baseline_model/model-best"

---

# Training Data Transformation

**SpaCy doesnt accept JSON format to trained the models. Instead, JSON should be converted to .spacy format, & this is what we are going to do.**

In [3]:
#Load TRAIN_DATA
train_file = open(TRAIN_DATA_PATH)
TRAIN_DATA = json.load(train_file)
#Load DEV_DATA
dev_file = open(DEV_DATA_PATH)
DEV_DATA = json.load(dev_file)
#Load TEST_DATA
test_file = open(TEST_DATA_PATH)
TEST_DATA = json.load(test_file)

The following code cell is built using the snippet provided from spaCy: https://spacy.io/usage/training#training-data

In [7]:
TRAIN_DATA

[{'document': 'Great English beer, poor English food, friendly English staff. A unique atmosphere with reasonable prices, the Toad is an excellent place to get sloshed.',
  'annotation': []},
 {'document': "You can swim at Al Mamzar Beach or stroll in the park. Though less popular than other Dubai parks, the park has the park's share of Dubai's greenery.",
  'annotation': [[16, 31, 'LOC']]},
 {'document': 'The Virgil Avenue Tobacconist\'s slogan proudly declares it to be "where the city smokes", and if local renown is a measure of quality, that\'s hardly an exaggeration. This one-stop shop just off Hertel offers everything the the city smoking enthusiast could conceivably desire: imported cigarettes, pipes and pipe tobacco, rolling papers, loose cigarette tobacco, and — the main draw — a dizzying range of premium cigars shipped directly from factory to store. Ashton, Hemmingway, Arturo Fuentes, and Cohiba are only a few of the many brands to be found in Virgil Avenue\'s massive walk-in

In [17]:
# load a new spacy model
nlp = spacy.blank("en")
db1 = DocBin()

#Convert train data
for item in tqdm(TRAIN_DATA):
    # create doc object from text
    doc = nlp.make_doc(item["document"]) 
    ents = []
    for ent in item["annotation"]: 
        span = doc.char_span(ent[0], ent[1], label=ent[2], alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents
    db1.add(doc)

db1.to_disk(TRAIN_DATA_EXPORT_PATH) 

#Convert dev data
db2 = DocBin()
for item in tqdm(DEV_DATA):
    doc = nlp.make_doc(item["document"])
    ents = []
    for ent in item["annotation"]: 
        span = doc.char_span(ent[0], ent[1], label=ent[2], alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db2.add(doc)

db2.to_disk(DEV_DATA_EXPORT_PATH)

100%|████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:01<00:00, 1309.74it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 700/700 [00:00<00:00, 1639.22it/s]


In [62]:
#Convert test data
db3 = DocBin()
for item in tqdm(TEST_DATA):
    doc = nlp.make_doc(item["document"])
    ents = []
    for ent in item["annotation"]: 
        span = doc.char_span(ent[0], ent[1], label=ent[2], alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db3.add(doc)

db3.to_disk(TEST_DATA_EXPORT_PATH)

100%|██████████████████████████████████████████████████████████████████████████████| 235/235 [00:00<00:00, 1305.32it/s]


---

# \~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~ NER Training \~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~

## ============================ Experiment 1 ============================

## Baseline Model using "en_core_web_sm"

* The folowing model is run on the following sample sizes: 
    * Training: 2000 samples
    * Validation: 700 samples
    * Testing: 235 samples

### Initialize Config File

In [18]:
#Initialize config file with default parameters of BERT
!python -m spacy init fill-config ../config/baseline_base_config.cfg ../config/baseline_config.cfg

[+] Auto-filled config with all values
[+] Saved config
..\config\baseline_config.cfg
You can now add your data and train your pipeline:
python -m spacy train baseline_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


### Debug Data

In [20]:
# Debug & validate the training and development data, get useful stats, 
# and find problems like invalid entity annotations,cyclic dependencies, low data labels and more.
!python -m spacy debug data ../config/baseline_config.cfg --paths.train ../data/processed/ner_for_training/spacy_train_mini.spacy --paths.dev ../data/processed/ner_for_training/spacy_dev_mini.spacy

[1m
[+] Pipeline can be initialized with data
[+] Corpus is loadable
[1m
Language: en
Training pipeline: tok2vec, ner
2000 training docs
700 evaluation docs
[!] 4 training examples also in evaluation data
[1m
[i] 167926 total word(s) in the data (16096 unique)
[i] No word vectors present in the package
[1m
[i] 6 label(s)
0 missing value(s) (tokens with '-' label)
[+] Good amount of examples for all labels
[+] Examples without occurrences available for all labels
[+] No entities consisting of or starting/ending with whitespace
[+] No entities crossing sentence boundaries
[1m
[+] 6 checks passed


## Train NER Baseline Model

In [29]:
start = time.time()

!python -m spacy train ../config/baseline_config.cfg --output ../models/baseline_model --paths.train ../data/processed/ner_for_training/spacy_train_mini.spacy --paths.dev ../data/processed/ner_for_training/spacy_dev_mini.spacy

[2022-08-06 20:18:10,291] [INFO] Set up nlp object from config
[2022-08-06 20:18:10,301] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-08-06 20:18:10,304] [INFO] Created vocabulary
[2022-08-06 20:18:10,305] [INFO] Finished initializing nlp object
[2022-08-06 20:18:13,215] [INFO] Initialized pipeline components: ['tok2vec', 'ner']


[i] Saving to output directory: ..\models\baseline_model
[i] Using CPU
[i] To switch to GPU 0, use the option: --gpu-id 0
[1m
[+] Initialized pipeline
[1m
[i] Pipeline: ['tok2vec', 'ner']
[i] Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     86.81    0.00    0.00    0.00    0.00
  0     200        291.89   2828.88   14.88   31.87    9.71    0.15
  0     400       3965.28   1822.81   24.70   27.48   22.42    0.25
  0     600        154.41   1746.87   27.27   25.03   29.94    0.27
  0     800        155.03   1595.89   35.23   46.88   28.22    0.35
  0    1000        366.73   1796.85   43.62   58.07   34.92    0.44
  1    1200        263.19   1786.59   46.47   60.38   37.77    0.46
  1    1400        445.77   1986.49   49.70   52.47   47.21    0.50
  1    1600        437.56   2145.66   53.19   48.90   58.32    0.53
  2    1800        647.46   2245

[2022-08-06 20:23:37,109] [INFO] Set up nlp object from config
[2022-08-06 20:23:37,125] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-08-06 20:23:37,133] [INFO] Created vocabulary
[2022-08-06 20:23:37,134] [INFO] Finished initializing nlp object
[2022-08-06 20:23:41,375] [INFO] Initialized pipeline components: ['tok2vec', 'ner']


In [30]:
elapsed_time = time.time() - start
print('Baseline Model Training Time:', time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))

Baseline Model Training Time: 00:22:24


----------

## Test the model

After training, the model will be saved in a folder named model-best in "models/baseline_model" directory. Lets try to extract entities using the newly trained model.

In [68]:
nlp_trained = spacy.load(TRAINED_MODEL_PATH)

In [69]:
docs = []
for item in TEST_DATA:
    docs.append(nlp_trained(item["document"]))

In [76]:
#Loop through the test descriptions and display the entities using the trained model.
docs = iter(docs)
spacy.displacy.render(next(docs), style = "ent")

In [77]:
!python -m spacy evaluate ../models/baseline_model/model-best ../data/processed/ner_for_training/spacy_test_mini.spacy -o ../evaluation/eval_baseline.json -dp ../figures/baseline_model

[i] Using CPU
[i] To switch to GPU 0, use the option: --gpu-id 0
[1m

TOK     100.00
NER P   65.30 
NER R   59.98 
NER F   62.52 
SPEED   38228 

[1m

            P       R       F
LOC     60.00   50.56   54.88
DATE    80.49   78.88   79.68
ORG     54.85   50.14   52.39
FAC     60.19   56.36   58.22
MONEY   95.00   79.17   86.36
EVENT   75.00   25.00   37.50

[+] Generated 25 parses as HTML
..\figures\baseline_model
[+] Saved results to ..\evaluation\eval_baseline.json


