# Training and Evaluating an NER model with spaCy on the CoNLL dataset

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. 

https://www.clips.uantwerpen.be/conll2003/ner/

## Step 1: Converting data to json structures so it can be used by Spacy

In [None]:
import os

In [None]:
!wget https://data.deepai.org/conll2003.zip

--2022-11-10 06:16:02--  https://data.deepai.org/conll2003.zip
Resolving data.deepai.org (data.deepai.org)... 5.9.140.253
Connecting to data.deepai.org (data.deepai.org)|5.9.140.253|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 982975 (960K) [application/x-zip-compressed]
Saving to: ‘conll2003.zip’


2022-11-10 06:16:03 (2.04 MB/s) - ‘conll2003.zip’ saved [982975/982975]



In [None]:
!unzip conll2003.zip

Archive:  conll2003.zip
  inflating: metadata                
  inflating: test.txt                
  inflating: train.txt               
  inflating: valid.txt               


In [None]:
#Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data
import os
# !mkdir spacyNER_data
os.mkdir('spacyNER_data')
        
#the above lines create folder if it doesn't exist. If it does, the output shows a message that it
#already exists and cannot be created again

In [None]:
!python -m spacy convert "train.txt" spacyNER_data -c ner 
!python -m spacy convert "test.txt" spacyNER_data -c ner
!python -m spacy convert "valid.txt" spacyNER_data -c ner

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (14987 documents):
spacyNER_data/train.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3684 documents): spacyNER_data/test.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): spacyNER_data/valid.spacy[0m


## Training the NER model with Spacy (CLI)


In [None]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train config.cfg --output ./output --paths.train /content/spacyNER_data/train.spacy --paths.dev /content/spacyNER_data/valid.spacy --gpu-id 0

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-11-10 06:31:16,409] [INFO] Set up nlp object from config
INFO:spacy:Set up nlp object from config
[2022-11-10 06:31:16,421] [INFO] Pipeline: ['tok2vec', 'ner', 'tagger']
INFO:spacy:Pipeline: ['tok2vec', 'ner', 'tagger']
[2022-11-10 06:31:16,425] [INFO] Created vocabulary
INFO:spacy:Created vocabulary
[2022-11-10 06:31:16,426] [INFO] Finished initializing nlp object
INFO:spacy:Finished initializing nlp object
[2022-11-10 06:31:46,133] [INFO] Initialized pipeline components: ['tok2vec', 'ner', 'tagger']
INFO:spacy:Initialized pipeline components: ['tok2vec', 'ner', 'tagger']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner', 'tagger'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  LOSS TAGGER  ENTS_F  ENTS_P  ENTS_R  TAG_ACC  SCORE 
---  ------  ------------  --------  -----------  ------ 

Notice how the performance improves with each iteration!
## Evaluating the model with test data set (`spacyNER_data/test.spacy`)

### On Trained model (`model/model-best`)

In [None]:
#create a folder to store the output and visualizations. 
# !mkdir result
os.mkdir('result')
!python -m spacy evaluate output/model-last spacyNER_data/test.spacy -dp result
# !python -m spacy evaluate model/model-final data/test.txt.json -dp result

[1m

Time      3.93 s
Words     46666 
Words/s   11873 
TOK       100.00
POS       95.28 
UAS       0.00  
LAS       0.00  
NER P     81.80 
NER R     81.96 
NER F     81.88 
Textcat   0.00  

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
result


In [None]:
import spacy
from spacy import displacy
MODEL_PATH="output/model-last"
ner=spacy.load(MODEL_PATH)

In [None]:
text="Binod is a CEO of CG company at Kathmandu Nepal,"
doc=ner(text)
displacy.render(doc,jupyter=True, style="ent")