# End to end process for adding both entity ruler and word vectors to an NER model
1. Document cleaning and splitting the corpus into test and train sets
2. Build word vectors
3. Build training data with entity ruler and split into train and validation data
4. Add word vectors to model, run

## Notebook 4
- Add word vectors to model
- Train model
- Test model

In [1]:
#load models
import spacy

#load test data
import json

#review test data
from spacy import displacy

### Load word vectors into spaCy model
- The train and valid data can be the same from the other NER set up
- python3 -m spacy init vectors en word_vectors/word2vec_sw_word_vec_1.txt models/01 --name sw_word_vec_1
- python3 -m spacy init vectors 'lang' 'location of vectors txt' 'output model loc' --name 'vectorsmodel name'

In [2]:
!python3 -m spacy init vectors en word_vectors/word2vec_sw_word_vec_3.txt models/03 --name sw_word_vec_3

[38;5;4mℹ Creating blank nlp object for language 'en'[0m
[2022-02-19 14:59:38,347] [INFO] Reading vectors from word_vectors/word2vec_sw_word_vec_3.txt
4175it [00:00, 8516.18it/s]
[2022-02-19 14:59:38,867] [INFO] Loaded vectors from word_vectors/word2vec_sw_word_vec_3.txt
[38;5;2m✔ Successfully converted 4175 vectors[0m
[38;5;2m✔ Saved nlp object with vectors to output directory. You can now use
the path to it in your config as the 'vectors' setting in [initialize].[0m
/Users/sarasharick/Documents/NER/NER_wv_labels_end2end/models/03


In [3]:
#load vectors model
nlp = spacy.load('models/03')

In [4]:
#add ner pipe
nlp.add_pipe('ner')

<spacy.pipeline.ner.EntityRecognizer at 0x12ea5f6f0>

In [5]:
#resave model
nlp.to_disk('models/03')

### Train with new model
- python3 -m spacy train models/01/config.cfg --output models/02 --paths.train data/train.spacy --paths.dev data/valid.spacy --paths.vectors models/01

In [6]:
!python3 -m spacy train models/03/config.cfg --output models/04 --paths.train data/train.spacy --paths.dev data/valid.spacy --paths.vectors models/03

[38;5;4mℹ Saving to output directory: models/04[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-02-19 14:59:55,597] [INFO] Set up nlp object from config
[2022-02-19 14:59:55,607] [INFO] Pipeline: ['ner']
[2022-02-19 14:59:55,611] [INFO] Created vocabulary
[2022-02-19 14:59:55,897] [INFO] Added vectors: models/03
[2022-02-19 14:59:55,928] [INFO] Finished initializing nlp object
[2022-02-19 14:59:59,966] [INFO] Initialized pipeline components: ['ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  --------  ------  ------  ------  ------
  0       0     74.88   29.38   23.90   38.13    0.29
  0     200   2762.04   86.83   91.21   82.86    0.87
  1     400   1260.51   89.62   90.43   88.81    0.90
  2     600   1135.41   91.51   91.76   91.25    0.92
  3     800    911.67   92.95   93.20   92.70    0.93
  5    1000    825.82   93.24   93.77   92.71    0.93
  7   

### Best Model
#### Performance - Model is chosen by highest F1 score
- ents_f: 0.9467818334
- ents_p: 0.9511102818
- ents_r: 0.9424926036
- ner_loss: 135.2285380244

#### By ent type - Person
- p: 0.9776397516
- r: 0.9790992784
- f: 0.9783689707

### Review model on test data

In [7]:
def load_data(file):
    with open(file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return(data)

In [8]:
test_data = load_data('data/sw_test_ner.json')

In [9]:
trained_nlp = spacy.load('./models/04/model-best')

In [130]:
test = test_data[1275:1325]

for item in test:
    doc = trained_nlp(item)
    displacy.render(doc, style='ent')