![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/Training/named_entity_recognition/NLU_training_NER_demo.ipynb)



# Training a Named Entity Recognition (NER) model with NLU 
With the [NER_DL model](https://nlp.johnsnowlabs.com/docs/en/annotators#ner-dl-named-entity-recognition-deep-learning-annotator) from Spark NLP you can achieve State Of the Art results on any NER problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
  

import nlu

--2021-05-05 05:10:15--  https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1671 (1.6K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               Installing  NLU 3.0.0 with  PySpark 3.0.2 and Spark NLP 3.0.1 for Google Colab ...

2021-05-05 05:10:16 (1.54 MB/s) - written to stdout [1671/1671]

[K     |████████████████████████████████| 204.8MB 72kB/s 
[K     |████████████████████████████████| 153kB 52.9MB/s 
[K     |████████████████████████████████| 204kB 22.3MB/s 
[K     |████████████████████████████████| 204kB 46.2MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


# 2. Download conll2003 dataset

In [None]:
! wget https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train

--2021-05-05 05:12:10--  https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train [following]
--2021-05-05 05:12:10--  https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3283420 (3.1M) [text/plain]
Saving to: ‘eng.train’


2021-05-05 05:12:11 (36.4 MB/s) - ‘eng.train’ saved [3283420/3283420]



# 3. Train Deep Learning Classifier using nlu.load('train.ner')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
import nlu
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# Since there are no
train_path = '/content/eng.train'
trainable_pipe = nlu.load('train.ner')
fitted_pipe = trainable_pipe.fit(dataset_path=train_path)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions')
preds

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


Unnamed: 0,sentence,word_embedding_glove,document,entities_class,entities,token,origin_index
0,[Donald Trump and Angela Merkel dont share man...,"[[-0.5496799945831299, -0.488319993019104, 0.5...",Donald Trump and Angela Merkel dont share many...,"[PER, PER]","[Donald Trump, Angela Merkel]","[Donald, Trump, and, Angela, Merkel, dont, sha...",0


In [None]:
# Check out the Parameters of the NER model we can configure
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['named_entity_recognizer_dl'] has settable params:
pipe['named_entity_recognizer_dl'].setMinEpochs(0)  | Info: Minimum number of epochs to train | Currently set to : 0
pipe['named_entity_recognizer_dl'].setMaxEpochs(2)  | Info: Maximum number of epochs to train | Currently set to : 2
pipe['named_entity_recognizer_dl'].setLr(0.001)  | Info: Learning Rate | Currently set to : 0.001
pipe['named_entity_recognizer_dl'].setPo(0.005)  | Info: Learning rate decay coefficient. Real Learning Rage = lr / (1 + po * epoch) | Currently set to : 0.005
pipe['named_entity_recognizer_dl'].setBatchSize(8)  | Info: Batch size | Currently set to : 8
pipe['named_entity_recognizer_dl'].setDropout(0.5)  | Info: Dropout coefficient | Currently set to : 0.5
pipe['named_entity_recognizer_dl'].setVerbose(0)  | Info: Level of verbosity during training | Currently set to : 0
pipe['named_entity_recognizer_dl'

# 4. Lets use BERT embeddings instead of the default Glove_100d ones!

In [None]:
# We can use nlu.print_components(action='embed') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed')

For language <en> NLU provides the following Models : 
nlu.load('en.embed') returns Spark NLP model glove_100d
nlu.load('en.embed.glove') returns Spark NLP model glove_100d
nlu.load('en.embed.glove.100d') returns Spark NLP model glove_100d
nlu.load('en.embed.bert') returns Spark NLP model bert_base_uncased
nlu.load('en.embed.bert.base_uncased') returns Spark NLP model bert_base_uncased
nlu.load('en.embed.bert.base_cased') returns Spark NLP model bert_base_cased
nlu.load('en.embed.bert.large_uncased') returns Spark NLP model bert_large_uncased
nlu.load('en.embed.bert.large_cased') returns Spark NLP model bert_large_cased
nlu.load('en.embed.biobert') returns Spark NLP model biobert_pubmed_base_cased
nlu.load('en.embed.biobert.pubmed_base_cased') returns Spark NLP model biobert_pubmed_base_cased
nlu.load('en.embed.biobert.pubmed_large_cased') returns Spark NLP model biobert_pubmed_large_cased
nlu.load('en.embed.biobert.pmc_base_cased') returns Spark NLP model biobert_pmc_base_cased
nlu.lo

In [None]:
# Add bert word embeddings to pipe 
fitted_pipe = nlu.load('bert train.ner').fit(dataset_path=train_path)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions')
preds

small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,document,word_embedding_bert,entities_class,entities,token,origin_index
0,[Donald Trump and Angela Merkel dont share man...,Donald Trump and Angela Merkel dont share many...,"[[-0.447601318359375, 1.0348621606826782, 0.51...","[PER, PER]","[Donald Trump, Angela Merkel dont]","[Donald, Trump, and, Angela, Merkel, dont, sha...",0


# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions on laws about cheeseburgers')
preds

Unnamed: 0,text,sentence,origin_index,document,entities_class,entities,token,word_embedding_from_disk
0,Donald Trump and Angela Merkel dont share many...,[Donald Trump and Angela Merkel dont share man...,8589934592,Donald Trump and Angela Merkel dont share many...,"[PER, PER]","[Donald Trump, Angela Merkel dont]","[Donald, Trump, and, Angela, Merkel, dont, sha...","[[-0.6870571374893188, 1.1118954420089722, 0.5..."


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')       | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'] has settable params:
pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setExplodeSentences(False)  | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setStorageRef('SentenceDetectorDLModel_c83c27f46b97')  | Info: storage unique identifier | Currently set to : SentenceDetectorDLModel_c83c27f46b97
pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setEncoder(com.johnsnowlabs.nlp.annotators.sentence_detector_dl.S