![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/collab/Training/NLU_training_demo.ipynb)



# Training a Named Entity Recognition (NER) model with NLU 
With the [NER_DL model](https://nlp.johnsnowlabs.com/docs/en/annotators#ner-dl-named-entity-recognition-deep-learning-annotator) from Spark NLP you can achieve State Of the Art results on any NER problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [1]:
import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu > /dev/null


import nlu

# 2. Download Amazon Review Dataset

In [2]:
! wget https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train

--2020-11-30 06:56:34--  https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train [following]
--2020-11-30 06:56:34--  https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3283420 (3.1M) [text/plain]
Saving to: ‘eng.train’


2020-11-30 06:56:35 (55.8 MB/s) - ‘eng.train’ saved [3283420/3283420]



# 3. Train Deep Learning Classifier using nlu.load('train.classifier')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [3]:
import nlu
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# Since there are no
train_path = '/content/eng.train'
trainable_pipe = nlu.load('train.ner')
fitted_pipe = trainable_pipe.fit(dataset_path=train_path)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions')
preds

pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


Unnamed: 0_level_0,pos,entities,entities_confidence,ner_confidence,default_name_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,"[NNP, NNP, CC, NNP, NNP, NN, NN, JJ, NNS]",Donald Trump,PER,"[0.9993000030517578, 0.9976000189781189, 0.999...","[[-0.5496799945831299, -0.488319993019104, 0.5..."
0,"[NNP, NNP, CC, NNP, NNP, NN, NN, JJ, NNS]",Angela Merkel,PER,"[0.9993000030517578, 0.9976000189781189, 0.999...","[[-0.5496799945831299, -0.488319993019104, 0.5..."


In [4]:
# Check out the Parameters of the NER model we can configure
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['named_entity_recognizer_dl'] has settable params:
pipe['named_entity_recognizer_dl'].setMinEpochs(0)   | Info: Minimum number of epochs to train | Currently set to : 0
pipe['named_entity_recognizer_dl'].setMaxEpochs(2)   | Info: Maximum number of epochs to train | Currently set to : 2
pipe['named_entity_recognizer_dl'].setLr(0.001)      | Info: Learning Rate | Currently set to : 0.001
pipe['named_entity_recognizer_dl'].setPo(0.005)      | Info: Learning rate decay coefficient. Real Learning Rage = lr / (1 + po * epoch) | Currently set to : 0.005
pipe['named_entity_recognizer_dl'].setBatchSize(8)   | Info: Batch size | Currently set to : 8
pipe['named_entity_recognizer_dl'].setDropout(0.5)   | Info: Dropout coefficient | Currently set to : 0.5
pipe['named_entity_recognizer_dl'].setVerbose(0)     | Info: Level of verbosity during training | Currently set to : 0
pipe['named_entity

# 4. Lets use BERT embeddings instead of the default Glove_100d ones!

In [5]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed')

For language <en> NLU provides the following Models : 
nlu.load('en.embed') returns Spark NLP model glove_100d
nlu.load('en.embed.glove') returns Spark NLP model glove_100d
nlu.load('en.embed.glove.100d') returns Spark NLP model glove_100d
nlu.load('en.embed.bert') returns Spark NLP model bert_base_uncased
nlu.load('en.embed.bert.base_uncased') returns Spark NLP model bert_base_uncased
nlu.load('en.embed.bert.base_cased') returns Spark NLP model bert_base_cased
nlu.load('en.embed.bert.large_uncased') returns Spark NLP model bert_large_uncased
nlu.load('en.embed.bert.large_cased') returns Spark NLP model bert_large_cased
nlu.load('en.embed.biobert') returns Spark NLP model biobert_pubmed_base_cased
nlu.load('en.embed.biobert.pubmed_base_cased') returns Spark NLP model biobert_pubmed_base_cased
nlu.load('en.embed.biobert.pubmed_large_cased') returns Spark NLP model biobert_pubmed_large_cased
nlu.load('en.embed.biobert.pmc_base_cased') returns Spark NLP model biobert_pmc_base_cased
nlu.lo

In [6]:
# Add bert word embeddings to pipe 
fitted_pipe = nlu.load('bert train.ner').fit(dataset_path=train_path)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions')
preds

small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]


Unnamed: 0_level_0,bert_embeddings,pos,entities_confidence,ner_confidence,entities
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,"[[-0.447601854801178, 1.0348625183105469, 0.51...","[NNP, NNP, CC, NNP, NNP, NN, NN, JJ, NNS]",PER,"[0.7784000039100647, 0.9710999727249146, 0.997...",Donald Trump
0,"[[-0.447601854801178, 1.0348625183105469, 0.51...","[NNP, NNP, CC, NNP, NNP, NN, NN, JJ, NNS]",PER,"[0.7784000039100647, 0.9710999727249146, 0.997...",Angela Merkel dont


# 5. Lets save the model

In [7]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [8]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions on laws about cheeseburgers')
preds

Unnamed: 0_level_0,bert_embeddings,pos,entities_confidence,ner_confidence,entities
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,"[[-0.6870577335357666, 1.1118954420089722, 0.5...","[NNP, NNP, CC, NNP, NNP, NN, NN, JJ, NNS, IN, ...",PER,"[0.7975000143051147, 0.9343000054359436, 0.995...",Donald Trump
0,"[[-0.6870577335357666, 1.1118954420089722, 0.5...","[NNP, NNP, CC, NNP, NNP, NN, NN, JJ, NNS, IN, ...",PER,"[0.7975000143051147, 0.9343000054359436, 0.995...",Angela Merkel dont


In [9]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')    | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['sentence_detector'] has settable params:
pipe['sentence_detector'].setCustomBounds([])          | Info: characters used to explicitly mark sentence bounds | Currently set to : []
pipe['sentence_detector'].setDetectLists(True)         | Info: whether detect lists during sentence detection | Currently set to : True
pipe['sentence_detector'].setExplodeSentences(False)   | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
pipe['sentence_detector'].setMaxLength(99999)          | Info: Set the maximum allowed length for each sentence | Currently set to : 99999
p