![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/Training/named_entity_recognition/NLU_training_NER_demo.ipynb)



# Training a Named Entity Recognition (NER) model with NLU
With the [NER_DL model](https://nlp.johnsnowlabs.com/docs/en/annotators#ner-dl-named-entity-recognition-deep-learning-annotator) from Spark NLP you can achieve State Of the Art results on any NER problem

This notebook showcases the following features :

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Colab Setup

In [None]:
# Install the johnsnowlabs library
! pip install -q johnsnowlabs


# 2. Download conll2003 dataset

In [2]:
! wget https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train

--2023-10-27 13:23:45--  https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train [following]
--2023-10-27 13:23:45--  https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3283420 (3.1M) [text/plain]
Saving to: ‘eng.train’


2023-10-27 13:23:46 (35.7 MB/s) - ‘eng.train’ saved [3283420/3283420]



# 3. Train Deep Learning Classifier using nlu.load('train.ner')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [3]:
from johnsnowlabs import nlp
# load a trainable pipeline by specifying the train. prefix  and fit it on a dataset with label and text columns
# Since there are no
train_path = '/content/eng.train'
trainable_pipe = nlp.load('train.ner')
fitted_pipe = trainable_pipe.fit(dataset_path=train_path)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions')
preds

small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,document,entities_ner,entities_ner_class,entities_ner_confidence,entities_ner_origin_chunk,entities_ner_origin_sentence,word_embedding_bert
0,Donald Trump and Angela Merkel dont share many...,Donald Trump,PER,0.9544,0,0,"[[-0.44760167598724365, 1.0348622798919678, 0...."
0,Donald Trump and Angela Merkel dont share many...,Angela Merkel dont,PER,0.88476664,1,0,"[[-0.44760167598724365, 1.0348622798919678, 0...."


In [4]:
# Check out the Parameters of the NER model we can configure
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['document_assembler'] has settable params:
component_list['document_assembler'].setCleanupMode('shrink')                    | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> component_list['bert_embeddings@small_bert_L2_128'] has settable params:
component_list['bert_embeddings@small_bert_L2_128'].setBatchSize(8)              | Info: Size of every batch | Currently set to : 8
component_list['bert_embeddings@small_bert_L2_128'].setEngine('tensorflow')      | Info: Deep Learning engine used for this model | Currently set to : tensorflow
component_list['bert_embeddings@small_bert_L2_128'].setMaxSentenceLength(128)    | Info: Max sentence length to process | Currently set to : 128
component_list['bert_embeddings@small_bert_L2_128'].setDimension(128)            | Info: Number of embedd

# 4. Lets use BERT embeddings instead of the default Glove_100d ones!

In [8]:
nlp.nlu.print_components(action='embed')


For language <af> NLU provides the following Models : 
nlu.load('af.embed.w2v_cc_300d') returns Spark NLP model_anno_obj w2v_cc_300d
For language <als> NLU provides the following Models : 
nlu.load('als.embed.w2v_cc_300d') returns Spark NLP model_anno_obj w2v_cc_300d
For language <am> NLU provides the following Models : 
nlu.load('am.embed.am_roberta') returns Spark NLP model_anno_obj roberta_embeddings_am_roberta
nlu.load('am.embed.w2v_cc_300d') returns Spark NLP model_anno_obj w2v_cc_300d
nlu.load('am.embed.xlm_roberta') returns Spark NLP model_anno_obj xlm_roberta_base_finetuned_amharic
For language <an> NLU provides the following Models : 
nlu.load('an.embed.w2v_cc_300d') returns Spark NLP model_anno_obj w2v_cc_300d
For language <ar> NLU provides the following Models : 
nlu.load('ar.embed') returns Spark NLP model_anno_obj arabic_w2v_cc_300d
nlu.load('ar.embed.AraBertMo_base_V1') returns Spark NLP model_anno_obj bert_embeddings_AraBertMo_base_V1
nlu.load('ar.embed.Ara_DialectBERT')

In [9]:
# Add bert word embeddings to pipe
fitted_pipe = nlp.load('bert train.ner').fit(dataset_path=train_path)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions')
preds

small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,document,entities_ner,entities_ner_class,entities_ner_confidence,entities_ner_origin_chunk,entities_ner_origin_sentence,word_embedding_bert
0,Donald Trump and Angela Merkel dont share many...,Donald Trump,PER,0.9427,0,0,"[[-0.44760167598724365, 1.0348622798919678, 0...."
0,Donald Trump and Angela Merkel dont share many...,Angela Merkel dont,PER,0.9236667,1,0,"[[-0.44760167598724365, 1.0348622798919678, 0...."


# 5. Lets save the model

In [10]:
stored_model_path = './models/classifier_dl_trained'
fitted_pipe.save(stored_model_path)

# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [11]:
hdd_pipe = nlp.load(path=stored_model_path)

preds = hdd_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions on laws about cheeseburgers')
preds



Unnamed: 0,document,entities_from_disk,entities_from_disk_class,entities_from_disk_confidence,entities_from_disk_origin_chunk,entities_from_disk_origin_sentence,word_embedding_from_disk
0,Donald Trump and Angela Merkel dont share many...,Donald Trump,PER,0.9282,0,0,"[[-0.687057375907898, 1.1118954420089722, 0.58..."
0,Donald Trump and Angela Merkel dont share many...,Angela Merkel dont,PER,0.8248,1,0,"[[-0.687057375907898, 1.1118954420089722, 0.58..."


In [12]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['document_assembler'] has settable params:
component_list['document_assembler'].setCleanupMode('shrink')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> component_list['sentence_detector_dl'] has settable params:
component_list['sentence_detector_dl'].setCustomBounds([])  | Info: characters used to explicitly mark sentence bounds | Currently set to : []
component_list['sentence_detector_dl'].setExplodeSentences(False)  | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
component_list['sentence_detector_dl'].setMaxLength(99999)  | Info: Set the maximum allowed length for each sentence | Currently set to : 99999
component_list['sentence_detector_dl'].setMinLength(0)     | Info: Set the minimum a