![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/collab/Training/NLU_training_demo.ipynb)



# Training a Named Entity Recognition (POS) model with NLU 
With the [POS tagger](https://nlp.johnsnowlabs.com/docs/en/annotators#postagger-part-of-speech-tagger) from Spark NLP you can achieve State Of the Art results on any POS problem.
It uses an Averaged Percetron Model approach under the hood.

This notebook showcases the following features : 

- How to train the deep learning POS classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu  pyspark==2.4.7  > /dev/null 

import nlu

# 2. Download French POS dataset

In [None]:
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt

--2020-12-14 02:28:26--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.96.21
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.96.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3565213 (3.4M) [text/plain]
Saving to: ‘UD_French-GSD_2.3.txt’


2020-12-14 02:28:26 (7.76 MB/s) - ‘UD_French-GSD_2.3.txt’ saved [3565213/3565213]



# 3. Train Deep Learning Classifier using nlu.load('train.pos')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
import nlu
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# Since there are no
train_path = '/content/UD_French-GSD_2.3.txt'
trainable_pipe = nlu.load('train.pos')
fitted_pipe = trainable_pipe.fit(dataset_path=train_path)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions')
preds

Unnamed: 0_level_0,pos,token
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,PROPN,Donald
0,PROPN,Trump
0,CCONJ,and
0,PROPN,Angela
0,PROPN,Merkel
0,PRON,dont
0,VERB,share
0,ADJ,many
0,NOUN,oppinions


# 4. Lets save the model

In [None]:
stored_model_path = './models/pos_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/pos_trained


# 5. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions on laws about cheeseburgers')
preds

Fitting on empty Dataframe, could not infer correct training method!


Unnamed: 0_level_0,pos,token
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,PROPN,Donald
0,PROPN,Trump
0,CCONJ,and
0,PROPN,Angela
0,PROPN,Merkel
0,PRON,dont
0,VERB,share
0,ADJ,many
0,NOUN,oppinions
0,PRON,on


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['sentence_detector'] has settable params:
pipe['sentence_detector'].setCustomBounds([])  | Info: characters used to explicitly mark sentence bounds | Currently set to : []
pipe['sentence_detector'].setDetectLists(True)  | Info: whether detect lists during sentence detection | Currently set to : True
pipe['sentence_detector'].setExplodeSentences(False)  | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
pipe['sentence_detector'].setMaxLength(99999)  | Info: Set the maximum allowed length for each sentence | Currently set to : 99999
pipe['sentence_detector'].s