![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/Training/part_of_speech/NLU_training_POS_demo.ipynb)



# Training a Named Entity Recognition (POS) model with NLU
With the [POS tagger](https://nlp.johnsnowlabs.com/docs/en/annotators#postagger-part-of-speech-tagger) from Spark NLP you can achieve State Of the Art results on any POS problem.
It uses an Averaged Percetron Model approach under the hood.

This notebook showcases the following features :

- How to train the deep learning POS classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Colab Setup

In [None]:
# Install the johnsnowlabs library
! pip install -q johnsnowlabs

# 2. Download French POS dataset

In [2]:
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt

--2023-10-27 13:16:24--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.10.150, 54.231.138.136, 52.217.116.224, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.10.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3565213 (3.4M) [text/plain]
Saving to: ‘UD_French-GSD_2.3.txt’


2023-10-27 13:16:25 (14.4 MB/s) - ‘UD_French-GSD_2.3.txt’ saved [3565213/3565213]



# 3. Train Deep Learning Classifier using nlu.load('train.pos')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [3]:
from johnsnowlabs import nlp
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# Since there are no
train_path = '/content/UD_French-GSD_2.3.txt'
trainable_pipe = nlp.load('train.pos')
fitted_pipe = trainable_pipe.fit(dataset_path=train_path)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions')
preds

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,pos,token
0,9_NUM,Donald
0,1_NUM,Trump
0,6_NUM,and
0,7_NUM,Angela
0,7_NUM,Merkel
0,7_NUM,dont
0,7_NUM,share
0,6_NUM,many
0,940_NUM,oppinions


# 4. Lets save the model

In [4]:
stored_model_path = './models/pos_trained'
fitted_pipe.save(stored_model_path)

# 5. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [5]:
hdd_pipe = nlp.load(path=stored_model_path)

preds = hdd_pipe.predict('Donald Trump and Angela Merkel dont share many oppinions on laws about cheeseburgers')
preds



Unnamed: 0,pos,token
0,9_NUM,Donald
0,1_NUM,Trump
0,6_NUM,and
0,7_NUM,Angela
0,7_NUM,Merkel
0,7_NUM,dont
0,7_NUM,share
0,7_NUM,many
0,7_NUM,oppinions
0,7_NUM,on


In [6]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['document_assembler'] has settable params:
component_list['document_assembler'].setCleanupMode('shrink')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> component_list['sentence_detector_dl'] has settable params:
component_list['sentence_detector_dl'].setCustomBounds([])  | Info: characters used to explicitly mark sentence bounds | Currently set to : []
component_list['sentence_detector_dl'].setExplodeSentences(False)  | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
component_list['sentence_detector_dl'].setMaxLength(99999)  | Info: Set the maximum allowed length for each sentence | Currently set to : 99999
component_list['sentence_detector_dl'].setMinLength(0)  | Info: Set the minimum allo