![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **POSTagger(Part of speech tagger)**

This notebook will cover the different parameters and usages of `POSTagger`. 

**📖 Learning Objectives:**

1. Understand the basics of `part-of-speech (POS) tagging` and how it can be useful in natural language processing applications.

2. Learn about potential use cases for `POS tagging`, such as and named entity recognition, and dependency parser

**🔗 Helpful Links:**

- Documentation : [PerceptronModel](https://nlp.johnsnowlabs.com/docs/en/annotators#postagger-part-of-speech-tagger)

- Python Docs : [PerceptronModel](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/pos/perceptron/index.html)

- Scala Docs : [PerceptronModel](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/pos/perceptron/PerceptronModel)


## **📜 Background**
`Part-of-speech (POS) tagging` is the process of labeling each word in a text with its corresponding part of speech, such as noun, verb, adjective, etc. `POS tagging` is a fundamental task in natural language processing, and it is used in many downstream applications such as  and named entity recognition, relation extraction, and dependency parser.

## **🎬 Colab Setup**

In [None]:
! pip install -q pyspark==3.1.2  spark-nlp==4.2.4

In [None]:
import sparknlp

import sys
sys.path.append('../../')

import sparknlp

from sparknlp.base import LightPipeline
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.functions import array_contains
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher
import pandas as pd
import pyspark.sql.functions as F

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.2.4
Apache Spark version: 3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`  `DOCUMENT`

- Output: `POS`

## **🔎Parameters**

- `NONE`








In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    posTagger
])
data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers"]]).toDF("text")

result = pipeline.fit(data).transform(data)

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]


In [None]:
result.selectExpr("explode(pos) as pos").show(truncate=False)

+----------------------------------------------------------+
|pos                                                       |
+----------------------------------------------------------+
|{pos, 0, 4, NNP, {word -> Peter, sentence -> 0}, []}      |
|{pos, 6, 11, NNP, {word -> Pipers, sentence -> 0}, []}    |
|{pos, 13, 21, NNS, {word -> employees, sentence -> 0}, []}|
|{pos, 23, 25, VBP, {word -> are, sentence -> 0}, []}      |
|{pos, 27, 33, VBG, {word -> picking, sentence -> 0}, []}  |
|{pos, 35, 39, NNS, {word -> pecks, sentence -> 0}, []}    |
|{pos, 41, 42, IN, {word -> of, sentence -> 0}, []}        |
|{pos, 44, 50, JJ, {word -> pickled, sentence -> 0}, []}   |
|{pos, 52, 58, NNS, {word -> peppers, sentence -> 0}, []}  |
+----------------------------------------------------------+



**When we use `pereptron model`.**

* For example `NerCRF model` and `Dependency parser` require POS tag as an input column.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("word_embeddings")

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")

dependencyParser = DependencyParserModel() \
    .setInputCols("sentence", "pos", "token") \
    .setOutputCol("dependency")

nerTagger = NerCrfModel.pretrained() \
    .setInputCols(["sentence", "token", "word_embeddings", "pos"]) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    posTagger,
    dependencyParser,
    nerTagger
])

data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("ner.result").show(truncate=False)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
ner_crf download started this may take some time.
Approximate size to download 10.2 MB
[OK!]
+------------------------------------+
|result                              |
+------------------------------------+
|[I-ORG, O, O, I-PER, O, O, I-LOC, O]|
+------------------------------------+

