![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/12.03.Chunker.ipynb)

#   **Chunker**

This notebook will cover the different parameters and usages of `Chunker`. This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document.

**📖 Learning Objectives:**

1. Understand how extracted part-of-speech tags are mapped onto the sentence, which can then be parsed by regular expressions.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [Chunker](https://nlp.johnsnowlabs.com/docs/en/annotators#chunker)

- Python Docs : [Chunker](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/chunker/index.html)

- Scala Docs : [Chunker](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/Chunker.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/).

## **🎬 Colab Setup**

In [None]:
# Install PySpark and Spark NLP
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT, POS`

- Output: `CHUNK`

## **🔎 Parameters**


- `regexParsers`: (String) --> An array of grammar based chunk parsers.

### `setRegexParsers(patterns)`

A list of regex patterns to match chunks, for example: Array(“‹DT›?‹JJ›*‹NN›

In [22]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

pos = PerceptronModel.pretrained("pos_anc", 'en')\
    .setInputCols("sentence", "token")\
    .setOutputCol("pos")

chunker = Chunker()\
    .setInputCols(["sentence", "pos"])\
    .setOutputCol("chunk")\
    .setRegexParsers(["<NNP>+", "<NNS>+"])

nlpPipeline = Pipeline(stages=[documentAssembler, 
                              sentence,
                              tokenizer,
                              pos,
                              chunker])

sample_text = """
One of the pioneers of Artificial Intelligence (AI) is Alan Turing, the British mathematician who developed the concept of the Turing machine, 
a theoretical device that laid the groundwork for modern computing. Other early contributors to the field include John McCarthy, Marvin Minsky, and Claude Shannon, 
who made significant contributions to the development of AI algorithms and techniques. More recently, AI has become a major focus for many tech giants, including 
Google, Microsoft, Amazon, and IBM, who are investing heavily in research and development in this area. These companies have also acquired or partnered with startups and 
smaller companies that are working on cutting-edge AI applications, such as self-driving cars, personalized medicine, and predictive maintenance. 
There are also many researchers and academics who are advancing the field of AI through their work. These include Yoshua Bengio, Geoffrey Hinton, and Yann LeCun, 
who are known for their contributions to the development of deep learning algorithms and techniques.
"""

data = spark.createDataFrame([[sample_text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
result.selectExpr("explode(chunk) as result").show(50, truncate=False)

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
+-------------------------------------------------------------------------+
|result                                                                   |
+-------------------------------------------------------------------------+
|{chunk, 24, 46, Artificial Intelligence, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 49, 50, AI, {sentence -> 0, chunk -> 1}, []}                     |
|{chunk, 56, 66, Alan Turing, {sentence -> 0, chunk -> 2}, []}            |
|{chunk, 128, 133, Turing, {sentence -> 0, chunk -> 3}, []}               |
|{chunk, 12, 19, pioneers, {sentence -> 0, chunk -> 4}, []}               |
|{chunk, 259, 271, John McCarthy, {sentence -> 1, chunk -> 0}, []}        |
|{chunk, 274, 286, Marvin Minsky, {sentence -> 1, chunk -> 1}, []}        |
|{chunk, 293, 306, Claude Shannon, {sentence -> 1, chunk -> 2}, []}       |
|{chunk, 367, 368, AI, {sentence -> 1, chunk -> 3}, []}                 

In [23]:
result_df = result.select(F.explode(F.arrays_zip(result.chunk.result, 
                                                 result.chunk.begin, 
                                                 result.chunk.end)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end")).toPandas()

result_df.head(50)

Unnamed: 0,chunk,begin,end
0,Artificial Intelligence,24,46
1,AI,49,50
2,Alan Turing,56,66
3,Turing,128,133
4,pioneers,12,19
5,John McCarthy,259,271
6,Marvin Minsky,274,286
7,Claude Shannon,293,306
8,AI,367,368
9,contributors,225,236
