![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]()

# ChunkTokenizer
In this notebook, we will examine the `ChunkTokenizer` annotator.

This annotator tokenizes and flattens extracted NER chunks. The `ChunkTokenizer` will split the extracted NER CHUNK type annotations and will create TOKEN type annotations. The result is then flattened, and a single array is produced.


**📖 Learning Objectives:**

1. Understand how to split chunks into tokens in different ways.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

Python Documentation: [ChunkTokenizer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/token/chunk_tokenizer/index.html#sparknlp.annotator.token.chunk_tokenizer.ChunkTokenizer)

Scala Documentation: [ChunkTokenizer](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/ChunkTokenizer.html)


## **📜 Background**


In some Spark NLP pipelines, this annotator is needed when there is a need for splitting chunks into tokens. "ChunkTokenizer" allows users to perform this operation with highly flexible features that can be set through different parameters.

## **🎬 Colab Setup**

In [None]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.3.0 spark-nlp==4.2.6

In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, IntegerType

spark = sparknlp.start()
spark

## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK` <br/>
- Output: `TOKEN`

## **🔎 Parameters**


- `caseSensitiveExceptions` *(Boolean)*: Whether to care for case sensitiveness in exceptions (Default: true)
- `contextChars` *(List[str])*: Character list used to separate from token boundaries (Default: Array(".", ",", ";", ":", "!", "?", "*", "-", "(", ")", "\"", "'"))
- `exceptions` *(List[str])*: Words that won't be affected by tokenization rules
- `exceptionsPath` *(String)*: Path to file containing list of exceptions
- `infixPatterns` *(List[str])*: Regex patterns that match tokens within a single target.
- `maxLength` *(Integer)*: Set the maximum allowed length for each token
- `minLength` *(Integer)*: Set the minimum allowed length for each token
- `prefixPattern` *(String)*: Regex with groups and begins with \\A to match target prefix. Overrides contextCharacters Param

- `splitChars` *(List[str])*: Character list used to separate from the inside of tokens

- `splitPattern` *(String)*: Pattern to separate from the inside of tokens. Takes priority over splitChars. This pattern will be applied to the tokens which where extracted with the target pattern previously

- `suffixPattern` *(String)*: Regex with groups and ends with \\z to match target suffix.

- `targetPattern`: *(String)*: Pattern to grab from text as token candidates.

Firstly, let's build a pipeline with default `ChunkTokenizer` parameters and see how it works. <br/>

In the example pipeline, we will use the `ner_dl` pretrained NER model to detect named entities, then we will tokenize the chunks detected by the ner model by using `ChunkTokenizer`. 

In [None]:
#Creating a sample data

text = ['Peter Parker is a nice lad and lives in New York']
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+------------------------------------------------+
|text                                            |
+------------------------------------------------+
|Peter Parker is a nice lad and lives in New York|
+------------------------------------------------+



In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained("ner_dl", "en")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= NerConverter()\
    .setInputCols(['document', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token")


pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


In [None]:
result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+------------------------+--------------------------+
|ner_chunk               |chunk_token               |
+------------------------+--------------------------+
|[Peter Parker, New York]|[Peter, Parker, New, York]|
+------------------------+--------------------------+



As seen above, our chunks consist of two words and `ChunkTokenizer` split tokens. 

### contextChars  
This parameter is used to set Character list to separate from token boundaries (Default: [".", ",", ";", ":", "!", "?", "*", "-", "(", ")", """, "'"])

In [None]:
#Creating a sample data

text = ['Peter Parker (Spiderman) is a nice guy and lives in New-York!']
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+-------------------------------------------------------------+
|text                                                         |
+-------------------------------------------------------------+
|Peter Parker (Spiderman) is a nice guy and lives in New-York!|
+-------------------------------------------------------------+



In [None]:
chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") \
     .setContextChars(['?', '!'])\

pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+------------------------------------+---------------------------------------+
|ner_chunk                           |chunk_token                            |
+------------------------------------+---------------------------------------+
|[Peter Parker, Spiderman, New-York!]|[Peter, Parker, Spiderman, New-York, !]|
+------------------------------------+---------------------------------------+



As seen from the output, the exclamation mark was separated.

### infixPatterns
This parameter is used to set regex patterns that match tokens within a single target.

In [56]:
#Creating a sample data

text = ['Peter Parker bookad a ticket for the concert. Ticket price is 100$ dollars']
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+--------------------------------------------------------------------------+
|text                                                                      |
+--------------------------------------------------------------------------+
|Peter Parker bookad a ticket for the concert. Ticket price is 100$ dollars|
+--------------------------------------------------------------------------+



In [57]:
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

tokenClassifier = RoBertaForTokenClassification \
.pretrained('roberta_base_token_classifier_ontonotes', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('ner') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)

# since output column is IOB/IOB2 style, NerConverter can extract entities
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('entities')


chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["entities"]) \
     .setOutputCol("chunk_token") \
     .setInfixPatterns(["(\b\d{3}\b)"])

pipeline = Pipeline(stages=[
document_assembler, 
tokenizer,
tokenClassifier,
ner_converter,
chunkTokenizer
])


roberta_base_token_classifier_ontonotes download started this may take some time.
Approximate size to download 434.7 MB
[OK!]


WE DEFINED A REGEX FOR EXTRACTING 3 DIGITS NUMBERS <br/>
WE EXPECT TO SEE "100" AND "$" SEPERATELY 

In [59]:
result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("entities.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+----------------------------+-----------------------------+
|ner_chunk                   |chunk_token                  |
+----------------------------+-----------------------------+
|[Peter Parker, 100$ dollars]|[Peter Parker, 100$, dollars]|
+----------------------------+-----------------------------+



### Exceptions 
This parameter is used to choose words that won't be affected by tokenization rules.

In [None]:
#Creating a sample data

text = ["Peter Parker is a nice lad and lives in New York"]
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+------------------------------------------------+
|text                                            |
+------------------------------------------------+
|Peter Parker is a nice lad and lives in New York|
+------------------------------------------------+



In some cases, you may not want some chunks to be splitted. Let's give "New York" to the `setExceptions()` parameter. Then, we expect to see that the "New York" will not be splitted into its tokens. 

In [None]:
word_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained("ner_dl", "en")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= NerConverter()\
    .setInputCols(['document', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") \
     .setExceptions(["New York"])


pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]
+------------------------+-------------------------+
|ner_chunk               |chunk_token              |
+------------------------+-------------------------+
|[Peter Parker, New York]|[Peter, Parker, New York]|
+------------------------+-------------------------+



As seen above, `ChunkTokenizer` did not split "New York"

### CaseSensitiveExceptions

This parameter is used to decide whether to care for case sensitiveness in exceptions (Default: true)

Firstly, we will use the default value as **setCaseSensitiveExceptions(True)**. By doing this, we expect to see "New York" will be split into its tokens and exception will not work since we defined it as lowercased in the `setExceptions()`.

In [None]:
chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") \
     .setExceptions(["new york"]) \
     .setCaseSensitiveExceptions(True)


pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+------------------------+--------------------------+
|ner_chunk               |chunk_token               |
+------------------------+--------------------------+
|[Peter Parker, New York]|[Peter, Parker, New, York]|
+------------------------+--------------------------+



Now, we will define the value in the `setExceptions(["new york"])` as lowercased, and `setCaseSensitiveExceptions(False)`. Thus, we expect to see that "New York" will not be split into its tokens and our exception will work. 

In [None]:
chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") \
     .setExceptions(["new york"]) \
     .setCaseSensitiveExceptions(False)


pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+------------------------+-------------------------+
|ner_chunk               |chunk_token              |
+------------------------+-------------------------+
|[Peter Parker, New York]|[Peter, Parker, New York]|
+------------------------+-------------------------+



As seen above, "New York" were not separated since we set `setCaseSensitiveExceptions(False)`. 

### ExceptionsPath
This parameter is used to set a path to file containing list of exceptions.

First, we will create a txt file containing exceptions. Then, we will give the path of this file into `setExceptionsPath()` parameter. 

In [None]:
#Defining exceptions
exceptions= """Peter Parker
James Murphy
Lucas Nelson
"""

#open text file
text_file = open("exceptions.txt", "w")
 
#write string to file
text_file.write(exceptions)
 
#close file
text_file.close()

Building `ChunkTokenizer` with `ExceptionsPath`.

In [None]:
chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") \
     .setExceptionsPath('exceptions.txt') \


pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+------------------------+-------------------------+
|ner_chunk               |chunk_token              |
+------------------------+-------------------------+
|[Peter Parker, New York]|[Peter Parker, New, York]|
+------------------------+-------------------------+



As you see, "Peter Parker" weren't split since it is defined in the *exceptions.txt* file. 

### maxLength  
This parameter is used to set the maximum allowed length for each token.

In [None]:
#Creating a sample data

text = ["Peter Parker is a nice lad and lives in Minnesota"]
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+-------------------------------------------------+
|text                                             |
+-------------------------------------------------+
|Peter Parker is a nice lad and lives in Minnesota|
+-------------------------------------------------+



Firstly we will build our pipeline without `MaxLenght` parameter. 

In [None]:
chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") 

pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+-------------------------+--------------------------+
|ner_chunk                |chunk_token               |
+-------------------------+--------------------------+
|[Peter Parker, Minnesota]|[Peter, Parker, Minnesota]|
+-------------------------+--------------------------+



Now, we will set `setMaxLenght(7)` and we expect to not see "Minnesota" in the result since it has 9 characters. 

In [None]:
chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") \
     .setMaxLength(7)

pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+-------------------------+---------------+
|ner_chunk                |chunk_token    |
+-------------------------+---------------+
|[Peter Parker, Minnesota]|[Peter, Parker]|
+-------------------------+---------------+



As you see, `ChunkTokenizer` did not accept "Minnesota" as a token because of its length. 

### minLength 
This parameter is used to set the minimum allowed length for each token

In [None]:
#Creating a sample data

text = ["Peter Parker is a nice lad and lives in LA"]
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+------------------------------------------+
|text                                      |
+------------------------------------------+
|Peter Parker is a nice lad and lives in LA|
+------------------------------------------+



Firstly we will build our pipeline without MinLenght parameter.

In [None]:
chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") 
     
pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+------------------+-------------------+
|ner_chunk         |chunk_token        |
+------------------+-------------------+
|[Peter Parker, LA]|[Peter, Parker, LA]|
+------------------+-------------------+



Now, we will set `setMaxLenght(3)` and we expect to not see "LA" in the result since it has 2 characters.

In [None]:
chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") \
     .setMinLength(3)
     
pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+------------------+---------------+
|ner_chunk         |chunk_token    |
+------------------+---------------+
|[Peter Parker, LA]|[Peter, Parker]|
+------------------+---------------+



As you see, `ChunkTokenizer` did not accept "LA" as a token because of its length.

### splitChars 
Character list used to separate from the inside of tokens

In [None]:
#Creating a sample data

text = ["Peter Parker is a nice lad and lives in New-York"]
data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+------------------------------------------------+
|text                                            |
+------------------------------------------------+
|Peter Parker is a nice lad and lives in New-York|
+------------------------------------------------+



In [None]:
chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") 
          
pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+------------------------+-------------------------+
|ner_chunk               |chunk_token              |
+------------------------+-------------------------+
|[Peter Parker, New-York]|[Peter, Parker, New-York]|
+------------------------+-------------------------+



Now we set `setSplitChars(["-"])`, therefore we expect to see "New-York" will be split from `-`. 

In [None]:
chunkTokenizer = ChunkTokenizer() \
     .setInputCols(["ner_chunk"]) \
     .setOutputCol("chunk_token") \
     .setSplitChars(["-"])
     
pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+------------------------+--------------------------+
|ner_chunk               |chunk_token               |
+------------------------+--------------------------+
|[Peter Parker, New-York]|[Peter, Parker, New, York]|
+------------------------+--------------------------+



As see above, "New-York" were split into two tokens; "New" and "York"

### suffixPattern

This parameter is used to set regex with groups and ends with \z to match target suffix.

In [None]:
text = ['Peter Parker (Spiderman) is a nice guy and lives in New-York!']

data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+-------------------------------------------------------------+
|text                                                         |
+-------------------------------------------------------------+
|Peter Parker (Spiderman) is a nice guy and lives in New-York!|
+-------------------------------------------------------------+



A pipeline with no defined `.setSuffixPattern()`. <br/>
Check the chunk "New-York!"

In [None]:
word_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained("ner_dl", "en")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= NerConverter()\
    .setInputCols(['document', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

chunkTokenizer = ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("chunk_token") \
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)


pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]
+------------------------------------+---------------------------------------+
|ner_chunk                           |chunk_token                            |
+------------------------------------+---------------------------------------+
|[Peter Parker (Spiderman), New-York]|[Peter, Parker, (Spiderman), New, York]|
+------------------------------------+---------------------------------------+



A pipeline with defined `.setSuffixPattern("([a])\z")`. <br/>
Check the chunk "New-York!"

In [None]:
word_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained("ner_dl", "en")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= NerConverter()\
    .setInputCols(['document', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

chunkTokenizer = ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("chunk_token") \
    .setSuffixPattern("([a])\z")\
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)


pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]
+------------------------------------+--------------------------------------+
|ner_chunk                           |chunk_token                           |
+------------------------------------+--------------------------------------+
|[Peter Parker (Spiderman), New-York]|[Peter, Parker, (Spiderman), New-York]|
+------------------------------------+--------------------------------------+



### prefixPattern

This parameter is used to set regex with groups and begins with \A to match target prefix. Overrides contextCharacters parameter.

In [None]:
word_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained("ner_dl", "en")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= NerConverter()\
    .setInputCols(['document', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

chunkTokenizer = ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("chunk_token") \
    .setSuffixPattern("([a])\z")\
    .setPrefixPattern("\A([a])")\
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)


pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]
+------------------------------------+--------------------------------------+
|ner_chunk                           |chunk_token                           |
+------------------------------------+--------------------------------------+
|[Peter Parker (Spiderman), New-York]|[Peter, Parker, (Spiderman), New-York]|
+------------------------------------+--------------------------------------+



In [None]:
word_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained("ner_dl", "en")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= NerConverter()\
    .setInputCols(['document', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

chunkTokenizer = ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("chunk_token") \
    .setSuffixPattern("([a])\z")\
    .setPrefixPattern("\A([a])")\
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)


pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]
+------------------------------------+--------------------------------------+
|ner_chunk                           |chunk_token                           |
+------------------------------------+--------------------------------------+
|[Peter Parker (Spiderman), New-York]|[Peter, Parker, (Spiderman), New-York]|
+------------------------------------+--------------------------------------+



In [None]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setSuffixPattern("([a])\z")\
    .setPrefixPattern("\A([a])")\
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)

nlpPipeline = Pipeline(stages=[documenter, 
                               tokenizer])

text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('token.result').take(1)

[Row(result=['Peter', 'Parker', '(Spiderman)', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New York', 'but', 'has', 'no', 'e-mail!'])]

In [None]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setSuffixPattern("([a])\z")\
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)

nlpPipeline = Pipeline(stages=[documenter, 
                               tokenizer])

text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('token.result').take(1)

[Row(result=['Peter', 'Parker', '(Spiderman)', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New York', 'but', 'has', 'no', 'e-mail!'])]

### targetPattern

This parameter is used to set pattern to grab from text as token candidates.

In [None]:
text = ['Peter Parker (Spiderman) is a nice guy and lives in New-York!']

data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+-------------------------------------------------------------+
|text                                                         |
+-------------------------------------------------------------+
|Peter Parker (Spiderman) is a nice guy and lives in New-York!|
+-------------------------------------------------------------+



A pipeline with no `.setTargetPattern()` defined. <br/>
Check the chunk "New-York!" 

In [None]:
word_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained("ner_dl", "en")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= NerConverter()\
    .setInputCols(['document', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

chunkTokenizer = ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("chunk_token") 

pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]
+-------------------------------------+---------------------------------------------+
|ner_chunk                            |chunk_token                                  |
+-------------------------------------+---------------------------------------------+
|[Peter Parker (Spiderman), New-York!]|[Peter, Parker, (, Spiderman, ), New-York, !]|
+-------------------------------------+---------------------------------------------+



A pipeline with `.setTargetPattern("\b\w+!\b")` defined. <br/>
Check the chunk "New-York!" 

In [None]:
word_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_tagger = NerDLModel.pretrained("ner_dl", "en")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= NerConverter()\
    .setInputCols(['document', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

chunkTokenizer = ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("chunk_token") \
    .setTargetPattern("\b\w+!\b")

pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]
+-------------------------------------+----------------------------------------------+
|ner_chunk                            |chunk_token                                   |
+-------------------------------------+----------------------------------------------+
|[Peter Parker (Spiderman), New-York!]|[Peter, Parker, (, Spiderman, ), New, York, !]|
+-------------------------------------+----------------------------------------------+



### splitPattern

This parameter is used to set pattern to separate from the inside of tokens. Takes priority over `splitChars`. This pattern will be applied to the tokens which where extracted with the target pattern previously.

In [None]:
text = ['John Adam is a nice guy and visited to Washinton D.C.!']

data_set = spark.createDataFrame(text, StringType()).toDF("text")
data_set.show(truncate=False)

+------------------------------------------------------+
|text                                                  |
+------------------------------------------------------+
|John Adam is a nice guy and visited to Washinton D.C.!|
+------------------------------------------------------+



In [None]:
chunkTokenizer = ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("chunk_token") \
    .setTargetPattern("\b\w+!\b")

pipeline = Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner_tagger,
    ner_converter,
    chunkTokenizer
])

result = pipeline.fit(data_set).transform(data_set)
result.selectExpr("ner_chunk.result as ner_chunk" , "chunk_token.result as chunk_token").show(truncate=False)

+----------------------------+--------------------------------+
|ner_chunk                   |chunk_token                     |
+----------------------------+--------------------------------+
|[John Adam, Washinton D.C.!]|[John, Adam, Washinton, D, C, !]|
+----------------------------+--------------------------------+

