![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **RegexTokenizer**

This notebook will cover the different parameters and usages of `RegexTokenizer`. This annotator provides the ability to tokenize text according to user-defined regex patterns.

**📖 Learning Objectives:**

1. Understand how different regex patterns split sequences of words in different ways.

2. Understand the difference between the regex tokenizer and regular tokenizer.

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [RegexTokenizer](https://nlp.johnsnowlabs.com/docs/en/annotators#regextokenizer)

- Python Docs : [RegexTokenizer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/token/regex_tokenizer/index.html)

- Scala Docs : [RegexTokenizer](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/RegexTokenizer.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/02.0.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb).

## **📜 Background**


Tokenization is an important task in NLP that facilitates various downstream applications within NLP pipelines. In Spark NLP, tokenization can be carried out using 2 different annotators: the `Tokenizer` or the `RegexTokenizer`. The `RegexTokenizer` gives users additional flexibility in defining token boundaries when compared to the regular `Tokenizer` and is therefore the preferred option in highly customized NLP pipelines.

## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.1.2 spark-nlp==4.2.4

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `TOKEN`

## **🔎 Parameters**


- `maxLength`: (Int) Maximum token length, greater than or equal to 1.

- `minLength`: (Int) Minimum token length, greater than or equal to 0 (Default: 1). Default is 1, to avoid returning empty strings.

- `pattern`: (String) --> Regex pattern used to match delimiters (Default: "\\s+")

- `positionalMask`: (BooleanParam) --> Indicates whether to apply the regex tokenization using a positional mask to guarantee the incremental progression (Default: false).

- `preservePosition`: (BooleanParam)
Indicates whether to use a preserve initial indexes before eventual whitespaces removal in tokens (Default: true).

- `toLowercase`: (BooleanParam)
Indicates whether to convert all characters to lowercase before tokenizing (Default: false).

- `trimWhitespace`: (BooleanParam)
Indicates whether to use a trimWhitespace flag to remove whitespaces from identified tokens. (Default: false).

### `setPattern()`

The `setPattern` parameter should be used to divide the text into tokens according to desired regex patterns.

In [None]:
from pyspark.sql.types import StringType

content = "1. T1-T2 DATE**[12/24/13] $1.99 () (10/12) ph+ 90%"

df = spark.createDataFrame([content], StringType()).withColumnRenamed("value", "text")

In [None]:
pattern = '\\s+|(?=[-.:;"*+,$&?!%\\[\\]\\(\\)\\/])|(?<=[-.:;"*+,$&?!%\\[\\]\\(\\)\\/])'

documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols('document')\
    .setOutputCol('sentence')

tokenizer = RegexTokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("RegexToken")

regexTokenizer = RegexTokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("RegexToken_with_pattern") \
    .setPattern(pattern)

docPatternRemoverPipeline = Pipeline().setStages([documenter,
                                                  sentenceDetector,
                                                  tokenizer,
                                                  regexTokenizer])



In [None]:
content = "1. T1-T2 DATE**[12/24/13] $1.99 () (10/12) ph+ 90% sting? or hi!"

df = spark.createDataFrame([content], StringType()).withColumnRenamed("value", "text")

result = docPatternRemoverPipeline.fit(df).transform(df)
result.selectExpr("RegexToken.result as RegexToken", "RegexToken_with_pattern.result as RegexToken_with_Pattern").show(truncate=False)

+----------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
|RegexToken                                                                  |RegexToken_with_Pattern                                                                                                     |
+----------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
|[1., T1-T2, DATE**[12/24/13], $1.99, (), (10/12), ph+, 90%, sting?, or, hi!]|[1, ., T1, -, T2, DATE, *, *, [, 12, /, 24, /, 13, ], $, 1, ., 99, (, ), (, 10, /, 12, ), ph, +, 90, %, sting, ?, or, hi, !]|
+----------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------

`Regextokenizer` created the tokens by dividing using "/s+" when no pattern was given. When a pattern was given to the `setPattern` parameter, it performed the separation using that pattern.

In [None]:
tokenizer.extractParamMap()

{Param(parent='RegexTokenizer_ef71be1d91e0', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='RegexTokenizer_ef71be1d91e0', name='inputCols', doc='previous annotations columns, if renamed'): ['sentence'],
 Param(parent='RegexTokenizer_ef71be1d91e0', name='outputCol', doc='output annotation column. can be left default.'): 'RegexToken',
 Param(parent='RegexTokenizer_ef71be1d91e0', name='toLowercase', doc='Indicates whether to convert all characters to lowercase before tokenizing.'): False,
 Param(parent='RegexTokenizer_ef71be1d91e0', name='minLength', doc='Set the minimum allowed length for each token'): 1,
 Param(parent='RegexTokenizer_ef71be1d91e0', name='pattern', doc='regex pattern used for tokenizing. Defaults \\S+'): '\\s+',
 Param(parent='RegexTokenizer_ef71be1d91e0', name='positionalMask', doc='Using a positional mask to guarantee the incremental progression of the tokenization.'): False,
 Param(parent='RegexTokeni

### `setTrimWhitespace`

To remove spaces from tokens after tokenizer operation, `setTrimWhitespace()` param is set to **True**.

Now, by looking at the results of both cases, let's better understand the role of the parameter:

In [None]:
regex_pattern = """\t"""
sampleText = "   Jack   \t    registered \t with \t   id:7354632112   \t    on    \t      23/3/2022    "

df = spark.createDataFrame([[sampleText]]).toDF("text")

regexTokenizer = RegexTokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")\
  .setPattern(regex_pattern)\
  .setTrimWhitespace(False)

pipeline = Pipeline().setStages([documenter,
                                 sentenceDetector,
                                 tokenizer,
                                 regexTokenizer])

result = pipeline.fit(df).transform(df)
result.selectExpr("token.result as RegexToken").show(truncate=False)

+------------------------------------------------------------------------------------+
|RegexToken                                                                          |
+------------------------------------------------------------------------------------+
|[Jack   ,     registered ,  with ,    id:7354632112   ,     on    ,       23/3/2022]|
+------------------------------------------------------------------------------------+



In [None]:
regexTokenizer = RegexTokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")\
  .setPattern(regex_pattern)\
  .setTrimWhitespace(True)


pipeline = Pipeline().setStages([documenter,
                                 sentenceDetector,
                                 tokenizer,
                                 regexTokenizer])



result = pipeline.fit(df).transform(df)
result.selectExpr("token.result as RegexToken").show(truncate=False)


+------------------------------------------------------+
|RegexToken                                            |
+------------------------------------------------------+
|[Jack, registered, with, id:7354632112, on, 23/3/2022]|
+------------------------------------------------------+






As seen above, we can set whether to remove spaces with the parameter setTrimWhitespace()

### `setPreservePosition()`

setPreservePosition() param indicates whether to apply a method to preserve initial character indexes before eventual whitespace removal in tokens.

If after removing whitespaces with the setTrimWhitespace() parameter, the start and end indexes of the tokens need to be preserved as in the original sentence, you should set the setPreservePosition() parameter to True.

In [None]:
regex_pattern = """\t"""
sampleText = "   Jack   \t    registered \t with \t   id:7354632112   \t    on    \t      23/3/2022    "

df = spark.createDataFrame([[sampleText]]).toDF("text")

regexTokenizer = RegexTokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")\
  .setPattern(regex_pattern)\
  .setTrimWhitespace(True)\
  .setPreservePosition(False)

pipeline = Pipeline().setStages([documenter,
                                 sentenceDetector,
                                 tokenizer,
                                 regexTokenizer])

result = pipeline.fit(df).transform(df)
result.selectExpr("token.result as RegexToken", "token.begin as Begin", "token.end as End").show(truncate=False)

+------------------------------------------------------+-----------------------+-----------------------+
|RegexToken                                            |Begin                  |End                    |
+------------------------------------------------------+-----------------------+-----------------------+
|[Jack, registered, with, id:7354632112, on, 23/3/2022]|[3, 15, 28, 37, 58, 71]|[6, 24, 31, 49, 59, 79]|
+------------------------------------------------------+-----------------------+-----------------------+



In [None]:
regexTokenizer = RegexTokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")\
  .setPattern(regex_pattern)\
  .setTrimWhitespace(True)\
  .setPreservePosition(True)

pipeline = Pipeline().setStages([documenter,
                                 sentenceDetector,
                                 tokenizer,
                                 regexTokenizer])

result = pipeline.fit(df).transform(df)
result.selectExpr("token.result as RegexToken", "token.begin as Begin", "token.end as End").show(truncate=False)

+------------------------------------------------------+-----------------------+-----------------------+
|RegexToken                                            |Begin                  |End                    |
+------------------------------------------------------+-----------------------+-----------------------+
|[Jack, registered, with, id:7354632112, on, 23/3/2022]|[3, 11, 27, 34, 54, 65]|[9, 25, 32, 52, 63, 79]|
+------------------------------------------------------+-----------------------+-----------------------+



As seen above, when we set the setPreservePosition() parameter to True, the starting and ending indexes in the original sentence were preserved, even though we removed the spaces in the tokens.

### `setToLowercase()`

If it is desired to convert all characters of the text to lowercase before the tokenizer, it can be set with this parameter.

In [None]:
from pyspark.sql.types import StringType

content = "1. The investments made reached a value of £4.5Million, gaining __85.6% on DATE**[24/12/2022]."
pattern = "\\s+|(?=[-:;*__+,$&\\[\\]])|(?<=[-:;*__+,$&\\[\\]])"

df = spark.createDataFrame([content], StringType()).withColumnRenamed("value", "text")

documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

regexTokenizer = RegexTokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("regexToken") \
    .setPattern(pattern)\
    .setToLowercase(True)

docPatternRemoverPipeline = Pipeline().setStages([documenter,
                                                  sentenceDetector,
                                                  regexTokenizer])

result = docPatternRemoverPipeline.fit(df).transform(df)

result.selectExpr("sentence.result as Sentence", "regexToken.result as RegexToken").show(truncate=False)

+------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|Sentence                                                                                        |RegexToken                                                                                                                    |
+------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|[1. The investments made reached a value of £4.5Million, gaining __85.6% on DATE**[24/12/2022].]|[1., the, investments, made, reached, a, value, of, £4.5million, ,, gaining, _, _, 85.6%, on, date, *, *, [, 24/12/2022, ], .]|
+-----------------------------------------------------------------------------------------------

### `setMaxLength()`

This parameter can be adjusted when you want to see only tokens with a certain maximum length.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

regexTokenizer = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("regexToken") \
    .setPattern("\\s+|(?=[-:;*__+,$&\\[\\]])|(?<=[-:;*__+,$&\\[\\]])")\
    .setMaxLength(3)

pipeline = Pipeline().setStages([
      documentAssembler,
      regexTokenizer
    ])

data = spark.createDataFrame([["1. The investments made reached a value of £4.5Million, gaining __85.6% on DATE**[24/12/2022]."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(truncate=False)

+--------------------------------------------+
|result                                      |
+--------------------------------------------+
|[1., The, a, of, ,, _, _, on, *, *, [, ], .]|
+--------------------------------------------+



As seen in the above results, only tokens with a maximum length of 3 were received.

### `setMinLength()`

This parameter can be adjusted when you want to see only tokens with a certain minimum length.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

regexTokenizer = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("regexToken") \
    .setPattern("\\s+|(?=[-:;*__+,$&\\[\\]])|(?<=[-:;*__+,$&\\[\\]])")\
    .setMinLength(5)

pipeline = Pipeline().setStages([
      documentAssembler,
      regexTokenizer
    ])

data = spark.createDataFrame([["1. The investments made reached a value of £4.5Million, gaining __85.6% on DATE**[24/12/2022]."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(truncate=False)

+----------------------------------------------------------------------+
|result                                                                |
+----------------------------------------------------------------------+
|[investments, reached, value, £4.5Million, gaining, 85.6%, 24/12/2022]|
+----------------------------------------------------------------------+



As seen in the above results, only tokens with a minimum length of 5 were received.