![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


#  **Tokenizer**

This notebook will cover the different parameters and usages of `Tokenizer`. This annotator identifies tokens with tokenization open standards. It is an Annotator Approach, so it requires .fit().

A few rules will help customizing it if defaults do not fit user needs.

**📖 Learning Objectives:**

1. Understand how to use `Tokenizer`.

2. Become comfortable using the different parameters of the `Tokenizer`.


**🔗 Helpful Links:**

- Documentation : [Tokenizer](https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer)

- Python Docs : [Tokenizer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/token/tokenizer/index.html)

- Scala Docs : [Tokenizer](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/Tokenizer.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/).

## **🎬 Colab Setup**

In [1]:
# Install PySpark and Spark NLP
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `TOKEN`

## **🔎 Parameters**

- `addException()`: (String) Add a single exception.

- `setExceptionsPath()`: (String) Path to txt file with list of token exceptions.

- `caseSensitiveExceptions`: (Bool) Whether to follow case sensitiveness for matching exceptions in text.

- `contextChars()`: (StringArray) List of 1 character string to rip off from tokens, such as parenthesis or question marks. Ignored if using prefix, infix or suffix patterns.

- `splitChars()`: (StringArray) List of 1 character string to split tokens inside, such as hyphens. Ignored if using infix, prefix or suffix patterns.

- `splitPattern()`: (String) pattern to separate from the inside of tokens. takes priority over splitChars. setTargetPattern: Basic regex rule to identify a candidate for tokenization. Defaults to \S+ which means anything not a space.

- `setSuffixPattern()`: Regex to identify subtokens that are in the end of the token. Regex has to end with \z and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis.

- `setPrefixPattern()`: Regex to identify subtokens that come in the beginning of the token. Regex has to start with \A and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis.

- `addInfixPattern()`: Add an extension pattern regex with groups to the top of the rules (will target first, from more specific to the more general).

- `minLength()`: Set the minimum allowed legth for each token.

- `maxLength()`: Set the maximum allowed legth for each token.

### `.addException()`

Words that won’t be affected by tokenization rules

In [30]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .addException("New York")

nlpPipeline = Pipeline(stages=[documenter, 
                               tokenizer])

text = 'Peter Parker (Spiderman) is a nice guy and lives in New York or new york but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [31]:
result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('token.result').take(1)

[Row(result=['Peter', 'Parker', '(', 'Spiderman', ')', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New York', 'or', 'new', 'york', 'but', 'has', 'no', 'e-mail', '!'])]

### `.setCaseSensitiveExceptions()`

Whether to care for case sensitiveness in exceptions, by default True

In [32]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .addException("New York")\
    .setCaseSensitiveExceptions(False)

nlpPipeline = Pipeline(stages=[documenter, 
                               tokenizer])

text = 'Peter Parker (Spiderman) is a nice guy and lives in New York or new york but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [33]:
result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('token.result').take(1)

[Row(result=['Peter', 'Parker', '(', 'Spiderman', ')', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New York', 'or', 'new york', 'but', 'has', 'no', 'e-mail', '!'])]

### `.setContextChars()`

Sets character list used to separate from token boundaries, by default [‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘?’, ‘*’, ‘-’, ‘(’, ‘)’, ‘”’, “’”].

In [34]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)

nlpPipeline = Pipeline(stages=[documenter, 
                               tokenizer])

text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [35]:
result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('token.result').take(1)

[Row(result=['Peter', 'Parker', '(Spiderman)', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New York', 'but', 'has', 'no', 'e-mail', '!'])]

### `.setSplitChars()`

Gets character list used to separate from the inside of tokens.

In [36]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)

nlpPipeline = Pipeline(stages=[documenter, 
                               tokenizer])

text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [37]:
result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('token.result').take(1)

[Row(result=['Peter', 'Parker', '(Spiderman)', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New York', 'but', 'has', 'no', 'e', 'mail', '!'])]

### `.setSuffixPattern()`

Sets regex with groups and ends with \z to match target suffix, by default ([^\s\w]?)([^\s\w]*)\z.

In [38]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setSuffixPattern("([a])\z")\
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)

nlpPipeline = Pipeline(stages=[documenter, 
                               tokenizer])

text = 'Petra Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [39]:
result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('token.result').take(1)

[Row(result=['Petr', 'a', 'Parker', '(Spiderman)', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New York', 'but', 'has', 'no', 'e-mail!'])]

### `.setPrefixPattern()`

Sets regex with groups and begins with \A to match target prefix, by default \A([^\s\w\$\.]*).

In [40]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setPrefixPattern("\A([a])")\
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)

nlpPipeline = Pipeline(stages=[documenter, 
                               tokenizer])

text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [41]:
result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('token.result').take(1)

[Row(result=['Peter', 'Parker', '(Spiderman)', 'is', 'a', 'nice', 'guy', 'a', 'nd', 'lives', 'in', 'New York', 'but', 'has', 'no', 'e-mail!'])]

### `.setMinLength()`

Sets the minimum allowed length for each token, by default 0.

In [42]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)\
    .setMinLength(2)

nlpPipeline = Pipeline(stages=[documenter, 
                               tokenizer])

text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [43]:
result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('token.result').take(1)

[Row(result=['Peter', 'Parker', '(Spiderman)', 'is', 'nice', 'guy', 'and', 'lives', 'in', 'New York', 'but', 'has', 'no', 'mail'])]

### `.setMaxLength()`

Sets the maximum allowed length for each token, by default 99999.

In [44]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setSplitChars(['-']) \
    .setContextChars(['?', '!'])\
    .addException("New York")\
    .setCaseSensitiveExceptions(True)\
    .setMinLength(2)\
    .setMaxLength(5)
    
nlpPipeline = Pipeline(stages=[documenter, 
                               tokenizer])

text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [45]:
result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('token.result').take(1)

[Row(result=['Peter', 'is', 'nice', 'guy', 'and', 'lives', 'in', 'but', 'has', 'no', 'mail'])]