![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **Normalizer**

This notebook will cover the different parameters and usages of `Normalizer`. 

**📖 Learning Objectives:**

1. Understand how to clean tokens by making use of this annotator.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [Normalizer](https://nlp.johnsnowlabs.com/docs/en/annotators#normalizer)

- Python Docs : [Normalizer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/normalizer/index.html)

- Scala Docs : [Normalizer](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/Normalizer.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb).

## **📜 Background**

This annotator cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

## **🎬 Colab Setup**

In [1]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F

spark = sparknlp.start()
spark

## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `TOKEN`

## **🔎 Parameters**


*  `CleanupPatterns` (*StringArrayParam*) : Normalization regex patterns which match will be removed from token (Default: ["[^A-Za-z]"])

*   `Lowercase` ( *Boolean* ) : Whether to convert strings to lowercase (Default: false)



*   `MaxLength` ( *Int* ) : Set the maximum allowed length for each token


*  `MinLength` ( *Int* ) : Set the minimum allowed length for each token (Default: 0)

*  `SlangDictionary` ( *path* ) : Delimited file with list of custom words to be manually corrected


*   `SlangMatchCase` ( *Boolean* ) : Whether or not to be case sensitive to match slangs (Default: false)








### `CleanupPatterns` 

If we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])

In [3]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")\

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much. John is 20 years old and Peter is 26"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("normalized.result").show(truncate = False)

+------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                        |
+------------------------------------------------------------------------------------------------------------------------------+
|[John, and, Peter, are, brothers, However, they, dont, support, each, other, that, much, John, is, years, old, and, Peter, is]|
+------------------------------------------------------------------------------------------------------------------------------+



As CleanupPatterns will take default value, so anything other than the alphabet letters is cleaned. In our case that is why 20 and 26 get removed.

➤ After specifying CleanupPatterns

In [4]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setCleanupPatterns(["""[^\w\d\s]"""]) 

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much. John is 20 years old and Peter is 26"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("normalized.result").show(truncate = False)

+--------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------------------------+
|[John, and, Peter, are, brothers, However, they, dont, support, each, other, that, much, John, is, 20, years, old, and, Peter, is, 26]|
+--------------------------------------------------------------------------------------------------------------------------------------+



Our CleanupPattern removes all non-word, non-digit and non-space characters. 

So that is why: *don't -> dont*

### `Lowercase`

(Default: false)

In [5]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much. John is 20 years old and Peter is 26"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("normalized.result").show(truncate = False)

+------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                        |
+------------------------------------------------------------------------------------------------------------------------------+
|[John, and, Peter, are, brothers, However, they, dont, support, each, other, that, much, John, is, years, old, and, Peter, is]|
+------------------------------------------------------------------------------------------------------------------------------+



As LowerCase takes its default value which is False, we get non-lowercase tokens as well.

➤ Lowercase : True

In [6]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(True)
    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much. John is 20 years old and Peter is 26"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("normalized.result").show(truncate = False)

+------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                        |
+------------------------------------------------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much, john, is, years, old, and, peter, is]|
+------------------------------------------------------------------------------------------------------------------------------+



As we set LowerCase as True, we get everything in lowercase.

### `MaxLength and MinLength` 

Sets the maximum and minimum allowed length for each token.

In [7]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(True)\
    .setMaxLength(4)\
    .setMinLength(3) #Default = 0
    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much. John is 20 years old and Peter is 26"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("normalized.result").show(truncate = False)

+--------------------------------------------------------------+
|result                                                        |
+--------------------------------------------------------------+
|[john, and, are, they, dont, each, that, much, john, old, and]|
+--------------------------------------------------------------+



As we set the MaxLength=4 and MinLength=3, we get only the tokens having lengths 3 and 4.

### `SlangDictionary`

Give delimited file with list of custom words to be manually corrected

In [8]:
#Create a Demo Slang CSV File
import csv
  
field_names = ['Slang', 'Correct_Word']
  
slangs = [
{'Slang': "bros", 'Correct_Word': 'brothers'},
{'Slang': "approx", 'Correct_Word': 'approximately'},
{'Slang': "AFAIK", 'Correct_Word': 'As far as I know'}
]
  
with open('slangs.csv', 'w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames = field_names)
    writer.writeheader()
    writer.writerows(slangs)

In [9]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setSlangDictionary("/content/slangs.csv" ,",")

    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are bros. However they don't support each other that much. AFAIK, John is 20 years old and Peter is approx 26"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("normalized.result").show(truncate = False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[John, and, Peter, are, brothers, However, they, dont, support, each, other, that, much, As, far, as, I, know, John, is, years, old, and, Peter, is, approximately]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+



The slang words were replaced with their correct words as specified in the file.

*   bros -> brothers

*   AFAIK -> As far as I know

*   approx -> approximately




### `SlangMatchCase` 

Whether or not to be case sensitive to match slangs (Default: false)

In [10]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setSlangDictionary("/content/slangs.csv" ,",")\
    .setSlangMatchCase(False)

    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are bros. However they don't support each other that much. afaik, John is 20 years old and Peter is approx 26"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("normalized.result").show(truncate = False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[John, and, Peter, are, brothers, However, they, dont, support, each, other, that, much, As, far, as, I, know, John, is, years, old, and, Peter, is, approximately]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+



As SlangMatchCase is False, even non case sensitive slangs were matched.

*   afaik -> As far as I know

➤ SlangMatchCase : True

In [11]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setSlangDictionary("/content/slangs.csv" ,",")\
    .setSlangMatchCase(True)

    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are bros. However they don't support each other that much. afaik, John is 20 years old and Peter is approx 26"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("normalized.result").show(truncate = False)

+----------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
|[John, and, Peter, are, brothers, However, they, dont, support, each, other, that, much, afaik, John, is, years, old, and, Peter, is, approximately]|
+----------------------------------------------------------------------------------------------------------------------------------------------------+



As SlangMatchCase is True, non case sensitive slangs were not matched.

*   afaik -> afaik (Remains Same)


▶ *Token Indicies are preserved* 

In [12]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")\

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John is 20 and Peter is 26 years old."]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)

result.select("token.result","token.begin","token.end").show(truncate=False)
result.select("normalized.result","normalized.begin","normalized.end").withColumnRenamed("result","normalized result").show(truncate=False)


+-------------------------------------------------+-------------------------------------+-------------------------------------+
|result                                           |begin                                |end                                  |
+-------------------------------------------------+-------------------------------------+-------------------------------------+
|[John, is, 20, and, Peter, is, 26, years, old, .]|[0, 5, 8, 11, 15, 21, 24, 27, 33, 36]|[3, 6, 9, 13, 19, 22, 25, 31, 35, 36]|
+-------------------------------------------------+-------------------------------------+-------------------------------------+

+--------------------------------------+--------------------------+--------------------------+
|normalized result                     |begin                     |end                       |
+--------------------------------------+--------------------------+--------------------------+
|[John, is, and, Peter, is, years, old]|[0, 5, 11, 15, 21, 27, 33]|[3, 6, 

As we can see above, token indicies are preserved after using a Normalizer() 