![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/07.01.DocumentNormalizer.ipynb)

# **DocumentNormalizer**

This notebook will cover the different parameters and usages of `DocumentNormalizer`.

**📖 Learning Objectives:**

1. Understand how we can normalize raw text from tagged text eg: scrapped web pages, xml documents etc.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [DocumentNormalizer](https://nlp.johnsnowlabs.com/docs/en/annotators#documentnormalizer)

- Python Docs : [DocumentNormalizer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/document_normalizer/index.html)

- Scala Docs : [DocumentNormalizer](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/DocumentNormalizer.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb).

## **📜 Background**

This annotator normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

## **🎬 Colab Setup**

In [1]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F

spark = sparknlp.start()
spark

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `DOCUMENT`

## **🔎 Parameters**


*  `Action` (*String*) : Action to perform applying regex patterns on text, i.e. (clean | extract). Default is "clean".

*   `Lowercase` ( *Boolean* ) : Whether to convert strings to lowercase (Default: false)

*  `Patterns` ( *StringArrayParam* ) : Normalization regex patterns which match will be removed from document (Default: ["<[^>]*>"],  it removes all HTML tags)


*  `Policy` ( *String* ) : RemovalPolicy to remove patterns from text with a given policy (Default: "pretty_all").  Valid policy values are: "all", "pretty_all", "first", "pretty_first"

*   `Encoding` ( *String* ) : File encoding to apply on normalized documents (Default: "disable"). Supported encodings are: UTF_8, UTF_16, US_ASCII, ISO-8859-1, UTF-16BE, UTF-16LE.


*   `Replacement` ( *String* ) : Replacement string to apply when regexes match (Default: " ")







In [3]:
text = '''
  <div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
    THE WORLD'S LARGEST WEB DEVELOPER SITE
    <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
    <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
  </div>

</div>'''

In [4]:
spark_df = spark.createDataFrame([[text]]).toDF("text")

spark_df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                       

### `Action`
 
Action to perform applying regex patterns on text, i.e. (clean | extract). 

Default Action: "clean" 
 

In [5]:
documentAssembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

#default
cleanUpPatterns = ["<[^>]*>"]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)

pipeline = Pipeline() \
    .setStages([documentAssembler,
                documentNormalizer])
    
result = pipeline.fit(spark_df).transform(spark_df)
result.select('normalizedDocument').show(truncate=False)
result.select('normalizedDocument.result').show(truncate=False)


+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|normalizedDocument                                                                                                                                                                                                                                                                                         

As the default action is clean, it removes the cleanUpPatterns that we defined above. So all HTML tags are removed in this case.


<h4> Action : "extract" </h4>

In [6]:
#Download demo data : https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/document-normalizer/xml-docs/C-CDAsample.xml
!mkdir xml-docs
!wget -O xml-docs/demo.xml https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/annotation/english/document-normalizer/xml-docs/C-CDAsample.xml

--2023-01-11 13:55:27--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/annotation/english/document-normalizer/xml-docs/C-CDAsample.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 133800 (131K) [text/plain]
Saving to: ‘xml-docs/demo.xml’


2023-01-11 13:55:27 (5.82 MB/s) - ‘xml-docs/demo.xml’ saved [133800/133800]



In [7]:
# Data loading
data = spark.sparkContext.wholeTextFiles("xml-docs")
df = data.toDF(schema=["filename", "text"]).select("text")
df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [8]:
# Specify the action as extract
action = "extract"

tag = "name"
patterns = [tag]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction(action) \
    .setPatterns(patterns) \
    .setReplacement("") \
    .setPolicy("pretty_all") \
    
sentenceDetector = SentenceDetector() \
      .setInputCols(["normalizedDocument"]) \
      .setOutputCol("sentence")

regexTokenizer = Tokenizer() \
      .setInputCols(["sentence"]) \
      .setOutputCol("token") \
      .fit(df)

docPatternRemoverPipeline = \
  Pipeline() \
    .setStages([
        documentAssembler,
        documentNormalizer,
        sentenceDetector,
        regexTokenizer])

ds = docPatternRemoverPipeline.fit(df).transform(df)

ds.select("normalizedDocument.result").show(10, False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                             

As action = "extract" and specified tag="name", it extracts XML name tag contents from the given data.

### `Lowercase` 

Whether to convert strings to lowercase (Default: false)

In [9]:
text = '''
  <div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
    THE WORLD'S LARGEST WEB DEVELOPER SITE
    <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
    <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
  </div>

</div>'''

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [10]:
#default
cleanUpPatterns = ["<[^>]*>"]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \


pipeline = Pipeline() \
    .setStages([documentAssembler,
                documentNormalizer])
    
result = pipeline.fit(spark_df).transform(spark_df)
result.select('normalizedDocument.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                              

By default, lowercase is False. So not everything converted to lowercase.

In [11]:
#default
cleanUpPatterns = ["<[^>]*>"]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)


pipeline = Pipeline() \
    .setStages([documentAssembler,
                documentNormalizer])
    
result = pipeline.fit(spark_df).transform(spark_df)
result.select('normalizedDocument.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                              

As Lowercase set to True, everything is in lowercase.

### `Patterns` 

Normalization regex patterns which match will be removed from document (Default: ["<[^>]*>"], it removes all HTML tags)

In [12]:
text = '''
  <div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
    THE WORLD'S LARGEST WEB DEVELOPER SITE
    <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
    <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
  </div>

</div>'''

In [13]:
documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)


pipeline = Pipeline() \
    .setStages([documentAssembler,
                documentNormalizer])
    
result = pipeline.fit(spark_df).transform(spark_df)
result.select('normalizedDocument.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                              

No pattern specified, so it takes the default value ( cleanUpPatterns = ["<[^>]*>"]) and removes all HTML tags.

▶ After specifying a pattern.

In [14]:
#Specify cleanUpPatterns to remove the paragraph tag and its content
cleanUpPatterns = ["<p .*?>(.*?)</p>"]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)


pipeline = Pipeline() \
    .setStages([documentAssembler,
                documentNormalizer])
    
result = pipeline.fit(spark_df).transform(spark_df)
result.select('normalizedDocument.result').show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                     |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ <div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'segoe ui',arial,sans-serif"> the world's largest web developer site <h1 style="font-size:300%;">the world's largest web developer site</h1> 

Here we specified a regex to remove just the HTML paragraph tag and its content. So everything within it got removed.However the other tags remained as it is.




### `Replacement` 

 Replacement string to apply when regexes match (Default: " ")

In [15]:
documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPolicy("pretty_all") \
    .setLowercase(True)


pipeline = Pipeline() \
    .setStages([documentAssembler,
                documentNormalizer])
    
result = pipeline.fit(spark_df).transform(spark_df)
result.select('normalizedDocument.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                              

Here as we can see, Replacement takes it default value (" ")

Using Replacement to obfuscate PII such as ages in HTML content

In [16]:
text = """
<!DOCTYPE html>
<html>
<body>
<a class='w3schools-logo notranslate' href='//www.w3schools.com'>w3schools<span class='dotcom'>.com</span></a>
<h1 style="font-size:300%;">This is a heading</h1>
<p style="font-size:160%;">This is a paragraph containing some PII like jonhdoe@myemail.com ! John is now 42 years old.</p>
<p style="font-size:160%;">48% of cardiologists treated patients aged 65+.</p>

</body>
</html> """

In [17]:
df = spark.createDataFrame([[text]]).toDF("text")
df.show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                                                                                                                                                                                                                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [18]:
action = "clean"
patterns = ["\\d+(?=[\\s]?year)", "(aged)[\\s]?\\d+"]

#Specify the replacement, other than the default: " "
replacement = "***OBFUSCATED PII***"

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction(action) \
    .setPatterns(patterns) \
    .setReplacement(replacement) \
    .setPolicy("pretty_all") \
    .setLowercase(True)

docPatternRemoverPipeline = \
  Pipeline() \
    .setStages([
        documentAssembler,
        documentNormalizer,
        sentenceDetector,
        regexTokenizer])

ds = docPatternRemoverPipeline.fit(df).transform(df)

ds.select("normalizedDocument.result").show(10, False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+---------------------------------------------------------------------------------------------------------

Replaces PII ages in the html content with the specified replacement = "OBFUSCATED PII". 


For example, the HTML content contained the following:

John is now 42 years old.

42 years old -> "***obfuscated pii***" years old

So we get:  john is now ***obfuscated pii*** years old.




### `Policy` 

RemovalPolicy to remove patterns from text with a given policy (Default: "pretty_all"). 

Valid policy values are:

*   `all`
*   `pretty_all`
*   `first`
*   `pretty_first`






In [19]:
text = '''
  <div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
    THE WORLD'S LARGEST WEB DEVELOPER SITE
    <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
    <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
  </div>

</div>'''

<h4> Policy: "pretty_all" </h4>

In [20]:
documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setReplacement(" ") \
    .setLowercase(True)


pipeline = Pipeline() \
    .setStages([documentAssembler,
                documentNormalizer])
    
result = pipeline.fit(spark_df).transform(spark_df)
result.select('normalizedDocument.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                              

As nothing is specified, Policy takes default value= "pretty_all". In this policy,  all matched patterns are considered and newline, multiple spaces, tab characters are removed from the string.

<h4> Policy: "all" </h4>

In [21]:
documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setReplacement(" ") \
    .setPolicy("all") \
    .setLowercase(True)


pipeline = Pipeline() \
    .setStages([documentAssembler,
                documentNormalizer])
    
result = pipeline.fit(spark_df).transform(spark_df)
result.select('normalizedDocument.result').show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                   

In this policy, all matched patterns are considered . However newline, multiple spaces, tab characters are not removed from the string.

<h4> Policy: "first" </h4>

In [22]:
documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setReplacement(" ") \
    .setPolicy("first") \
    .setLowercase(True)


pipeline = Pipeline() \
    .setStages([documentAssembler,
                documentNormalizer])
    
result = pipeline.fit(spark_df).transform(spark_df)
result.select('normalizedDocument.result').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                             

In this policy, scope is just the first matched pattern (first html tag in this case) and newline, multiple spaces, tab characters are not removed from the string.

<h4> Policy: "pretty_first" </h4>

In [23]:
documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setReplacement(" ") \
    .setPolicy("pretty_first") \
    .setLowercase(True)


pipeline = Pipeline() \
    .setStages([documentAssembler,
                documentNormalizer])
    
result = pipeline.fit(spark_df).transform(spark_df)
result.select('normalizedDocument.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                

In this policy, scope is just the first matched pattern (first html tag in this case) and newline, multiple spaces, tab characters are removed from it.