![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_Cleaner_Demo.ipynb)

# Introducing Cleaner in SparkNLP
This notebook showcases the newly added  `Cleaner()` annotator in Spark NLP to remove unnecessary or undesirable content from datasets, such as bullets, dashes, and non-ASCII characters, enhancing data consistency and readability.

In [0]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.4.1


## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading html files was introduced in Spark NLP 6.0.0. Please make sure you have upgraded to the latest Spark NLP release.
We simple need to import the cleaners components to use `Cleaner` annotator:

In [0]:
from sparknlp.annotator.cleaners import *

## Cleaning data

Clean a string with bytes to output a string with human visible characters

In [0]:
data = "Hello ð\\x9f\\x98\\x80"
data_set = spark.createDataFrame([[data]]).toDF("text")

In [0]:
from sparknlp.annotator import *
from sparknlp.base import *

document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

cleaner = Cleaner() \
    .setInputCols(["document"]) \
    .setOutputCol("cleaned") \
    .setCleanerMode("bytes_string_to_string")

pipeline = Pipeline().setStages([
    document_assembler,
    cleaner
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("cleaned").show(truncate=False)

+---------------------------------+
|cleaned                          |
+---------------------------------+
|[{chunk, 0, 8, Hello 😀, {}, []}]|
+---------------------------------+



Cleaning special characters from a screen

In [0]:
data = [
    "● An excellent point!",
    "ITEM 1A:     RISK-FACTORS"
]

data_set = spark.createDataFrame(data, "string").toDF("text")

In [0]:
cleaner = Cleaner() \
    .setInputCols(["document"]) \
    .setOutputCol("cleaned") \
    .setCleanerMode("clean") \
    .setBullets(True) \
    .setExtraWhitespace(True) \
    .setDashes(True)

pipeline = Pipeline().setStages([
    document_assembler,
    cleaner
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("cleaned").show(truncate=False)

+-----------------------------------------------+
|cleaned                                        |
+-----------------------------------------------+
|[{chunk, 0, 19, An excellent point!, {}, []}]  |
|[{chunk, 0, 21, ITEM 1A: RISK FACTORS, {}, []}]|
+-----------------------------------------------+



Clean non-ascii characters

In [0]:
data = ["\\x88This text contains ®non-ascii characters!●"]
data_set = spark.createDataFrame(data, "string").toDF("text")

In [0]:
cleaner = Cleaner() \
    .setInputCols(["document"]) \
    .setOutputCol("cleaned") \
    .setCleanerMode("clean_non_ascii_chars")

pipeline = Pipeline().setStages([
    document_assembler,
    cleaner
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("cleaned").show(truncate=False)

+------------------------------------------------------------------+
|cleaned                                                           |
+------------------------------------------------------------------+
|[{chunk, 0, 40, This text contains non-ascii characters!, {}, []}]|
+------------------------------------------------------------------+



Cleaning alphanumeric bullets from the beginning of a text

In [0]:
data = [("1.1 This is a very important point",),
        ("a.1 This is a very important point",),
        ("1.4.2 This is a very important point",)]

data_set = spark.createDataFrame(data).toDF("text")

In [0]:
cleaner = Cleaner() \
    .setInputCols(["document"]) \
    .setOutputCol("cleaned") \
    .setCleanerMode("clean_ordered_bullets")

pipeline = Pipeline().setStages([
    document_assembler,
    cleaner
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("cleaned").show(truncate=False)

+--------------------------------------------------------+
|cleaned                                                 |
+--------------------------------------------------------+
|[{chunk, 0, 30, This is a very important point, {}, []}]|
|[{chunk, 0, 30, This is a very important point, {}, []}]|
|[{chunk, 0, 30, This is a very important point, {}, []}]|
+--------------------------------------------------------+



Clean postfix from a text based on a pattern

In [0]:
data = ["The end! END"]

data_set = spark.createDataFrame(data, "string").toDF("text")

In [0]:
cleaner = Cleaner() \
    .setInputCols(["document"]) \
    .setOutputCol("cleaned") \
    .setCleanerMode("clean_postfix") \
    .setCleanPrefixPattern("(END|STOP)")

pipeline = Pipeline().setStages([
    document_assembler,
    cleaner
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("cleaned").show(truncate=False)

+---------------------------------+
|cleaned                          |
+---------------------------------+
|[{chunk, 0, 8, The end!, {}, []}]|
+---------------------------------+



Clean prefix from a text based on a pattern

In [0]:
data = ["SUMMARY: This is the best summary of all time!"]

data_set = spark.createDataFrame(data, "string").toDF("text")

In [0]:
cleaner = Cleaner() \
    .setInputCols(["document"]) \
    .setOutputCol("cleaned") \
    .setCleanerMode("clean_prefix") \
    .setCleanPrefixPattern("(SUMMARY|DESCRIPTION):")

pipeline = Pipeline().setStages([
    document_assembler,
    cleaner
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("cleaned").show(truncate=False)

+---------------------------------------------------------------+
|cleaned                                                        |
+---------------------------------------------------------------+
|[{chunk, 0, 37, This is the best summary of all time!, {}, []}]|
+---------------------------------------------------------------+



Cleaning unicode characters from a text

In [0]:
data = [
    "\x93A lovely quote!\x94",
    "\x91A lovely quote!\x92",
    """\u201CA lovely quote!\u201D — with a dash"""
]

data_set = spark.createDataFrame(data, "string").toDF("text")

In [0]:
cleaner = Cleaner() \
    .setInputCols(["document"]) \
    .setOutputCol("cleaned") \
    .setCleanerMode("replace_unicode_characters")

pipeline = Pipeline().setStages([
    document_assembler,
    cleaner
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("cleaned").show(truncate=False)

+---------------------------------------------------------+
|cleaned                                                  |
+---------------------------------------------------------+
|[{chunk, 0, 17, “A lovely quote!”, {}, []}]              |
|[{chunk, 0, 17, ‘A lovely quote!’, {}, []}]              |
|[{chunk, 0, 31, ?A lovely quote!? ? with a dash, {}, []}]|
+---------------------------------------------------------+



### Translator

You can use `Cleaner` annotator to even translate a text  

In [0]:
data = ["This should go to French"]
data_set = spark.createDataFrame(data, "string").toDF("text")

In [0]:
cleaner = Cleaner() \
    .pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("cleaned")

pipeline = Pipeline().setStages([
    document_assembler,
    cleaner
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("cleaned").show(truncate=False)

opus_mt_en_fr download started this may take some time.
Approximate size to download 378.7 MB
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][OK!]
+-----------------------------------------------------------------------+
|cleaned                                                                |
+-----------------------------------------------------------------------+
|[{document, 0, 28, Ça devrait aller en français., {sentence -> 0}, []}]|
+-----------------------------------------------------------------------+

