![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/06.04.StopWordsCleaner.ipynb)

# **StopWordsCleaner**

This notebook will cover the different parameters and usages of `StopWordsCleaner`.

**📖 Learning Objectives:**

1. Understand how to drop stop words from the input sequences.

2. Become comfortable using the different parameters of the annotator.

3. How to use pretrained StopWordsCleaner models.


**🔗 Helpful Links:**

- Documentation : [StopWordsCleaner](https://nlp.johnsnowlabs.com/docs/en/annotators#stopwordscleaner)

- Python Docs : [StopWordsCleaner](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/stop_words_cleaner/index.html)

- Scala Docs : [StopWordsCleaner](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/StopWordsCleaner.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb).

## **📜 Background**

This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

By default, it uses stop words from MLlibs StopWordsRemover. <https://spark.apache.org/docs/latest/ml-features#stopwordsremover>

Stop words can also be defined by explicitly setting them with setStopWords([String]) or loaded from pretrained models using pretrained of its companion object.


## **🎬 Colab Setup**

In [1]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F

spark = sparknlp.start()
spark

## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `TOKEN`

## **🔎 Parameters**

*  `CaseSensitive` (*Boolean*) : Whether to do a case-sensitive comparison over the stop words (Default: false)

*   `StopWords` ( [*String*] ) : The words to be filtered out (Default: Stop words from MLlib)


### `CaseSensitive` 

Whether to do a case-sensitive comparison over the stop words (Default: false)



In [3]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

stop_words = StopWordsCleaner()\
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \

prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        stop_words
    ]
)

prediction_data = spark.createDataFrame([["Tom is a nice man. He lives in Kashmir."]]).toDF("text")

result = prediction_pipeline.fit(prediction_data).transform(prediction_data)
result.select("cleanTokens.result").show(1, False)

+--------------------------------------+
|result                                |
+--------------------------------------+
|[Tom, nice, man, ., lives, Kashmir, .]|
+--------------------------------------+



As nothing specified, by default `CaseSensitive = False` . So if any stopword from the default (Stop words from MLlib) is present, it is removed. In this case, words like:  "is", "a", "He", "in" were removed.

 <h4> ➤ CaseSensitive = True </h4>

In [4]:
# Specify .setCaseSensitive(True)
stop_words = StopWordsCleaner()\
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(True)

prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        stop_words
    ]
)

prediction_data = spark.createDataFrame([["Tom is a nice man. He lives in Kashmir."]]).toDF("text")

result = prediction_pipeline.fit(prediction_data).transform(prediction_data)
result.select("cleanTokens.result").show(1, False)

+------------------------------------------+
|result                                    |
+------------------------------------------+
|[Tom, nice, man, ., He, lives, Kashmir, .]|
+------------------------------------------+



Because of `CaseSensitive = True`, the word "*He*" was not considered a stopword because all stopwords in MLlibs StopWordsRemover are specified in lowercase.

➤ Stopwords from MLlibs StopWordsRemover 

In [5]:
stop_words.getStopWords()

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

### `StopWords` 


In [6]:
stop_words = StopWordsCleaner()\
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)\

prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        stop_words
    ]
)

prediction_data = spark.createDataFrame([["Tom is a nice man. He lives in Kashmir."]]).toDF("text")

result = prediction_pipeline.fit(prediction_data).transform(prediction_data)
result.select("cleanTokens.result").show(1, False)

+--------------------------------------+
|result                                |
+--------------------------------------+
|[Tom, nice, man, ., lives, Kashmir, .]|
+--------------------------------------+



As nothing specified, so default Stop words from MLlibs StopWordsRemover are considered.

<h2> ➤ StopWords : as an array of strings from a text file or manually. </h2>

In [7]:
stop_words = StopWordsCleaner()\
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)\
    .setStopWords(["is", "a"])      #(e.g. Here we manually specified only (["is", "a"]) as stopwords. This can come from a file as well.

prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        stop_words
    ]
)

prediction_data = spark.createDataFrame([["Tom is a nice man. He lives in Kashmir."]]).toDF("text")

result = prediction_pipeline.fit(prediction_data).transform(prediction_data)
result.select("cleanTokens.result").show(1, False)

+----------------------------------------------+
|result                                        |
+----------------------------------------------+
|[Tom, nice, man, ., He, lives, in, Kashmir, .]|
+----------------------------------------------+



We can observe that only the stopwords we specified in .setStopWords() were removed. As we specified just two words ("is" and "a") as stopwords, only they got removed.

➤ Token positions are preserved.

In [8]:
stop_words = StopWordsCleaner()\
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)\

prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        stop_words
    ]
)

prediction_data = spark.createDataFrame([["Tom is a nice man. He lives in Kashmir."]]).toDF("text")

result = prediction_pipeline.fit(prediction_data).transform(prediction_data)


In [9]:
result.select("token.result","token.begin","token.end").withColumnRenamed("result","Tokens").show(truncate=False)
result.select("cleanTokens.result","cleanTokens.begin","cleanTokens.end").withColumnRenamed("result","Clean Tokens").show(truncate=False)

+-----------------------------------------------------+----------------------------------------+-----------------------------------------+
|Tokens                                               |begin                                   |end                                      |
+-----------------------------------------------------+----------------------------------------+-----------------------------------------+
|[Tom, is, a, nice, man, ., He, lives, in, Kashmir, .]|[0, 4, 7, 9, 14, 17, 19, 22, 28, 31, 38]|[2, 5, 7, 12, 16, 17, 20, 26, 29, 37, 38]|
+-----------------------------------------------------+----------------------------------------+-----------------------------------------+

+--------------------------------------+--------------------------+---------------------------+
|Clean Tokens                          |begin                     |end                        |
+--------------------------------------+--------------------------+---------------------------+
|[Tom, nice, man

As we can see above, the position of the tokens is preserved in the cleaned tokens.

### `StopWordsCleaner Pre-trained Models` 

Available pretrained models can be checked here : https://nlp.johnsnowlabs.com/models?q=stopwords

It is used as  :  `StopWordsCleaner.pretrained("Model_Name", "Language")`

example : `("stopwords_iso", "en")`


In [10]:
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

stop_words = StopWordsCleaner.pretrained("stopwords_iso","en") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")

pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) 

example = spark.createDataFrame([["You are not better than me"]], ["text"]) 

results = pipeline.fit(example).transform(example)
results.select("cleanTokens.result").show(1, False)


stopwords_iso download started this may take some time.
Approximate size to download 2.1 KB
[OK!]
+--------+
|result  |
+--------+
|[better]|
+--------+



In [11]:
# Pretrained model ("stopwords_iso", "en") stopwords
stop_words.getStopWords()

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'call',
 'can',
 'cannot',
 'ca',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'front',
 'full',
 'further',
 'get',
 'give',
 'g

We have pretrained models for other languages as well.

example: `("stopwords_iso", "fr")`

In [12]:
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

stop_words = StopWordsCleaner.pretrained("stopwords_iso","fr") \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens")

pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) 

example = spark.createDataFrame([["Tu n'es pas mieux que moi"]], ["text"]) 

results = pipeline.fit(example).transform(example)
results.select("cleanTokens.result").show(1, False)

stopwords_iso download started this may take some time.
Approximate size to download 2.8 KB
[OK!]
+-------------+
|result       |
+-------------+
|[n'es, mieux]|
+-------------+



In [13]:
# Pretrained model ("stopwords_iso", "fr") stopwords
stop_words.getStopWords()

['a',
 'à',
 'â',
 'abord',
 'afin',
 'ah',
 'ai',
 'aie',
 'ainsi',
 'ait',
 'allaient',
 'allons',
 'alors',
 'anterieur',
 'anterieure',
 'anterieures',
 'antérieur',
 'antérieure',
 'antérieures',
 'apres',
 'après',
 'as',
 'assez',
 'attendu',
 'au',
 'aupres',
 'auquel',
 'aura',
 'auraient',
 'aurait',
 'auront',
 'aussi',
 'autre',
 'autrement',
 'autres',
 'autrui',
 'aux',
 'auxquelles',
 'auxquels',
 'avaient',
 'avais',
 'avait',
 'avant',
 'avec',
 'avoir',
 'avons',
 'ayant',
 'bas',
 'basee',
 'bat',
 "c'",
 'c’',
 'ça',
 'car',
 'ce',
 'ceci',
 'cela',
 'celle',
 'celle-ci',
 'celle-la',
 'celle-là',
 'celles',
 'celles-ci',
 'celles-la',
 'celles-là',
 'celui',
 'celui-ci',
 'celui-la',
 'celui-là',
 'cent',
 'cependant',
 'certain',
 'certaine',
 'certaines',
 'certains',
 'certes',
 'ces',
 'cet',
 'cette',
 'ceux',
 'ceux-ci',
 'ceux-là',
 'chacun',
 'chacune',
 'chaque',
 'chez',
 'ci',
 'cinq',
 'cinquantaine',
 'cinquante',
 'cinquantième',
 'cinquième',
 'combi