![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


# **Stemmer**

This notebook will cover the different parameters and usages of `Stemmer`. 

**📖 Learning Objectives:**

1. Understand how extract the base form of the words by removing affixes from them.

2. Learn how to create pipelines with this annotator.


**🔗 Helpful Links:**

- Documentation : [Stemmer](https://nlp.johnsnowlabs.com/docs/en/annotators#stemmer)

- Python Docs : [Stemmer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/stemmer/index.html)

- Scala Docs : [Stemmer](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/Stemmer.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/).

## **🎬 Colab Setup**

In [6]:
# Install PySpark and Spark NLP
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

In [18]:
import sparknlp
from sparknlp.base import DocumentAssembler, LightPipeline, Pipeline
from sparknlp.annotator import Tokenizer, Stemmer

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `TOKEN`

## **🔎 Parameters**

None

## **Example Pipeline**

In [19]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

stemmer = Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")

nlpPipeline = Pipeline(stages=[documentAssembler, 
                               tokenizer,
                               stemmer])

sample_texts = [["I love working with SparkNLP."], 
        ["I am living in Canada."]]

data = spark.createDataFrame(sample_texts).toDF("text")

model = nlpPipeline.fit(data)

result = model.transform(data)
result.show(truncate=40)

+-----------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|                         text|                                document|                                   token|                                    stem|
+-----------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|I love working with SparkNLP.|[{document, 0, 28, I love working wit...|[{token, 0, 0, I, {sentence -> 0}, []...|[{token, 0, 0, i, {sentence -> 0}, []...|
|       I am living in Canada.|[{document, 0, 21, I am living in Can...|[{token, 0, 0, I, {sentence -> 0}, []...|[{token, 0, 0, i, {sentence -> 0}, []...|
+-----------------------------+----------------------------------------+----------------------------------------+----------------------------------------+



In [20]:
result.select('token.result', 'stem.result').show(truncate=False)

+-------------------------------------+----------------------------------+
|result                               |result                            |
+-------------------------------------+----------------------------------+
|[I, love, working, with, SparkNLP, .]|[i, love, work, with, sparknlp, .]|
|[I, am, living, in, Canada, .]       |[i, am, live, in, canada, .]      |
+-------------------------------------+----------------------------------+



## 🎯 **Usage with LightPipeline**

- **LightPipeline** is a Spark NLP specific Pipeline class equivalent to Spark ML Pipeline. The difference is that its execution does not hold to Spark principles, instead, it computes everything locally (but in parallel) in order to achieve faster inference when dealing with small amounts of data. This means, we don't have to Spark Dataframe, but a string or an array of strings instead, to be annotated. To create Light Pipelines, you need to input an already trained (fit) Spark ML Pipeline.

- It’s `transform()` stage is converted into `annotate()` or `fullAnnotate()` instead. <br/>

- Let's ceate a pipeline with `MarianTransformer`, and run it with `LightPipeline` and see the results with an example text. 

In [15]:
from sparknlp.base import LightPipeline


light_pipeline = LightPipeline(model)

In [21]:
light_pipeline.annotate("I love working with SparkNLP.")["stem"]

['i', 'love', 'work', 'with', 'sparknlp', '.']

In [22]:
light_pipeline.fullAnnotate("I love working with SparkNLP.")[0]["stem"]

[Annotation(token, 0, 0, i, {'sentence': '0'}),
 Annotation(token, 2, 5, love, {'sentence': '0'}),
 Annotation(token, 7, 13, work, {'sentence': '0'}),
 Annotation(token, 15, 18, with, {'sentence': '0'}),
 Annotation(token, 20, 27, sparknlp, {'sentence': '0'}),
 Annotation(token, 28, 28, ., {'sentence': '0'})]