![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/05.01.Lemmatizer_LemmatizerModel.ipynb)

#  **Lemmatizer** and **LemmatizerModel**

This notebook will cover the different parameters and usages of `Lemmatizer` and `LemmatizerModel` annotators.

**📖 Learning Objectives:**

1. Understand the process of reducing inflected words to their base forms to obtain the lemmas.

2. Be able to train custom `LemmatizerModel` annotators.

3. Become confortable with creating pipelines to preprocess texts with `Lemmatizer` and `LemmatizerModel`. 


**🔗 Helpful Links:**

- Documentation : [Lemmatizer](https://nlp.johnsnowlabs.com/docs/en/annotators#lemmatizer), [LemmatizerModel](https://nlp.johnsnowlabs.com/docs/en/annotators#lemmatizer)

- Python Docs : [Lemmatizer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/lemmatizer/index.html), [LemmatizerModel](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/lemmatizer/index.html#sparknlp.annotator.lemmatizer.LemmatizerModel)

- Scala Docs : [Lemmatizer](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/Lemmatizer.html), [LemmatizerModel](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/LemmatizerModel.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/).

## **🎬 Colab Setup**

In [26]:
# Install PySpark and Spark NLP
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

In [27]:
import sparknlp
from sparknlp.base import DocumentAssembler, Pipeline
from sparknlp.annotator import Lemmatizer, LemmatizerModel, Tokenizer
import pyspark.sql.functions as F


spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `TOKEN`

## **Training a Lemmatizer model**

### **🔎 Parameters**


- `dictionary`: Path to external dictionary for the lemmatizer.

- `formCol`: Name of the column containing the word form information, following the [CoNLLU](https://universaldependencies.org/format.html) format.

- `lemmaCol`: Name of the column containing the lemma information, following the [CoNLLU](https://universaldependencies.org/format.html) format.



### `.setDictionary()`

External dictionary to be used by the lemmatizer, which needs `keyDelimiter` (separates lemmas from the word forms) and `valueDelimiter` (separator between different word forms of the same lemma) for parsing the resource.

In [28]:
!wget -q https://raw.githubusercontent.com/mahavivo/vocabulary/master/lemmas/AntBNC_lemmas_ver_001.txt

In [29]:
!head -5 AntBNC_lemmas_ver_001.txt

aaah	->	aaahed	aaah
aac	->	aac	aacs
aah	->	aah	aahs	aahing	aahed	aahhing
aam	->	aams	aam
aardvark	->	aardvark	aardvarks


In [41]:
lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary("./AntBNC_lemmas_ver_001.txt", value_delimiter ="\t", key_delimiter = "->")

In [42]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

nlpPipeline = Pipeline(stages=[documentAssembler, 
                               tokenizer,
                               lemmatizer])

sample_texts = [
    ["I love working with SparkNLP."], 
    ["I am living in Canada."]
]

data = spark.createDataFrame(sample_texts).toDF("text")

result = nlpPipeline.fit(data).transform(data)

result.show(truncate=40)

+-----------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|                         text|                                document|                                   token|                                   lemma|
+-----------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|I love working with SparkNLP.|[{document, 0, 28, I love working wit...|[{token, 0, 0, I, {sentence -> 0}, []...|[{token, 0, 0, I, {sentence -> 0}, []...|
|       I am living in Canada.|[{document, 0, 21, I am living in Can...|[{token, 0, 0, I, {sentence -> 0}, []...|[{token, 0, 0, I, {sentence -> 0}, []...|
+-----------------------------+----------------------------------------+----------------------------------------+----------------------------------------+



In [43]:
result.select('lemma.result').show(truncate=False)

+----------------------------------+
|result                            |
+----------------------------------+
|[I, love, work, with, SparkNLP, .]|
|[I, be, live, in, Canada, .]      |
+----------------------------------+



In [45]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result, 
                                                 result.lemma.result)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("lemma")).toPandas()

result_df.head(10)

Unnamed: 0,token,lemma
0,I,I
1,love,love
2,working,work
3,with,with
4,SparkNLP,SparkNLP
5,.,.
6,I,I
7,am,be
8,living,live
9,in,in


# **Using pretrained models**

The `LemmatizerModel` annotator can automatically download pretrained models with the `.pretrained()` method. For available pretrained models, check the [NLP Models Hub](https://nlp.johnsnowlabs.com/models?task=Lemmatization).

## **🔎 Example Pipeline**


In [34]:
!wget -q -O news_category_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv

In [35]:
!head -5 news_category_test.csv

category,description
Business,Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.
Sci/Tech," TORONTO, Canada    A second team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for privately funded suborbital space flight, has officially announced the first launch date for its manned rocket."
Sci/Tech," A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins."
Sci/Tech," It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry and flames will roar."


In [46]:
import pyspark.sql.functions as F

news_df = spark.read\
                .option("header", "true")\
                .csv("news_category_test.csv")\
                .withColumnRenamed("description", "text")

news_df.show(truncate=50)

+--------+--------------------------------------------------+
|category|                                              text|
+--------+--------------------------------------------------+
|Business|Unions representing workers at Turner   Newall ...|
|Sci/Tech| TORONTO, Canada    A second team of rocketeers...|
|Sci/Tech| A company founded by a chemistry researcher at...|
|Sci/Tech| It's barely dawn when Mike Fitzpatrick starts ...|
|Sci/Tech| Southern California's smog fighting agency wen...|
|Sci/Tech|"The British Department for Education and Skill...|
|Sci/Tech|"confessed author of the Netsky and Sasser viru...|
|Sci/Tech|\\FOAF/LOAF  and bloom filters have a lot of in...|
|Sci/Tech|"Wiltshire Police warns about ""phishing"" afte...|
|Sci/Tech|In its first two years, the UK's dedicated card...|
|Sci/Tech| A group of technology companies  including Tex...|
|Sci/Tech| Apple Computer Inc.&lt;AAPL.O&gt; on  Tuesday ...|
|Sci/Tech| Free Record Shop, a Dutch music  retail chain,...|
|Sci/Tec

Let's use the model `lemma_antbnc`:

In [47]:
lemmatizer = LemmatizerModel.pretrained('lemma_antbnc', 'en') \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") 

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]


In [48]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

nlpPipeline = Pipeline(stages=[documentAssembler, 
                               tokenizer,
                               lemmatizer])

empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

In [49]:
result = pipelineModel.transform(news_df)

result.show(5)

+--------+--------------------+--------------------+--------------------+--------------------+
|category|                text|            document|               token|               lemma|
+--------+--------------------+--------------------+--------------------+--------------------+
|Business|Unions representi...|[{document, 0, 12...|[{token, 0, 5, Un...|[{token, 0, 5, Un...|
|Sci/Tech| TORONTO, Canada ...|[{document, 0, 22...|[{token, 1, 7, TO...|[{token, 1, 7, TO...|
|Sci/Tech| A company founde...|[{document, 0, 20...|[{token, 1, 1, A,...|[{token, 1, 1, A,...|
|Sci/Tech| It's barely dawn...|[{document, 0, 26...|[{token, 1, 4, It...|[{token, 1, 4, It...|
|Sci/Tech| Southern Califor...|[{document, 0, 17...|[{token, 1, 8, So...|[{token, 1, 8, So...|
+--------+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows



In [50]:
result.select('token.result','lemma.result').show(5, truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                              result|                                                                                              result|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|[Unions, representing, workers, at, Turner, Newall, say, they, are, ', disappointed, ', after, ta...|[Unions, represent, worker, at, Turner, Newall, say, they, be, ', disappointed, ', after, talk, w...|
|[TORONTO, ,, Canada, A, second, team, of, rocketeers, competing, for, the, #36;10, million, Ansar...|[TORONTO, ,, Canada, A, second, team, of, rocketeer, compete, for, the, #36;10, mi