![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Training Lemmatizer Model in Italian language

### A brief explaination about `Lemmatizer` annotator in Spark NLP:

Retrieves lemmas out of words with the objective of returning a base dictionary word<br><br>
**Type:** Token<br>
**Requires:** None<br>
**Input:** abduct -> abducted abducting abduct abducts<br><br>
**Functions:**<br>
* setDictionary(path, keyDelimiter, valueDelimiter, readAs, options): Path and options to lemma dictionary, in lemma vs possible words format. readAs can be LINE_BY_LINE or SPARK_DATASET. options contain option passed to spark reader if readAs is SPARK_DATASET.
<br>

**Example:**
```Python
lemmatizer = Lemmatizer() \
  .setInputCols(["token"]) \
  .setOutputCol("lemma") \
  .setDictionary("./lemmas001.txt")
```

Let's import required libraries including `SQL` and `ML` from Spark and some annotators from Spark NLP

In [1]:
#Spark ML and SQL
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import array_contains
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#Spark NLP
import sparknlp
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher

### Let's create a Spark Session for our app

In [2]:
import sparknlp 

spark = sparknlp.start()

print("Spark NLP version")
sparknlp.version()
print("Apache Spark version")
spark.version

Spark NLP version
2.2.1
Apache Spark version


'2.4.3'

In [3]:
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/it/lemma/dxc.technology/lemma_italian.txt -P /tmp

--2019-07-16 22:00:38--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/it/lemma/dxc.technology/lemma_italian.txt
R'esolution de s3.amazonaws.com (s3.amazonaws.com)... 52.216.82.139
Connexion `a s3.amazonaws.com (s3.amazonaws.com)|52.216.82.139|:443... connect'e.
requ^ete HTTP transmise, en attente de la r'eponse... 200 OK
Taille : 6900964 (6.6M) [text/plain]
Sauvegarde en : << /tmp/lemma_italian.txt >>


2019-07-16 22:00:46 (977 KB/s) - << /tmp/lemma_italian.txt >> sauvegard'e [6900964/6900964]



### Now we are going to create a Spark NLP Pipeline by using Spark ML Pipeline natively

In [4]:
document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normal")
    
lemmatizer = Lemmatizer() \
    .setInputCols(["normal"]) \
    .setOutputCol("lemma") \
    .setDictionary(
          path = "/tmp/lemma_italian.txt",
          read_as = "LINE_BY_LINE",
          key_delimiter = "\\s+", 
          value_delimiter = "->"
        )
pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, normalizer, lemmatizer])

Let's see how good our model does when it comes to prediction. We are going to create a DataFrame with Italian text for testing purposes and use `transform()` to predict.

In [5]:
# Let's create a DataFrame with Italian text for testing our Spark NLP Pipeline
dfTest = spark.createDataFrame(["Finchè non avevo la linea ADSL di fastweb potevo entrare nel router e configurare quelle pochissime cose configurabili (es. nome dei device), da ieri che ho avuto la linea niente è più configurabile...", 
    "L'uomo è insoddisfatto del prodotto.", 
    "La coppia contenta si abbraccia sulla spiaggia."], StringType()).toDF("text")

# Of course you can select multiple columns at the same time however, this way we see each annotator without truncating their results
pipeline.fit(dfTest).transform(dfTest).select("token.result").show(truncate=False)
pipeline.fit(dfTest).transform(dfTest).select("normal.result").show(truncate=False)
pipeline.fit(dfTest).transform(dfTest).select("lemma.result").show(truncate=False)

# Print the schema of the Pipeline
model.transform(dfTest).printSchema()

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Finchè, non, avevo, la, linea, ADSL, di, fastweb, potevo, entrare, nel, router, e, configurare, quelle, pochissime, cose, configurabili, (, es, ., nome, dei, device, ),, da, ieri, che, ho, avuto, la, linea, niente, è, più, configurabile, ., ., .]|


NameError: name 'model' is not defined

### Credits 
We would like to thank `DXC.Technology` for sharing their Italian datasets and models with Spark NLP community. The datasets are used to train `Lemmatizer` and `SentimentDetector` Models.