![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Training Lemmatizer Model in Italian language

## Spark `2.4` and Spark NLP `1.8.x`

### A brief explaination about `Lemmatizer` annotator in Spark NLP:

Retrieves lemmas out of words with the objective of returning a base dictionary word<br><br>
**Type:** Token<br>
**Requires:** None<br>
**Input:** abduct -> abducted abducting abduct abducts<br><br>
**Functions:**<br>
* setDictionary(path, keyDelimiter, valueDelimiter, readAs, options): Path and options to lemma dictionary, in lemma vs possible words format. readAs can be LINE_BY_LINE or SPARK_DATASET. options contain option passed to spark reader if readAs is SPARK_DATASET.
<br>

**Example:**
```Python
lemmatizer = Lemmatizer() \
  .setInputCols(["token"]) \
  .setOutputCol("lemma") \
  .setDictionary("./lemmas001.txt")
```

Let's import required libraries including `SQL` and `ML` from Spark and some annotators from Spark NLP

In [21]:
#Spark ML and SQL
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import array_contains
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#Spark NLP
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher

### Let's create a Spark Session for our app

In [17]:
spark = SparkSession.builder \
    .appName("Training_Lemmatizer")\
    .master("local[*]")\
    .config("spark.driver.memory","8G")\
    .config("spark.driver.maxResultSize", "2G")\
    .config("spark.driver.extraClassPath", "~/anaconda3/envs/spark/lib/python3.6/site-packages/sparknlp/lib/sparknlp.jar")\
    .config("spark.executor.extraClassPath", "~/anaconda3/envs/spark/lib/python3.6/site-packages/sparknlp/lib/sparknlp.jar")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

As you can see for `spark.driver.extraClassPath` and `spark.executor.extraClassPath` I had to give the path that points to `sparknlp.jar` that comes with Python package. You can use `locate` in Linux/Unix based system, `mdfind -name sparknlp.jar` in macOS, and any other way to search. However, usually you should be able to find it in your `site-packages/sparknlp/lib/sparknlp.jar`.

In [18]:
spark.version

'2.4.0'

### Now we are going to create a Spark NLP Pipeline by using Spark ML Pipeline natively

In [19]:
document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normal")
    
lemmatizer = Lemmatizer() \
    .setInputCols(["normal"]) \
    .setOutputCol("lemma") \
    .setDictionary(
          path = "/tmp/dxc.technology/data/lemma_italian.txt",
          read_as = "LINE_BY_LINE",
          key_delimiter = "\\s+", 
          value_delimiter = "->"
        )
pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, normalizer, lemmatizer])

Now that we have our Spark NLP Pipeline, we can go ahead with training it by using `fit()`. Since we are using an external dataset to train our `Lemmatizer` we don't need to pass any DataFrame with real data. We are going to create an empty DataFrame to just trigger the training.

**NOTE:** Here how you can download the dataset used in this example:
* [lemma_italian.txt](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/it/lemma/dxc.technology/lemma_italian.txt)

In [None]:
emptyDF = spark.createDataFrame([], StringType()).toDF("text")
model = pipeline.fit(emptyDF)

Let's see how good our model does when it comes to prediction. We are going to create a DataFrame with Italian text for testing purposes and use `transform()` to predict.

In [None]:
# Let's create a DataFrame with Italian text for testing our Spark NLP Pipeline
dfTest = spark.createDataFrame(["Finchè non avevo la linea ADSL di fastweb potevo entrare nel router e configurare quelle pochissime cose configurabili (es. nome dei device), da ieri che ho avuto la linea niente è più configurabile...", 
    "L'uomo è insoddisfatto del prodotto.", 
    "La coppia contenta si abbraccia sulla spiaggia."], StringType()).toDF("text")

# Of course you can select multiple columns at the same time however, this way we see each annotator without truncating their results
model.transform(dfTest).select("token.result").show(truncate=False)
model.transform(dfTest).select("normal.result").show(truncate=False)
model.transform(dfTest).select("lemma.result").show(truncate=False)

# Print the schema of the Pipeline
model.transform(dfTest).printSchema()

### Credits 
We would like to thank `DXC.Technology` for sharing their Italian datasets and models with Spark NLP community. The datasets are used to train `Lemmatizer` and `SentimentDetector` Models.