![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/training/italian/Train-SentimentDetector-Italian.ipynb)

# Training SentimentDetector Model in Italian language

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

### A brief explaination about `SentimentDetector` annotator in Spark NLP:

Scores a sentence for a sentiment<br>
**Type:** sentiment<br>
**Requires:** Document, Token<br>

**Functions:**<br>
* setSentimentCol(colname): Column with sentiment analysis row's result for training. If not set, external sources need to be set instead.<br>
* setPositiveSource(path, tokenPattern, readAs, options): Path to file or folder with positive sentiment text, with tokenPattern the regex pattern to match tokens in source. readAs either LINE_BY_LINE or as SPARK_DATASET. If latter is set, options is passed to reader <br>
* setNegativeSource(path, tokenPattern, readAs, options): Path to file or folder with positive sentiment text, with tokenPattern the regex pattern to match tokens in source. readAs either LINE_BY_LINE or as SPARK_DATASET. If latter is set, options is passed to reader <br>
* setPruneCorpus(true): when training on small data you may want to disable this to not cut off unfrequent words
<br>

**Input:** File or folder of text files of positive and negative data<br>
**Example:**<br>
```python
sentiment_detector = SentimentDetector() \
    .setInputCols(["lemma", "sentence"]) \
    .setOutputCol("sentiment")
```    

Let's import required libraries including `SQL` and `ML` from Spark and some annotators from Spark NLP

In [None]:
#Spark ML and SQL
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import array_contains
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#Spark NLP
import sparknlp
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher

### Let's create a Spark Session for our app

In [None]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)


Spark NLP version:  4.3.1
Apache Spark version:  3.3.0


In [None]:
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/it/lemma/dxc.technology/lemma_italian.txt -P /tmp
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/it/sentiment/dxc.technology/sentiment_italian.txt -P /tmp    

--2023-02-20 18:15:24--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/it/lemma/dxc.technology/lemma_italian.txt
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.132.189, 52.217.174.32, 52.216.242.22, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.132.189|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘/tmp/lemma_italian.txt’ not modified on server. Omitting download.

--2023-02-20 18:15:25--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/it/sentiment/dxc.technology/sentiment_italian.txt
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.107.142, 54.231.234.0, 52.216.209.0, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.107.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 349115 (341K) [text/plain]
Saving to: ‘/tmp/sentiment_it

### Now we are going to create a Spark NLP Pipeline by using Spark ML Pipeline natively

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normal")

lemmatizer = Lemmatizer() \
    .setInputCols(["normal"]) \
    .setOutputCol("lemma") \
    .setDictionary(
          path = "/tmp/lemma_italian.txt",
          read_as = "TEXT",
          key_delimiter = "\\s+",
          value_delimiter = "->"
        )

sentiment_detector = SentimentDetector() \
    .setInputCols(["lemma", "sentence"]) \
    .setOutputCol("sentiment_score") \
    .setDictionary(
          path = "/tmp/sentiment_italian.txt",
          read_as = "TEXT",
          delimiter = ","
        )
pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, normalizer, lemmatizer, sentiment_detector])

Now that we have our Spark NLP Pipeline, we can go ahead with training it by using `fit()`. Since we are using an external dataset to train our `Lemmatizer` and `SentimentDetector` models we don't need to pass any DataFrame with real data. We are going to create an empty DataFrame to just trigger the training.

Let's see how good our model does when it comes to prediction. We are going to create a DataFrame with Italian text for testing purposes and use `transform()` to predict.

In [None]:
# Let's create a DataFrame with Italian text for testing our Spark NLP Pipeline
dfTest = spark.createDataFrame(["Finchè non avevo la linea ADSL di fastweb potevo entrare nel router e configurare quelle pochissime cose configurabili (es. nome dei device), da ieri che ho avuto la linea niente è più configurabile...",
    "L'uomo è insoddisfatto del prodotto.",
    "La coppia contenta si abbraccia sulla spiaggia."], StringType()).toDF("text")

# Of course you can select multiple columns at the same time however, this way we see each annotator without truncating their results
pipeline.fit(dfTest).transform(dfTest).select("token.result").show(truncate=False)
pipeline.fit(dfTest).transform(dfTest).select("normal.result").show(truncate=False)
pipeline.fit(dfTest).transform(dfTest).select("lemma.result").show(truncate=False)
pipeline.fit(dfTest).transform(dfTest).select("sentiment_score").show(truncate=False)

# Print the schema of the Pipeline
pipeline.fit(dfTest).transform(dfTest).printSchema()


+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Finchè, non, avevo, la, linea, ADSL, di, fastweb, potevo, entrare, nel, router, e, configurare, quelle, pochissime, cose, configurabili, (, es, ., nome, dei, device, ),, da, ieri, che, ho, avuto, la, linea, niente, è, più, configurabile, ., ., .]|


### Credits 
We would like to thank `DXC.Technology` for sharing their Italian datasets and models with Spark NLP community. The datasets are used to train `Lemmatizer` and `SentimentDetector` Models.