![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/30.01.MarianTransformer.ipynb)

# **MarianTransformer**

This notebook will cover the different parameters and usages of `MarianTransformer`. 

**📖 Learning Objectives:**

1. Understand how to use the pre-trained `MarianTransformer` model in Spark NLP for machine translation tasks, including loading pre-trained models and configuring the translation pipeline.

2. Become familiar with the parameters and options available for the `MarianTransformer` model.


**🔗 Helpful Links:**

- Documentation : [MarianTransformer](https://nlp.johnsnowlabs.com/docs/en/transformers#mariantransformer)

- Python Docs : [MarianTransformer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/seq2seq/marian_transformer/index.html#sparknlp.annotator.seq2seq.marian_transformer.MarianTransformer)

- Scala Docs : [MarianTransformer](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/seq2seq/MarianTransformer)


- For academic reference, see [Marian: Fast Neural Machine Translation in C++](https://aclanthology.org/P18-4020/).

- For additional information, see [NMarianNMT at GitHub](https://marian-nmt.github.io/).

## **📜 Background**
`MarianTransformer` is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. `MarianTransformer` uses the models trained by MarianNMT.

## **🎬 Colab Setup**

In [1]:
! pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


## ⚒️ Setup and Import Libraries

In [2]:
import sparknlp

from sparknlp.base import LightPipeline, Pipeline
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import SentenceDetectorDLModel, MarianTransformer
from pyspark.sql import functions as F

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.2.4
Apache Spark version: 3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `DOCUMENT`

## **🔎Parameters**

- `batchSize`: Size of every batch, by default 1.
- `langId`: The language code (e.g., "en", "fr", "pt","tr" etc.) for the input language of multilanguage models that accepts many languages as input.(Default: "")
- `configProtoBytes`: `configProto` from Tensorflow, serialized into byte array.
- `maxInputLength`: Controls the maximum length for the tokenized input sequence (source language [SentencePieces](https://github.com/google/sentencepiece)), by default 40.
- `maxOutputLength`: Controls the maximum length for the output sequence (target language texts), by default 40. If this parameter is smaller than `maxInputLength`, then `maxInputLength` will be used instead, meaning the the maximum output length will be the maximum value between `maxInputLength` and `maxOutputLength` parameters.  























The default model is `"opus_mt_en_fr"`, default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the [Models Hub.](https://nlp.johnsnowlabs.com/models?task=Translation&type=model&q=marian)

### `setMaxInputLength and setMaxOutputLenght`

Setting the parameters to `30`.

In [3]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols("document") \
    .setOutputCol("sentence")

marian = MarianTransformer.pretrained() \
    .setInputCols("sentence") \
    .setOutputCol("translation") \
    .setMaxInputLength(30) \
    .setMaxOutputLength(30)
    
pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      marian
    ])

data = spark.createDataFrame([["What is the capital of France? We should know this in french."]]).toDF("text")

result = pipeline.fit(data).transform(data)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
opus_mt_en_fr download started this may take some time.
Approximate size to download 378.7 MB
[OK!]


In [4]:
result.selectExpr("explode(translation.result) as result").show(truncate=False)

+-------------------------------------+
|result                               |
+-------------------------------------+
|Quelle est la capitale de la France ?|
|On devrait le savoir en français.    |
+-------------------------------------+



Since the input sentences were small, the output sentences are less than 30 characters long. Limiting the length of input/output texts can speed up inference and save memory. 

**What happens when the input text is longer?**

In [5]:
french_sentence = """La capitale française, célèbre pour ses monuments historiques, ses musées prestigieux et sa cuisine raffinée, attire des visiteurs du monde entier."""
print(f"Sentence length: {len(french_sentence)}")

Sentence length: 147


Let's send this example to a Spark data frame and add the `DOCUMENT` annotation:

In [6]:
spark_df = spark.createDataFrame([[french_sentence]]).toDF("text")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

spark_df = documentAssembler.transform(spark_df)

Using a multilanguage (French being one of them) model to translate to English:

In [7]:
marian = MarianTransformer.pretrained("opus_mt_mul_en", "xx") \
    .setInputCols("document") \
    .setOutputCol("translation")

opus_mt_mul_en download started this may take some time.
Approximate size to download 395.3 MB
[OK!]


Restricting input length:

In [12]:
marian.setMaxInputLength(10).setMaxOutputLength(200).transform(spark_df).select("translation.result").show(truncate=False)

+----------------------+
|result                |
+----------------------+
|[The French capital ,]|
+----------------------+



Restricting output length (input and output to the same value)

In [11]:
marian.setMaxInputLength(100).setMaxOutputLength(100).transform(spark_df).select("translation.result").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|[The French capital, famous for its historical monuments, prestigious museums and refined cuisine, attracts visitors from all over the world.]|
+----------------------------------------------------------------------------------------------------------------------------------------------+



Restricting output to lower than input doesn't change the output length.

In [13]:
marian.setMaxInputLength(100).setMaxOutputLength(5).transform(spark_df).select("translation.result").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|[The French capital, famous for its historical monuments, prestigious museums and refined cuisine, attracts visitors from all over the world.]|
+----------------------------------------------------------------------------------------------------------------------------------------------+



<br/>

We can get the parameters of the `MarianTransformer` in detail by using `extractParamMap`. We will be able to see:

*   Definition of the parameter
*   Default value



In [None]:
marian.extractParamMap()

{Param(parent='MARIAN_TRANSFORMER_139dd9d2ebfc', name='batchSize', doc='Size of every batch'): 1,
 Param(parent='MARIAN_TRANSFORMER_139dd9d2ebfc', name='engine', doc='Deep Learning engine used for this model'): 'tensorflow',
 Param(parent='MARIAN_TRANSFORMER_139dd9d2ebfc', name='ignoreTokenIds', doc="A list of token ids which are ignored in the decoder's output"): [],
 Param(parent='MARIAN_TRANSFORMER_139dd9d2ebfc', name='langId', doc="Transformer's task, e.g. summarize>"): '',
 Param(parent='MARIAN_TRANSFORMER_139dd9d2ebfc', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='MARIAN_TRANSFORMER_139dd9d2ebfc', name='maxInputLength', doc='Controls the maximum length for encoder inputs (source language texts)'): 30,
 Param(parent='MARIAN_TRANSFORMER_139dd9d2ebfc', name='maxOutputLength', doc='Controls the maximum length for decoder outputs (target language texts)'): 30,
 Param(parent='MARIAN_TRANSFORMER_139dd9d2ebfc', name='i

### `setLangId`

We use this parameter to inform the multilanguage model which language we are using as input. Let's use `tr` (Turkish) as an example:

In [17]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols("document") \
    .setOutputCol("sentence")

marian = MarianTransformer.pretrained("opus_mt_mul_en", "xx") \
    .setInputCols(["sentence"]) \
    .setOutputCol("translation") \
    .setLangId("tr") 
        
pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      marian
    ])

data = spark.createDataFrame([["Bu adam 50 yaşında ve çok çalışkan"]]).toDF("text")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
opus_mt_mul_en download started this may take some time.
Approximate size to download 395.3 MB
[OK!]


In [21]:
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(translation.result) as result").show(truncate=False)

+-----------------------------------------------+
|result                                         |
+-----------------------------------------------+
|This guy is 50 years old and working very hard.|
+-----------------------------------------------+



Or in French:

In [22]:
pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      marian.setLangId("fr")
    ])

data_fr = spark.createDataFrame([["Il a 50 ans et il est très travailleur"]]).toDF("text")
result_fr = pipeline.fit(data_fr).transform(data_fr)
result_fr.selectExpr("explode(translation.result) as result").show(truncate=False)

+----------------------------------------+
|result                                  |
+----------------------------------------+
|He's 50 years old and he's very employed|
+----------------------------------------+



## 🎯 **Usage with LightPipeline**

- **LightPipeline** is a Spark NLP specific Pipeline class equivalent to Spark ML Pipeline. The difference is that its execution does not hold to Spark principles, instead, it computes everything locally (but in parallel) in order to achieve faster inference when dealing with small amounts of data. This means, we don't have to Spark Dataframe, but a string or an array of strings instead, to be annotated. To create Light Pipelines, you need to input an already trained (fit) Spark ML Pipeline.

- It’s `transform()` stage is converted into `annotate()` or `fullAnnotate()` instead. <br/>

- Let's ceate a pipeline with `MarianTransformer`, and run it with `LightPipeline` and see the results with an example text. 

**A sample text in Italian for demo - we'll translate Italian text to English**

In [23]:
text = """La Gioconda è un dipinto ad olio del XVI secolo creato da Leonardo. Si tiene al Louvre di Parigi."""

**Define Spark NLP pipeline**

In [24]:
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentencerDL = SentenceDetectorDLModel()\
.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentences")

marian = MarianTransformer.pretrained("opus_mt_it_en", "xx")\
.setInputCols(["sentences"])\
.setOutputCol("translation")

nlp_pipeline = Pipeline(stages=[
    documentAssembler,
    sentencerDL, 
    marian
])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
opus_mt_it_en download started this may take some time.
Approximate size to download 454.8 MB
[OK!]


**Run the pipeline with `fullAnnotate()` and visualize the result**

In [28]:
empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)
lmodel = LightPipeline(pipeline_model)
res = lmodel.fullAnnotate(text)
print ('Original:', text, '\n\n')

for i, sentence in enumerate(res[0]['translation']):
  print(f"Translation of sentence {i}:")
  print (f"\t{sentence.result}")
  print("Metadata:\n")
  print(f"\t{sentence.metadata}")

Original: La Gioconda è un dipinto ad olio del XVI secolo creato da Leonardo. Si tiene al Louvre di Parigi. 


Translation of sentence 0:
	La Gioconda is an oil painting of the sixteenth century created by Leonardo.
Metadata:

	{'sentence': '0'}
Translation of sentence 1:
	It's held at the Louvre in Paris.
Metadata:

	{'sentence': '1'}


**Run the pipeline with `annotate()` and visualize the results**

In [29]:
empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)
lmodel = LightPipeline(pipeline_model)
res2 = lmodel.annotate(text)
print('Original:', text, '\n\n')

print('Translated:\n')
for sentence in res2['translation']:
    print(sentence)

Original: La Gioconda è un dipinto ad olio del XVI secolo creato da Leonardo. Si tiene al Louvre di Parigi. 


Translated:

La Gioconda is an oil painting of the sixteenth century created by Leonardo.
It's held at the Louvre in Paris.


- The LightPipeline in Spark NLP offers two methods, `annotate()` and `fullAnnotate()`, to process input text and obtain pipeline results. 

- `annotate()` returns a dictionary with keys as output column names and values as lists of annotated strings, providing a simplified output without metadata. 

- `fullAnnotate()` returns a list of dictionaries, each representing an input document, with keys as output column names and values as lists of Annotation objects, providing a more detailed output with metadata included. 

- Use `annotate()` for a simple output without metadata, and `fullAnnotate()` for a more detailed output with metadata included.
