

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TRANSLATION_MARIAN.ipynb)




# **Translate text**

### Spark NLP documentation and instructions:
https://nlp.johnsnowlabs.com/docs/en/quickstart

### You can find details about Spark NLP annotators here:
https://nlp.johnsnowlabs.com/docs/en/annotators

### You can find details about Spark NLP models here:
https://nlp.johnsnowlabs.com/models


## 1. Colab Setup

In [None]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.3.0 spark-nlp==4.2.8

# Install Spark NLP Display lib
! pip install --upgrade -q spark-nlp-display

## 2. Start the Spark session

Import dependencies and start Spark session.

In [2]:
import json
import pandas as pd
import numpy as np

import sparknlp
import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.types import StringType, IntegerType

In [3]:
spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.2.8
Apache Spark version: 3.3.0


## 3. Select the DL model

For complete model list: 
https://nlp.johnsnowlabs.com/models

For `Translation` models:
https://nlp.johnsnowlabs.com/models?tag=translation

## 4. A sample text in Italian for demo - we'll translate Italian text to English

In [4]:
text = """La Gioconda è un dipinto ad olio del XVI secolo creato da Leonardo. Si tiene al Louvre di Parigi."""


## 5. Define Spark NLP pipeline

In [5]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

## More accurate Sentence Detection using Deep Learning
sentencerDL = SentenceDetectorDLModel()\
  .pretrained("sentence_detector_dl", "xx")\
  .setInputCols(["document"])\
  .setOutputCol("sentences")

marian = MarianTransformer.pretrained("opus_mt_it_en", "xx")\
  .setInputCols(["sentences"])\
  .setOutputCol("translation")

nlp_pipeline = Pipeline(
    stages=[
        documentAssembler, 
        sentencerDL, 
        marian
        ])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
opus_mt_it_en download started this may take some time.
Approximate size to download 454.8 MB
[OK!]


## 6. Run the pipeline

In [6]:
empty_df = spark.createDataFrame([[""]]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)
lmodel = LightPipeline(pipeline_model)
res = lmodel.fullAnnotate(text)


Before _validateStagesInputCols


## 7. Visualize results

In [7]:
print ('Original:', text, '\n\n')

print ('Translated:\n')
for sentence in res[0]['translation']:
  print (sentence.result)

Original: La Gioconda è un dipinto ad olio del XVI secolo creato da Leonardo. Si tiene al Louvre di Parigi. 


Translated:

La Gioconda is an oil painting of the sixteenth century created by Leonardo.
It's held at the Louvre in Paris.
