

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5TRANSFORMER_TRANSLATION.ipynb)




# **Text Translation using google's T5 Transformer**

### Spark NLP documentation and instructions:
https://nlp.johnsnowlabs.com/docs/en/quickstart

### You can find details about Spark NLP annotators here:
https://nlp.johnsnowlabs.com/docs/en/annotators

### You can find details about Spark NLP models here:
https://nlp.johnsnowlabs.com/models


## 1. Colab Setup

In [1]:
# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp==2.7.0

openjdk version "11.0.9.1" 2020-11-04
OpenJDK Runtime Environment (build 11.0.9.1+1-Ubuntu-0ubuntu1.18.04)
OpenJDK 64-Bit Server VM (build 11.0.9.1+1-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)
Collecting spark-nlp==2.7.0
  Using cached https://files.pythonhosted.org/packages/cf/2c/0112881b86046b362592a7a9217d41894d857a1a0561dd4fd19a3d9c5791/spark_nlp-2.7.0-1-py2.py3-none-any.whl
Installing collected packages: spark-nlp
Successfully installed spark-nlp-2.7.0


## 2. Start the Spark session

Import dependencies and start Spark session.

In [3]:
import os
import json
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-8-openjdk-amd64"

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp

spark = sparknlp.start()

## 3. Select the DL model

For complete model list: 
https://nlp.johnsnowlabs.com/models

For `T5` models:
https://nlp.johnsnowlabs.com/models?tag=t5

##4. Text Translation using T5 Transformer - English to German

 Define Spark NLP pipeline

In [8]:
from sparknlp.annotator import *
from sparknlp.base import *

from pyspark.ml import Pipeline

document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("documents")

sentence_detector = SentenceDetectorDLModel().pretrained()\
  .setInputCols("documents")\
  .setOutputCol("sentence")
  
t5 = T5Transformer().pretrained("t5_small", 'en') \
  .setInputCols(["sentence"]) \
  .setOutputCol("translation")\
  .setTask("translate English to German:")\
  .setMaxOutputLength(200)
  
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    t5
])

data = spark.createDataFrame([
  [1, "My name is Spark NLP! It's nice to meet you."],
  [2, "My name is Wolfgang and I live in Berlin"]
]).toDF('id', 'text')

results = pipeline.fit(data).transform(data)

results.select("translation.result").show(truncate=False)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
t5_small download started this may take some time.
Approximate size to download 168.7 MB
[OK!]
+---------------------------------------------------------+
|result                                                   |
+---------------------------------------------------------+
|[Mein Name ist Spark NLP!, Es ist schön, Sie zu treffen.]|
|[Mein Name ist Wolfgang und ich lebe in Berlin.]         |
+---------------------------------------------------------+



##5. Text Translation using T5 Transformer - English to French

In [7]:
from sparknlp.annotator import *
from sparknlp.base import *

from pyspark.ml import Pipeline

document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("documents")

sentence_detector = SentenceDetectorDLModel().pretrained()\
  .setInputCols("documents")\
  .setOutputCol("sentence")
  
t5 = T5Transformer().pretrained("t5_small", 'en') \
  .setInputCols(["sentence"]) \
  .setOutputCol("translation")\
  .setTask("translate English to French:")\
  .setMaxOutputLength(200)
  
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    t5
])

data = spark.createDataFrame([
  [1, "My name is Spark NLP! It's nice to meet you."],
  [2, "My name is Wolfgang and I live in Berlin"]
]).toDF('id', 'text')

results = pipeline.fit(data).transform(data)

results.select("translation.result").show(truncate=False)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
t5_small download started this may take some time.
Approximate size to download 168.7 MB
[OK!]
+------------------------------------------------------------+
|result                                                      |
+------------------------------------------------------------+
|[Mon nom est Spark NLP!, C'est agréable de vous rencontrer.]|
|[Mon nom est Wolfgang et je vit à Berlin.]                  |
+------------------------------------------------------------+

