![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/13.2.T5_SQL_Code_Generation_and_Style_Transfer_with_SparkNLP.ipynb)

# SQL Code Generation and Style Transfer with T5

Google's T5 is a Sequence to Sequence model that was trained on over 15 different NLP datasets with various problem types, raning from Text Summarization, Question Answering, Translation to various semantical deduction tasks, which enriches T5's ability to map token sequences to semantic vectors which contain more meaning, which T5 leverages to generalize across various tasks and even to never before trained tasks.

On top of this, T5 is trained on the standard Word prediction task, which most transformer based models like BERT, GPT, ELMO have been trained on. This gives T5 general knowledge of real world concepts to additionally enhance its understanding.

## Colab Setup

In [None]:
!pip install -q pyspark==3.3.0 spark-nlp==5.0.0

In [2]:
import sparknlp

spark = sparknlp.start()

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 5.0.0
Apache Spark version: 3.3.0


# T5-small fine-tuned on WikiSQL

Google’s T5 small fine-tuned on WikiSQL for English to SQL translation. Will generate SQL code from natural language input when task is set it to “translate English to SQL:”.

In [3]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

t5 = T5Transformer.pretrained("t5_small_wikiSQL") \
    .setTask("translate English to SQL:") \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(200) \
    .setOutputCol("sql")

pipeline = Pipeline().setStages([documentAssembler, t5])

data = spark.createDataFrame([["How many customers have ordered more than 2 items?"]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("sql.result").show(truncate=False)

t5_small_wikiSQL download started this may take some time.
Approximate size to download 249.9 MB
[OK!]
+----------------------------------------------------+
|result                                              |
+----------------------------------------------------+
|[SELECT COUNT Customers FROM table WHERE Orders > 2]|
+----------------------------------------------------+



## **Warning**
The next models are quite large and they will not all fit into the same Spark/Colab session. You will have to **restart your Colab session** in between running the models or you will encounter errors.

# T5 for Active to Passive Style Transfer

This is a text-to-text model based on T5 fine-tuned to generate actively written text from a passively written text input, for the task “transfer Active to Passive:”. It is based on Prithiviraj Damodaran’s Styleformer.

In [4]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

t5 = T5Transformer.pretrained("t5_active_to_passive_styletransfer") \
    .setTask("transfer Active to Passive:") \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(200) \
    .setOutputCol("transfers")

pipeline = Pipeline().setStages([documentAssembler, t5])

data = spark.createDataFrame([["I am writing you a letter."]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("transfers.result").show(truncate=False)

t5_active_to_passive_styletransfer download started this may take some time.
Approximate size to download 252.7 MB
[OK!]
+---------------------------+
|result                     |
+---------------------------+
|[a letter is written by me]|
+---------------------------+



# T5 for Passive to Active Style Transfer

This is a text-to-text model based on T5 fine-tuned to generate passively written text from a actively written text input, for the task “transfer Passive to Active:”. It is based on Prithiviraj Damodaran’s Styleformer.

In [5]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

t5 = T5Transformer.pretrained("t5_passive_to_active_styletransfer") \
    .setTask("transfer Passive to Active:") \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(200) \
    .setOutputCol("transfers")

pipeline = Pipeline().setStages([documentAssembler, t5])

data = spark.createDataFrame([["A letter was sent to you."]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("transfers.result").show(truncate=False)

t5_passive_to_active_styletransfer download started this may take some time.
Approximate size to download 253.2 MB
[OK!]
+-------------------+
|result             |
+-------------------+
|[you sent a letter]|
+-------------------+



# T5 for Formal to Informal Style Transfer

This is a text-to-text model based on T5 fine-tuned to generate informal text from a formal text input, for the task “transfer Formal to Casual:”. It is based on Prithiviraj Damodaran’s Styleformer.

In [6]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

t5 = T5Transformer.pretrained("t5_formal_to_informal_styletransfer") \
    .setTask("transfer Formal to Casual:") \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(200) \
    .setOutputCol("transfers")

pipeline = Pipeline().setStages([documentAssembler, t5])

data = spark.createDataFrame([["Please leave the room now."]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("transfers.result").show(truncate=False)

t5_formal_to_informal_styletransfer download started this may take some time.
Approximate size to download 881.2 MB
[OK!]
+---------------------+
|result               |
+---------------------+
|[leave the room now.]|
+---------------------+



# T5 for Informal to Formal Style Transfer

This is a text-to-text model based on T5 fine-tuned to generate informal text from a formal text input, for the task “transfer Casual to Formal:”. It is based on Prithiviraj Damodaran’s Styleformer.

In [7]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

t5 = T5Transformer.pretrained("t5_informal_to_formal_styletransfer") \
    .setTask("transfer Casual to Formal:") \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(200) \
    .setOutputCol("transfers")

pipeline = Pipeline().setStages([documentAssembler, t5])

data = spark.createDataFrame([["Who gives a crap?"]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("transfers.result").show(truncate=False)

t5_informal_to_formal_styletransfer download started this may take some time.
Approximate size to download 881.2 MB
[OK!]
+------------+
|result      |
+------------+
|[Who cares?]|
+------------+




# Overview of every task available with T5
[The T5 model](https://arxiv.org/pdf/1910.10683.pdf) is trained on various datasets for 17 different tasks which fall into 8 categories.



1. Text summarization
2. Question answering
3. Translation
4. Sentiment analysis
5. Natural Language inference
6. Coreference resolution
7. Sentence Completion
8. Word sense disambiguation

### Every T5 Task with explanation:
|Task Name | Explanation |
|----------|--------------|
|[1.CoLA](https://nyu-mll.github.io/CoLA/)                   | Classify if a sentence is gramaticaly correct|
|[2.RTE](https://dl.acm.org/doi/10.1007/11736790_9)                    | Classify whether if a statement can be deducted from a sentence|
|[3.MNLI](https://arxiv.org/abs/1704.05426)                   | Classify for a hypothesis and premise whether they contradict or contradict each other or neither of both (3 class).|
|[4.MRPC](https://www.aclweb.org/anthology/I05-5002.pdf)                   | Classify whether a pair of sentences is a re-phrasing of each other (semantically equivalent)|
|[5.QNLI](https://arxiv.org/pdf/1804.07461.pdf)                   | Classify whether the answer to a question can be deducted from an answer candidate.|
|[6.QQP](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)                    | Classify whether a pair of questions is a re-phrasing of each other (semantically equivalent)|
|[7.SST2](https://www.aclweb.org/anthology/D13-1170.pdf)                   | Classify the sentiment of a sentence as positive or negative|
|[8.STSB](https://www.aclweb.org/anthology/S17-2001/)                   | Classify the sentiment of a sentence on a scale from 1 to 5 (21 Sentiment classes)|
|[9.CB](https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601)                     | Classify for a premise and a hypothesis whether they contradict each other or not (binary).|
|[10.COPA](https://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418/0)                   | Classify for a question, premise, and 2 choices which choice the correct choice is (binary).|
|[11.MultiRc](https://www.aclweb.org/anthology/N18-1023.pdf)                | Classify for a question, a paragraph of text, and an answer candidate, if the answer is correct (binary),|
|[12.WiC](https://arxiv.org/abs/1808.09121)                    | Classify for a pair of sentences and a disambigous word if the word has the same meaning in both sentences.|
|[13.WSC/DPR](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492/0)       | Predict for an ambiguous pronoun in a sentence what it is referring to.  |
|[14.Summarization](https://arxiv.org/abs/1506.03340)          | Summarize text into a shorter representation.|
|[15.SQuAD](https://arxiv.org/abs/1606.05250)                  | Answer a question for a given context.|
|[16.WMT1.](https://arxiv.org/abs/1706.03762)                  | Translate English to German|
|[17.WMT2.](https://arxiv.org/abs/1706.03762)                   | Translate English to French|
|[18.WMT3.](https://arxiv.org/abs/1706.03762)                   | Translate English to Romanian|

