

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/QUESTION_ANSWERING.ipynb)



# **Automatically answer questions**

Automatically generate answers to questions with & without context.

## 1. Colab Setup

In [1]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash
# !bash colab.sh
# -p is for pyspark
# -s is for spark-nlp
# !bash colab.sh -p 3.1.1 -s 3.0.1
# by default they are set to the latest

openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.18.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)
setup Colab for PySpark 3.1.1 and Spark NLP 3.0.0
[K     |████████████████████████████████| 212.3MB 72kB/s 
[K     |████████████████████████████████| 143kB 42.2MB/s 
[K     |████████████████████████████████| 204kB 45.6MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import pandas as pd
import numpy as np
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

## 2. Start Spark Session

In [3]:
spark = sparknlp.start()

## 3. Select the model to use

In [4]:
#MODEL_NAME = 'google_t5_small_ssm_nq'
#MODEL_NAME = 't5_small'
MODEL_NAME = 't5_base'

### 3.1 Select the task

The `T5 Transformer` model is able to perform 18 different tasks (ref.: [this paper](https://arxiv.org/abs/1910.10683)). To answer questions, we can use the following tasks:

For models `t5_base` and `t5_small`, we use the task `squad`: Answer a question for a given context.

For model `google_t5_small_ssm_nq`, we use the task `qa`: Answers questions without context.

In [5]:
if MODEL_NAME == "google_t5_small_ssm_nq":
    TASK = 'qa'
else:
    TASK = 'squad'

In [6]:
# Prefix to be used on the T5Transformer().setTask(<<prefix>>)
task_prefix = {
                'qa': 'trivia question:', 
                'squad': 'question:',
            }

## 4 Examples to try on the model

In [7]:
text_lists = {
            'squad':    ["""
                        What does increased oxygen concentrations in the patient’s lungs displace? 
                        context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.
                        """],
            'qa':  ["Who is Clark Kent?",
                "who is the most famous singer?",
                "when do we have winters?",
                "In which city is Eiffel Tower located?",
                "Who is the founder of Microsoft?"]
            }

## 5. Define the Spark NLP pipeline

In [8]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("documents")

t5 = T5Transformer() \
    .pretrained(MODEL_NAME) \
    .setTask(task_prefix[TASK])\
    .setMaxOutputLength(200)\
    .setInputCols(["documents"]) \
    .setOutputCol("T5")

pipeline = Pipeline(stages=[document_assembler, t5])

t5_base download started this may take some time.
Approximate size to download 446 MB
[OK!]


## 6. Run the pipeline

In [9]:
# Fit on empty data frame (model is pretrained)
empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = pipeline.fit(empty_df)

# Send example texts to spark data frame
text_df = spark.createDataFrame(pd.DataFrame({'text': text_lists[TASK]}))

# Predict with the Pipeline model
result = pipeline_model.transform(text_df)

# Create Light Pipeline
lmodel = LightPipeline(pipeline_model)

# Predict with then Ligh Pipeline model
res = lmodel.fullAnnotate(text_lists[TASK])

## 7. Visualize the results

Using Light Pipeline:

In [10]:
for r in res:
    print(f"{r['documents'][0].result} => {r['T5'][0].result}\n")
    print("----------------------------------------------\n")


                        What does increased oxygen concentrations in the patient’s lungs displace? 
                        context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.
                         => carbon monoxide

----------------------------------------------



Using pipeline model:

In [11]:
result.select('text', 'T5.result').show(truncate=100)

+----------------------------------------------------------------------------------------------------+-----------------+
|                                                                                                text|           result|
+----------------------------------------------------------------------------------------------------+-----------------+
|
                        What does increased oxygen concentrations in the patient’s lungs displac...|[carbon monoxide]|
+----------------------------------------------------------------------------------------------------+-----------------+

