![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


# **NerQuestionGenerator**

This notebook will cover the different parameters and usages of `NerQuestionGenerator` annotator.

**📖 Learning Objectives:**

1. Understand how to use `NerQuestionGenerator`.

2. Become comfortable using the different parameters of the annotator.

3. Programatically generate question to be used by Question-Answering models.


**🔗 Helpful Links:**

- Documentation : [NerQuestionGenerator](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#nerquestiongenerator)

- Python Docs : [NerQuestionGenerator](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/qa/qa_ner_generator/index.html#sparknlp_jsl.annotator.qa.qa_ner_generator.NerQuestionGenerator)

- Scala Docs : [NerQuestionGenerator](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/qa/NerQuestionGenerator.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp/).

## **📜 Background**


`NerQuestionGenerator` takes an NER chunk (obtained by, e.g., `NerConverterInternal`) and generates a questions based on two entity types, a pronoun and a strategy.

The question is generated in the form of `[QUESTIONPRONOUN] [ENTITY1] [ENTITY2] [QUESTIONMARK]`. The generated question can be used by `QuestionAnswerer` or `ZeroShotNer` annotators to answer the question or find NER entities.

## **🎬 Colab Setup**

In [1]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m70.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.8 

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [3]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [4]:
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()
spark

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


In [9]:
import pyspark.sql.functions as F


## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK`

- Output: `DOCUMENT`

## **🔎 Parameters**


- `questionPronoun`: Pronoun to be used in the question. E.g., 'When', 'Where', 'Why', 'How', 'Who', 'What'.
- `strategyType`: Strategy for the proccess, either `Paired` (default) or `Combined`.
- `questionMark`: Whether to add a question mark at the end of the question.
- `entities1`: List with the entity types of entities that appear first in the question.
- `entities2`: List with the entity types of entities that appear second in the question.


All the parameters can be set using the corresponding set method in camel case. For example, `.setQuestionPronoun(True)`.

### Preparation

First, let's create a data frame with identified entities that will be used to generate questions. We will use the `EntityRulerApproach` to identify entitites present in a JSON file.

In [5]:
import json

entities = [
          {
            "label": "Person",
            "patterns": ["Jon", "John", "John's"]
          },
          {
            "label": "Organization",
            "patterns": ["St. Mary's Hospital", "St. Mary's"]
          },
          {
              "label": "Condition",
              "patterns": ["vital signs", "heartbeat", "oxygen saturation levels"]
          }
         ]

with open('./entities.json', 'w') as jsonfile:
    json.dump(entities, jsonfile)

In [6]:
document_assembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")

entity_ruler = nlp.EntityRulerApproach() \
                  .setInputCols(["document"]) \
                  .setOutputCol("entity") \
                  .setPatternsResource("./entities.json")\
                  .setCaseSensitive(False)

prep_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    entity_ruler
])

In [7]:
example = """At St. Mary's Hospital, the healthcare team closely monitored John's vital signs with unwavering attention. They recorded his heartbeat and oxygen saturation levels, promptly addressing any deviations from normal. Their dedication and expertise at St. Mary's played a vital role in ensuring John's stability and fostering a swift recovery."""
df = spark.createDataFrame([[example]]).toDF("text")

# Apply the initial steps
df = prep_pipeline.fit(df).transform(df)

df.show()

+--------------------+--------------------+--------------------+
|                text|            document|              entity|
+--------------------+--------------------+--------------------+
|At St. Mary's Hos...|[{document, 0, 33...|[{chunk, 3, 21, S...|
+--------------------+--------------------+--------------------+



In [10]:
df.select(F.explode(F.arrays_zip(df.entity.result, df.entity.metadata)).alias("cols")).select(
    F.expr("cols['0']").alias("chunk"),
    F.expr("cols['1'].entity").alias("entity")
).show(truncate=False)

+------------------------+------------+
|chunk                   |entity      |
+------------------------+------------+
|St. Mary's Hospital     |Organization|
|John's                  |Person      |
|vital signs             |Condition   |
|heartbeat               |Condition   |
|oxygen saturation levels|Condition   |
|John's                  |Person      |
+------------------------+------------+



### `questionPronoun`, `entities1`, `entities2`

Using `What`:

In [11]:
qagenerator = (
    medical.NerQuestionGenerator()
    .setInputCols(["entity"])
    .setOutputCol("question")
    .setQuestionPronoun("What")
    .setEntities1(["Person"])
    .setEntities2(["Condition"])
)


In [12]:
qagenerator.transform(df).select("question").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------+
|question                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 62, 79, What John's vital signs , {sentence -> 0}, []}, {document, 291, 134, What John's heartbeat , {sentence -> 0}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------+



Using `Where`:

In [13]:
qagenerator = (
    medical.NerQuestionGenerator()
    .setInputCols(["entity"])
    .setOutputCol("question")
    .setQuestionPronoun("Where")
    .setEntities1(["Person"])
    .setEntities2(["Condition"])
)

qagenerator.transform(df).select("question").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------+
|question                                                                                                                                |
+----------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 62, 79, Where John's vital signs , {sentence -> 0}, []}, {document, 291, 134, Where John's heartbeat , {sentence -> 0}, []}]|
+----------------------------------------------------------------------------------------------------------------------------------------+



### `strategyType`

- If set to `Paired` (default), applies a one-vs-one strategy. In this case, the number of chunks in Entity 1 must be aligned with the number of chunks in Entity 2. E.g., if Entity 1 has 3 chunks and Entity 2 has 3 chunks, the first chunk of Entity 1 will be grouped with first chunk of Entity 2,
the second with second, third with third, etc.

- If set to `Combined`, applies a one-vs-all strategy. In this case, the number of chunks in Entity 1 don't need to be the same as the number of chunks in Entity 2, and each chunk in Entity 1 will be grouped with all chunks in Entity 2.

In [14]:
qagenerator = (
    medical.NerQuestionGenerator()
    .setInputCols(["entity"])
    .setOutputCol("question")
    .setQuestionPronoun("Where")
    .setEntities1(["Person"])
    .setEntities2(["Organization"])
    .setStrategyType("Paired")
)

qagenerator.transform(df).select("question.result").show(truncate=False)

+-----------------------------------+
|result                             |
+-----------------------------------+
|[Where John's St. Mary's Hospital ]|
+-----------------------------------+



In [15]:
qagenerator = (
    medical.NerQuestionGenerator()
    .setInputCols(["entity"])
    .setOutputCol("question")
    .setQuestionPronoun("How")
    .setEntities1(["Person"])
    .setEntities2(["Condition"])
    .setStrategyType("Combined")
)

qagenerator.transform(df).select("question.result").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|[John's vital signs, John's heartbeat, John's oxygen saturation levels, John's vital signs, John's heartbeat, John's oxygen saturation levels]|
+----------------------------------------------------------------------------------------------------------------------------------------------+



### `questionMark`

In [16]:
qagenerator = (
    medical.NerQuestionGenerator()
    .setInputCols(["entity"])
    .setOutputCol("question")
    .setQuestionPronoun("How is")
    .setEntities1(["Person"])
    .setEntities2(["Condition"])
    .setStrategyType("Paired")
    .setQuestionMark(True)
)

qagenerator.transform(df).select("question.result").show(truncate=False)

+--------------------------------------------------------+
|result                                                  |
+--------------------------------------------------------+
|[How is John's vital signs ?, How is John's heartbeat ?]|
+--------------------------------------------------------+



## Fast inference with [LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline)

We can use Spark NLP's `LightPipeline` to run fast inference directly on text (or list of text) instead of using spark data frames.

Let's check how to do that.

In [17]:
pipeline = nlp.Pipeline(stages=[prep_pipeline, qagenerator])

lp = nlp.LightPipeline(pipeline.fit(df.select("text")))

In [18]:
result = lp.annotate(example)
result

{'document': ["At St. Mary's Hospital, the healthcare team closely monitored John's vital signs with unwavering attention. They recorded his heartbeat and oxygen saturation levels, promptly addressing any deviations from normal. Their dedication and expertise at St. Mary's played a vital role in ensuring John's stability and fostering a swift recovery."],
 'entity': ["St. Mary's Hospital",
  "John's",
  'vital signs',
  'heartbeat',
  'oxygen saturation levels',
  "John's"],
 'question': ["How is John's vital signs ?", "How is John's heartbeat ?"]}