![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


# **NerQuestionGenerator**

This notebook will cover the different parameters and usages of `NerQuestionGenerator` annotator. 

**📖 Learning Objectives:**

1. Understand how to use `NerQuestionGenerator`.

2. Become comfortable using the different parameters of the annotator.

3. Programatically generate question to be used by Question-Answering models.


**🔗 Helpful Links:**

- Documentation : [NerQuestionGenerator](https://nlp.johnsnowlabs.com/docs/en/licensed_annotator)

- Python Docs : [NerQuestionGenerator](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/qa/qa_ner_generator/index.html#sparknlp_jsl.annotator.qa.qa_ner_generator.NerQuestionGenerator)

- Scala Docs : [NerQuestionGenerator](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/qa/NerQuestionGenerator.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp/).

## **📜 Background**


`NerQuestionGenerator` takes an NER chunk (obtained by, e.g., `NerConverterInternal`) and generates a questions based on two entity types, a pronoun and a strategy.

The question is generated in the form of `[QUESTIONPRONOUN] [ENTITY1] [ENTITY2] [QUESTIONMARK]`. The generated question can be used by `QuestionAnswerer` or `ZeroShotNer` annotators to answer the question or find NER entities.

## **🎬 Colab Setup**

In [3]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.2/84.2 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.7/486.7 kB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m639.9/639.9 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m7.6 MB/s[0

In [4]:
from johnsnowlabs import nlp


nlp.install(force_browser=True)

<IPython.core.display.Javascript object>

127.0.0.1 - - [01/Jun/2023 19:47:17] "GET /login?code=mwOOQWBW7czMUWe41WtmRs7lDsXfWG HTTP/1.1" 200 -


<IPython.core.display.Javascript object>

Downloading license...
Licenses extracted successfully
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.4.1-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.4.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.4.1.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.4.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.4.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==4.4.2 installed! ✅ Heal the planet with NLP! 


In [5]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F


spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.2, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK`

- Output: `DOCUMENT`

## **🔎 Parameters**


- `questionPronoun`: Pronoun to be used in the question. E.g., 'When', 'Where', 'Why', 'How', 'Who', 'What'.
- `strategyType`: Strategy for the proccess, either `Paired` (default) or `Combined`.
- `questionMark`: Whether to add a question mark at the end of the question.
- `entities1`: List with the entity types of entities that appear first in the question. 
- `entities2`: List with the entity types of entities that appear second in the question.


All the parameters can be set using the corresponding set method in camel case. For example, `.setQuestionPronoun(True)`.

### Preparation

First, let's create a data frame with identified entities that will be used to generate questions. We will use the `EntityRulerApproach` to identify entitites present in a JSON file.

In [10]:
import json

entities = [
          {
            "label": "Person",
            "patterns": ["Jon", "John", "John's"]
          },
          {
            "label": "Organization",
            "patterns": ["St. Mary's Hospital", "St. Mary's"]
          },
          {
              "label": "Condition",
              "patterns": ["vital signs", "heartbeat", "oxygen saturation levels"]
          }
         ]

with open('./entities.json', 'w') as jsonfile:
    json.dump(entities, jsonfile)

In [11]:
document_assembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")

entity_ruler = nlp.EntityRulerApproach() \
                  .setInputCols(["document"]) \
                  .setOutputCol("entity") \
                  .setPatternsResource("./entities.json")\
                  .setCaseSensitive(False)

prep_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    entity_ruler
])

In [12]:
example = """At St. Mary's Hospital, the healthcare team closely monitored John's vital signs with unwavering attention. They recorded his heartbeat and oxygen saturation levels, promptly addressing any deviations from normal. Their dedication and expertise at St. Mary's played a vital role in ensuring John's stability and fostering a swift recovery."""
df = spark.createDataFrame([[example]]).toDF("text")

# Apply the initial steps
df = prep_pipeline.fit(df).transform(df)

df.show()

+--------------------+--------------------+--------------------+
|                text|            document|              entity|
+--------------------+--------------------+--------------------+
|At St. Mary's Hos...|[{document, 0, 33...|[{chunk, 3, 21, S...|
+--------------------+--------------------+--------------------+



In [18]:
df.select(F.explode(F.arrays_zip(df.entity.result, df.entity.metadata)).alias("cols")).select(
    F.expr("cols['0']").alias("chunk"),
    F.expr("cols['1'].entity").alias("entity")
).show(truncate=False)

+------------------------+------------+
|chunk                   |entity      |
+------------------------+------------+
|St. Mary's Hospital     |Organization|
|John's                  |Person      |
|vital signs             |Condition   |
|heartbeat               |Condition   |
|oxygen saturation levels|Condition   |
|John's                  |Person      |
+------------------------+------------+



### `questionPronoun`, `entities1`, `entities2`

Using `What`: 

In [14]:
qagenerator = (
    medical.NerQuestionGenerator()
    .setInputCols(["entity"])
    .setOutputCol("question")
    .setQuestionPronoun("What")
    .setEntities1(["Person"])
    .setEntities2(["Condition"])
)


In [15]:
qagenerator.transform(df).select("question").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------+
|question                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 62, 79, What John's vital signs , {sentence -> 0}, []}, {document, 291, 134, What John's heartbeat , {sentence -> 0}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------+



Using `Where`:

In [20]:
qagenerator = (
    medical.NerQuestionGenerator()
    .setInputCols(["entity"])
    .setOutputCol("question")
    .setQuestionPronoun("Where")
    .setEntities1(["Person"])
    .setEntities2(["Condition"])
)

qagenerator.transform(df).select("question").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------+
|question                                                                                                                                |
+----------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 62, 79, Where John's vital signs , {sentence -> 0}, []}, {document, 291, 134, Where John's heartbeat , {sentence -> 0}, []}]|
+----------------------------------------------------------------------------------------------------------------------------------------+



### `strategyType`

- If set to `Paired` (default), applies a one-vs-one strategy. In this case, the number of chunks in Entity 1 must be aligned with the number of chunks in Entity 2. E.g., if Entity 1 has 3 chunks and Entity 2 has 3 chunks, the first chunk of Entity 1 will be grouped with first chunk of Entity 2,
the second with second, third with third, etc.

- If set to `Combined`, applies a one-vs-all strategy. In this case, the number of chunks in Entity 1 don't need to be the same as the number of chunks in Entity 2, and each chunk in Entity 1 will be grouped with all chunks in Entity 2.

In [21]:
qagenerator = (
    medical.NerQuestionGenerator()
    .setInputCols(["entity"])
    .setOutputCol("question")
    .setQuestionPronoun("Where")
    .setEntities1(["Person"])
    .setEntities2(["Organization"])
    .setStrategyType("Paired")
)

qagenerator.transform(df).select("question.result").show(truncate=False)

+----------------------------------------------------+
|result                                              |
+----------------------------------------------------+
|[Where John's vital signs , Where John's heartbeat ]|
+----------------------------------------------------+



In [23]:
qagenerator = (
    medical.NerQuestionGenerator()
    .setInputCols(["entity"])
    .setOutputCol("question")
    .setQuestionPronoun("How")
    .setEntities1(["Person"])
    .setEntities2(["Condition"])
    .setStrategyType("Combined")
)

qagenerator.transform(df).select("question.result").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|[John's vital signs, John's heartbeat, John's oxygen saturation levels, John's vital signs, John's heartbeat, John's oxygen saturation levels]|
+----------------------------------------------------------------------------------------------------------------------------------------------+



### `questionMark`

In [26]:
qagenerator = (
    medical.NerQuestionGenerator()
    .setInputCols(["entity"])
    .setOutputCol("question")
    .setQuestionPronoun("How is")
    .setEntities1(["Person"])
    .setEntities2(["Condition"])
    .setStrategyType("Paired")
    .setQuestionMark(True)
)

qagenerator.transform(df).select("question.result").show(truncate=False)

+--------------------------------------------------------+
|result                                                  |
+--------------------------------------------------------+
|[How is John's vital signs ?, How is John's heartbeat ?]|
+--------------------------------------------------------+



## Fast inference with [LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline)

We can use Spark NLP's `LightPipeline` to run fast inference directly on text (or list of text) instead of using spark data frames. 

Let's check how to do that.

In [28]:
pipeline = nlp.Pipeline(stages=[prep_pipeline, qagenerator])

lp = nlp.LightPipeline(pipeline.fit(df.select("text")))

In [29]:
result = lp.annotate(example)
result

{'document': ["At St. Mary's Hospital, the healthcare team closely monitored John's vital signs with unwavering attention. They recorded his heartbeat and oxygen saturation levels, promptly addressing any deviations from normal. Their dedication and expertise at St. Mary's played a vital role in ensuring John's stability and fostering a swift recovery."],
 'entity': ["St. Mary's Hospital",
  "John's",
  'vital signs',
  'heartbeat',
  'oxygen saturation levels',
  "John's"],
 'question': ["How is John's vital signs ?", "How is John's heartbeat ?"]}