![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/15.0.Question_Answering.ipynb)

## Colab Setup

In [None]:
!pip install -q pyspark==3.4.1 spark-nlp==5.3.2

In [None]:
import sparknlp

spark = sparknlp.start()

from sparknlp.base import *
from sparknlp.annotator import *

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 5.3.2
Apache Spark version: 3.4.1


## QuestionAnswering Models

Pretrained Question Answering models have been sourced and curated from many open sources to ensure scalability and production readiness using Spark NLP.

### AlbertForQuestionAnswering

📚 For more information, check out this link : [albertforquestionanswering](https://sparknlp.org/docs/en/transformers#albertforquestionanswering).

For the avaible models please check the [Modes Hub](https://sparknlp.org/docs/en/transformers#albertforquestionanswering)

In [None]:
documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xxlargev1_squad2_512","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")\
    .setCaseSensitive(True)

pipeline = Pipeline(stages=[documentAssembler,
                            spanClassifier])

data = spark.createDataFrame([["Which name is also used to describe the Amazon rainforest in English?","""The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."""]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

albert_qa_xxlargev1_squad2_512 download started this may take some time.
Approximate size to download 735.9 MB
[OK!]


In [None]:
result.select('answer.result').show(truncate=False)

+-----------------------+
|result                 |
+-----------------------+
|[usuallyAmazonia;Frenc]|
+-----------------------+



In [None]:
empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = pipeline.fit(empty_df)

light_model = LightPipeline(pipelineModel)


In [None]:
light_model.annotate("Which name is also used to describe the Amazon rainforest in English?","""The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.""")

{'document_question': ['Which name is also used to describe the Amazon rainforest in English?'],
 'document_context': ['The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet\'s remaining

In [None]:
light_model.fullAnnotate("Which name is also used to describe the Amazon rainforest in English?","""The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.""")

[{'document_question': [Annotation(document, 0, 68, Which name is also used to describe the Amazon rainforest in English?, {}, [])],
  'document_context': [Annotation(document, 0, 1056, The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their

### BertForQuestionAnswering

📚 For more information, check out this link : [bertforquestionanswering](https://sparknlp.org/docs/en/transformers#bertforquestionanswering).

For the avaible models please check the [Modes Hub](https://nlp.johnsnowlabs.com/models?annotator=BertForQuestionAnswering)

In [None]:
document_assembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488","es") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(True)


pipeline = Pipeline().setStages([document_assembler,
                                 spanClassifier])

# Question in Spanish: How many people speak Spanish?
# Context in Spanish: Spanish is the second most spoken language in the world with more than 442 million speakers

example = spark.createDataFrame([["¿Cuántas personas hablan español?", "El español es el segundo idioma más hablado del mundo con más de 442 millones de hablantes"]]).toDF("question", "context")

result = pipeline.fit(example).transform(example)

bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488 download started this may take some time.
Approximate size to download 391 MB
[OK!]


In [None]:
result.select('answer.result').show(truncate=False)

+--------------+
|result        |
+--------------+
|[442 millones]|
+--------------+



### DebertaForQuestionAnswering

📚 For more information, check out this link : [debertaforquestionanswering](https://sparknlp.org/docs/en/transformers#debertaforquestionanswering).

For the avaible models please check the [Modes Hub](https://nlp.johnsnowlabs.com/models?annotator=DeBertaForQuestionAnswering)

In [None]:
documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = DeBertaForQuestionAnswering.pretrained("deberta_v3_xsmall_qa_squad2","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")\
    .setCaseSensitive(True)

pipeline = Pipeline(stages=[documentAssembler,
                            spanClassifier])

data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

deberta_v3_xsmall_qa_squad2 download started this may take some time.
Approximate size to download 240.6 MB
[OK!]


In [None]:
result.select('answer.result').show(truncate=False)

+-------+
|result |
+-------+
|[Clara]|
+-------+



### DistilBertForQuestionAnswering

📚 For more information, check out this link : [distilbertforquestionanswering](https://sparknlp.org/docs/en/transformers#distilbertforquestionanswering).

For the avaible models please check the [Modes Hub](https://nlp.johnsnowlabs.com/models?annotator=DistilBertForQuestionAnswering)

In [None]:
documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_base_cased_qa_squad2","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")\
    .setCaseSensitive(True)

pipeline = Pipeline(stages=[documentAssembler,
                            spanClassifier])

data = spark.createDataFrame([["Where do I live?", "My name is Wolfgang and I live in Berlin"]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

distilbert_base_cased_qa_squad2 download started this may take some time.
Approximate size to download 232.5 MB
[OK!]


In [None]:
result.select('answer.result').show(truncate=False)

+--------+
|result  |
+--------+
|[Berlin]|
+--------+



### LongformerForQuestionAnswering

📚 For more information, check out this link : [longformerforquestionanswering](https://sparknlp.org/docs/en/transformers#longformerforquestionanswering).

For the avaible models please check the [Modes Hub](https://nlp.johnsnowlabs.com/models?annotator=LongformerForQuestionAnswering)

In [None]:
documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_large_4096_finetuned_triviaqa","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")\
    .setCaseSensitive(True)

pipeline = Pipeline(stages=[documentAssembler,
                            spanClassifier])

data = spark.createDataFrame([["Where did Super Bowl 50 take place?", """Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season.
The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title.
The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.
As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives,
as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"),
so that the logo could prominently feature the Arabic numerals 50."""]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

longformer_qa_large_4096_finetuned_triviaqa download started this may take some time.
Approximate size to download 1.5 GB
[OK!]


In [None]:
result.select('answer.result').show(truncate=False)

+------------------------+
|result                  |
+------------------------+
|[San Francisco Bay Area]|
+------------------------+



### RoBertaForQuestionAnswering

📚 For more information, check out this link : [obertaforsequenceclassification](https://sparknlp.org/docs/en/transformers#robertaforsequenceclassification).

For the avaible models please check the [Modes Hub](https://nlp.johnsnowlabs.com/models?annotator=RoBertaForQuestionAnswering)

In [None]:
document_assembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_covid","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([document_assembler,
                                 spanClassifier])

data = spark.createDataFrame([["Do I have Covid?", "I have a fever and a cough and for the past few days, I have lost my sense of smell and taste. Later I was diagnosed with Covid."]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

roberta_qa_roberta_base_squad2_covid download started this may take some time.
Approximate size to download 442.2 MB
[OK!]


In [None]:
result.select('answer.result').show(truncate=False)

+----------------------------------+
|result                            |
+----------------------------------+
|[Later I was diagnosed with Covid]|
+----------------------------------+



###  XlmRoBertaForQuestionAnswering


📚 For more information, check out this link : [xlmrobertaforquestionanswering](https://sparknlp.org/docs/en/transformers#xlmrobertaforquestionanswering).

For the avaible models please check the [Modes Hub](https://nlp.johnsnowlabs.com/models?annotator=XlmRoBertaForQuestionAnswering)

In [None]:
documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_base_qa_squad2","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")\
    .setCaseSensitive(True)

pipeline = Pipeline(stages=[documentAssembler,
                            spanClassifier])

data = spark.createDataFrame([["What year was the Carolina Panthers franchise founded?", """The Panthers finished the regular season with a 15–1 record, and quarterback Cam Newton was named the NFL Most Valuable Player (MVP).
They defeated the Arizona Cardinals 49–15 in the NFC Championship Game and advanced to their second Super Bowl appearance since the franchise was founded in 1995.
The Broncos finished the regular season with a 12–4 record, and denied the New England Patriots a chance to defend their title from Super Bowl XLIX by defeating them 20–18 in the AFC Championship Game.
They joined the Patriots, Dallas Cowboys, and Pittsburgh Steelers as one of four teams that have made eight appearances in the Super Bowl."""]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

xlm_roberta_base_qa_squad2 download started this may take some time.
Approximate size to download 834.5 MB
[OK!]


In [None]:
result.select('answer.result').show(truncate=False)

+-------+
|result |
+-------+
|[1995.]|
+-------+



### CamemBERTForQuestionAnswering

CamemBERT for Question Answering is a French-specific language model, designed to process and answer questions in French by extracting relevant information from text passages. This adaptation is valuable for applications such as chatbots, virtual assistants, or information retrieval systems in the French context.

📚 For more information, check out this link : [camembertforquestionanswering](https://sparknlp.org/docs/en/transformers#camembertforquestionanswering).

For the avaible models please check the [Modes Hub](https://nlp.johnsnowlabs.com/models?annotator=CamemBertForQuestionAnswering)

In [None]:
# Define the pipeline
Document_Assembler = MultiDocumentAssembler()\
    .setInputCols(["question", "context"])\
    .setOutputCols(["document_question", "document_context"])

Question_Answering = CamemBertForQuestionAnswering.pretrained("camembert_base_qa_fquad", "fr")\
    .setInputCols(["document_question", "document_context"])\
    .setOutputCol("answer")\
    .setCaseSensitive(True)

pipelinecbert = Pipeline(stages=[Document_Assembler, Question_Answering])

camembert_base_qa_fquad download started this may take some time.
Approximate size to download 392.8 MB
[OK!]


In [None]:
# Prepare the data
data = spark.createDataFrame([["Où est-ce que je vis?","Mon nom est Wolfgang et je vis à Berlin."]]).toDF("question", "context")

# Fit and transform the data using the pipeline
result = pipelinecbert.fit(data).transform(data)

In [None]:
result.show(truncate=False)

+---------------------+----------------------------------------+---------------------------------------------------------------+----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|question             |context                                 |document_question                                              |document_context                                                                  |answer                                                                                                                                                |
+---------------------+----------------------------------------+---------------------------------------------------------------+----------------------------------------------------------------------------------+---------------------------------------------------------------

In [None]:
result.select(['question', 'context','answer.result']).show(truncate=False)

+---------------------+----------------------------------------+--------+
|question             |context                                 |result  |
+---------------------+----------------------------------------+--------+
|Où est-ce que je vis?|Mon nom est Wolfgang et je vis à Berlin.|[Berlin]|
+---------------------+----------------------------------------+--------+



### MPNetForQuestionAnswering

📚 For more information, check out this link : [mpnetforquestionanswering](https://sparknlp.org/docs/en/transformers#mpnetforquestionanswering).

For the avaible models please check the [Modes Hub](https://nlp.johnsnowlabs.com/models?annotator=MPNetForQuestionAnswering)

In [None]:
document_assembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = MPNetForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
     document_assembler,
     spanClassifier
])

mpnet_base_question_answering_squad2 download started this may take some time.
Approximate size to download 384.9 MB
[OK!]


In [None]:
examples = [
    ["Do you know where I'm from?", "I'm from Tokyo and love sushi."],
    ["Can you guess my favorite color?", "My favorite color is blue and I love the ocean."],
    ["What do you think I do for a living?", "I'm a teacher in New York and enjoy reading."],
    ["Are you aware of my hobby?", "I enjoy painting and often visit art galleries."],
    ["Do you know my pet's name?", "My dog's name is Max and he loves long walks."]
    ]

In [None]:
data = spark.createDataFrame(examples).toDF("question", "context")

In [None]:
result = pipeline.fit(data).transform(data)
result.select("question", "context", "answer.result").show(truncate=False)

+------------------------------------+-----------------------------------------------+----------+
|question                            |context                                        |result    |
+------------------------------------+-----------------------------------------------+----------+
|Do you know where I'm from?         |I'm from Tokyo and love sushi.                 |[Tokyo]   |
|Can you guess my favorite color?    |My favorite color is blue and I love the ocean.|[blue]    |
|What do you think I do for a living?|I'm a teacher in New York and enjoy reading.   |[teacher] |
|Are you aware of my hobby?          |I enjoy painting and often visit art galleries.|[painting]|
|Do you know my pet's name?          |My dog's name is Max and he loves long walks.  |[Max]     |
+------------------------------------+-----------------------------------------------+----------+

