![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)




[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/public/QUESTION_ANSWERING_OPEN_BOOK.ipynb)

# **QUESTION ANSWERING (Open Book)**




## **Colab Setup and Start Spark Session**

In [1]:
!pip install -q pyspark==3.3.0 spark-nlp==4.2.8

In [2]:
import sparknlp
import pandas as pd

spark = sparknlp.start()


from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.types import StringType, IntegerType



print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark 


Spark NLP version 4.2.8
Apache Spark version: 3.3.0


### ***`🔎 Models we will review in this notebook:`***



*  `t5_base`
*  `t5_small`
*  `albert_qa_xxlarge_tweetqa`
*  `bert_qa_callmenicky_finetuned_squad`
*  `deberta_v3_xsmall_qa_squad2`
*  `distilbert_base_cased_qa_squad2`
*  `longformer_qa_large_4096_finetuned_triviaqa`
*  `roberta_qa_roberta_base_squad2_covid`
*  `roberta_qa_CV_Merge_DS`
*  `xlm_roberta_base_qa_squad2`




# **`T5_small` and `T5_base` models**

In [3]:
model_list = ['t5_base', 't5_small']

In [4]:
question_context_list = [
    ["""What does increased oxygen concentrations in the patient’s lungs displace?""" """context : Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment."""],
    ["""What category of game is Legend of Zelda: Twilight Princess?""" """context: The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006."""],
    ["""Who is founder of Alibaba Group?""" """context: Alibaba Group founder Jack Ma has made his first appearance since Chinese regulators cracked down on his business empire. His absence had fuelled speculation over his whereabouts amid increasing official scrutiny of his businesses. The billionaire met 100 rural teachers in China via a video meeting on Wednesday, according to local government media. Alibaba shares surged 5% on Hong Kong's stock exchange on the news."""],
    ["""For what instrument did Frédéric write primarily for?""" """context: Frédéric François Chopin (/ˈʃoʊpæn/; French pronunciation: ​[fʁe.de.ʁik fʁɑ̃.swa ʃɔ.pɛ̃]; 22 February or 1 March 1810 – 17 October 1849), born Fryderyk Franciszek Chopin,[n 1] was a Polish and French (by citizenship and birth of father) composer and a virtuoso pianist of the Romantic era, who wrote primarily for the solo piano. He gained and has maintained renown worldwide as one of the leading musicians of his era, whose "poetic genius was based on a professional technique that was without equal in his generation." Chopin was born in what was then the Duchy of Warsaw, and grew up in Warsaw, which after 1815 became part of Congress Poland. A child prodigy, he completed his musical education and composed his earlier works in Warsaw before leaving Poland at the age of 20, less than a month before the outbreak of the November 1830 Uprising."""],
    ["""The most populated city in the United States is which city?""" """context: New York—often called New York City or the City of New York to distinguish it from the State of New York, of which it is a part—is the most populous city in the United States and the center of the New York metropolitan area, the premier gateway for legal immigration to the United States and one of the most populous urban agglomerations in the world. A global power city, New York exerts a significant impact upon commerce, finance, media, art, fashion, research, technology, education, and entertainment, its fast pace defining the term New York minute. Home to the headquarters of the United Nations, New York is an important center for international diplomacy and has been described as the cultural and financial capital of the world."""],
]

In [5]:
for model_name in model_list:
  document_assembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("documents")

  t5 = T5Transformer() \
      .pretrained(model_name) \
      .setTask("question:")\
      .setMaxOutputLength(200)\
      .setInputCols(["documents"]) \
      .setOutputCol("answers")

  t5_pp = Pipeline(stages=[ document_assembler, 
                            t5])

  df = spark.createDataFrame(question_context_list).toDF('text')
  model = t5_pp.fit(df).transform(df)
  
  print(f"MODEL NAME : {model_name}")
  model.select(['text','answers.result']).show(truncate=50)


t5_base download started this may take some time.
Approximate size to download 451.8 MB
[OK!]
MODEL NAME : t5_base
+--------------------------------------------------+------------------+
|                                              text|            result|
+--------------------------------------------------+------------------+
|What does increased oxygen concentrations in th...| [carbon monoxide]|
|What category of game is Legend of Zelda: Twili...|[action-adventure]|
|Who is founder of Alibaba Group?context: Alibab...|         [Jack Ma]|
|For what instrument did Frédéric write primaril...|      [solo piano]|
|The most populated city in the United States is...|        [New York]|
+--------------------------------------------------+------------------+

t5_small download started this may take some time.
Approximate size to download 141.1 MB
[OK!]
MODEL NAME : t5_small
+--------------------------------------------------+------------------+
|                                              

# **`alert_qa_xxlarge_tweetqa`**

In [4]:
documentAssembler = MultiDocumentAssembler() \
        .setInputCols(["question", "context"]) \
        .setOutputCols(["document_question", "document_context"])

spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_tweetqa","en") \
  .setInputCols(["document_question", "document_context"]) \
  .setOutputCol("answer")

pipeline = Pipeline(stages=[documentAssembler, 
                            spanClassifier])


albert_qa_xxlarge_tweetqa download started this may take some time.
Approximate size to download 735.8 MB
[OK!]


In [7]:
sample_texts = [
    ["When was the first nutrition experiment performed?", """Sometimes overlooked during his life, James Lind, a physician in the British navy, performed the first scientific nutrition experiment in 1747. Lind discovered that lime juice saved sailors that had been at sea for years from scurvy, a deadly and painful bleeding disorder. Between 1500 and 1800, an estimated two million sailors had died of scurvy. The discovery was ignored for forty years, after which British sailors became known as "limeys." The essential vitamin C within citrus fruits would not be identified by scientists until 1932."""],
    ["What is the social style of hunter-gather societies?", """Hunter-gatherers tend to have an egalitarian social ethos, although settled hunter-gatherers (for example, those inhabiting the Northwest Coast of North America) are an exception to this rule. Nearly all African hunter-gatherers are egalitarian, with women roughly as influential and powerful as men."""],
    ["Who is responsible for defence and foreign affairs?", """Defence and foreign affairs are carried out by the United Kingdom, which also retains responsibility to ensure good government. It must approve any changes to the Constitution of Bermuda."""],
    ["What is my name?", "My name is Clara and I live in Berkeley."],
    ["What type of flight decks are aircraft carriers equipped with?", """An aircraft carrier is a warship that serves as a seagoing airbase, equipped with a full-length flight deck and facilities for carrying, arming, deploying, and recovering aircraft."""],
]

In [8]:
df = spark.createDataFrame(sample_texts).toDF("question", "context")

model = pipeline.fit(df).transform(df)

In [9]:
model.select("question", "answer.result").show(truncate=False)

+--------------------------------------------------------------+------------------+
|question                                                      |result            |
+--------------------------------------------------------------+------------------+
|When was the first nutrition experiment performed?            |[1747]            |
|What is the social style of hunter-gather societies?          |[egalitarian]     |
|Who is responsible for defence and foreign affairs?           |[theUnitedKingdom]|
|What is my name?                                              |[Clara]           |
|What type of flight decks are aircraft carriers equipped with?|[full-length]     |
+--------------------------------------------------------------+------------------+



## **`bert_qa_callmenicky_finetuned_squad`**

In [10]:
spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_callmenicky_finetuned_squad","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")\
    .setMaxSentenceLength(50)\
    .setCaseSensitive(False)

pipeline = Pipeline(stages=[documentAssembler, 
                            spanClassifier])

bert_qa_callmenicky_finetuned_squad download started this may take some time.
Approximate size to download 385.6 MB
[OK!]


In [7]:
sample_texts = [
    ["When was the first nutrition experiment performed?", """Sometimes overlooked during his life, James Lind, a physician in the British navy, performed the first scientific nutrition experiment in 1747. Lind discovered that lime juice saved sailors that had been at sea for years from scurvy, a deadly and painful bleeding disorder. Between 1500 and 1800, an estimated two million sailors had died of scurvy. The discovery was ignored for forty years, after which British sailors became known as "limeys." The essential vitamin C within citrus fruits would not be identified by scientists until 1932."""],
    ["What is regarded as the greatest literary work in Old English?", """The first example is taken from the opening lines of the folk-epic Beowulf, a poem of some 3,000 lines and the single greatest work of Old English. This passage describes how Hrothgar's legendary ancestor Scyld was found as a baby, washed ashore, and adopted by a noble family. The translation is literal and represents the original poetic word order. As such, it is not typical of Old English prose. The modern cognates of original words have been used whenever practical to give a close approximation of the feel of the original poem."""],
    ["What is the social style of hunter-gather societies?", """Hunter-gatherers tend to have an egalitarian social ethos, although settled hunter-gatherers (for example, those inhabiting the Northwest Coast of North America) are an exception to this rule. Nearly all African hunter-gatherers are egalitarian, with women roughly as influential and powerful as men."""],
    ["Through what part of the body are nutrients transported to feed cells?", """Animal tissue consists of elements and compounds ingested, digested, absorbed, and circulated through the bloodstream to feed the cells of the body. Except in the unborn fetus, the digestive system is the first system involved[vague]. Digestive juices break chemical bonds in ingested molecules, and modify their conformations and energy states. Though some molecules are absorbed into the bloodstream unchanged, digestive processes release them from the matrix of foods. Unabsorbed matter, along with some waste products of metabolism, is eliminated from the body in the feces."""],
    ["What type of flight decks are aircraft carriers equipped with?", """An aircraft carrier is a warship that serves as a seagoing airbase, equipped with a full-length flight deck and facilities for carrying, arming, deploying, and recovering aircraft."""]    
  ]


df = spark.createDataFrame(sample_texts).toDF("question", "context")


In [12]:
model = pipeline.fit(df).transform(df)

In [13]:
model.select("question", "answer.result").show(truncate=False)


+----------------------------------------------------------------------+---------------+
|question                                                              |result         |
+----------------------------------------------------------------------+---------------+
|When was the first nutrition experiment performed?                    |[1747]         |
|What is regarded as the greatest literary work in Old English?        |[Beowulf]      |
|What is the social style of hunter-gather societies?                  |[egalitarian]  |
|Through what part of the body are nutrients transported to feed cells?|[bloodstream]  |
|What type of flight decks are aircraft carriers equipped with?        |[full - length]|
+----------------------------------------------------------------------+---------------+



# **`deberta_v3_xsmall_qa_squad2`**

In [14]:
spanClassifier = DeBertaForQuestionAnswering.pretrained("deberta_v3_xsmall_qa_squad2","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")\
    .setCaseSensitive(True)
    
pipeline = Pipeline().setStages([documentAssembler,
                                 spanClassifier])


deberta_v3_xsmall_qa_squad2 download started this may take some time.
Approximate size to download 240.6 MB
[OK!]


In [15]:
model = pipeline.fit(df).transform(df)

In [16]:
model.select("question", "answer.result").show(truncate=100)

+----------------------------------------------------------------------+-----------------+
|                                                              question|           result|
+----------------------------------------------------------------------+-----------------+
|                    When was the first nutrition experiment performed?|           [1747]|
|        What is regarded as the greatest literary work in Old English?|        [Beowulf]|
|                  What is the social style of hunter-gather societies?|    [egalitarian]|
|Through what part of the body are nutrients transported to feed cells?|[the bloodstream]|
|        What type of flight decks are aircraft carriers equipped with?|    [full-length]|
+----------------------------------------------------------------------+-----------------+



# **`distilbert_base_cased_qa_squad2`**

In [17]:
spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_base_cased_qa_squad2","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")\
    .setCaseSensitive(True)
    
pipeline = Pipeline().setStages([documentAssembler,
                                 spanClassifier])


distilbert_base_cased_qa_squad2 download started this may take some time.
Approximate size to download 232.8 MB
[OK!]


In [18]:
model = pipeline.fit(df).transform(df)

In [19]:
model.select("question", "answer.result").show(truncate=100)

+----------------------------------------------------------------------+---------------+
|                                                              question|         result|
+----------------------------------------------------------------------+---------------+
|                    When was the first nutrition experiment performed?|         [1747]|
|        What is regarded as the greatest literary work in Old English?|      [Beowulf]|
|                  What is the social style of hunter-gather societies?|  [egalitarian]|
|Through what part of the body are nutrients transported to feed cells?|  [bloodstream]|
|        What type of flight decks are aircraft carriers equipped with?|[full - length]|
+----------------------------------------------------------------------+---------------+



# **`longformer_qa_large_4096_finetuned_triviaqa`**

In [20]:
spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_large_4096_finetuned_triviaqa","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")\
    .setCaseSensitive(True)
    
pipeline = Pipeline().setStages([documentAssembler,
                                 spanClassifier])

longformer_qa_large_4096_finetuned_triviaqa download started this may take some time.
Approximate size to download 1.5 GB
[OK!]


In [8]:
sample_texts = [
          ["Who did the Kaiser Library's books previously belong to?","""The National Library of Nepal is located in Patan. It is the largest library in the country with more than 70,000 books. English, Nepali, Sanskrit, Hindi, and Nepal Bhasa books are found here. The library is in possession of rare scholarly books in Sanskrit and English dating from the 17th century AD. Kathmandu also contains the Kaiser Library, located in the Kaiser Mahal on the ground floor of the Ministry of Education building. This collection of around 45,000 books is derived from a personal collection of Kaiser Shamsher Jang Bahadur Rana. It covers a wide range of subjects including history, law, art, religion, and philosophy, as well as a Sanskrit manual of Tantra, which is believed to be over 1,000 years old. The 2015 earthquake caused severe damage to the Ministry of Education building, and the contents of the Kaiser Library have been temporarily relocated."""],
          ["Who was the penultimate king of Nepal?","The Tribhuvan Museum contains artifacts related to the King Tribhuvan (1906\u20131955). It has a variety of pieces including his personal belongings, letters and papers, memorabilia related to events he was involved in and a rare collection of photos and paintings of Royal family members. The Mahendra Museum is dedicated to king Mahendra of Nepal (1920\u20131972). Like the Tribhuvan Museum, it includes his personal belongings such as decorations, stamps, coins and personal notes and manuscripts, but it also has structural reconstructions of his cabinet room and office chamber. The Hanumandhoka Palace, a lavish medieval palace complex in the Durbar, contains three separate museums of historic importance. These museums include the Birendra museum, which contains items related to the second-last monarch, Birendra of Nepal."],
          ["When was the National Museum founded?","""The National Museum is located in the western part of Kathmandu, near the Swayambhunath stupa in an historical building. This building was constructed in the early 19th century by General Bhimsen Thapa. It is the most important museum in the country, housing an extensive collection of weapons, art and antiquities of historic and cultural importance. The museum was established in 1928 as a collection house of war trophies and weapons, and the initial name of this museum was Chhauni Silkhana, meaning \"the stone house of arms and ammunition\". Given its focus, the museum contains many weapons, including locally made firearms used in wars, leather cannons from the 18th\u201319th century, and medieval and modern works in wood, bronze, stone and paintings."""],
          ["What is Britain's busiest railway station in terms of passengers?", "There are 366 railway stations in the London Travelcard Zones on an extensive above-ground suburban railway network. South London, particularly, has a high concentration of railways as it has fewer Underground lines. Most rail lines terminate around the centre of London, running into eighteen terminal stations, with the exception of the Thameslink trains connecting Bedford in the north and Brighton in the south via Luton and Gatwick airports. London has Britain's busiest station by number of passengers \u2013 Waterloo, with over 184 million people using the interchange station complex (which includes Waterloo East station) each year. Clapham Junction is the busiest station in Europe by the number of trains passing."],
          ["What Arsenal manager replaced Mee?", "Terry Neill was recruited by the Arsenal board to replace Bertie Mee on 9 July 1976 and at the age of 34 he became the youngest Arsenal manager to date. With new signings like Malcolm Macdonald and Pat Jennings, and a crop of talent in the side such as Liam Brady and Frank Stapleton, the club enjoyed their best form since the 1971 double, reaching a trio of FA Cup finals (1978, 1979 and 1980), and losing the 1980 European Cup Winners' Cup Final on penalties. The club's only success during this time was a last-minute 3\u20132 victory over Manchester United in the 1979 FA Cup Final, widely regarded as a classic."]
            ]

df = spark.createDataFrame(sample_texts).toDF("question", "context")

In [22]:
model = pipeline.fit(df).transform(df)

In [23]:
model.select("question", "answer.result").show(truncate=100)

+-----------------------------------------------------------------+-----------------------------------+
|                                                         question|                             result|
+-----------------------------------------------------------------+-----------------------------------+
|         Who did the Kaiser Library's books previously belong to?|[Kaiser Shamsher Jang Bahadur Rana]|
|                           Who was the penultimate king of Nepal?|                         [Birendra]|
|                            When was the National Museum founded?|                             [1928]|
|What is Britain's busiest railway station in terms of passengers?|                         [Waterloo]|
|                               What Arsenal manager replaced Mee?|                      [Terry Neill]|
+-----------------------------------------------------------------+-----------------------------------+



# **`roberta_qa_roberta_base_squad2_covid`**

In [24]:
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_covid","en")\
    .setInputCols(["document_question", "document_context"])\
    .setOutputCol("answer") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier])

roberta_qa_roberta_base_squad2_covid download started this may take some time.
Approximate size to download 442.8 MB
[OK!]


In [9]:
sample_texts = [
["What is the size of bovine coronavirus?", """We sequenced the first Bovine coronavirus (BCoV) complete genome sequence from France. This BCoV was directly sequenced from a fecal sample collected from a calf in Normandy in 2014. B ovine coronavirus (BCoV) belongs to the Nidovirales order, the Coronaviridae family, the Coronavirinae subfamily, and the Betacoronavirus. Its genome is a single-stranded, linear, and nonsegmented RNA of around 31 kb. BCoV is responsible for respiratory and enteric diseases in cattle, particularly during winter. To date, the 19 complete BCoV genome sequences available in GenBank databases (consulted on 17 January 2017) originated from the United States or Asia. Here, we report the first complete genome sequence of a BCoV detected in France."""],
["How many cysteine residues are contained in the first transmembrane domain of IFITM3?", """Recently, one of the interferon-induced transmembrane (IFITM) family proteins, IFITM3, has become an important target for the activity against influenza A (H1N1) virus infection. In this protein, a post-translational modification by fatty acids covalently attached to cysteine, termed S-palmitoylation, plays a crucial role for the antiviral activity. IFITM3 possesses three cysteine residues for the S-palmitoylation in the first transmembrane (TM1) domain and in the cytoplasmic (CP) loop. Because these cysteines are well conserved in the mammalian IFITM family proteins, the S-palmitoylation on these cysteines is significant for their functions. IFITM5 is another IFITM family protein and interacts with the FK506-binding protein 11 (FKBP11) to form a higher-order complex in osteoblast cells, which induces the expression of immunologically relevant genes."""],
["What kinds of viruses are Japanese encephalitis virus(JEV), tick-borne encephalitis virus(TBEV), eastern equine encephalitis virus (EEEV), sindbis virus(SV), and dengue virus(DV)?","""The developed ELISA-array proved to have similar specificity and higher sensitivity compared with the conventional ELISAs. This method was validated by different viral cultures and three chicken eggs inoculated with infected patient serum. The results demonstrated that the developed ELISA-array is sensitive and easy to use, which would have potential for clinical use. Japanese encephalitis virus(JEV), tick-borne encephalitis virus(TBEV), eastern equine encephalitis virus (EEEV), sindbis virus(SV), and dengue virus(DV) are arboviruses. Establishment of an accurate and easy method for detection of these viruses is essential for the prevention and treatment of associated infectious diseases. """],
["What method is useful in administering small molecules for systemic delivery to the body?", """In many studies, in vivo success has been demonstrated in delivering siRNAs to the lungs intranasally . An experimental setup of intranasal delivery by spray or droplet is simple and painless for the animal. Although the success in delivering siRNAs intranasally in rodents cannot be completely extrapolated to human use because of the significant differences in lung anatomy , this approach has potential for the clinical application of siRNAs. Phase II clinical trials have been initiated for the treatment of respiratory syncytial virus (RSV) infection, making use of intranasal application of naked chemically modified siRNA molecules that target viral gene products. Intranasal entry has long been used to administer small molecules, such as proteins, for systemic delivery. Because the nasal mucosa is highly vascularized, delivery of a thin epithelium of medication across the surface area can result in rapid absorption of the medication into the blood. Therefore, siRNAs administered intranasally might be deposited in the nose, and some of them may be unable to reach the lower respiratory tract. In fact, it has been reported that intranasal application of unformulated siRNAs resulted in lower delivery efficiency and homogeneous pulmonary distribution than that achieved with intratracheal application."""],
["When did the last Director General of the WHO resign?", """In this month 2019s editorial, the PLOS Medicine Editors propose ideal qualities for the World Health Organization's next Director General, for whom the selection process is now underway. Response to the Ebola outbreak. Reformation of WHO to ready it to lead responses to future health emergencies is one area of active debate.Chan will step down from WHO on June 30, 2017 after more than a decade in the post. The process for choosing WHO's next leader has begun, promising to be protracted and rigorous as befits the importance of the role. Factoring in the many influential stakeholders in the process of appointing Chan's successor, however, transparency of the selection process may be one area unlikely to attract plaudits. Although too soon to speculate about the identity of WHO's next Director-General, it is worth reflecting on what qualities an incoming leader should bring to WHO and how that person might need to conceive changes in the structure and behavior of the organization against a landscape of important and evolving threats to the health of the fastgrowing global population. """]
]

df = spark.createDataFrame(sample_texts).toDF("question", "context")

In [26]:
model = pipeline.fit(df).transform(df)

In [27]:
model.select("question", "answer.result").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question                                                                                                                                                                           |result                                                                                                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------

# **`roberta_qa_CV_Merge_DS`**

In [28]:
spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_CV_Merge_DS", "en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([documentAssembler,
                                 spanClassifier])


roberta_qa_CV_Merge_DS download started this may take some time.
Approximate size to download 442.6 MB
[OK!]


In [11]:
sample_texts = [
    ["When was the first nutrition experiment performed?", """Sometimes overlooked during his life, James Lind, a physician in the British navy, performed the first scientific nutrition experiment in 1747. Lind discovered that lime juice saved sailors that had been at sea for years from scurvy, a deadly and painful bleeding disorder. Between 1500 and 1800, an estimated two million sailors had died of scurvy. The discovery was ignored for forty years, after which British sailors became known as "limeys." The essential vitamin C within citrus fruits would not be identified by scientists until 1932."""],
    ["What is regarded as the greatest literary work in Old English?", """The first example is taken from the opening lines of the folk-epic Beowulf, a poem of some 3,000 lines and the single greatest work of Old English. This passage describes how Hrothgar's legendary ancestor Scyld was found as a baby, washed ashore, and adopted by a noble family. The translation is literal and represents the original poetic word order. As such, it is not typical of Old English prose. The modern cognates of original words have been used whenever practical to give a close approximation of the feel of the original poem."""],
    ["What is the social style of hunter-gather societies?", """Hunter-gatherers tend to have an egalitarian social ethos, although settled hunter-gatherers (for example, those inhabiting the Northwest Coast of North America) are an exception to this rule. Nearly all African hunter-gatherers are egalitarian, with women roughly as influential and powerful as men."""],
    ["Through what part of the body are nutrients transported to feed cells?", """Animal tissue consists of elements and compounds ingested, digested, absorbed, and circulated through the bloodstream to feed the cells of the body. Except in the unborn fetus, the digestive system is the first system involved[vague]. Digestive juices break chemical bonds in ingested molecules, and modify their conformations and energy states. Though some molecules are absorbed into the bloodstream unchanged, digestive processes release them from the matrix of foods. Unabsorbed matter, along with some waste products of metabolism, is eliminated from the body in the feces."""],
    ["What type of flight decks are aircraft carriers equipped with?", """An aircraft carrier is a warship that serves as a seagoing airbase, equipped with a full-length flight deck and facilities for carrying, arming, deploying, and recovering aircraft."""]    
]

df = spark.createDataFrame(sample_texts).toDF("question", "context")

In [30]:
model = pipeline.fit(df).transform(df)

In [31]:
model.select("question", "answer.result").show(truncate=False)

+----------------------------------------------------------------------+--------------------------+
|question                                                              |result                    |
+----------------------------------------------------------------------+--------------------------+
|When was the first nutrition experiment performed?                    |[1747]                    |
|What is regarded as the greatest literary work in Old English?        |[Beowulf]                 |
|What is the social style of hunter-gather societies?                  |[egalitarian social ethos]|
|Through what part of the body are nutrients transported to feed cells?|[bloodstream]             |
|What type of flight decks are aircraft carriers equipped with?        |[full - length]           |
+----------------------------------------------------------------------+--------------------------+



# **`xlm_roberta_base_qa_squad2`**

In [12]:
spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_base_qa_squad2","en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer")\
    .setCaseSensitive(True)
    
pipeline = Pipeline().setStages([documentAssembler,
                                 spanClassifier])

xlm_roberta_base_qa_squad2 download started this may take some time.
Approximate size to download 834.5 MB
[OK!]


In [13]:
model = pipeline.fit(df).transform(df)

In [14]:
model.select("question", "answer.result").show(truncate=False)

+----------------------------------------------------------------------+-------------+
|question                                                              |result       |
+----------------------------------------------------------------------+-------------+
|When was the first nutrition experiment performed?                    |[1747]       |
|What is regarded as the greatest literary work in Old English?        |[Beowulf]    |
|What is the social style of hunter-gather societies?                  |[egalitarian]|
|Through what part of the body are nutrients transported to feed cells?|[bloodstream]|
|What type of flight decks are aircraft carriers equipped with?        |[full-length]|
+----------------------------------------------------------------------+-------------+

