![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/25.0.Biogpt_Chat_JSL.ipynb)



# **BioGPT - Chat JSL - Closed Book Question Answering**

The objective of this notebook is to explore the Biomedical Generative Pre-trained Transformer (BioGPT) models - `biogpt_chat_jsl` and `biogpt_chat_jsl_conversational_en`, for closed book question answering. These models are pre-trained on large biomedical text data and can generate coherent and relevant responses to biomedical questions.

📖 Learning Objectives:

- Learn how to use the BioGPT models in Spark NLP for closed book question answering tasks, including loading pre-trained models and configuring the pipeline.

- Understand the parameters and options available for the BioGPT models to customize the text generation process based on specific use cases.

# ⚒️ Setup and Import Libraries

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical, visual
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only
import textwrap

# 	📎🏥 `biogpt_chat_jsl`

This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases.

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")
    
gpt_qa = medical.TextGenerator().pretrained("biogpt_chat_jsl", "en", "clinical/models")\
    .setInputCols("documents")\
    .setOutputCol("answer")\
    .setMaxNewTokens(299)\
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(3)\
    .setRandomSeed(42)
    
pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])

question = "What medications are commonly used to treat emphysema?"
TEXT = [f"question: {question} answer:"]
data = spark.createDataFrame([TEXT]).toDF("text")

result = pipeline.fit(data).transform(data)
result.show(truncate=False)

In [None]:
result.select("answer.result").show(truncate=False)

## **📍 LightPipeline**

In [None]:
gpt_qa = medical.TextGenerator().pretrained("biogpt_chat_jsl", "en", "clinical/models")\
    .setInputCols("documents")\
    .setOutputCol("answer")\
    .setMaxNewTokens(299)\
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(3)\
    .setRandomSeed(42)
    
pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])

In [None]:
question = "What are the risk factors for developing heart disease?"
TEXT = [f"question: {question} answer:"]

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = nlp.LightPipeline(model)
light_result = light_model.annotate(TEXT)
answer_text = light_result[0]["answer"]

In [None]:
# Extract the text after 'answer:'
final_answer = answer_text[0][len(TEXT[0]) + 1:].strip()

# Format the text into paragraphs
wrapped_text = textwrap.fill(final_answer, width=120)

print("➤ Answer: \n{}".format(wrapped_text))
print("\n")

## 🚩 `setMaxNewTokens`

- This parameter sets the maximum number of new tokens that the GPT model will generate for the output, constraining the length of the generated response and managing the computational cost.

Pipeline with `setMaxNewTokens(128)` and `setMaxNewTokens(299)`

In [None]:
# Default parameters
gpt_qa = medical.TextGenerator().pretrained("biogpt_chat_jsl", "en", "clinical/models") \
    .setInputCols("documents") \
    .setOutputCol("answer") \
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(3) \
    .setRandomSeed(42)

MaxNewTokens = [128, 299]


# Sample question
question = "How can asthma be treated?"
TEXT = [f"question: {question} answer:"]

for j in MaxNewTokens:
    print("Question:", question)
    print("Parameters:") 
    print(f"\nsetMaxNewTokens({j}):")
    gpt_qa.setMaxNewTokens(j)
    pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])

    light_model = nlp.LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
    answer_default = light_model.annotate(TEXT)
    
    answer_text = answer_default[0]["answer"][0][len(TEXT[0]) + 1:].strip()
    wrapped_answer_text = textwrap.fill(answer_text, width=150)
    token_count = len(answer_text.split())
    print("➤ Answer:")
    print(wrapped_answer_text)
    print(f"Number of tokens used: {token_count}")
    print("-" * 40)  # Separator line


<b><h1><font color='darkred'>!!! ATTENTION !!! </font><h1><b>

<b>before running the following cells, <font color='darkred'>RESTART the COLAB RUNTIME </font> than start your session and go ahead.<b>

# 	📎🏥 `biogpt_chat_jsl_conversational`

This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases. The difference between this model and `biogpt_chat_jsl` is that this model produces more concise/smaller response.

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")
    
gpt_qa = medical.TextGenerator().pretrained("biogpt_chat_jsl_conversational", "en", "clinical/models")\
    .setInputCols("documents")\
    .setOutputCol("answer")\
    .setMaxNewTokens(399)\
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(1)\
    .setRandomSeed(42)
    
pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])


In [None]:
question = "What is the difference between melanoma and sarcoma?"
TEXT = [f"question: {question} answer:"]

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = nlp.LightPipeline(model)
light_result = light_model.annotate(TEXT)
answer_text = light_result[0]["answer"]


In [None]:
# Extract the text after 'answer:'
final_answer = answer_text[0][len(TEXT[0]) + 1:].strip()

# Format the text into paragraphs
wrapped_text = textwrap.fill(final_answer, width=120)

print("➤ Answer: \n{}".format(wrapped_text))
print("\n")