![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/25.0.Biogpt_Chat_JSL.ipynb)



# **BioGPT - Chat JSL - Closed Book Question Answering**

The objective of this notebook is to explore the Biomedical Generative Pre-trained Transformer (BioGPT) models - `biogpt_chat_jsl` and `biogpt_chat_jsl_conversational_en`, for closed book question answering. These models are pre-trained on large biomedical text data and can generate coherent and relevant responses to biomedical questions.

📖 Learning Objectives:

- Learn how to use the BioGPT models in Spark NLP for closed book question answering tasks, including loading pre-trained models and configuring the pipeline.

- Understand the parameters and options available for the BioGPT models to customize the text generation process based on specific use cases.

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# ⚒️ Setup and Import Libraries

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:

from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only
import textwrap

# 🔎 MODELS

<div align="center">

| **Index** | **Summarizer Models**        |
|---------------|----------------------|
| 1        | [biogpt_chat_jsl](https://nlp.johnsnowlabs.com/2023/04/12/biogpt_chat_jsl_en.html)     |
| 2          | [biogpt_chat_jsl_conversational](https://nlp.johnsnowlabs.com/2023/04/18/biogpt_chat_jsl_conversational_en.html)       |
| 3      | [biogpt_chat_jsl_conditions](https://nlp.johnsnowlabs.com/2023/05/11/biogpt_chat_jsl_conditions_en.html)    |

</div>

# 	📎🏥 `biogpt_chat_jsl`

This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases.

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

gpt_qa = medical.TextGenerator().pretrained("biogpt_chat_jsl", "en", "clinical/models")\
    .setInputCols("documents")\
    .setOutputCol("answer")\
    .setMaxNewTokens(299)\
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(3)\
    .setRandomSeed(42)\
    .setCustomPrompt("question: {DOCUMENT} answer:")

pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

biogpt_chat_jsl download started this may take some time.
Approximate size to download 1.3 GB
[OK!]


In [None]:
TEXT = "What medications are commonly used to treat emphysema?"

data = spark.createDataFrame([[TEXT]]).toDF("text")

result = model.transform(data)

result.show(truncate=False)

+------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result.select("answer.result").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                         

## **📍 LightPipeline**

In [None]:
TEXT = "What are the risk factors for developing heart disease?"

light_model = nlp.LightPipeline(model)

light_result = light_model.annotate(TEXT)

result_text = light_result["answer"]

In [None]:
result_text[0].split(" answer: ")

['question: What are the risk factors for developing heart disease?',
 'Hello, There are several factors that are responsible for the development of heart disease. One of the most important is your cholesterol level. High cholesterol levels are responsible for the development of coronary artery disease. The other factors are blood pressure and smoking. The goal of treatment is to reduce the total and low - density lipoprotein ( LDL ) cholesterol levels. Statins are good cholesterol lowering medications. They help reduce the risk of coronary artery disease by preventing the formation of blood clots. The goal of blood pressure treatment is to reduce the average blood pressure. Regular exercises, weight loss, fruits, vegetables, fish once or twice a week, avoid smoking. The goal of the cholesterol treatment is to bring the LDL level to normal ( less than 100 mg / DL ). The goal of the smoking treatment is to bring the smoking cessation to a significant extent. You should also get your cho

In [None]:
print("➤ Question: \n{}".format(TEXT))
print("\n")

# Format the text into paragraphs
wrapped_text = textwrap.fill(result_text[0].split(" answer: ")[1], width=120)

print("➤ Answer: \n{}".format(wrapped_text))
print("\n")

➤ Question: 
What are the risk factors for developing heart disease?


➤ Answer: 
Hello, There are several factors that are responsible for the development of heart disease. One of the most important is
your cholesterol level. High cholesterol levels are responsible for the development of coronary artery disease. The
other factors are blood pressure and smoking. The goal of treatment is to reduce the total and low - density lipoprotein
( LDL ) cholesterol levels. Statins are good cholesterol lowering medications. They help reduce the risk of coronary
artery disease by preventing the formation of blood clots. The goal of blood pressure treatment is to reduce the average
blood pressure. Regular exercises, weight loss, fruits, vegetables, fish once or twice a week, avoid smoking. The goal
of the cholesterol treatment is to bring the LDL level to normal ( less than 100 mg / DL ). The goal of the smoking
treatment is to bring the smoking cessation to a significant extent. You should also ge

## 🚩 `setMaxNewTokens`

- This parameter sets the maximum number of new tokens that the GPT model will generate for the output, constraining the length of the generated response and managing the computational cost.

Pipeline with `setMaxNewTokens(128)` and `setMaxNewTokens(299)`

In [None]:
# Default parameters
gpt_qa = medical.TextGenerator().pretrained("biogpt_chat_jsl", "en", "clinical/models") \
    .setInputCols("documents") \
    .setOutputCol("answer") \
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(3) \
    .setRandomSeed(42)\
    .setStopAtEos(True)\
    .setCustomPrompt("QUESTION: {DOCUMENT} ANSWER:")


MaxNewTokens = [128, 299]


# Sample question
TEXT = "How can asthma be treated?"

for j in MaxNewTokens:
    print("Question:", TEXT)
    print("Parameters:")
    print(f"\nsetMaxNewTokens({j}):")

    gpt_qa.setMaxNewTokens(j)
    pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])

    light_model = nlp.LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
    result_text = light_model.annotate(TEXT)["answer"][0]

    answer_text = result_text.split(" answer: ")[1]
    wrapped_text = textwrap.fill(answer_text, width=120)
    token_count = len(result_text.split())

    print("➤ Answer:")
    print(wrapped_text)
    print(f"Number of tokens used: {token_count}")
    print("-" * 40)  # Separator line


biogpt_chat_jsl download started this may take some time.
Approximate size to download 1.3 GB
[OK!]
Question: How can asthma be treated?
Parameters:

setMaxNewTokens(128):
➤ Answer:
Hello, Asthma is itself an allergic disease due to cold or dust or pollen or grass etc. irrespective of the triggering
factor. You are not able to get rid from it without taking any medication. You are not able to get control from outside
as it is the only way. You can try the following measures: 1. Improve your air quality by avoiding fine particles (
dust, mite, pollen ). 2. Sugar cane may be the best food to feed your child. 3. Keep your house clean and warm. 4. Use
loose bins / grinders / double joiners / oil skim
Number of tokens used: 108
----------------------------------------
Question: How can asthma be treated?
Parameters:

setMaxNewTokens(299):
➤ Answer:
Hello, Asthma is itself an allergic disease due to cold or dust or pollen or grass etc. irrespective of the triggering
factor. You are not able 

<b><h1><font color='darkred'>!!! ATTENTION !!! </font><h1><b>

<b>before running the following cells, <font color='darkred'>RESTART the COLAB RUNTIME </font> than start your session and go ahead.<b>

# 	📎🏥 `biogpt_chat_jsl_conversational`

This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases. The difference between this model and `biogpt_chat_jsl` is that this model produces more concise/smaller response.

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

gpt_qa = medical.TextGenerator().pretrained("biogpt_chat_jsl_conversational", "en", "clinical/models")\
    .setInputCols("documents")\
    .setOutputCol("answer")\
    .setMaxNewTokens(399)\
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(1)\
    .setRandomSeed(42)\
    .setCustomPrompt("question: {DOCUMENT} answer:")

pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])


biogpt_chat_jsl_conversational download started this may take some time.
Approximate size to download 1.3 GB
[OK!]


In [None]:
TEXT = "What is the difference between melanoma and sarcoma?"

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = nlp.LightPipeline(model)

light_result = light_model.annotate(TEXT)

result_text = light_result["answer"]

In [None]:
print("➤ Question: \n{}".format(TEXT))
print("\n")

# Format the text into paragraphs
wrapped_text = textwrap.fill(result_text[0].split(" answer: ")[1], width=120)

print("➤ Answer: \n{}".format(wrapped_text))
print("\n")

➤ Question: 
What is the difference between melanoma and sarcoma?


➤ Answer: 
Both are blood - borne cancers. Melanoma is a type of skin cancer that arises from melanocytes, the pigment - producing
cells in the skin. Sarcoma is a type of bone cancer that arises from bone. Both are blood - borne cancers and therefore
have very different treatment options.




# 	📎🏥 `biogpt_chat_jsl_conditions`

This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases. The difference between this model and `biogpt_chat_jsl` is that this model produces more concise/smaller response.

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

gpt_qa = medical.TextGenerator().pretrained("biogpt_chat_jsl_conditions", "en", "clinical/models")\
    .setInputCols("documents")\
    .setOutputCol("answer")\
    .setMaxNewTokens(399)\
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(1)\
    .setRandomSeed(42)\
    .setCustomPrompt("question: {DOCUMENT} answer:")

pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])


biogpt_chat_jsl_conditions download started this may take some time.
Approximate size to download 1.3 GB
[OK!]


In [None]:
TEXT = "What are the potential causes and risk factors for developing cardiovascular disease?"

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = nlp.LightPipeline(model)

light_result = light_model.annotate(TEXT)

result_text = light_result["answer"]

In [None]:
print("➤ Question: \n{}".format(TEXT))
print("\n")

# Format the text into paragraphs
wrapped_text = textwrap.fill(result_text[0].split(" answer: ")[1], width=120)

print("➤ Answer: \n{}".format(wrapped_text))
print("\n")

➤ Question: 
What are the potential causes and risk factors for developing cardiovascular disease?


➤ Answer: 
Cardiovascular disease ( CVD ) is a general term for conditions affecting the heart or blood vessels. It can be caused
by a variety of factors, including smoking, high blood pressure, diabetes, high cholesterol, and obesity. Certain
medical conditions, such as chronic kidney disease, can also increase the risk of developing CVD.


