![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/33.Biogpt_Chat_JSL.ipynb)



# **BioGPT - Chat JSL - Closed Book Question Answering**

The objective of this notebook is to explore the Biomedical Generative Pre-trained Transformer (BioGPT) models - `biogpt_chat_jsl` and `biogpt_chat_jsl_conversational_en`, for closed book question answering. These models are pre-trained on large biomedical text data and can generate coherent and relevant responses to biomedical questions.

📖 Learning Objectives:

- Learn how to use the BioGPT models in Spark NLP for closed book question answering tasks, including loading pre-trained models and configuring the pipeline.

- Understand the parameters and options available for the BioGPT models to customize the text generation process based on specific use cases.

# ⚒️ Setup and Import Libraries

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2  spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import pandas as pd
pd.set_option('display.max_colwidth', 200)
import textwrap
import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 4.4.0
Spark NLP_JSL Version : 4.4.0


# 	📎🏥 `biogpt_chat_jsl`

This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases.

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")
    
gpt_qa = MedicalTextGenerator().pretrained("biogpt_chat_jsl", "en", "clinical/models")\
    .setInputCols("documents")\
    .setOutputCol("answer")\
    .setMaxNewTokens(299)\
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(3)\
    .setRandomSeed(42)
    
pipeline = Pipeline().setStages([document_assembler, gpt_qa])

question = "What medications are commonly used to treat emphysema?"
TEXT = [f"question: {question} answer:"]
data = spark.createDataFrame([TEXT]).toDF("text")

result = pipeline.fit(data).transform(data)
result.show(truncate=False)

biogpt_chat_jsl download started this may take some time.
[OK!]
+------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result.select("answer.result").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                        

## **📍 LightPipeline**

In [None]:
gpt_qa = MedicalTextGenerator().pretrained("biogpt_chat_jsl", "en", "clinical/models")\
    .setInputCols("documents")\
    .setOutputCol("answer")\
    .setMaxNewTokens(299)\
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(3)\
    .setRandomSeed(42)
    
pipeline = Pipeline().setStages([document_assembler, gpt_qa])

biogpt_chat_jsl download started this may take some time.
[OK!]


In [None]:
question = "What are the risk factors for developing heart disease?"
TEXT = [f"question: {question} answer:"]

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)
light_result = light_model.annotate(TEXT)
answer_text = light_result[0]["answer"]

In [None]:
# Extract the text after 'answer:'
final_answer = answer_text[0][len(TEXT[0]) + 1:].strip()

# Format the text into paragraphs
wrapped_text = textwrap.fill(final_answer, width=120)

print("➤ Answer: \n{}".format(wrapped_text))
print("\n")

➤ Answer: 
Hello, There are several factors that are responsible for the development of heart disease. One of the most important is
your cholesterol level. High cholesterol levels are responsible for the development of coronary artery disease. The
other factors are blood pressure and smoking. The goal of treatment is to reduce the total and low - density lipoprotein
( LDL ) cholesterol levels. Statins are good cholesterol lowering medications. They help reduce the risk of coronary
artery disease by preventing the formation of blood clots. The goal of blood pressure treatment is to reduce the average
blood pressure. Regular exercises, weight loss, fruits, vegetables, fish once or twice a week, avoid smoking. The goal
of the cholesterol treatment is to bring the LDL level to normal ( less than 100 mg / DL ). The goal of the smoking
treatment is to bring the smoking cessation to a significant extent. You should also get your cholesterol and LDL levels
checked once a year to monitor the pr

## 🚩 `setMaxNewTokens`

- This parameter sets the maximum number of new tokens that the GPT model will generate for the output, constraining the length of the generated response and managing the computational cost.

Pipeline with `setMaxNewTokens(128)` and `setMaxNewTokens(299)`

In [None]:
# Default parameters
gpt_qa = MedicalTextGenerator().pretrained("biogpt_chat_jsl", "en", "clinical/models") \
    .setInputCols("documents") \
    .setOutputCol("answer") \
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(3) \
    .setRandomSeed(42)

MaxNewTokens = [128, 299]


# Sample question
question = "How can asthma be treated?"
TEXT = [f"question: {question} answer:"]

for j in MaxNewTokens:
    print("Question:", question)
    print("Parameters:") 
    print(f"\nsetMaxNewTokens({j}):")
    gpt_qa.setMaxNewTokens(j)
    pipeline = Pipeline().setStages([document_assembler, gpt_qa])

    light_model = LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
    answer_default = light_model.annotate(TEXT)
    
    answer_text = answer_default[0]["answer"][0][len(TEXT[0]) + 1:].strip()
    wrapped_answer_text = textwrap.fill(answer_text, width=150)
    token_count = len(answer_text.split())
    print("➤ Answer:")
    print(wrapped_answer_text)
    print(f"Number of tokens used: {token_count}")
    print("-" * 40)  # Separator line


biogpt_chat_jsl download started this may take some time.
[OK!]
Question: How can asthma be treated?
Parameters:

setMaxNewTokens(128):
➤ Answer:
Hello, Asthma is itself an allergic disease due to cold or dust or pollen or grass etc. irrespective of the triggering factor. You are not able to get
rid from it without taking any medication. You are not able to get control from outside as it is the only way. You can try the following measures: 1.
Improve your air quality by avoiding fine particles ( dust, mite, pollen ). 2. Sugar cane may be the best food to feed your child. 3. Keep your house
clean and warm. 4. Use loose bins / grinders / double joiners / oil skim
Number of tokens used: 101
----------------------------------------
Question: How can asthma be treated?
Parameters:

setMaxNewTokens(299):
➤ Answer:
Hello, Asthma is itself an allergic disease due to cold or dust or pollen or grass etc. irrespective of the triggering factor. You are not able to get
rid from it without taking an

<b><h1><font color='darkred'>!!! ATTENTION !!! </font><h1><b>

<b>before running the following cells, <font color='darkred'>RESTART the COLAB RUNTIME </font> than start your session and go ahead.<b>

# 	📎🏥 `biogpt_chat_jsl_conversational`

This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases. The difference between this model and `biogpt_chat_jsl` is that this model produces more concise/smaller response.

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")
    
gpt_qa = MedicalTextGenerator().pretrained("biogpt_chat_jsl_conversational", "en", "clinical/models")\
    .setInputCols("documents")\
    .setOutputCol("answer")\
    .setMaxNewTokens(399)\
    .setStopAtEos(True)\
    .setDoSample(False)\
    .setTopK(1)\
    .setRandomSeed(42)
    
pipeline = Pipeline().setStages([document_assembler, gpt_qa])


biogpt_chat_jsl_conversational download started this may take some time.
[OK!]


In [None]:
question = "What is the difference between melanoma and sarcoma?"
TEXT = [f"question: {question} answer:"]

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)
light_result = light_model.annotate(TEXT)
answer_text = light_result[0]["answer"]


In [None]:
# Extract the text after 'answer:'
final_answer = answer_text[0][len(TEXT[0]) + 1:].strip()

# Format the text into paragraphs
wrapped_text = textwrap.fill(final_answer, width=120)

print("➤ Answer: \n{}".format(wrapped_text))
print("\n")

➤ Answer: 
Both are blood - borne cancers. Melanoma is a type of skin cancer that arises from melanocytes, the pigment - producing
cells in the skin. Sarcoma is a type of bone cancer that arises from bone. Both are blood - borne cancers and therefore
have very different treatment options.


