![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/33.1.Medical_Text_Generation.ipynb)


# **Medical Text Generation**

MedicalTextGenerator uses the basic BioGPT model to perform various tasks related to medical text abstraction. With this annotator, a user can provide a prompt and context and instruct the system to perform a specific task, such as explaining why a patient may have a particular disease or paraphrasing the context more directly. In addition, this annotator can create a clinical note for a cancer patient using the given keywords or write medical texts based on introductory sentences. The BioGPT model is trained on large volumes of medical data allowing it to identify and extract the most relevant information from the text provided.


Available models can be found at the [Models Hub](https://nlp.johnsnowlabs.com/models?annotator=MedicalTextGenerator).


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

## Colab Setup

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.5.1  spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import json
import os

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession

import sparknlp
import sparknlp_jsl

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader
from pyspark.sql import functions as F

import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'], params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.1.3
Spark NLP_JSL Version : 6.1.1


# 🔎 MODELS

<div align="center">

| **Index** | **Text Generator Models**        |
|---------------|----------------------|
| 1        |  [text_generator_biomedical_biogpt_base](https://nlp.johnsnowlabs.com/2023/04/03/text_generator_biomedical_biogpt_base_en.html)     |
| 2      | [text_generator_generic_jsl_base](https://nlp.johnsnowlabs.com/2023/04/03/text_generator_generic_jsl_base_en.html)    |
| 3      | [text_generator_generic_flan_base](https://nlp.johnsnowlabs.com/2023/04/03/text_generator_generic_flan_base_en.html)    |
| 4      | [text_generator_generic_flan_t5_large](https://nlp.johnsnowlabs.com/2023/04/04/text_generator_generic_flan_t5_large_en.html)    |


</div>

## 📑  **text_generator_biomedical_biogpt_base**

This model is a BioGPT (LLM) based text generation model that is finetuned with biomedical datasets (Pubmed abstracts) by John Snow Labs.  Given a few tokens as an intro, it can generate human-like, conceptually meaningful texts  up to 1024 tokens given an input text (max 1024 tokens).

In [4]:
document_assembler = DocumentAssembler()\
    .setInputCol("prompt")\
    .setOutputCol("document_prompt")

med_text_generator  = MedicalTextGenerator.pretrained("text_generator_biomedical_biogpt_base", "en", "clinical/models")\
    .setInputCols("document_prompt")\
    .setOutputCol("answer")\
    .setMaxNewTokens(256)\
    .setDoSample(True)\
    .setTopK(3)\
    .setRandomSeed(40)\
    .setStopAtEos(True)

pipeline = Pipeline(stages=[document_assembler, med_text_generator])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("prompt"))

text_generator_biomedical_biogpt_base download started this may take some time.
Approximate size to download 875.4 MB
[OK!]


In [5]:
data = spark.createDataFrame([['The patient is admitted to the clinic with a severe back pain']]).toDF("prompt")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------+
|[The patient is admitted to the clinic with a severe back pain and the pain has increased. ( 1 - 4 / 7 of a week &apos;s time.]|
+-------------------------------------------------------------------------------------------------------------------------------+



### **📍 LightPipelines**

In [6]:
text = ["COVID-19",
        "SARS-CoV-2",
        "Asthma is a chronic respiratory disease characterized by"]

light_model = LightPipeline(model)
light_result = light_model.annotate(text)

In [7]:
import textwrap

for i in range(len(light_result)):
    document_text = textwrap.fill(light_result[i]['document_prompt'][0], width=120)
    summary_text = textwrap.fill(light_result[i]['answer'][0], width=120)

    print("➤ Document {}: \n{}".format(i+1, document_text))
    print("\n")
    print("➤ Answer {}: \n{}".format(i+1, summary_text))
    print("\n")

➤ Document 1: 
COVID-19


➤ Answer 1: 
COVID - 19 and diabetes: a review of current literature and implications. ( 1 - 4.


➤ Document 2: 
SARS-CoV-2


➤ Answer 2: 
SARS - CoV - 2 is the cause for COVID 19. ( ABSTRACT


➤ Document 3: 
Asthma is a chronic respiratory disease characterized by


➤ Answer 3: 
Asthma is a chronic respiratory disease characterized by reversible airflow limitation and bronchial hyperreactivity to
nonspecific and allergen stimuli, and it has a significant negative effect upon quality - adjusted survival in the
general adult and elderly population, as it increases mortality from respiratory causes and causes hospitalization and
disability in those who are affected by the condition, as it is from other chronic respiratory conditions such COPD. ( 1
- 4 &#93; The aim is the reduction in morbidity, disability, mortality, hospitalization and health service utilization.
( 5,6 - 9 &#93; The prevalence and severity are higher among people over the 65 year - olds than i