![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/25.1.Medical_Text_Generation.ipynb)


# **Medical Text Generation**

MedicalTextGenerator uses the basic BioGPT model to perform various tasks related to medical text abstraction. With this annotator, a user can provide a prompt and context and instruct the system to perform a specific task, such as explaining why a patient may have a particular disease or paraphrasing the context more directly. In addition, this annotator can create a clinical note for a cancer patient using the given keywords or write medical texts based on introductory sentences. The BioGPT model is trained on large volumes of medical data allowing it to identify and extract the most relevant information from the text provided.


Available models can be found at the [Models Hub](https://nlp.johnsnowlabs.com/models?annotator=MedicalTextGenerator).


## Colab Setup

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

# 🔎 MODELS

<div align="center">

| **Index** | **Text Generator Models**        |
|---------------|----------------------|
| 1        |  [text_generator_biomedical_biogpt_base](https://nlp.johnsnowlabs.com/2023/04/03/text_generator_biomedical_biogpt_base_en.html)     |
| 2      | [text_generator_generic_jsl_base](https://nlp.johnsnowlabs.com/2023/04/03/text_generator_generic_jsl_base_en.html)    |
| 3      | [text_generator_generic_flan_base](https://nlp.johnsnowlabs.com/2023/04/03/text_generator_generic_flan_base_en.html)    |
| 4      | [text_generator_generic_flan_t5_large](https://nlp.johnsnowlabs.com/2023/04/04/text_generator_generic_flan_t5_large_en.html)    |


</div>

## 📑  **text_generator_biomedical_biogpt_base**

This model is a BioGPT (LLM) based text generation model that is finetuned with biomedical datasets (Pubmed abstracts) by John Snow Labs.  Given a few tokens as an intro, it can generate human-like, conceptually meaningful texts  up to 1024 tokens given an input text (max 1024 tokens).

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("prompt")\
    .setOutputCol("document_prompt")

med_text_generator  = medical.TextGenerator.pretrained("text_generator_biomedical_biogpt_base", "en", "clinical/models")\
    .setInputCols("document_prompt")\
    .setOutputCol("answer")\
    .setMaxNewTokens(256)\
    .setDoSample(True)\
    .setTopK(3)\
    .setRandomSeed(42)\
    .setStopAtEos(True)

pipeline = nlp.Pipeline(stages=[document_assembler, med_text_generator])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("prompt"))

text_generator_biomedical_biogpt_base download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame([['Covid 19 is']]).toDF("prompt")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)

+--------------------------------------------------------------------------------+
|result                                                                          |
+--------------------------------------------------------------------------------+
|[Covid 19 is a pandemic that has affected the world &apos;s economy and health.]|
+--------------------------------------------------------------------------------+



### **📍 LightPipelines**

In [None]:
med_text_generator.setMaxNewTokens(128)

MedicalTextGenerator_9430e26a418f

In [None]:
text = ["SARS-CoV-2",
        "Asthma is a chronic respiratory disease characterized by"]

light_model = nlp.LightPipeline(model)
light_result = light_model.annotate(text)

In [None]:
import textwrap

for i in range(len(light_result)):
    document_text = textwrap.fill(light_result[i]['document_prompt'][0], width=120)
    summary_text = textwrap.fill(light_result[i]['answer'][0], width=120)

    print("➤ Document {}: \n{}".format(i+1, document_text))
    print("\n")
    print("➤ Answer {}: \n{}".format(i+1, summary_text))
    print("\n")

➤ Document 1: 
SARS-CoV-2


➤ Answer 1: 
SARS - CoV - 2 infection is a global health concern.


➤ Document 2: 
Asthma is a chronic respiratory disease characterized by


➤ Answer 2: 
Asthma is a chronic respiratory disease characterized by inflammation in the airways, resulting primarily of the type 2
helper T cells ( Th2 ).


