

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare_jsl/CLINICAL_CLASSIFICATION.ipynb)




# **How to use Licensed Classification models in Spark NLP**

### Spark NLP documentation and instructions:
https://nlp.johnsnowlabs.com/docs/en/quickstart

### You can find details about Spark NLP annotators here:
https://nlp.johnsnowlabs.com/docs/en/annotators

### You can find details about Spark NLP models here:
https://nlp.johnsnowlabs.com/models


To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

## 2. Start Session

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

## 2. Usage Guidelines

1. **Selecting the correct Classification Model**

> a. To select from all the Classification models available in Spark NLP please go to https://nlp.johnsnowlabs.com/models

> b. Read through the model descriptions to select desired model

> c. Some of the available models:
>> classifierdl_pico_biobert

>> classifierdl_ade_biobert
---
2. **Selecting correct embeddings for the chosen model**

> a. Models are trained on specific embeddings and same embeddings should be used at inference to get best results

> b. If the name of the model contains "**biobert**" (e.g: *ner_anatomy_biobert*) then the model is trained using "**biobert_pubmed_base_cased**" embeddings. Otherwise, "**embeddings_clinical**" was used to train that model.

> c. Using correct embeddings

>> To use *embeddings_clinical* :

>>> word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

>> To use *Bert* Embeddings:

>>> embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\
    .setInputCols(["document", 'token'])\
    .setOutputCol("word_embeddings")
> d. You can find list of all embeddings at https://nlp.johnsnowlabs.com/models?tag=embeddings


Create the pipeline

In [None]:
document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained('biobert_pubmed_base_cased')\
    .setInputCols(["document", 'token'])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
      .setInputCols(["document", "word_embeddings"]) \
      .setOutputCol("sentence_embeddings") \
      .setPoolingStrategy("AVERAGE")
      # .setStorageRef('SentenceEmbeddings_5d018a59d7c3')

classifier = nlp.ClassifierDLModel.pretrained('classifierdl_pico_biobert', 'en', 'clinical/models')\
    .setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('class')

nlp_pipeline = Pipeline(
    stages=[
        document_assembler, 
        tokenizer,
        embeddings,
        sentence_embeddings, 
        classifier
        ])

biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
classifierdl_pico_biobert download started this may take some time.
Approximate size to download 22 MB
[OK!]


## 3. Create example inputs

In [None]:
# Enter examples as strings in this array
input_list = [
    """A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.""",
]

## 4. Use the pipeline to create outputs

Full Pipeline (Expects a Spark Data Frame)

In [None]:
import pandas as pd

empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)
df = spark.createDataFrame(pd.DataFrame({'text': input_list}))
result = pipeline_model.transform(df)
lmodel = LightPipeline(pipeline_model)

Light Pipeline (Expects a list of string)

In [None]:
lresult = lmodel.annotate(input_list)

## 5. Visualize results

Full Pipeline Results

In [None]:
result.select(F.explode(F.arrays_zip('class.result', 
                                     'document.result')).alias("cols")) \
      .select(F.expr("cols['0']").alias("class"),
              F.expr("cols['1']").alias("document")).show(truncate=False)

+------------+---------------------------------------------------------------------------------------------------------------------------+
|class       |document                                                                                                                   |
+------------+---------------------------------------------------------------------------------------------------------------------------+
|PARTICIPANTS|A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.|
+------------+---------------------------------------------------------------------------------------------------------------------------+



Light Pipeline Results

In [None]:
lresult[0]['class'][0]

'PARTICIPANTS'