

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLINICAL_CLASSIFICATION.ipynb)




# **How to use Licensed Classification models in Spark NLP**

### Spark NLP documentation and instructions:
https://nlp.johnsnowlabs.com/docs/en/quickstart

### You can find details about Spark NLP annotators here:
https://nlp.johnsnowlabs.com/docs/en/annotators

### You can find details about Spark NLP models here:
https://nlp.johnsnowlabs.com/models


To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

Import license keys

In [1]:
import os
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

sparknlp_version = license_keys["PUBLIC_VERSION"]
jsl_version = license_keys["JSL_VERSION"]

print ('SparkNLP Version:', sparknlp_version)
print ('SparkNLP-JSL Version:', jsl_version)

Saving spark_nlp_for_healthcare.json to spark_nlp_for_healthcare.json
SparkNLP Version: 3.0.1
SparkNLP-JSL Version: 3.0.0



Install dependencies

In [2]:
%%capture
for k,v in license_keys.items(): 
    %set_env $k=$v

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jsl_colab_setup.sh
!bash jsl_colab_setup.sh



Import dependencies into Python and start the Spark session

In [3]:
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl

spark = sparknlp_jsl.start(license_keys['SECRET'])

# manually start session
# params = {"spark.driver.memory" : "16G",
#           "spark.kryoserializer.buffer.max" : "2000M",
#           "spark.driver.maxResultSize" : "2000M"}

# spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

## 2. Usage Guidelines

1. **Selecting the correct Classification Model**

> a. To select from all the Classification models available in Spark NLP please go to https://nlp.johnsnowlabs.com/models

> b. Read through the model descriptions to select desired model

> c. Some of the available models:
>> classifierdl_pico_biobert

>> classifierdl_ade_biobert
---
2. **Selecting correct embeddings for the chosen model**

> a. Models are trained on specific embeddings and same embeddings should be used at inference to get best results

> b. If the name of the model contains "**biobert**" (e.g: *ner_anatomy_biobert*) then the model is trained using "**biobert_pubmed_base_cased**" embeddings. Otherwise, "**embeddings_clinical**" was used to train that model.

> c. Using correct embeddings

>> To use *embeddings_clinical* :

>>> word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

>> To use *Bert* Embeddings:

>>> embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\
    .setInputCols(["document", 'token'])\
    .setOutputCol("word_embeddings")
> d. You can find list of all embeddings at https://nlp.johnsnowlabs.com/models?tag=embeddings


Create the pipeline

In [7]:
document_assembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

tokenizer = Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\
    .setInputCols(["document", 'token'])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
      .setInputCols(["document", "word_embeddings"]) \
      .setOutputCol("sentence_embeddings") \
      .setPoolingStrategy("AVERAGE")
      # .setStorageRef('SentenceEmbeddings_5d018a59d7c3')

classifier = ClassifierDLModel.pretrained('classifierdl_pico_biobert', 'en', 'clinical/models')\
    .setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('class')

nlp_pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    embeddings,
    sentence_embeddings, 
    classifier])

biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
classifierdl_pico_biobert download started this may take some time.
Approximate size to download 22 MB
[OK!]


## 3. Create example inputs

In [8]:
# Enter examples as strings in this array
input_list = [
    """A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.""",
]

## 4. Use the pipeline to create outputs

Full Pipeline (Expects a Spark Data Frame)

In [12]:
empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)
df = spark.createDataFrame(pd.DataFrame({'text': input_list}))
result = pipeline_model.transform(df)
lmodel = LightPipeline(pipeline_model)

Light Pipeline (Expects a list of string)

In [13]:
lresult = lmodel.fullAnnotate(input_list)

## 5. Visualize results

Full Pipeline Results

In [14]:
result.select(F.explode(F.arrays_zip('class.result', 'document.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("class"),
        F.expr("cols['1']").alias("document")).show(truncate=False)

+------------+---------------------------------------------------------------------------------------------------------------------------+
|class       |document                                                                                                                   |
+------------+---------------------------------------------------------------------------------------------------------------------------+
|PARTICIPANTS|A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.|
+------------+---------------------------------------------------------------------------------------------------------------------------+



Light Pipeline Results

In [16]:
lresult[0]['class'][0].result

'PARTICIPANTS'