![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare_jsl/NER_DRUGS_DEVELOPMENT_TRIALS.ipynb)

# **Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

# **Start Session**

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

In [2]:
spark

# **🔎 For about models**

### 📌 **bert_token_classifier_drug_development_trials**

It is a BertForTokenClassification NER model to identify concepts related to drug development including **Trial Groups , End Points , Hazard Ratio and other** entities in free text.

# **🔎Define Spark NLP pipeline**

In [3]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") 

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

tokenClassifier = medical.BertForTokenClassifier.pretrained( "bert_token_classifier_drug_development_trials", "en", 'clinical/models')\
    .setInputCols("sentence","token")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline =  Pipeline(stages=[documentAssembler,
                             sentenceDetector, 
                             tokenizer, 
                             tokenClassifier, 
                             ner_converter])

pipelineModel = pipeline.fit(spark.createDataFrame([['']]).toDF("text"))


sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
bert_token_classifier_drug_development_trials download started this may take some time.
[OK!]


# **🔎Sample Text**

In [4]:
sample_text = "In July 2018, final pfs analysis results were reported from 358 patients with relapsed/refractory follicular and marginal zone lymphoma showing that the trial met its primary endpoint, with a significant improvement in pfs compared with rituximab + placebo.   At that time, follow-up was to continue for the mature os results. The median pfs per irc assessment (primary endpoint) with lenalidomide + rituximab and rituximab + placebo was 39.4 and 14.1months, respectively "

data = spark.createDataFrame([[sample_text]]).toDF('text')

In [5]:
result = pipelineModel.transform(data)

# **🔎Run the pipeline**

In [6]:
light_result = LightPipeline(pipelineModel).fullAnnotate(sample_text)


chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    
import pandas as pd

df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities})

df.head(20)

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,July 2018,3,11,0,DATE
1,358 patients,60,71,0,Patient_Count
2,relapsed/refractory follicular and marginal zo...,78,134,0,Patient_Group
3,rituximab + placebo,237,255,0,Trial_Group
4,median,331,336,2,Duration
5,lenalidomide + rituximab,385,408,2,Trial_Group
6,rituximab + placebo,414,432,2,Trial_Group
7,39.4,438,441,2,Value
8,14.1months,447,456,2,Value


# **🔎Visualize results**

In [7]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
visualiser.display(result = light_result[0] ,label_col = 'ner_chunk', document_col = 'document')