
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DRUGS.ipynb)

# **Detect Drug Chemicals**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

# **Colab Setup**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.8
Spark NLP_JSL Version : 4.2.8


# **🔎 For about models**


📌 **"ner_drugs"**--> *Pretrained named entity recognition deep learning model for Drugs.*

*   Predicted Entities => **DrugChem**

📌 **bert_token_classifier_ner_drugs** --> *Pretrained named entity recognition deep learning model for Drugs.It detects drug chemicals.*

*   Predicted Entities => **DrugChem** 



# **🔎Define Spark NLP pipeline**

In [4]:
#BASIC STAGES👇🏻

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

ner_converter = NerConverter()\
    .setInputCols(['sentence', 'token', 'ner']) \
    .setOutputCol('ner_chunk')

#SELECT NER MODEL👇🏻

def pipeline(model_name):

  if model_name == 'bert_token_classifier_ner_drugs':

    tokenClassifier = MedicalBertForTokenClassifier.pretrained(model_name, "en", 'clinical/models')\
      .setInputCols("sentence","token")\
      .setOutputCol("ner")\
      .setCaseSensitive(True)

    nlpPipeline = Pipeline(stages=[documentAssembler,
                                   sentenceDetector,
                                   tokenizer,
                                   tokenClassifier,
                                   ner_converter])

  else:

    clinical_ner = MedicalNerModel.pretrained(model_name, "en", "clinical/models")\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setOutputCol("ner")\


    nlpPipeline = Pipeline(stages=[documentAssembler,
                                   sentenceDetector,
                                   tokenizer,
                                   word_embeddings,
                                   clinical_ner,
                                   ner_converter])

  empty_data = spark.createDataFrame([[""]]).toDF("text")
  model = nlpPipeline.fit(empty_data)
  
  light_model = LightPipeline(model)
  return light_model

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


# **🔎 "ner_drugs" model**

In [5]:
sample_text = """The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying Potassium (GIRK) channel family. Here we describe the genomic organization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene for type II diabetes mellitus in the Pima Indian population. The gene spans approximately 7.6 kb and contains one noncoding and two coding exons separated by approximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Our expression studies revealed the presence of the transcript in various human tissues including the pancreas, and two major insulin-responsive tissues: fat and skeletal muscle. The characterization of the KCNJ9 gene should facilitate further studies on the function of the KCNJ9 protein and allow evaluation of the potential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulness of vinorelbine monotherapy in patients with advanced or recurrent breast cancer after standard therapy, we evaluated the efficacy and safety of vinorelbine in patients previously treated with anthracyclines and taxanes."""

drug_light_result = pipeline("ner_drugs").fullAnnotate(sample_text)

chunks = []
entities = []
sentence= []
begin = []
end = []

for n in drug_light_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    
import pandas as pd

drug_df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities})

drug_df.head(20)

ner_drugs download started this may take some time.
[OK!]


Unnamed: 0,chunks,begin,end,sentence_id,entities
0,Potassium,92,100,0,DrugChem
1,anthracyclines,1137,1150,6,DrugChem
2,taxanes,1156,1162,6,DrugChem
3,vinorelbine,1217,1227,7,DrugChem
4,vinorelbine,1358,1368,7,DrugChem
5,anthracyclines,1406,1419,7,DrugChem
6,taxanes,1425,1431,7,DrugChem


# **🔎 "bert_token_classifier_ner_drugs" model**

In [6]:
sample_text = """The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying Potassium (GIRK) channel family. Here we describe the genomic organization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene for type II diabetes mellitus in the Pima Indian population. The gene spans approximately 7.6 kb and contains one noncoding and two coding exons separated by approximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Our expression studies revealed the presence of the transcript in various human tissues including the pancreas, and two major insulin-responsive tissues: fat and skeletal muscle. The characterization of the KCNJ9 gene should facilitate further studies on the function of the KCNJ9 protein and allow evaluation of the potential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulness of vinorelbine monotherapy in patients with advanced or recurrent breast cancer after standard therapy, we evaluated the efficacy and safety of vinorelbine in patients previously treated with anthracyclines and taxanes."""

classifier_light_result = pipeline("bert_token_classifier_ner_drugs").fullAnnotate(sample_text)

chunks = []
entities = []
sentence= []
begin = []
end = []

for n in classifier_light_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    
import pandas as pd

classifier_df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities})

classifier_df.head(20)

bert_token_classifier_ner_drugs download started this may take some time.
[OK!]


Unnamed: 0,chunks,begin,end,sentence_id,entities
0,Potassium,92,100,0,DrugChem
1,nucleotide,475,484,3,DrugChem
2,anthracyclines,1137,1150,6,DrugChem
3,taxanes,1156,1162,6,DrugChem
4,vinorelbine,1217,1227,7,DrugChem
5,vinorelbine,1358,1368,7,DrugChem
6,anthracyclines,1406,1419,7,DrugChem
7,taxanes,1425,1431,7,DrugChem


# **Checking `ner_drugs` and `bert_token_classifier_ner_drugs` results together**

In [7]:
from google.colab import widgets
from sparknlp_display import NerVisualizer

t = widgets.TabBar(["ner_drugs", "bert_token_classifier_ner_drugs", "viz_drug", "viz_token_classifier" ])

with t.output_to(0):
    display(drug_df)

with t.output_to(1):
    display(classifier_df)

with t.output_to(2):
    NerVisualizer().display(drug_light_result[0], label_col='ner_chunk', document_col='document')

with t.output_to(3):
    NerVisualizer().display(classifier_light_result[0], label_col='ner_chunk', document_col='document')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,Potassium,92,100,0,DrugChem
1,anthracyclines,1137,1150,6,DrugChem
2,taxanes,1156,1162,6,DrugChem
3,vinorelbine,1217,1227,7,DrugChem
4,vinorelbine,1358,1368,7,DrugChem
5,anthracyclines,1406,1419,7,DrugChem
6,taxanes,1425,1431,7,DrugChem


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,Potassium,92,100,0,DrugChem
1,nucleotide,475,484,3,DrugChem
2,anthracyclines,1137,1150,6,DrugChem
3,taxanes,1156,1162,6,DrugChem
4,vinorelbine,1217,1227,7,DrugChem
5,vinorelbine,1358,1368,7,DrugChem
6,anthracyclines,1406,1419,7,DrugChem
7,taxanes,1425,1431,7,DrugChem


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>