![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CHEMICALS.ipynb)

# `ner_chemicals` **Models**

This model extract different types of chemical compounds mentioned in text using pretrained NER model.

## 1. Colab Setup

**Import license keys**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

## 2. Start Session

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

In [None]:
spark

## 3. Select the model and construct the pipeline

In [None]:
MODEL_LIST = ["ner_chemicals",
              "bert_token_classifier_ner_chemicals"]

**Create the pipeline**

In [None]:
document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")


word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["document", "token"])\
      .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel.pretrained("ner_chemicals", "en", "clinical/models") \
      .setInputCols(["document", "token", "word_embeddings"]) \
      .setOutputCol("ner")

tokenClassifier = medical.BertForTokenClassifier.pretrained("bert_token_classifier_ner_chemicals","en", "clinical/models")\
      .setInputCols(["token", "document"])\
      .setOutputCol("ner")\
      .setCaseSensitive(True)

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\



def run_pipeline(MODEL_NAME , sample_text):
    if MODEL_NAME == "ner_chemicals":
        resolver_pipeline = Pipeline(stages = [document_assembler,                
                                               tokenizer,
                                               word_embeddings,
                                               clinical_ner,
                                               ner_converter,])
        
    else: 
        resolver_pipeline = Pipeline(stages = [document_assembler,
                                               tokenizer,
                                               tokenClassifier,
                                               ner_converter,])
        
    text = spark.createDataFrame(sample_text, StringType()).toDF('text')

    result = resolver_pipeline.fit(text).transform(text)
    return result

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_chemicals download started this may take some time.
[OK!]
bert_token_classifier_ner_chemicals download started this may take some time.
[OK!]


## 4. Create example inputs

In [None]:
sample_text = [
"""Differential cell - protective function of two resveratrol (trans - 3, 5, 4 - trihydroxystilbene) glucosides against oxidative stress. Resveratrol (trans - 3, 5, 4  - trihydroxystilbene ; RSV) , a natural polyphenol, exerts a beneficial effect on health and diseases. RSV targets and activates the NAD(+) - dependent protein deacetylase SIRT1; in turn, SIRT1 induces an intracellular antioxidative mechanism by inducing mitochondrial superoxide dismutase (SOD2). Most RSV found in plants is glycosylated, and the effect of these glycosylated forms on SIRT1 has not been studied.""",

"""In this study, the mechanisms of TRG prevention of oxidative stress were determined by measuring erythrocyte and liver antioxidant enzyme activities, and expressions of genes associated with reactive oxygen species production, and carbohydrate and lipid metabolisms by DNA microarray. Erythrocyte and liver glutathione peroxidase, and liver catalase activities in the GK rats fed with TRG were significantly lower than those of the GK control rats. TRG downregulated the gene expressions involved with NADPH oxidase and mitochondrial electron transfer system when compared with those of the GK control group. These results suggested that mitigation of diabetes by TRG is mediated by its ameliorating effects on oxidative stress. Metabolic effects of honey in type 1 diabetes mellitus: a randomized crossover pilot study. The aim of this study was to evaluate the metabolic effects of 12 - week honey consumption on patients suffering from type 1 diabetes mellitus (DM).""",
    
"""In both experiments , the ordering of the interactions of the cations was : Ca ( 2 + ) > Mg ( 2 + ) > Li ( + ) > Na ( + ) = ~ K ( + ) . This is a direct cationic Hofmeister series . Even for Ca ( 2 + ) , however , the apparent equilibrium dissociation constant of the cation with the amide carbonyl oxygen was no tighter than ~ 8 . 5 M . For Na ( + ) and K ( + ) , no evidence was found for any binding.""",

"""It was revealed that the most active compounds 4 - ((5Z) - 5 - {[5 - (4 - bromophenyl) - 2 - furyl] methylene} - 4 - oxo - 2 - thioxo - 1, 3 - thiazolidin - 3 - yl) butanoic acid and 6 - ((5Z) - 5 - {[5 - (4 - bromophenyl) - 2 - furyl] methylene} - 4 - oxo - 2 - thioxo - 1, 3 - thiazolidin - 3 - yl) hexanoic acid inhibit ASK1 with IC50 of 0.2 mu M. Structure - activity relationships of 33 derivatives of 5 - (5 - Phenyl - furan - 2 - ylmethylene) - 2 - thioxo - thiazolidin - 4 - one have been studied and binding mode of this chemical class has been predicted. Identification and characterization of novel catalytic bioscavengers of organophosphorus nerve agents.""",

"""Hepatic function was assessed by evaluating the following parameters: liver histology; plasma levels of alanine aminotransferase (ALT), triglyceride (TG), malondialdehyde (MDA), and reduced glutathione (GSH); expression levels of TNF - alpha and IL - 6; and levels of caspase - 3 and pJNK / JNK protein.""",
]

In [None]:
from pyspark.sql.types import StringType, IntegerType

text = spark.createDataFrame(sample_text, StringType()).toDF('text')
text.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|Differential cell - protective function of two resveratrol (trans - 3, 5, 4 - trihydroxystilbene)...|
|In this study, the mechanisms of TRG prevention of oxidative stress were determined by measuring ...|
|In both experiments , the ordering of the interactions of the cations was : Ca ( 2 + ) > Mg ( 2 +...|
|It was revealed that the most active compounds 4 - ((5Z) - 5 - {[5 - (4 - bromophenyl) - 2 - fury...|
|Hepatic function was assessed by evaluating the following parameters: liver histology; plasma lev...|
+----------------------------------------------------------------------------------------------------+



## 5. Use the pipeline to create outputs

In [None]:
for i in range(len(MODEL_LIST)):

    result = run_pipeline(MODEL_LIST[i], sample_text)

    print(f"\n*******{MODEL_LIST[i]}********")

    result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                         result.ner_chunk.begin, 
                                         result.ner_chunk.end,
                                         result.ner_chunk.metadata, )).alias("cols"))\
          .select(F.expr("cols['0']").alias("chunk"),
                  F.expr("cols['1']").alias("begin"),
                  F.expr("cols['2']").alias("end"),
                  F.expr("cols['3']['entity']").alias("entity")).show()


*******ner_chemicals********
+--------------------+-----+---+------+
|               chunk|begin|end|entity|
+--------------------+-----+---+------+
|         resveratrol|   47| 57|  CHEM|
|trans - 3, 5, 4 -...|   60|107|  CHEM|
|         Resveratrol|  135|145|  CHEM|
|trans - 3, 5, 4  ...|  148|184|  CHEM|
|                 RSV|  188|190|  CHEM|
|          polyphenol|  205|214|  CHEM|
|                 RSV|  268|270|  CHEM|
|               NAD(+|  298|302|  CHEM|
|          superoxide|  434|443|  CHEM|
|                 RSV|  468|470|  CHEM|
|                 TRG|   33| 35|  CHEM|
|              oxygen|  200|205|  CHEM|
|        carbohydrate|  231|242|  CHEM|
|         glutathione|  307|317|  CHEM|
|                 TRG|  385|387|  CHEM|
|                 TRG|  449|451|  CHEM|
|               NADPH|  502|506|  CHEM|
|                 TRG|  664|666|  CHEM|
|          Ca ( 2 + )|   76| 85|  CHEM|
|          Mg ( 2 + )|   89| 98|  CHEM|
+--------------------+-----+---+------+
only showi

## 6. Visualize results

In [None]:
from sparknlp_display import NerVisualizer

ner_viz = NerVisualizer()

for i in range(len(MODEL_LIST)):

    result = run_pipeline(MODEL_LIST[i], sample_text)
    print(f"\n\n******************{MODEL_LIST[i]}************************\n")
    
    for j in range(len(sample_text)):
        ner_viz.display(result = result.collect()[j], label_col = "ner_chunk")



******************ner_chemicals************************





******************bert_token_classifier_ner_chemicals************************

