

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_COVID.ipynb)




# **Extract entities in covid trials**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

**Import license keys**

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

## 2. Install dependencies

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

**Import dependencies into Python and start the Spark session**

In [None]:
# Import sparknlp & sparknlp_jsl packages
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

# Import Pyspark packages
from pyspark.sql import SparkSession
from pyspark.sql import functions as F 
from pyspark.ml import Pipeline, PipelineModel

import pandas as pd
import numpy as np 

spark = sparknlp_jsl.start(license_keys['SECRET'])

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 3.4.4
Spark NLP_JSL Version : 3.5.2


## 3. Select the NER model and construct the pipeline

In [None]:
MODEL_NAME = "ner_covid_trials"

**Create the pipeline**

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = SentenceDetector() \
    .setInputCols(['document'])\
    .setOutputCol('sentence')

tokenizer = Tokenizer()\
    .setInputCols(['sentence']) \
    .setOutputCol('token')

word_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \
    .setInputCols(['sentence', 'token']) \
    .setOutputCol('embeddings')

clinical_ner = MedicalNerModel.pretrained(MODEL_NAME, "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(['sentence', 'token', 'ner']) \
    .setOutputCol('ner_chunk')

nlp_pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter
    ])

empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_covid_trials download started this may take some time.
[OK!]


## 4. Create example inputs

In [None]:
sample_text = ["""In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tract such as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR )."""]

## 5. Use the pipeline to create outputs

In [None]:
df = spark.createDataFrame(pd.DataFrame({'text': sample_text}))

result = pipeline_model.transform(df)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------------------+-------------------------+
|chunk                              |ner_label                |
+-----------------------------------+-------------------------+
|December 2019                      |Date                     |
|acute respiratory disease          |Disease_Syndrome_Disorder|
|beta-coronavirus                   |Virus                    |
|2019 coronavirus infection         |Disease_Syndrome_Disorder|
|SARS-CoV-2                         |Virus                    |
|coronavirus                        |Virus                    |
|β-coronaviruses                    |Virus                    |
|subgenus Coronaviridae             |Virus                    |
|SARS-CoV-2                         |Virus                    |
|zoonotic coronavirus disease       |Disease_Syndrome_Disorder|
|severe acute respiratory syndrome  |Disease_Syndrome_Disorder|
|SARS                               |Disease_Syndrome_Disorder|
|Middle Eastern respiratory syndrome|Dis

## 6. Visualize results

In [None]:
from sparknlp_display import NerVisualizer

NerVisualizer().display(
    result = result.collect()[0],
    label_col = 'ner_chunk',
    document_col = 'document'
)