

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BACTERIAL_SPECIES.ipynb)




# **Detect bacterial species**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

**Import license keys**

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

## 2. Install dependencies

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

**Import dependencies into Python and start the Spark session**

In [None]:
# Import sparknlp & sparknlp_jsl packages
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

# Import Pyspark packages
from pyspark.sql import SparkSession
from pyspark.sql import functions as F 
from pyspark.ml import Pipeline, PipelineModel


import pandas as pd
import numpy as np 

spark = sparknlp_jsl.start(license_keys['SECRET'])

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 3.4.4
Spark NLP_JSL Version : 3.5.2


## 3. Select the NER model 


**ner_bacterial_species** :  Detect different types of species of bacteria in text using pretrained NER model.

**bert_token_classifier_ner_bacteria** : Detect different types of species of bacteria in text using pretrained NER model. This model is trained with the BertForTokenClassification method from transformers library and imported into Spark NLP.


In [None]:
### Select Model
ModelList = ["ner_bacterial_species",
             "bert_token_classifier_ner_bacteria" ]

## 4. Create example inputs

In [None]:
from pyspark.sql.types import StringType, IntegerType

sample_text = [
    """Bayesian analysis of 16S rRNA gene sequences suggested that the newly identified isolates belong to distinct but related species of the genus Neisseria, and are members of a clade that includes N. dentiae, N. bacilliformis and N. canis
    The predominant cellular fatty acids [16 : 0 , summed feature 3 (16 : 1omega7c and/or iso-15 : 0 2-OH) and 18:1omega7c], as well as biochemical and morphological analyses further support the designation of Neisseria wadsworthii sp . nov.""",
    """16S rRNA gene sequence analysis showed that strain P(T) fell within a group of species in the genus Spirochaeta, including Spirochaeta litoralis, S. isovalerica and S. cellobiosiphila, with which it shared less then 89% sequence similarity.""",
    """It exhibited highest 16S rRNA gene sequence similarity (93.4%) with Clostridiisalibacter paucivorans 37HS60 (T), 91. 8% with Thermohalobacter berrensis CTT3 (T) and 91. 7% with Caloranaerobacter azorensis MV1087 (T).""",
    """The 16S rRNA gene sequence of strain F44 - 8 (T) showed highest similarities to those of Flavobacterium frigoris LMG 21922 (T) (93.3%), Flavobacterium terrae R2A1 - 13 (T) (93.3%) and Flavobacterium gelidilacus LMG 21477 (T) (93.1%)""",
    """The morphology and infraciliature of three karyorelictean ciliates, Geleia sinica and two poorly known Kentrophoros species, K.flavus and K.gracilis, isolated from the intertidal zone of a beach at Qingdao, China, were investigated."""
]

df = spark.createDataFrame(sample_text, StringType()).toDF("text")
df.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|Bayesian analysis of 16S rRNA gene sequences suggested that the newly identified isolates belong ...|
|16S rRNA gene sequence analysis showed that strain P(T) fell within a group of species in the gen...|
|It exhibited highest 16S rRNA gene sequence similarity (93.4%) with Clostridiisalibacter paucivor...|
|The 16S rRNA gene sequence of strain F44 - 8 (T) showed highest similarities to those of Flavobac...|
|The morphology and infraciliature of three karyorelictean ciliates, Geleia sinica and two poorly ...|
+----------------------------------------------------------------------------------------------------+



## 5. Define Spark NLP pipeline

**Create the pipeline**

In [None]:
document_assembler = DocumentAssembler() \
  .setInputCol('text')\
  .setOutputCol('document')

sentence_detector = SentenceDetector() \
    .setInputCols(['document'])\
    .setOutputCol('sentence')

tokenizer = Tokenizer()\
    .setInputCols(['sentence']) \
    .setOutputCol('token')

word_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \
    .setInputCols(['sentence', 'token']) \
    .setOutputCol('embeddings')

clinical_ner = MedicalNerModel.pretrained("ner_bacterial_species", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"])\
      .setOutputCol("ner")

tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bacteria", "en", "clinical/models")\
  .setInputCols("token", "document")\
  .setOutputCol("ner")\
  .setCaseSensitive(True) 

ner_converter = NerConverter()\
    .setInputCols(['sentence', 'token', 'ner']) \
    .setOutputCol('ner_chunk')



embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_bacterial_species download started this may take some time.
[OK!]
bert_token_classifier_ner_bacteria download started this may take some time.
[OK!]


In [None]:
def run_pipeline(MODEL_NAME, TEXT, RESULT ):


    if MODEL_NAME == "ner_bacterial_species":
        nlp_pipeline = Pipeline(stages=[
            document_assembler, 
            sentence_detector,
            tokenizer,
            word_embeddings,
            clinical_ner,
            ner_converter
            ])
    elif MODEL_NAME == "bert_token_classifier_ner_bacteria":
        nlp_pipeline = Pipeline(stages=[
            document_assembler, 
            sentence_detector,
            tokenizer,
            tokenClassifier,
            ner_converter
            ])

    RESULT[MODEL_NAME]=(nlp_pipeline.fit(df).transform(df))

## 6. Run the pipeline

In [None]:
results = {}

for model in ModelList:
  run_pipeline(model, sample_text, results)

In [None]:
results

{'bert_token_classifier_ner_bacteria': DataFrame[text: string, document: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentence: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, token: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, ner: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, ner_chunk: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>],
 'ner_bacterial_species': DataFrame[text: string, document: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentence: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,emb

## 6. Visualize results

In [None]:
from sparknlp_display import NerVisualizer

In [None]:
for model_name, result in zip(results.keys(),results.values()): 

    res = result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                              result.ner_chunk.metadata)).alias("col"))\
                  .select(F.expr("col['0']").alias("ner_chunk"),
                          F.expr("col['1']['entity']").alias("entity")) 

    print("\n",model_name,"\n") 

    NerVisualizer().display(
        result = result.collect()[0],
        label_col = 'ner_chunk',
        document_col = 'document'
    )

    print("\n**********************************\n") 
res.show(truncate=False)


 ner_bacterial_species 




**********************************


 bert_token_classifier_ner_bacteria 




**********************************

+------------------------------------------------+-------+
|ner_chunk                                       |entity |
+------------------------------------------------+-------+
|N. dentiae                                      |SPECIES|
|N. bacilliformis                                |SPECIES|
|N. canis                                        |SPECIES|
|Neisseria wadsworthii                           |SPECIES|
|Spirochaeta litoralis                           |SPECIES|
|S. isovalerica                                  |SPECIES|
|S. cellobiosiphila                              |SPECIES|
|Clostridiisalibacter paucivorans                |SPECIES|
|Thermohalobacter berrensis                      |SPECIES|
|Caloranaerobacter azorensis                     |SPECIES|
|Flavobacterium frigoris LMG 21922 (T) (93.3%    |SPECIES|
|Flavobacterium terrae R2A1 - 13 (T) (93.3%)     |SPECIES|
|Flavobacterium gelidilacus LMG 21477 (T) (93.1%)|SPECIES|
|Geleia sinica     