![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4TC.ipynb)

# `Medical Bert For Token Classification` for **Public Health Models**

# **Colab Setup**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.8
Spark NLP_JSL Version : 4.2.8


# **General Function for MedicalBertForTokenClassifier Pipeline**

In [4]:
from sparknlp_display import NerVisualizer

def run_pipeline (model, text, lang = "en"):
  documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
  sentenceDetector = SentenceDetectorDLModel.pretrained()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    
  tokenizer = Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")
    
  tokenClassifier = MedicalBertForTokenClassifier.pretrained(model, lang, "clinical/models")\
    .setInputCols("token", "sentence")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)
    
  ner_converter = NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")
    
  pipeline =  Pipeline(
      stages=[
          documentAssembler,
          sentenceDetector,
          tokenizer,
          tokenClassifier,
          ner_converter
          ])

  df = spark.createDataFrame(text, StringType()).toDF("text")
  result = pipeline.fit(df).transform(df)
  
  print("\n")
  print("<----------------- MODEL NAME:","\033[1m" + model + "\033[0m"," ----------------- >")
  print("\n")

  result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                       result.ner_chunk.metadata)).alias("cols")) \
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

  print("\n")
  
  for i in range(len(text)):
    NerVisualizer().display(
      result = result.collect()[i],
      label_col = 'ner_chunk',
      document_col = 'document'
  )

# **MODELS**

## **bert_token_classifier_ade_tweet_binary**

In [5]:
model = 'bert_token_classifier_ade_tweet_binary'

In [6]:
sample_texts = [
  """I understand you very well. :( just got 1st urgh ! humira worked for me for just 3months then got painful reactions.""",
  """This vyvanse got me sweating right now and i dont even know why!""",
  """Wonder which drug is doing this memory lapse thing. My guess the Duloxetine.""",
  """I used to be on paxil but that made me more depressed and prozac made me angry.""",
  """Maybe it's because of the effect of seroquel, but when I eat fast carbohydrates, I feel the sugar drop."""
]

In [7]:
run_pipeline(model, sample_texts)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
bert_token_classifier_ade_tweet_binary download started this may take some time.
[OK!]


<----------------- MODEL NAME: [1mbert_token_classifier_ade_tweet_binary[0m  ----------------- >


+-----------------+---------+
|chunk            |ner_label|
+-----------------+---------+
|painful reactions|ADE      |
|sweating         |ADE      |
|memory lapse     |ADE      |
|depressed        |ADE      |
|angry            |ADE      |
|sugar drop       |ADE      |
+-----------------+---------+





## **bert_token_classifier_disease_mentions_tweet**

In [8]:
model = 'bert_token_classifier_disease_mentions_tweet'

In [9]:
sample_texts = [
"""La ansiedad, la depresión, son dos trastornos emocionales graves, muy graves, a todos nos pueden llegar en cualquier momento de nuestras vidas y por muchas…""",
"""Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada.""",
"""El tabaquismo está detrás de un alto porcentaje de casos de cáncer y enfermedades cardiovasculares""",
"""Muchos pacientes vivimos sin tiroides por diferentes patologías Bocio Hipertiroidismo Carcinomas (papilar, folicular, medular o anaplásico) Tumores neuroendocrinos Laringectomizados Tomamos levotiroxina sódica.""",
"""El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No."""
]

In [10]:
run_pipeline(model, sample_texts, lang = 'es')

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
bert_token_classifier_disease_mentions_tweet download started this may take some time.
[OK!]


<----------------- MODEL NAME: [1mbert_token_classifier_disease_mentions_tweet[0m  ----------------- >


+----------------------------------------------------+----------+
|chunk                                               |ner_label |
+----------------------------------------------------+----------+
|ansiedad                                            |ENFERMEDAD|
|depresión                                           |ENFERMEDAD|
|trastornos emocionales graves                       |ENFERMEDAD|
|Sinusitis                                           |ENFERMEDAD|
|Faringitis aguda                                    |ENFERMEDAD|
|infección de orina                                  |ENFERMEDAD|
|tabaquismo                                          |ENFERMEDAD|
|cáncer                        