![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare_jsl/PUBLIC_HEALTH_MB4TC.ipynb)

# `Medical Bert For Token Classification` for **Public Health Models**

# **Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

## Start Session

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

# **General Function for MedicalBertForTokenClassifier Pipeline**

In [None]:
from sparknlp_display import NerVisualizer

def run_pipeline (model, text, lang = "en"):
  documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
  sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    
  tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")
    
  tokenClassifier = medical.BertForTokenClassifier.pretrained(model, lang, "clinical/models")\
    .setInputCols("token", "sentence")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)
    
  ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")
    
  pipeline =  Pipeline(
      stages=[
          documentAssembler,
          sentenceDetector,
          tokenizer,
          tokenClassifier,
          ner_converter
          ])

  df = spark.createDataFrame(text, StringType()).toDF("text")
  result = pipeline.fit(df).transform(df)
  
  print("\n")
  print("<----------------- MODEL NAME:","\033[1m" + model + "\033[0m"," ----------------- >")
  print("\n")

  result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                       result.ner_chunk.metadata)).alias("cols")) \
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

  print("\n")
  
  for i in range(len(text)):
    NerVisualizer().display(
      result = result.collect()[i],
      label_col = 'ner_chunk',
      document_col = 'document'
  )

# **MODELS**

## **bert_token_classifier_ade_tweet_binary**

In [None]:
model = 'bert_token_classifier_ade_tweet_binary'

In [None]:
sample_texts = [
  """I understand you very well. :( just got 1st urgh ! humira worked for me for just 3months then got painful reactions.""",
  """This vyvanse got me sweating right now and i dont even know why!""",
  """Wonder which drug is doing this memory lapse thing. My guess the Duloxetine.""",
  """I used to be on paxil but that made me more depressed and prozac made me angry.""",
  """Maybe it's because of the effect of seroquel, but when I eat fast carbohydrates, I feel the sugar drop."""
]

In [None]:
run_pipeline(model, sample_texts)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
bert_token_classifier_ade_tweet_binary download started this may take some time.
[OK!]


<----------------- MODEL NAME: [1mbert_token_classifier_ade_tweet_binary[0m  ----------------- >


+-----------------+---------+
|chunk            |ner_label|
+-----------------+---------+
|painful reactions|ADE      |
|sweating         |ADE      |
|memory lapse     |ADE      |
|depressed        |ADE      |
|angry            |ADE      |
|sugar drop       |ADE      |
+-----------------+---------+





## **bert_token_classifier_disease_mentions_tweet**

In [None]:
model = 'bert_token_classifier_disease_mentions_tweet'

In [None]:
sample_texts = [
"""La ansiedad, la depresión, son dos trastornos emocionales graves, muy graves, a todos nos pueden llegar en cualquier momento de nuestras vidas y por muchas…""",
"""Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada.""",
"""El tabaquismo está detrás de un alto porcentaje de casos de cáncer y enfermedades cardiovasculares""",
"""Muchos pacientes vivimos sin tiroides por diferentes patologías Bocio Hipertiroidismo Carcinomas (papilar, folicular, medular o anaplásico) Tumores neuroendocrinos Laringectomizados Tomamos levotiroxina sódica.""",
"""El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No."""
]

In [None]:
run_pipeline(model, sample_texts, lang = 'es')

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
bert_token_classifier_disease_mentions_tweet download started this may take some time.
[OK!]


<----------------- MODEL NAME: [1mbert_token_classifier_disease_mentions_tweet[0m  ----------------- >


+----------------------------------------------------+----------+
|chunk                                               |ner_label |
+----------------------------------------------------+----------+
|ansiedad                                            |ENFERMEDAD|
|depresión                                           |ENFERMEDAD|
|trastornos emocionales graves                       |ENFERMEDAD|
|Sinusitis                                           |ENFERMEDAD|
|Faringitis aguda                                    |ENFERMEDAD|
|infección de orina                                  |ENFERMEDAD|
|tabaquismo                                          |ENFERMEDAD|
|cáncer                        