[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb)

# Introduction

We identify topics on a set of about 1000 items of publications on the subject of equine colic, extracted via Google Scholar. The two data files contain either titles only, or titles augumented with the truncated abstracts when available.

In the first stage we work with pretrained models and pipelines from Spark NLP and Healthcare Spark NLP. 



# Preliminaries

## Workspace Setup

In [2]:
!pip install -q stylecloud

[K     |████████████████████████████████| 262 kB 5.0 MB/s 
[K     |████████████████████████████████| 161 kB 59.8 MB/s 
[K     |████████████████████████████████| 87 kB 5.8 MB/s 
[K     |████████████████████████████████| 87 kB 5.8 MB/s 
[?25h  Building wheel for stylecloud (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone
  Building wheel for tinycss (setup.py) ... [?25l[?25hdone


In [3]:
# Importing the neccessary libraries
import json
import os

import numpy as np
import pandas as pd

from wordcloud import (
    WordCloud,
    ImageColorGenerator,
    )
import stylecloud

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

colors = sns.color_palette('PuBuGn')

# Options to display pandas dataframes
pd.options.display.max_colwidth = None

In [4]:
# License keys settings neccessary to work in Healthcare Spark NLP
# as given in the JSL notebooks

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

Saving spark_nlp_for_healthcare_spark_ocr_4811.json to spark_nlp_for_healthcare_spark_ocr_4811.json


In [5]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

[K     |████████████████████████████████| 212.4 MB 66 kB/s 
[K     |████████████████████████████████| 142 kB 25.3 MB/s 
[K     |████████████████████████████████| 198 kB 75.5 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 188 kB 2.6 MB/s 
[K     |████████████████████████████████| 95 kB 2.6 MB/s 
[K     |████████████████████████████████| 66 kB 4.1 MB/s 
[?25h

In [6]:
# Importing Spark libraries and modules

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from sparknlp.pretrained import PretrainedPipeline

from sparknlp_display import NerVisualizer
visualiser = NerVisualizer()

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

# Settings and parameters for the Spark session
# As included in JSL notebooks

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

# Staring Healthcare Spark NLP session
spark = sparknlp_jsl.start(license_keys['SECRET'], params=params)
spark

Spark NLP Version : 3.4.2
Spark NLP_JSL Version : 3.5.0


## Data Import

In [7]:
# Importing the collection of titles
!wget -q https://raw.githubusercontent.com/SolanaO/SparkNLP_Study/master/data/horse_titles.csv
# Save as a spark dataframe
titles_df = spark.createDataFrame(pd.read_csv("horse_titles.csv", index_col=0).reset_index())
# Inspect the data
titles_df.show(4, truncate=False)

+-----+--------------------------------------------------------------------------------------------------------------------------------------------+
|index|text                                                                                                                                        |
+-----+--------------------------------------------------------------------------------------------------------------------------------------------+
|0    |Prospective study of equine colic risk factors                                                                                              |
|1    | Dietary and other management factors associated with equine colic                                                                          |
|2    |Prospective study of equine colic incidence and mortality                                                                                   |
|3    |Case-control study of the association between various management factors and development of colic i

In [8]:
# Importing the collection of augmented titles
!wget -q https://raw.githubusercontent.com/SolanaO/SparkNLP_Study/master/data/horse_augm_titles.csv
# Save as a pandas dataframe
augm_df = spark.createDataFrame(pd.read_csv('horse_augm_titles.csv', index_col=0).reset_index())
# Inspect the data
augm_df.show(4, truncate=False)

+-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|index|text                                                                                                                                                                                                                                                                                                                            |
+-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0    |Prospe

In [61]:
# Importing the three sample texts
!wget -q https://raw.githubusercontent.com/SolanaO/SparkNLP_Study/master/data/sample_text_1.txt
!wget -q https://raw.githubusercontent.com/SolanaO/SparkNLP_Study/master/data/sample_text_2.txt
!wget -q https://raw.githubusercontent.com/SolanaO/SparkNLP_Study/master/data/sample_text_short.txt

In [10]:
#sample_text_1 = spark.read.text('./sample_text_1.txt')
#sample_text_2 = spark.read.text('./sample_text_2.txt')
#sample_text_short = spark.read.text('./sample_text_short.txt')


In [62]:
# Read the text files
with open('./sample_text_1.txt') as f:
    sample_text_1 = f.readlines()

with open('./sample_text_2.txt') as f:
    sample_text_2 = f.readlines()

with open('./sample_text_short.txt') as f:
    sample_text_short = f.readlines()

## Useful Functions

In [12]:
def display_results(result):
    '''
    Function to extract results from .fullAnnotate() of the
    LightPipeline as a pandas dataframe.
    '''
    
    chunks = []
    entities = []
    sentence= []
    begin = []
    end = []

    for n in result[0]['ner_chunk']:
        
        begin.append(n.begin)
        end.append(n.end)
        chunks.append(n.result)
        entities.append(n.metadata['entity']) 
        sentence.append(n.metadata['sentence'])
    

    df_results = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities})

    return df_results

# Test Pretrained Pipelines on Short Text

## Key Pretrained Pipelines

In [13]:
# Text to analyze
sample_text_short

['Coastal Bermuda hay is associated with impactions in this most distal segment of the small intestine, although it is difficult to separate this risk factor from geographic location, since the southeastern United States has a higher prevalence of ileal impaction\n',
 'and also has regional access to coastal Bermuda hay. Other causes can be obstruction by\n',
 'ascarids (Parascaris equorum), usually occurring at 3–5 months of age right after deworming. \n']

In [14]:
# Check the clinical_ner pretrained pipeline

pipeline = PretrainedPipeline("ner_clinical_pipeline", "en", "clinical/models")

test_ner_clinical = display_results(pipeline.fullAnnotate(sample_text_short))

ner_clinical_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [15]:
# Check the ner_jsl pretrained pipeline

pipeline = PretrainedPipeline("ner_jsl_pipeline", "en", "clinical/models")

test_ner_jsl = display_results(pipeline.fullAnnotate(sample_text_short))

ner_jsl_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [16]:
# Check the ner_jsl with bert_embeddings pretrained pipeline

pipeline = PretrainedPipeline("ner_jsl_biobert_pipeline", "en", "clinical/models")

test_ner_jsl_biobert = display_results(pipeline.fullAnnotate(sample_text_short))

ner_jsl_biobert_pipeline download started this may take some time.
Approx size to download 403.2 MB
[OK!]


Display the results side by side:

In [17]:
from google.colab import widgets

t = widgets.TabBar(["ner_clinical", "ner_jsl", "ner_bert"])

with t.output_to(0):
    display(test_ner_clinical)

with t.output_to(1):
    display(test_ner_jsl)

with t.output_to(2):
    display(test_ner_jsl_biobert)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,impactions in this most distal segment,39,76,0,PROBLEM
1,geographic location,161,179,0,PROBLEM
2,ileal impaction,246,260,0,PROBLEM


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,Coastal Bermuda hay,0,18,0,Disease_Syndrome_Disorder
1,impactions in this most distal segment of the small intestine,39,99,0,Symptom
2,ileal impaction,246,260,0,Disease_Syndrome_Disorder


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,impactions,39,48,0,Symptom
1,distal,63,68,0,Direction
2,small intestine,85,99,0,Internal_organ_or_component
3,ileal impaction,246,260,0,Disease_Syndrome_Disorder


<IPython.core.display.Javascript object>

## Use Pretrained Model Finder

In [18]:
# Find a pretrained model that identifies organisms and geographical locations
ner_pipeline = PretrainedPipeline("ner_model_finder", "en", "clinical/models")

result = ner_pipeline.annotate("Bermuda hay")
result

ner_model_finder download started this may take some time.
Approx size to download 148.6 MB
[OK!]


{'model_names': ["['ner_medmentions_coarse']"]}

In [19]:
# Check the suggested pipeline

pipeline = PretrainedPipeline("ner_medmentions_coarse_pipeline", "en", "clinical/models")

test_ner_med = display_results(pipeline.fullAnnotate(sample_text_short))
test_ner_med

ner_medmentions_coarse_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


Unnamed: 0,chunks,begin,end,sentence_id,entities
0,Coastal Bermuda hay,0,18,0,Food
1,associated with,23,37,0,Qualitative_Concept
2,impactions,39,48,0,Disease_or_Syndrome
3,distal segment,63,76,0,Body_Location_or_Region
4,small intestine,85,99,0,"Body_Part,_Organ,_or_Organ_Component"
5,geographic location,161,179,0,Spatial_Concept
6,southeastern United States,192,217,0,Geographic_Area
7,prevalence,232,241,0,Quantitative_Concept
8,ileal impaction,246,260,0,Disease_or_Syndrome


# Build NER Pipeline

In [20]:
# Prepare data into a format processable by Spark NLP
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Detect sentence boundaries in healthcare texts using DL
sentence_detector = SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Split raw text into words pieces in relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Maps sentences and tokens to 200 dim vectors 
word_embeddings = WordEmbeddingsModel \
    .pretrained('embeddings_clinical', "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("word_embeddings")

# Name entity recognition annotator 
med_ner = MedicalNerModel \
    .pretrained("ner_medmentions_coarse", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("ner_med")

# Convert IOB or IOB2 representation to a user friendly one
ner_converter_1 = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_med"]) \
    .setOutputCol("ner_med_chunk")

# Map tokens and sentences to 768 dim vectors using Bert
bert_embeddings = BertEmbeddings \
    .pretrained("biobert_pubmed_base_cased", "en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("bert_embeddings")

# Named entity recognition annotator based on Bert embeddings
bert_ner = MedicalNerModel \
    .pretrained('ner_jsl_biobert', "en", "clinical/models") \
    .setInputCols(["sentence", "token", "bert_embeddings"]) \
    .setOutputCol("ner_bert")

# Convert IOB or IOB2 representation to a user friendly one
ner_converter_2 = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_bert"]) \
    .setOutputCol("ner_bert_chunk")

# Annotator to combine the chinks from the two ner models
chunk_merger = ChunkMergeApproach()\
    .setInputCols('ner_med_chunk', "ner_bert_chunk")\
    .setOutputCol('ner_chunk')

# Combine all the steps in a pipeline
ner_pipeline = Pipeline(stages = [
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    med_ner,
    ner_converter_1,
    bert_embeddings,
    bert_ner,
    ner_converter_2,
    chunk_merger
    ])

# Create an empty spark dataframe
empty_df = spark.createDataFrame([['']]).toDF("text")

# Create a pipeline model object
ner_pipe_model = ner_pipeline.fit(empty_df)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_medmentions_coarse download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
ner_jsl_biobert download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


In [21]:
# Create a LightPipeline model
light_ner_pipe_model = LightPipeline(ner_pipe_model)

In [22]:
# Print the classes for ner_medmentions_coarse
ner_med_classes = list(MedicalNerModel.pretrained('ner_medmentions_coarse', "en", "clinical/models").getClasses())
print(ner_med_classes)

ner_medmentions_coarse download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
['O', 'B-Qualitative_Concept', 'B-Mental_Process', 'B-Health_Care_Activity', 'I-Health_Care_Activity', 'B-Professional_or_Occupational_Group', 'B-Population_Group', 'I-Population_Group', 'I-Group', 'B-Pharmacologic_Substance', 'B-Research_Activity', 'B-Medical_Device', 'B-Diagnostic_Procedure', 'B-Molecular_Function', 'B-Spatial_Concept', 'B-Organic_Chemical', 'I-Organic_Chemical', 'B-Amino_Acid,_Peptide,_or_Protein', 'I-Amino_Acid,_Peptide,_or_Protein', 'B-Disease_or_Syndrome', 'I-Disease_or_Syndrome', 'B-Daily_or_Recreational_Activity', 'B-Quantitative_Concept', 'B-Biologic_Function', 'I-Daily_or_Recreational_Activity', 'I-Quantitative_Concept', 'B-Organism_Attribute', 'B-Clinical_Attribute', 'I-Clinical_Attribute', 'B-Pathologic_Function', 'B-Eukaryote', 'I-Eukaryote', 'B-Body_Part,_Organ,_or_Organ_Component', 'B-Anatomical_Structure', 'I-Anatomical_Structure', 'B-Cell_Compone

In [23]:
len(ner_med_classes)

109

In [24]:
# Print the classes for ner_jsl_biobert
ner_bert_classes = list(MedicalNerModel.pretrained('ner_jsl_biobert', "en", "clinical/models").getClasses())
print(ner_bert_classes)

ner_jsl_biobert download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
['O', 'B-Injury_or_Poisoning', 'B-Direction', 'B-Test', 'I-Route', 'B-Admission_Discharge', 'I-Tumor_Finding', 'B-Death_Entity', 'I-Oxygen_Therapy', 'B-Relationship_Status', 'I-Drug_BrandName', 'B-Duration', 'I-Alcohol', 'I-Triglycerides', 'I-Date', 'B-Hyperlipidemia', 'B-Respiration', 'I-Test', 'B-Birth_Entity', 'I-VS_Finding', 'B-Staging', 'B-Age', 'I-Social_History_Header', 'B-Labour_Delivery', 'I-Medical_Device', 'B-Family_History_Header', 'I-Female_Reproductive_Status', 'I-Metastasis', 'B-BMI', 'I-Fetus_NewBorn', 'I-BMI', 'B-Temperature', 'I-Section_Header', 'I-Communicable_Disease', 'I-ImagingFindings', 'I-Psychological_Condition', 'I-Obesity', 'B-Metastasis', 'I-Sexually_Active_or_Sexual_Orientation', 'I-Modifier', 'B-Alcohol', 'I-Temperature', 'I-Vaccine', 'I-Symptom', 'I-Pulse', 'B-Kidney_Disease', 'B-Oncological', 'I-EKG_Findings', 'B-Medical_History_Header', 'I-Relationship_S

In [25]:
len(ner_bert_classes)

170

## Apply NER Pipeline on Text Samples

In [64]:
result_text_short = display_results(light_ner_pipe_model.fullAnnotate(sample_text_short))
result_text_short

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,Coastal Bermuda hay,0,18,0,Food
1,associated with,23,37,0,Qualitative_Concept
2,impactions,39,48,0,Disease_or_Syndrome
3,distal segment,63,76,0,Body_Location_or_Region
4,small intestine,85,99,0,"Body_Part,_Organ,_or_Organ_Component"
5,geographic location,161,179,0,Spatial_Concept
6,southeastern United States,192,217,0,Geographic_Area
7,prevalence,232,241,0,Quantitative_Concept
8,ileal impaction,246,260,0,Disease_or_Syndrome
9,regional access,275,289,0,Spatial_Concept


In [71]:
# Create a graphical representation
result_short = light_ner_pipe_model.fullAnnotate(sample_text_short)

visualiser.display(result_short[0], 
                   label_col='ner_chunk', 
                   document_col='document', 
                   save_path="display_ner_text_short.html")

Check two larger samples of text using the NER light pipeline model.

In [30]:
# Get results on two larger samples of text

ann_result_1 = light_ner_pipe_model.fullAnnotate(sample_text_1)
result_text_1 = display_results(light_ner_pipe_model.fullAnnotate(sample_text_1))

ann_result_2 = light_ner_pipe_model.fullAnnotate(sample_text_2)
result_text_2 = display_results(light_ner_pipe_model.fullAnnotate(sample_text_2))

In [31]:
from google.colab import widgets

t = widgets.TabBar(["result_text_1", "result_text_2",
                    "viz_text_1", "viz_text_2"])

with t.output_to(0):
    display(result_text_1)

with t.output_to(1):
    display(result_text_2)

# Initialize visualizer
visualiser = NerVisualizer()

with t.output_to(2):
    visualiser.display(ann_result_1[0],
                       label_col='ner_chunk', document_col='document')

with t.output_to(3):
    visualiser.display(ann_result_2[0], 
                       label_col='ner_chunk', document_col='document')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,Ileal impaction,0,14,0,Disease_or_Syndrome
1,obstruction of ingesta,33,54,0,Disease_or_Syndrome
2,Coastal Bermuda hay,57,75,1,Food
3,associated with,80,94,1,Qualitative_Concept
4,impactions,96,105,1,Disease_or_Syndrome
5,distal segment,120,133,1,Body_Location_or_Region
6,small intestine,142,156,1,"Body_Part,_Organ,_or_Organ_Component"
7,geographic location,218,236,1,Spatial_Concept
8,southeastern United States,249,274,1,Geographic_Area
9,prevalence,289,298,1,Quantitative_Concept


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,Gas colic,0,8,0,Sign_or_Symptom
1,tympanic colic,25,38,0,Disease_or_Syndrome
2,gas buildup,58,68,0,Symptom
3,horse's digestive tract,81,103,0,Body_System
4,excessive fermentation,112,133,0,Symptom
5,intestines,146,155,0,"Body_Part,_Organ,_or_Organ_Component"
6,decreased ability to move gas,162,190,0,Symptom
7,diet,244,247,1,Food
8,dietary roughage,280,295,1,Food
9,parasites,305,313,1,Eukaryote


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [32]:
result_text_1.nunique()

chunks         39
begin          44
end            44
sentence_id     6
entities       22
dtype: int64

In [33]:
# Preview the chunks for the second sentence in sample_text_1
result_text_1[result_text_1.sentence_id == '1']

Unnamed: 0,chunks,begin,end,sentence_id,entities
2,Coastal Bermuda hay,57,75,1,Food
3,associated with,80,94,1,Qualitative_Concept
4,impactions,96,105,1,Disease_or_Syndrome
5,distal segment,120,133,1,Body_Location_or_Region
6,small intestine,142,156,1,"Body_Part,_Organ,_or_Organ_Component"
7,geographic location,218,236,1,Spatial_Concept
8,southeastern United States,249,274,1,Geographic_Area
9,prevalence,289,298,1,Quantitative_Concept
10,ileal impaction,303,317,1,Disease_or_Syndrome
11,regional access,332,346,1,Spatial_Concept


# Chunk Key Phrase Extraction Pipeline

Get key phrases from NER chunks with ChunkKeyPhraseExtraction annotator.

In [34]:
ner_key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(4) \
    .setDivergence(0.4)\
    .setInputCols(["sentence", "ner_med_chunk"])\
    .setOutputCol("ner_key_phrases")

key_phrase_pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer, 
    word_embeddings, 
    med_ner, 
    ner_converter_1, 
    ner_key_phrase_extractor
])


sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [35]:
# Create a key phrase pipeline model object
key_phrase_pipe_model = key_phrase_pipeline.fit(empty_df)

In [36]:

# Create a LightPipeline model
light_key_phrase_pipe_model = LightPipeline(key_phrase_pipe_model)

In [72]:
# Check the output on the short text
key_text_short = light_key_phrase_pipe_model.fullAnnotate(sample_text_short)

In [92]:
key_text_short[0]['ner_key_phrases']

[Annotation(chunk, 0, 18, Coastal Bermuda hay, {'chunk': '0', 'confidence': '0.43556666', 'DocumentSimilarity': '0.5944551135989127', 'MMRScore': '0.35667308233226197', 'ner_source': 'ner_med_chunk', 'entity': 'Food', 'sentence': '0'}),
 Annotation(chunk, 246, 260, ileal impaction, {'chunk': '8', 'confidence': '0.6582', 'DocumentSimilarity': '0.4770488626849328', 'MMRScore': '0.20274086344063413', 'ner_source': 'ner_med_chunk', 'entity': 'Disease_or_Syndrome', 'sentence': '0'}),
 Annotation(chunk, 294, 308, coastal Bermuda, {'chunk': '10', 'confidence': '0.46615002', 'DocumentSimilarity': '0.43987212652344726', 'MMRScore': '0.01138050757291198', 'ner_source': 'ner_med_chunk', 'entity': 'Geographic_Area', 'sentence': '0'}),
 Annotation(chunk, 192, 217, southeastern United States, {'chunk': '6', 'confidence': '0.9655333', 'DocumentSimilarity': '0.23619331710093222', 'MMRScore': '0.016414123474636416', 'ner_source': 'ner_med_chunk', 'entity': 'Geographic_Area', 'sentence': '0'})]

In [110]:
pd.set_option('display.precision', 3)

def display_key_phrase_results(result):
    '''
    Function to extract results from .fullAnnotate() of the
    LightPipeline as a pandas dataframe.
    '''
    
    chunks = []
    entities = []
    sentence= []
    begin = []
    end = []
    docsim = []
    confidence = []
    score = []

    for n in result[0]['ner_key_phrases']:
        
        begin.append(n.begin)
        end.append(n.end)
        chunks.append(n.result)
        entities.append(n.metadata['entity']) 
        sentence.append(n.metadata['sentence'])
        docsim.append(n.metadata['DocumentSimilarity'])
        confidence.append(n.metadata['confidence']) 
        score.append(n.metadata['MMRScore'])

    df_results = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities,
                   'docsim': docsim, 'confidence': confidence,
                   'MMRScore': score})

    return df_results

In [111]:
display_key_phrase_results(key_text_short)

Unnamed: 0,chunks,begin,end,sentence_id,entities,docsim,confidence,MMRScore
0,Coastal Bermuda hay,0,18,0,Food,0.5944551135989127,0.43556666,0.3566730823322619
1,ileal impaction,246,260,0,Disease_or_Syndrome,0.4770488626849328,0.6582,0.2027408634406341
2,coastal Bermuda,294,308,0,Geographic_Area,0.4398721265234472,0.46615002,0.0113805075729119
3,southeastern United States,192,217,0,Geographic_Area,0.2361933171009322,0.9655333,0.0164141234746364


In [None]:
result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-------------------------+---------+
|chunk                    |ner_label|
+-------------------------+---------+
|Mesothelioma             |PROBLEM  |
|pleural effusion         |PROBLEM  |
|atrial fibrillation      |PROBLEM  |
|anemia                   |PROBLEM  |
|ascites                  |PROBLEM  |
|esophageal reflux        |PROBLEM  |
|deep venous thrombosis   |PROBLEM  |
|Mesothelioma             |PROBLEM  |
|Pleural effusion         |PROBLEM  |
|atrial fibrillation      |PROBLEM  |
|anemia                   |PROBLEM  |
|ascites                  |PROBLEM  |
|esophageal reflux        |PROBLEM  |
|deep venous thrombosis   |PROBLEM  |
|decortication of the lung|TREATMENT|
|pleural biopsy           |TEST     |
|transpleural fluoroscopy |TEST     |
|thoracentesis            |TREATMENT|
|Port-A-Cath placement    |TREATMENT|
|a nonproductive cough    |PROBLEM  |
+-------------------------+---------+
only showing top 20 rows



### with LightPipelines

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") 

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("jsl_ner")

jsl_ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "jsl_ner"]) \
    .setOutputCol("jsl_ner_chunk")

jsl_ner_pipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    jsl_ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

jsl_ner_model = jsl_ner_pipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]


In [None]:
print (text)

jsl_light_model = LightPipeline(jsl_ner_model)

jsl_light_result = jsl_light_model.fullAnnotate(text)


chunks = []
entities = []
sentence= []
begin = []
end = []

for n in jsl_light_result[0]['jsl_ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    
import pandas as pd

jsl_df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities})

jsl_df.head(20)


A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , and associated with an acute hepatitis , presented with a one-week history of polyuria , poor appetite , and vomiting . 
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . 
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl ,  creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , and venous pH 7.27 . 



Unnamed: 0,chunks,begin,end,sentence_id,entities
0,28-year-old,3,13,0,Age
1,female,15,20,0,Gender
2,gestational diabetes mellitus,40,68,0,Diabetes
3,eight years prior,80,96,0,RelativeDate
4,subsequent,118,127,0,Modifier
5,type two diabetes mellitus,129,154,0,Diabetes
6,T2DM,158,161,0,Diabetes
7,HTG-induced pancreatitis,187,210,0,Disease_Syndrome_Disorder
8,three years prior,212,228,0,RelativeDate
9,acute,271,275,0,Modifier


In [None]:
# NER model trained on i2b2 (sampled from MIMIC) dataset
posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

posology_ner_converter = NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

# greedy model
posology_ner_greedy = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_greedy")

ner_converter_greedy = NerConverter()\
    .setInputCols(["sentence","token","ner_greedy"])\
    .setOutputCol("ner_chunk_greedy")

nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    posology_ner_converter,
    posology_ner_greedy,
    ner_converter_greedy])

empty_data = spark.createDataFrame([[""]]).toDF("text")

posology_model = nlpPipeline.fit(empty_data)


ner_posology download started this may take some time.
Approximate size to download 13.8 MB
[OK!]
ner_posology_greedy download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


In [None]:
posology_ner.getClasses()

['O',
 'B-DOSAGE',
 'B-STRENGTH',
 'I-STRENGTH',
 'B-ROUTE',
 'B-FREQUENCY',
 'I-FREQUENCY',
 'B-DRUG',
 'I-DRUG',
 'B-FORM',
 'I-DOSAGE',
 'B-DURATION',
 'I-DURATION',
 'I-FORM',
 'I-ROUTE']

In [None]:
posology_result = posology_model.transform(mt_samples_df)

In [None]:
posology_result.show(10)

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|index|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|          ner_greedy|    ner_chunk_greedy|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    0|Sample Type / Med...|[{document, 0, 54...|[{document, 0, 54...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 1609, 16...|[{named_entity, 0...|[{chunk, 1609, 16...|
|    1|Sample Type / Med...|[{document, 0, 32...|[{document, 0, 54...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 742, 750...|[{named_entity, 0...|[{chunk, 742, 753...|
|    2|Sample T

In [None]:
posology_result.printSchema()

root
 |-- index: long (nullable = true)
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- valu

In [None]:
from pyspark.sql.functions import monotonically_increasing_id

# This will return a new DF with all the columns + id
posology_result = posology_result.withColumn("id", monotonically_increasing_id())

posology_result.show(3)

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---+
|index|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|          ner_greedy|    ner_chunk_greedy| id|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---+
|    0|Sample Type / Med...|[{document, 0, 54...|[{document, 0, 54...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 1609, 16...|[{named_entity, 0...|[{chunk, 1609, 16...|  0|
|    1|Sample Type / Med...|[{document, 0, 32...|[{document, 0, 54...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 742, 750...|[{named_entity, 0...|[{chunk, 742, 753...|

In [None]:
posology_result.select('token.result','ner.result').take(2)

[Row(result=['Sample', 'Type', '/', 'Medical', 'Specialty', ':', 'Hematology', '-', 'Oncology', 'Sample', 'Name', ':', 'Discharge', 'Summary', '-', 'Mesothelioma', '-', '1', 'Description', ':', 'Mesothelioma', ',', 'pleural', 'effusion', ',', 'atrial', 'fibrillation', ',', 'anemia', ',', 'ascites', ',', 'esophageal', 'reflux', ',', 'and', 'history', 'of', 'deep', 'venous', 'thrombosis', '.', '(', 'Medical', 'Transcription', 'Sample', 'Report', ')', 'PRINCIPAL', 'DIAGNOSIS', ':', 'Mesothelioma', '.', 'SECONDARY', 'DIAGNOSES', ':', 'Pleural', 'effusion', ',', 'atrial', 'fibrillation', ',', 'anemia', ',', 'ascites', ',', 'esophageal', 'reflux', ',', 'and', 'history', 'of', 'deep', 'venous', 'thrombosis', '.', 'PROCEDURES', '1', '.', 'On', 'August', '24', ',', '2007', ',', 'decortication', 'of', 'the', 'lung', 'with', 'pleural', 'biopsy', 'and', 'transpleural', 'fluoroscopy', '.', '2', '.', 'On', 'August', '20', ',', '2007', ',', 'thoracentesis', '.', '3', '.', 'On', 'August', '31', ',', '

In [None]:
from pyspark.sql import functions as F

posology_result.select(F.explode(F.arrays_zip('token.result', 'ner.result')).alias("cols")) \
               .select(F.expr("cols['0']").alias("token"),
                       F.expr("cols['1']").alias("ner_label"))\
               .filter("ner_label!='O'")\
               .show(20, truncate=100)


+--------------+-----------+
|         token|  ner_label|
+--------------+-----------+
|      Coumadin|     B-DRUG|
|             1| B-STRENGTH|
|            mg| I-STRENGTH|
|         daily|B-FREQUENCY|
|    Amiodarone|     B-DRUG|
|           100| B-STRENGTH|
|            mg| I-STRENGTH|
|           p.o|    B-ROUTE|
|         daily|B-FREQUENCY|
|      Coumadin|     B-DRUG|
|       Lovenox|     B-DRUG|
|            40| B-STRENGTH|
|            mg| I-STRENGTH|
|subcutaneously|    B-ROUTE|
|  chemotherapy|     B-DRUG|
|     cisplatin|     B-DRUG|
|            75| B-STRENGTH|
| mg/centimeter| I-STRENGTH|
|           109| B-STRENGTH|
|            mg| I-STRENGTH|
+--------------+-----------+
only showing top 20 rows



In [None]:
posology_result.select('id',F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.begin', 'ner_chunk.end', 'ner_chunk.metadata')).alias("cols")) \
               .select('id', F.expr("cols['3']['sentence']").alias("sentence_id"),
                       F.expr("cols['0']").alias("chunk"),
                       F.expr("cols['1']").alias("begin"),
                       F.expr("cols['2']").alias("end"),
                       F.expr("cols['3']['entity']").alias("ner_label"))\
               .filter("ner_label!='O'")\
               .show(truncate=False)

+---+-----------+----------------+-----+----+---------+
|id |sentence_id|chunk           |begin|end |ner_label|
+---+-----------+----------------+-----+----+---------+
|0  |33         |Coumadin        |1609 |1616|DRUG     |
|0  |33         |1 mg            |1618 |1621|STRENGTH |
|0  |33         |daily           |1623 |1627|FREQUENCY|
|0  |34         |Amiodarone      |1696 |1705|DRUG     |
|0  |34         |100 mg          |1707 |1712|STRENGTH |
|0  |34         |p.o             |1714 |1716|ROUTE    |
|0  |34         |daily           |1719 |1723|FREQUENCY|
|0  |58         |Coumadin        |2770 |2777|DRUG     |
|0  |60         |Lovenox         |2880 |2886|DRUG     |
|0  |60         |40 mg           |2888 |2892|STRENGTH |
|0  |60         |subcutaneously  |2894 |2907|ROUTE    |
|0  |72         |chemotherapy    |4436 |4447|DRUG     |
|0  |72         |cisplatin       |4475 |4483|DRUG     |
|0  |72         |75 mg/centimeter|4485 |4500|STRENGTH |
|0  |72         |109 mg          |4519 |4524|STR

In [None]:
posology_result.select('id',F.explode(F.arrays_zip('ner_chunk_greedy.result', 'ner_chunk_greedy.begin', 'ner_chunk_greedy.end', 'ner_chunk_greedy.metadata')).alias("cols")) \
               .select('id', F.expr("cols['3']['sentence']").alias("sentence_id"),
                        F.expr("cols['0']").alias("chunk"),
                        F.expr("cols['1']").alias("begin"),
                        F.expr("cols['2']").alias("end"),
                        F.expr("cols['3']['entity']").alias("ner_label"))\
                .filter("ner_label!='O'")\
                .show(truncate=False)

+---+-----------+--------------------------------------------------------------+-----+----+---------+
|id |sentence_id|chunk                                                         |begin|end |ner_label|
+---+-----------+--------------------------------------------------------------+-----+----+---------+
|0  |33         |Coumadin 1 mg                                                 |1609 |1621|DRUG     |
|0  |33         |daily                                                         |1623 |1627|FREQUENCY|
|0  |34         |Amiodarone 100 mg p.o                                         |1696 |1716|DRUG     |
|0  |34         |daily                                                         |1719 |1723|FREQUENCY|
|0  |58         |Coumadin                                                      |2770 |2777|DRUG     |
|0  |72         |chemotherapy                                                  |4436 |4447|DRUG     |
|0  |72         |cisplatin 75 mg/centimeter                                    |44

In [None]:
posology_result.select('ner_chunk').take(2)[1][0][0].result

'Xylocaine'

In [None]:
posology_result.select('ner_chunk').take(2)[1][0][0].metadata

{'chunk': '0', 'confidence': '0.9903', 'entity': 'DRUG', 'sentence': '11'}

In [None]:
posology_light_result = posology_light_model.fullAnnotate(text)

chunks = []
entities = []
begin =[]
end = []

for n in posology_light_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    
import pandas as pd

posology_result_df = pd.DataFrame({'chunks':chunks, 'entities':entities,
                                   'begin': begin, 'end': end})

posology_result_df.head(15)

Unnamed: 0,chunks,entities,begin,end
0,1,DOSAGE,27,27
1,capsule,FORM,29,35
2,Advil,DRUG,40,44
3,for 5 days,DURATION,46,55
4,40 units,DOSAGE,126,133
5,insulin glargine,DRUG,138,153
6,at night,FREQUENCY,155,162
7,12 units,DOSAGE,166,173
8,insulin lispro,DRUG,178,191
9,with meals,FREQUENCY,193,202


In [None]:
chunks = []
entities = []
begin =[]
end = []

for n in posology_light_result[0]['ner_chunk_greedy']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    
import pandas as pd

posology_result_greedy_df = pd.DataFrame({'chunks':chunks, 
                                          'entities':entities,
                                          'begin': begin, 
                                          'end': end})

posology_result_greedy_df.head(15)

Unnamed: 0,chunks,entities,begin,end
0,1 capsule of Advil,DRUG,27,44
1,for 5 days,DURATION,46,55
2,40 units of insulin glargine,DRUG,126,153
3,at night,FREQUENCY,155,162
4,12 units of insulin lispro,DRUG,166,191
5,with meals,FREQUENCY,193,202
6,metformin 1000 mg,DRUG,210,226
7,two times a day,FREQUENCY,228,242
8,SGLT2 inhibitors,DRUG,273,288
9,for 3 months,DURATION,326,337


### Comparison of `ner_posology` and `ner_posology_greedy` results

In [None]:
from google.colab import widgets

t = widgets.TabBar(["ner_posology", "ner_posology_greedy", "viz_posology", "viz_posology_greedy"])

with t.output_to(0):
    display(posology_result_df.head(10))

with t.output_to(1):
    display(posology_result_greedy_df.head(10))

with t.output_to(2):
    visualiser.display(posology_light_result[0], label_col='ner_chunk', document_col='document')

with t.output_to(3):
    visualiser.display(posology_light_result[0], label_col='ner_chunk_greedy', document_col='document')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunks,entities,begin,end
0,1,DOSAGE,27,27
1,capsule,FORM,29,35
2,Advil,DRUG,40,44
3,for 5 days,DURATION,46,55
4,40 units,DOSAGE,126,133
5,insulin glargine,DRUG,138,153
6,at night,FREQUENCY,155,162
7,12 units,DOSAGE,166,173
8,insulin lispro,DRUG,178,191
9,with meals,FREQUENCY,193,202


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunks,entities,begin,end
0,1 capsule of Advil,DRUG,27,44
1,for 5 days,DURATION,46,55
2,40 units of insulin glargine,DRUG,126,153
3,at night,FREQUENCY,155,162
4,12 units of insulin lispro,DRUG,166,191
5,with meals,FREQUENCY,193,202
6,metformin 1000 mg,DRUG,210,226
7,two times a day,FREQUENCY,228,242
8,SGLT2 inhibitors,DRUG,273,288
9,for 3 months,DURATION,326,337


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Writing a generic NER function

**Generic NER Function with LightPipeline**

In [None]:
def get_light_model (embeddings, model_name = 'ner_clinical'):

  documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

  sentenceDetector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

  tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

  word_embeddings = WordEmbeddingsModel.pretrained(embeddings, "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

  loaded_ner_model = MedicalNerModel.pretrained(model_name, "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

  ner_converter = NerConverter() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")

  nlpPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      loaded_ner_model,
      ner_converter])

  model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

  return LightPipeline(model)

**Get NER Results with fullAnnotate Method**

In [None]:
import pandas as pd

def get_light_result (light_model, text, chunk_name="ner_chunk"):

  light_result = light_model.fullAnnotate(text)

  chunks = []
  entities = []
  sentence= []
  begin = []
  end = []

  for n in light_result[0][chunk_name]:
                  
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence']) 
      
    pd_df = pd.DataFrame({'sentence_id':sentence, 
                          'begin': begin, 
                          'end':end, 
                          'chunks':chunks,  
                          'entities':entities})
  return pd_df

In [None]:
embeddings = 'embeddings_clinical'

model_name = 'ner_clinical'

light_model = get_light_model (embeddings, model_name)

text = "I had a headache yesterday and took an Advil."

light_model.annotate(text)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


{'document': ['I had a headache yesterday and took an Advil.'],
 'embeddings': ['I',
  'had',
  'a',
  'headache',
  'yesterday',
  'and',
  'took',
  'an',
  'Advil',
  '.'],
 'ner': ['O',
  'O',
  'B-PROBLEM',
  'I-PROBLEM',
  'O',
  'O',
  'O',
  'B-TREATMENT',
  'I-TREATMENT',
  'O'],
 'ner_chunk': ['a headache', 'an Advil'],
 'sentence': ['I had a headache yesterday and took an Advil.'],
 'token': ['I',
  'had',
  'a',
  'headache',
  'yesterday',
  'and',
  'took',
  'an',
  'Advil',
  '.']}

In [None]:
get_light_result(light_model, text, chunk_name="ner_chunk")

Unnamed: 0,sentence_id,begin,end,chunks,entities
0,0,6,15,a headache,PROBLEM
1,0,36,43,an Advil,TREATMENT


In [None]:
text ='''The patient was prescribed 1 capsule of Parol with meals . 
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . 
It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'''

embeddings = 'embeddings_clinical'

model_name = 'ner_posology'

light_model = get_light_model (embeddings, model_name)

get_light_result (light_model, text, chunk_name="ner_chunk")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


Unnamed: 0,sentence_id,begin,end,chunks,entities
0,0,27,27,1,DOSAGE
1,0,29,35,capsule,FORM
2,0,40,44,Parol,DRUG
3,0,46,55,with meals,FREQUENCY
4,1,127,134,40 units,DOSAGE
5,1,139,154,insulin glargine,DRUG
6,1,156,163,at night,FREQUENCY
7,1,167,174,12 units,DOSAGE
8,1,179,192,insulin lispro,DRUG
9,1,194,203,with meals,FREQUENCY


## PHI NER

**Entities**
- AGE
- CONTACT
- DATE
- ID
- LOCATION
- NAME
- PROFESSION

In [None]:
embeddings = 'embeddings_clinical'

model_name = 'ner_deid_subentity_augmented'

# deidentify_dl
# ner_deid_large
# ner_deid_generic_augmented
# ner_deid_subentity_augmented
# ner_deid_subentity_augmented_i2b2

text = """Miriam BRAY is a 41-year-old female from Vietnam and she was admitted for a right-sided pleural effusion for thoracentesis on Monday by Dr. X. Her Coumadin was placed on hold.
She was instructed to followup with Dr. XYZ in the office to check her INR On August 24, 2007 ."""

light_model = get_light_model (embeddings, model_name)

get_light_result (light_model, text, chunk_name="ner_chunk")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


Unnamed: 0,sentence_id,begin,end,chunks,entities
0,0,0,10,Miriam BRAY,PATIENT
1,0,17,27,41-year-old,AGE
2,0,41,47,Vietnam,COUNTRY
3,0,126,131,Monday,DATE
4,0,140,140,X,DOCTOR
5,1,216,218,XYZ,DOCTOR
6,1,254,270,"August 24, 2007 .",DATE


## BioNLP (Cancer Genetics) NER

**Entities**

| | | |
|-|-|-|
|tissue_structure|Amino_acid|Simple_chemical|
|Organism_substance|Developing_anatomical_structure|Cell|
|Cancer|Cellular_component|Gene_or_gene_product|
|Immaterial_anatomical_entity|Organ|Organism|
|Pathological_formation|Organism_subdivision|Anatomical_system|
|Tissue|||

In [None]:
mt_samples_df.filter("index == '2'").collect()[0]["text"]

'Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nAnemia - Consult\nDescription:\nRefractory anemia that is transfusion dependent. At this time, he has been admitted for anemia with hemoglobin of 7.1 and requiring transfusion.\n(Medical Transcription Sample Report)\nDIAGNOSIS:\nRefractory anemia that is transfusion dependent.\nCHIEF COMPLAINT:\nI needed a blood transfusion.\nHISTORY:\nThe patient is a 78-year-old gentleman with no substantial past medical history except for diabetes. He denies any comorbid complications of the diabetes including kidney disease, heart disease, stroke, vision loss, or neuropathy. At this time, he has been admitted for anemia with hemoglobin of 7.1 and requiring transfusion. He reports that he has no signs or symptom of bleeding and had a blood transfusion approximately two months ago and actually several weeks before that blood transfusion, he had a transfusion for anemia. He has been placed on B12, oral iron, and Procrit. At this t

In [None]:
embeddings = 'embeddings_clinical'

model_name = 'ner_bionlp'

text =  mt_samples_df.filter("index == '2'").collect()[0]["text"]

light_model = get_light_model (embeddings, model_name)

get_light_result (light_model, text, chunk_name="ner_chunk").head(20)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_bionlp download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


Unnamed: 0,sentence_id,begin,end,chunks,entities
0,1,198,207,hemoglobin,Gene_or_gene_product
1,3,369,373,blood,Organism_substance
2,4,401,407,patient,Organism
3,4,426,434,gentleman,Organism
4,5,498,499,He,Organism
5,5,561,566,kidney,Organ
6,5,577,581,heart,Organ
7,6,679,688,hemoglobin,Gene_or_gene_product
8,7,789,793,blood,Organism_substance
9,7,875,879,blood,Organism_substance


## NER Chunker
We can extract phrases that fits into a known pattern using the NER tags. NerChunker would be quite handy to extract entity groups with neighboring tokens when there is no pretrained NER model to address certain issues. Lets say we want to extract drug and frequency together as a single chunk even if there are some unwanted tokens between them. 

In [None]:
posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_chunker = NerChunker()\
    .setInputCols(["sentence","ner"])\
    .setOutputCol("ner_chunk")\
    .setRegexParsers(["<DRUG>.*<FREQUENCY>"])

nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_chunker])

empty_data = spark.createDataFrame([[""]]).toDF("text")

ner_chunker_model = nlpPipeline.fit(empty_data)

ner_posology download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


In [None]:
posology_ner.getClasses()

['O',
 'B-DOSAGE',
 'B-STRENGTH',
 'I-STRENGTH',
 'B-ROUTE',
 'B-FREQUENCY',
 'I-FREQUENCY',
 'B-DRUG',
 'I-DRUG',
 'B-FORM',
 'I-DOSAGE',
 'B-DURATION',
 'I-DURATION',
 'I-FORM',
 'I-ROUTE']

In [None]:
light_model = LightPipeline(ner_chunker_model)

text ='The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

light_result = light_model.annotate(text)

list(zip(light_result['token'], light_result['ner']))

[('The', 'O'),
 ('patient', 'O'),
 ('was', 'O'),
 ('prescribed', 'O'),
 ('1', 'B-DOSAGE'),
 ('capsule', 'B-FORM'),
 ('of', 'O'),
 ('Advil', 'B-DRUG'),
 ('for', 'B-DURATION'),
 ('5', 'I-DURATION'),
 ('days', 'I-DURATION'),
 ('.', 'O'),
 ('He', 'O'),
 ('was', 'O'),
 ('seen', 'O'),
 ('by', 'O'),
 ('the', 'O'),
 ('endocrinology', 'O'),
 ('service', 'O'),
 ('and', 'O'),
 ('she', 'O'),
 ('was', 'O'),
 ('discharged', 'O'),
 ('on', 'O'),
 ('40', 'B-DOSAGE'),
 ('units', 'I-DOSAGE'),
 ('of', 'O'),
 ('insulin', 'B-DRUG'),
 ('glargine', 'I-DRUG'),
 ('at', 'B-FREQUENCY'),
 ('night', 'I-FREQUENCY'),
 (',', 'O'),
 ('12', 'B-DOSAGE'),
 ('units', 'I-DOSAGE'),
 ('of', 'O'),
 ('insulin', 'B-DRUG'),
 ('lispro', 'I-DRUG'),
 ('with', 'B-FREQUENCY'),
 ('meals', 'I-FREQUENCY'),
 (',', 'O'),
 ('metformin', 'B-DRUG'),
 ('1000', 'B-STRENGTH'),
 ('mg', 'I-STRENGTH'),
 ('two', 'B-FREQUENCY'),
 ('times', 'I-FREQUENCY'),
 ('a', 'I-FREQUENCY'),
 ('day', 'I-FREQUENCY'),
 ('.', 'O'),
 ('It', 'O'),
 ('was', 'O'),
 ('det

In [None]:
light_result["ner_chunk"]

['insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day']

## Chunk Filterer
ChunkFilterer will allow you to filter out named entities by some conditions or predefined look-up lists, so that you can feed these entities to other annotators like Assertion Status or Entity Resolvers. It can be used with two criteria: isin and regex.

In [None]:
posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")
      
chunk_filterer = ChunkFilterer()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("isin")\
    .setWhiteList(['Advil','metformin', 'insulin lispro'])

nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

ner_posology download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


In [None]:
light_model = LightPipeline(chunk_filter_model)

text ='The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

light_result = light_model.annotate(text)

light_result.keys()

dict_keys(['document', 'ner_chunk', 'chunk_filtered', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
light_result['ner_chunk'] 

['1',
 'capsule',
 'Advil',
 'for 5 days',
 '40 units',
 'insulin glargine',
 'at night',
 '12 units',
 'insulin lispro',
 'with meals',
 'metformin',
 '1000 mg',
 'two times a day',
 'SGLT2 inhibitors']

In [None]:
light_result['chunk_filtered']

['Advil', 'insulin lispro', 'metformin']

In [None]:
ner_model = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
    .setInputCols("sentence","token","embeddings")\
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")
    
chunk_filterer = ChunkFilterer()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("isin")\
    .setWhiteList(['severe fever','sore throat'])

nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


In [None]:
text = 'Patient with severe fever, severe cough, sore throat, stomach pain, and a headache.'

filter_df = spark.createDataFrame([[text]]).toDF("text")

chunk_filter_result = chunk_filter_model.transform(filter_df)

In [None]:
chunk_filter_result.select('ner_chunk.result','chunk_filtered.result').show(truncate=False)

+-------------------------------------------------------------------+---------------------------+
|result                                                             |result                     |
+-------------------------------------------------------------------+---------------------------+
|[severe fever, severe cough, sore throat, stomach pain, a headache]|[severe fever, sore throat]|
+-------------------------------------------------------------------+---------------------------+



## Changing entity labels with `NerConverterInternal()`

In [None]:
replace_dict = """Drug_BrandName,Drug
Frequency,Drug_Frequency
Dosage,Drug_Dosage
Strength,Drug_Strength
"""
with open('replace_dict.csv', 'w') as f:
    f.write(replace_dict)

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") 

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("jsl_ner")

jsl_ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "jsl_ner"]) \
    .setOutputCol("jsl_ner_chunk")

jsl_ner_converter_internal = NerConverterInternal()\
    .setInputCols(["sentence","token","jsl_ner"])\
    .setOutputCol("replaced_ner_chunk")\
    .setReplaceDictResource("replace_dict.csv","text", {"delimiter":","})
      
nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    jsl_ner_converter,
    jsl_ner_converter_internal
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

ner_converter_model = nlpPipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]


In [None]:
text ='The patient was prescribed 1 capsule of Parol with meals . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

light_model = LightPipeline(ner_converter_model)

jsl_ner_chunk_df = get_light_result (light_model, text, chunk_name='jsl_ner_chunk')
replaced_ner_chunk_df = get_light_result (light_model, text, chunk_name='replaced_ner_chunk')
pd.concat([jsl_ner_chunk_df, replaced_ner_chunk_df.iloc[:,-1:].rename(columns= {'entities':'replaced'})], axis=1)

Unnamed: 0,sentence_id,begin,end,chunks,entities,replaced
0,0,27,35,1 capsule,Dosage,Drug_Dosage
1,0,40,44,Parol,Drug_BrandName,Drug
2,1,59,60,He,Gender,Gender
3,1,78,98,endocrinology service,Clinical_Dept,Clinical_Dept
4,1,104,106,she,Gender,Gender
5,1,112,121,discharged,Admission_Discharge,Admission_Discharge
6,1,126,133,40 units,Dosage,Drug_Dosage
7,1,138,153,insulin glargine,Drug_Ingredient,Drug_Ingredient
8,1,155,162,at night,Frequency,Drug_Frequency
9,1,166,173,12 units,Dosage,Drug_Dosage


## Downloading Pretrained Models

- When we use `.pretrained` method, model is downloaded to  a folder named `cache_pretrained` automatically and it is loaded from thit folder if you run it again.

- In order to download the models manually to any folder, you can follow the steps below. In this case you should use `.load()` method.

  - Install AWS CLI to your local computer following the steps [here](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html) for Linux and [here](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-mac.html) for MacOS.

  - Then configure your AWS credentials.

  - Go to models hub and look for the model you need.

  - Select the model you found and you will see the model card that shows all the details about that model.

  - Hover the Download button on that page and you will see the download link from the S3 bucket. 

  - Just use AWS CLI like follows:

```
!aws s3 cp --region us-east-2 s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.1.0_2.4_1624566960534.zip .
```

## Training a Clinical NER (NCBI Disease Dataset)

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NER_NCBIconlltrain.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NER_NCBIconlltest.txt

In [None]:
from sparknlp.training import CoNLL

conll_data = CoNLL().readDataset(spark, 'NER_NCBIconlltrain.txt')

conll_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Identification of...|[{document, 0, 89...|[{document, 0, 89...|[{token, 0, 13, I...|[{pos, 0, 13, NN,...|[{named_entity, 0...|
|The adenomatous p...|[{document, 0, 21...|[{document, 0, 21...|[{token, 0, 2, Th...|[{pos, 0, 2, NN, ...|[{named_entity, 0...|
|Complex formation...|[{document, 0, 63...|[{document, 0, 63...|[{token, 0, 6, Co...|[{pos, 0, 6, NN, ...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [None]:
conll_data.count()

3266

In [None]:
from pyspark.sql import functions as F

conll_data.select(F.explode(F.arrays_zip('token.result','label.result')).alias("cols")) \
          .select(F.expr("cols['0']").alias("token"),
                  F.expr("cols['1']").alias("ground_truth"))\
          .groupBy('ground_truth')\
          .count()\
          .orderBy('count', ascending=False)\
          .show(100,truncate=False)

+------------+-----+
|ground_truth|count|
+------------+-----+
|O           |75093|
|I-Disease   |3547 |
|B-Disease   |3093 |
+------------+-----+



In [None]:
conll_data.select("label.result").distinct().count()

1537

In [None]:
'''
As you can see, there are too many `O` labels in the dataset. 
To make it more balanced, we can drop the sentences have only O labels.
(`c>1` means we drop all the sentences that have no valuable labels other than `O`)
'''

'''
conll_data = conll_data.withColumn('unique', F.array_distinct("label.result"))\
                       .withColumn('c', F.size('unique'))\
                       .filter(F.col('c')>1)

conll_data.select(F.explode(F.arrays_zip('token.result','label.result')).alias("cols")) \
          .select(F.expr("cols['0']").alias("token"),
                  F.expr("cols['1']").alias("ground_truth"))\
          .groupBy('ground_truth')\
          .count()\
          .orderBy('count', ascending=False)\
          .show(100,truncate=False)
'''

In [None]:
# Clinical word embeddings trained on PubMED dataset
clinical_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [None]:
test_data = CoNLL().readDataset(spark, 'NER_NCBIconlltest.txt')

test_data = clinical_embeddings.transform(test_data)

test_data.write.parquet('NER_NCBIconlltest.parquet')

### NERDL Graph
TensorFlow graph file (`.pb` extension) should be produced for NER training.

In [None]:
!pip install -q tensorflow==2.7.0
!pip install -q tensorflow-addons

In [None]:
from sparknlp_jsl.training import tf_graph
tf_graph.print_model_params("ner_dl")

tf_graph.build("ner_dl", 
               build_params={"embeddings_dim": 200, 
                             "nchars": 85, 
                             "ntags": 3, 
                             "is_medical": 1}, 
               model_location="./medical_ner_graphs", 
               model_filename="auto")

In [None]:
# for open source users
'''
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/training/english/dl-ner/nerdl-graph/create_graph.py
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/training/english/dl-ner/nerdl-graph/dataset_encoder.py
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/training/english/dl-ner/nerdl-graph/ner_model.py
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/training/english/dl-ner/nerdl-graph/ner_model_saver.py
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/training/english/dl-ner/nerdl-graph/sentence_grouper.py

!pip -q install tensorflow==1.15.0

import create_graph

ntags = 3 # number of labels
embeddings_dim = 200
nchars =83

create_graph.create_graph(ntags, embeddings_dim, nchars)
'''

In [None]:
nerTagger = MedicalNerApproach()\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(30)\
    .setBatchSize(64)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0.2)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')\
    .setGraphFolder('medical_ner_graphs')\
    .setTestDataset("NER_NCBIconlltest.parquet")\
    .setUseBestModel(True)\
    .setEarlyStoppingCriterion(0.04)\
    .setEarlyStoppingPatience(3)\
   # .setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch       

ner_pipeline = Pipeline(stages=[
          clinical_embeddings,
          nerTagger
 ])

You can visit [1.4.Resume_MedicalNer_Model_Training.ipynb](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.4.Resume_MedicalNer_Model_Training.ipynb) notebook for fine-tuning pretrained NER models and more details of `MedicalNerApproach()` parameters.

In [None]:
%%time
ner_model = ner_pipeline.fit(conll_data)

# if you get an error for incompatible TF graph, use 4.1 NerDL-Graph.ipynb notebook in public folder to create a graph
# licensed users can also use 17.Graph_builder_for_DL_models.ipynb to create tf graphs easily.

CPU times: user 3.94 s, sys: 527 ms, total: 4.47 s
Wall time: 11min 4s


`getTrainingClassDistribution()` parameter returns the distribution of labels used when training the NER model.

In [None]:
ner_model.stages[1].getTrainingClassDistribution()

{'O': 60355, 'B-Disease': 2476, 'I-Disease': 2866}

Let's check the results saved in the log file.

In [None]:
import os 
log_file= os.listdir("ner_logs")[0]

with open (f"./ner_logs/{log_file}") as f:
  print(f.read())

Name of the selected graph: /content/medical_ner_graphs/blstm_3_200_128_85.pb
Training started - total epochs: 30 - lr: 0.001 - batch size: 64 - labels: 3 - chars: 83 - training examples: 2614


Epoch 1/30 started, lr: 0.001, dataset size: 2614


Epoch 1/30 - 78.37s - loss: 429.10016 - avg training loss: 9.979074 - batches: 43
Quality on validation dataset (20.0%), validation examples = 522
time to finish evaluation: 5.61s
Total validation loss: 59.9254	Avg validation loss: 4.6096
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 336	 248	 345	 0.5753425	 0.49339208	 0.5312253
B-Disease	 154	 71	 463	 0.6844444	 0.24959481	 0.36579573
tp: 490 fp: 319 fn: 808 labels: 2
Macro-average	 prec: 0.6298934, rec: 0.37149346, f1: 0.4673544
Micro-average	 prec: 0.605686, rec: 0.37750384, f1: 0.4651163
Quality on test dataset: 
time to finish evaluation: 4.78s
Total test loss: 72.5353	Avg test loss: 5.1811
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 379	 270	 410	 0.5839754	 0.48035488	 0.52712107
B-Dis

As you see above, our **earlyStopping** feature worked, trainining was terminated before 30th epoch.

### Evaluate your model

In [None]:
pred_df = ner_model.stages[1].transform(test_data)

In [None]:
pred_df.columns

['text', 'document', 'sentence', 'token', 'pos', 'label', 'embeddings', 'ner']

In [None]:
from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

evaler = NerDLMetrics(mode="full_chunk", dropO=True)

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label").cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
    .withColumn("recall", F.round(eval_result["recall"],4))\
    .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+-----+-----+-----+---------+------+------+
| entity|   tp|   fp|   fn|total|precision|recall|    f1|
+-------+-----+-----+-----+-----+---------+------+------+
|Disease|570.0|214.0|134.0|704.0|    0.727|0.8097|0.7661|
+-------+-----+-----+-----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.7661290322580645|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.7661290322580645|
+------------------+

None


In [None]:
evaler = NerDLMetrics(mode="partial_chunk_per_token", dropO=True)

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label").cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
    .withColumn("recall", F.round(eval_result["recall"],4))\
    .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+------+-----+-----+------+---------+------+------+
| entity|    tp|   fp|   fn| total|precision|recall|    f1|
+-------+------+-----+-----+------+---------+------+------+
|Disease|1351.0|207.0|146.0|1497.0|   0.8671|0.9025|0.8845|
+-------+------+-----+-----+------+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8844517184942717|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8844517184942716|
+------------------+

None


In [None]:
ner_model.stages[1].write().overwrite().save('models/custom_NER_30epoch')

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

loaded_ner_model = MedicalNerModel.load("models/custom_NER_30epoch")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

converter = NerConverter()\
    .setInputCols(["document", "token", "ner"])\
    .setOutputCol("ner_span")

ner_prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        clinical_embeddings,
        loaded_ner_model,
        converter])

empty_data = spark.createDataFrame([['']]).toDF("text")

prediction_model = ner_prediction_pipeline.fit(empty_data)

from sparknlp.base import LightPipeline

light_model = LightPipeline(prediction_model)

In [None]:
text = "She has a metastatic breast cancer"

result = light_model.fullAnnotate(text)[0]

[(i.result, i.metadata['entity']) for i in result['ner_span']]

[('metastatic breast cancer', 'Disease')]

## BertForTokenClassification NER models

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")\
    .setInputCols("token", "sentence")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)

ner_converter = NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline =  Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        tokenClassifier,
        ner_converter
        ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . 
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . 
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . 
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . 
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . 
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . 
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . 
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . 
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . 
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . 
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . 
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . 
It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""

res = model.transform(spark.createDataFrame([[text]]).toDF("text"))

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
bert_token_classifier_ner_clinical download started this may take some time.
Approximate size to download 385.6 MB
[OK!]


In [None]:
from pyspark.sql import functions as F

res.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.begin', 'ner_chunk.end', 'ner_chunk.metadata')).alias("cols")) \
    .select(F.expr("cols['3']['sentence']").alias("sentence_id"),
            F.expr("cols['0']").alias("chunk"),
            F.expr("cols['2']").alias("end"),
            F.expr("cols['3']['entity']").alias("ner_label"))\
    .filter("ner_label!='O'")\
    .show(truncate=False)

+-----------+-----------------------------+---+---------+
|sentence_id|chunk                        |end|ner_label|
+-----------+-----------------------------+---+---------+
|0          |gestational diabetes mellitus|67 |PROBLEM  |
|0          |type two diabetes mellitus   |153|PROBLEM  |
|0          |T2DM                         |160|PROBLEM  |
|0          |HTG-induced pancreatitis     |209|PROBLEM  |
|0          |an acute hepatitis           |280|PROBLEM  |
|0          |obesity                      |294|PROBLEM  |
|0          |a body mass index            |317|TEST     |
|0          |BMI                          |323|TEST     |
|0          |polyuria                     |387|PROBLEM  |
|0          |polydipsia                   |400|PROBLEM  |
|0          |poor appetite                |416|PROBLEM  |
|0          |vomiting                     |431|PROBLEM  |
|1          |amoxicillin                  |522|TREATMENT|
|1          |a respiratory tract infection|556|PROBLEM  |
|2          |m

In [None]:
light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(light_result[0], label_col='ner_chunk', document_col='document', save_path="display_bert_result.html")