## Posology Demo

This is a demonstration of using SparkNLP for extracting posology relations. The following relatios are supported:

DRUG-DOSAGE
DRUG-FREQUENCY
DRUG-ADE (Adversed Drug Events)
DRUG-FORM
DRUG-ROUTE
DRUG-DURATION
DRUG-REASON
DRUG=STRENGTH

The model has been validated agains the posology dataset described in (Magge, Scotch, & Gonzalez-Hernandez, 2018).

| Relation | Recall | Precision | F1 | F1 (Magge, Scotch, & Gonzalez-Hernandez, 2018) |
| --- | --- | --- | --- | --- |
| DRUG-ADE | 0.66 | 1.00 | **0.80** | 0.76 |
| DRUG-DOSAGE | 0.89 | 1.00 | **0.94** | 0.91 |
| DRUG-DURATION | 0.75 | 1.00 | **0.85** | 0.92 |
| DRUG-FORM | 0.88 | 1.00 | **0.94** | 0.95* |
| DRUG-FREQUENCY | 0.79 | 1.00 | **0.88** | 0.90 |
| DRUG-REASON | 0.60 | 1.00 | **0.75** | 0.70 |
| DRUG-ROUTE | 0.79 | 1.00 | **0.88** | 0.95* |
| DRUG-STRENGTH | 0.95 | 1.00 | **0.98** | 0.97 |


*Magge, Scotch, Gonzalez-Hernandez (2018) collapsed DRUG-FORM and DRUG-ROUTE into a single relation.

In [2]:
import os
import re
import pyspark
import sparknlp
import sparknlp_jsl
import functools 
import json

import numpy as np
from scipy import spatial
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from sparknlp_jsl.annotator import *
from sparknlp.annotator import *
from sparknlp.base import *


**Build pipeline using SparNLP pretrained models and the relation extration model optimized for posology**.
 
 The precision of the RE model is controlled by "setMaxSyntacticDistance(4)", which sets the maximum syntactic distance between named entities to 4. A larger value will improve recall at the expense at lower precision. A value of 4 leads to literally perfect precision (i.e. the model doesn't produce any false positives) and reasonably good recall.

In [4]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = sparknlp.annotators.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("tokens")

words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "tokens"])\
    .setOutputCol("pos_tags")

ner_tagger = NerDLModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentence", "tokens", "embeddings")\
    .setOutputCol("ner_tags")    

ner_chunker = NerConverter()\
    .setInputCols(["sentence", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

reModel = RelationExtractionModel()\
    .pretrained("posology_re", "en")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(4)

pipeline = Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    ner_tagger,
    ner_chunker,
    dependency_parser,
    reModel
])

**Create empty dataframe**

In [6]:
schema = T.StructType([T.StructField("text", T.StringType(), True)])
empty_df = spark.createDataFrame([],schema)

**Create a light pipeline for annotating free text**

In [8]:
model = pipeline.fit(empty_df)
lmodel = sparknlp.base.LightPipeline(model)

**Sample free text**

In [10]:
text = """
The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also 
given 1 unit of Metformin daily.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 
12 units of insulin lispro with meals , and metformin 1000 mg two times a day.
"""
results = lmodel.fullAnnotate(text)

**Show extracted relations**

In [12]:
for rel in results[0]["relations"]:
    print("{}({}={} - {}={})".format(
        rel.result, 
        rel.metadata['entity1'], 
        rel.metadata['chunk1'], 
        rel.metadata['entity2'],
        rel.metadata['chunk2']
    ))

In [13]:
        
rel_pairs=[]
for rel in results[0]['relations']:
    rel_pairs.append((
        rel.result, 
        rel.metadata['entity1'], 
        rel.metadata['entity1_begin'],
        rel.metadata['entity1_end'],
        rel.metadata['chunk1'], 
        rel.metadata['entity2'],
        rel.metadata['entity2_begin'],
        rel.metadata['entity2_end'],
        rel.metadata['chunk2'], 
        rel.metadata['confidence']
    ))

import pandas as pd

rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

rel_df 

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,DOSAGE-DRUG,DOSAGE,28,33,1 unit,DRUG,38,42,Advil,1.0
1,DRUG-DURATION,DRUG,38,42,Advil,DURATION,44,53,for 5 days,1.0
2,DOSAGE-DRUG,DOSAGE,96,101,1 unit,DRUG,106,114,Metformin,1.0
3,DRUG-FREQUENCY,DRUG,106,114,Metformin,FREQUENCY,116,120,daily,1.0
4,DOSAGE-DRUG,DOSAGE,190,197,40 units,DRUG,202,217,insulin glargine,1.0
5,DRUG-FREQUENCY,DRUG,202,217,insulin glargine,FREQUENCY,219,226,at night,1.0
6,DOSAGE-DRUG,DOSAGE,231,238,12 units,DRUG,243,256,insulin lispro,1.0
7,DRUG-FREQUENCY,DRUG,243,256,insulin lispro,FREQUENCY,258,267,with meals,1.0
8,DRUG-STRENGTH,DRUG,275,283,metformin,STRENGTH,285,291,1000 mg,1.0
9,DRUG-FREQUENCY,DRUG,275,283,metformin,FREQUENCY,293,307,two times a day,1.0


In [14]:
text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), 
one prior episode of HTG-induced pancreatitis three years prior to presentation,  associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . 
She had close follow-up with endocrinology post discharge .
""".replace("\n", "")


annotations = lmodel.fullAnnotate(text)

rel_pairs=[]
for rel in annotations[0]['relations']:
  if rel.result != "O":
    rel_pairs.append((
        rel.result, 
        rel.metadata['entity1'], 
        rel.metadata['entity1_begin'],
        rel.metadata['entity1_end'],
        rel.metadata['chunk1'], 
        rel.metadata['entity2'],
        rel.metadata['entity2_begin'],
        rel.metadata['entity2_end'],
        rel.metadata['chunk2'], 
        rel.metadata['confidence']
    ))

import pandas as pd

rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

rel_df[rel_df.relation!="O"]



Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,DURATION-DRUG,DURATION,492,499,five-day,DRUG,511,521,amoxicillin,1.0
1,DRUG-DURATION,DRUG,680,692,dapagliflozin,DURATION,694,707,for six months,1.0
2,DRUG-ROUTE,DRUG,1939,1945,insulin,ROUTE,1947,1950,drip,1.0
3,DOSAGE-DRUG,DOSAGE,2254,2261,40 units,DRUG,2266,2281,insulin glargine,1.0
4,DRUG-FREQUENCY,DRUG,2266,2281,insulin glargine,FREQUENCY,2283,2290,at night,1.0
5,DOSAGE-DRUG,DOSAGE,2294,2301,12 units,DRUG,2306,2319,insulin lispro,1.0
6,DRUG-FREQUENCY,DRUG,2306,2319,insulin lispro,FREQUENCY,2321,2330,with meals,1.0
7,DRUG-STRENGTH,DRUG,2338,2346,metformin,STRENGTH,2348,2354,1000 mg,1.0
8,DRUG-FREQUENCY,DRUG,2338,2346,metformin,FREQUENCY,2356,2370,two times a day,1.0


## Load model from Local

In [16]:

dbutils.fs.cp("dbfs:/FileStore/shared_uploads/veysel@johnsnowlabs.com/i2b2_RE.zip","file:/databricks/driver/i2b2_RE.zip")


In [17]:
%fs ls file:/databricks/driver/RE

path,name,size
file:/databricks/driver/RE/metadata/,metadata/,4096
file:/databricks/driver/RE/encoder,encoder,321
file:/databricks/driver/RE/generic_classifier_tensorflow,generic_classifier_tensorflow,6209846
file:/databricks/driver/RE/categories,categories,175


In [18]:
import zipfile
with zipfile.ZipFile('/databricks/driver/i2b2_RE.zip', 'r') as zip_ref:
    zip_ref.extractall('/databricks/driver/')

In [19]:

dbutils.fs.cp("file:/databricks/driver/RE", "dbfs:/FileStore/shared_uploads/veysel@johnsnowlabs.com/RE", recurse=True)


In [20]:


clinical_ner_tagger = sparknlp.annotators.NerDLModel()\
    .pretrained("ner_clinical_large", "en", "clinical/models")\
    .setInputCols("sentence", "tokens", "embeddings")\
    .setOutputCol("ner_tags")    

clinical_re_Model = RelationExtractionModel()\
    .load("dbfs:/FileStore/shared_uploads/veysel@johnsnowlabs.com/RE")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations_clinical")\
    .setMaxSyntacticDistance(4)\
    .setRelationPairs(["problem-test", "problem-treatment"])

loaded_pipeline = Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    clinical_ner_tagger,
    ner_chunker,
    dependency_parser,
    clinical_re_Model
])

In [21]:
loaded_model = loaded_pipeline.fit(empty_df)
loaded_lmodel = LightPipeline(loaded_model)

In [22]:
text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), 
one prior episode of HTG-induced pancreatitis three years prior to presentation,  associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . 
She had close follow-up with endocrinology post discharge .
""".replace("\n", "")

annotations = loaded_lmodel.fullAnnotate(text)

rel_pairs=[]
for rel in annotations[0]['relations_clinical']:
  if rel.result != "O":
    rel_pairs.append((
        rel.result, 
        rel.metadata['entity1'], 
        rel.metadata['entity1_begin'],
        rel.metadata['entity1_end'],
        rel.metadata['chunk1'], 
        rel.metadata['entity2'],
        rel.metadata['entity2_begin'],
        rel.metadata['entity2_end'],
        rel.metadata['chunk2'], 
        rel.metadata['confidence']
    ))

import pandas as pd

rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

rel_df[rel_df.relation!="O"]


Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,TeRP,PROBLEM,39,67,gestational diabetes mellitus,TEST,321,323,BMI,1.0
1,TeRP,PROBLEM,117,153,subsequent type two diabetes mellitus,TEST,321,323,BMI,1.0
2,TrAP,PROBLEM,616,619,T2DM,TREATMENT,625,636,atorvastatin,0.99955326
3,TeRP,TEST,738,757,Physical examination,PROBLEM,795,809,dry oral mucosa,0.9994142
4,TrWP,TEST,1245,1257,blood samples,PROBLEM,1264,1273,hemolyzing,0.9854173
5,TrWP,TEST,1245,1257,blood samples,PROBLEM,1282,1300,significant lipemia,0.99998724
6,TeRP,TEST,1534,1546,the anion gap,PROBLEM,1552,1565,still elevated,0.9965193
7,TrAP,TEST,1837,1844,analysis,PROBLEM,1853,1879,interference from turbidity,0.9676019
8,TrAP,PROBLEM,1966,1968,HTG,TREATMENT,1975,1985,a reduction,0.9875973
9,TrAP,PROBLEM,1966,1968,HTG,TEST,1990,2002,the anion gap,0.9993911


### The set of relations defined in the 2010 i2b2 relation challenge

TrIP: A certain treatment has improved or cured a medical problem (eg, ‘infection resolved with antibiotic course’)

TrWP: A patient's medical problem has deteriorated or worsened because of or in spite of a treatment being administered (eg, ‘the tumor was growing despite the drain’)

TrCP: A treatment caused a medical problem (eg, ‘penicillin causes a rash’)

TrAP: A treatment administered for a medical problem (eg, ‘Dexamphetamine for narcolepsy’)

TrNAP: The administration of a treatment was avoided because of a medical problem (eg, ‘Ralafen which is contra-indicated because of ulcers’)

TeRP: A test has revealed some medical problem (eg, ‘an echocardiogram revealed a pericardial effusion’)

TeCP: A test was performed to investigate a medical problem (eg, ‘chest x-ray done to rule out pneumonia’)

PIP: Two problems are related to each other (eg, ‘Azotemia presumed secondary to sepsis’)

## Train a Relation Extraction Model

In [25]:
data = spark.read.option("header","true").format("csv").load("dbfs:/FileStore/shared_uploads/veysel@johnsnowlabs.com/i2b2_clinical_relfeatures.csv")

data.show(3)

In [26]:

rels = ["TrIP", "TrAP", "TeCP", "TrNAP", "TrCP", "PIP", "TrWP", "TeRP"]

valid_rel_query = "(" + " OR ".join(["rel = '{}'".format(rel) for rel in rels]) + ")"

data = data\
    .where(valid_rel_query)\
    .withColumn("begin1i", F.expr("cast(begin1 AS Int)"))\
    .withColumn("end1i", F.expr("cast(end1 AS Int)"))\
    .withColumn("begin2i", F.expr("cast(begin2 AS Int)"))\
    .withColumn("end2i", F.expr("cast(end2 AS Int)"))

train_data = data.where("dataset='train'")

test_data = data.where("dataset='test'")

In [27]:
"file:/databricks/driver/RE_in1200D_out20.pb"

In [28]:
documenter = sparknlp.DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("document")

sentencer = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = sparknlp.annotators.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")\

words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

ner_tagger = NerDLModel()\
    .pretrained("ner_clinical_large", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens", "embeddings"])\
    .setOutputCol("ner_tags")

ner_converter = NerConverter()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")
    
dependency_parser = sparknlp.annotators.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["document", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")
    
reApproach = sparknlp_jsl.annotator.RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations_t")\
    .setLabelColumn("rel")\
    .setEpochsNumber(50)\
    .setBatchSize(200)\
    .setLearningRate(0.001)\
    .setModelFile("dbfs:/FileStore/shared_uploads/veysel@johnsnowlabs.com/RE_in1200D_out20.pb")\
    .setFixImbalance(True)\
    .setValidationSplit(0.05)\
    .setFromEntity("begin1i", "end1i", "label1")\
    .setToEntity("begin2i", "end2i", "label2")
    
finisher = sparknlp.Finisher()\
    .setInputCols(["relations_t"])\
    .setOutputCols(["relations"])\
    .setCleanAnnotations(False)\
    .setValueSplitSymbol(",")\
    .setAnnotationSplitSymbol(",")\
    .setOutputAsArray(False)
    
train_pipeline = Pipeline(stages=[
    documenter, sentencer, tokenizer, words_embedder, pos_tagger,
    ner_tagger, ner_converter, dependency_parser,
    reApproach, finisher
])

In [29]:
dbutils.fs.cp("dbfs:/FileStore/shared_uploads/veysel@johnsnowlabs.com/RE_in1200D_out20.pb","file:/databricks/driver/RE_in1200D_out20.pb")

dbutils.fs.cp("file:/databricks/driver/RE", "dbfs:/FileStore/shared_uploads/veysel@johnsnowlabs.com/RE", recurse=True)


In [30]:
%fs ls file:/databricks/driver/

path,name,size
file:/databricks/driver/conf/,conf/,4096
file:/databricks/driver/pubmed_sample_text_small.csv,pubmed_sample_text_small.csv,9363435
file:/databricks/driver/logs/,logs/,4096
file:/databricks/driver/derby.log,derby.log,717
file:/databricks/driver/RE_in1200D_out20.pb,RE_in1200D_out20.pb,135162
file:/databricks/driver/i2b2_RE.zip,i2b2_RE.zip,6260915
file:/databricks/driver/eventlogs/,eventlogs/,4096
file:/databricks/driver/ganglia/,ganglia/,4096


In [31]:
rel_model = train_pipeline.fit(train_data)
