![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/16.0.Coreference_Resolution_with_Clinical_NER_Models.ipynb)

# Correference Resolution

Coreference Resolution is the task of finding all expressions that refer to the same entity in a text.

This is sometimes important in Healthcare, where the name of the patient, procedure, and disease is mentioned at the beginning of the document, but later on, it is referenced as "the patient, the disease, the procedure", instead of the original name.

There are two ways we can accomplish coreference resolution:
1. With a specific `SpanBertCorefModel` annotator;
2. Question Answering

Let's take a look at some examples and how to solve them using Coreference Resolution.

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
nlp.install()

In [4]:
from johnsnowlabs import nlp, medical, visual

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_jsl.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.0, 💊Spark-Healthcare==5.0.0, running on ⚡ PySpark==3.1.2


## 1. SpanBertCoref

SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on [BERT for Coreference Resolution: Baselines and Analysis](https://arxiv.org/abs/1908.09091) paper.

In Spark NLP, we include `SpanBertCorefModel` annotator as an implementation of this  SpanBert-based coreference resolution model.

In [5]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

corefResolution = nlp.SpanBertCorefModel()\
    .pretrained("spanbert_base_coref")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("corefs")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

ner_jsl = medical.NerModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_jsl")

ner_jsl_converter = medical.NerConverterInternal() \
    .setInputCols(["sentences", "tokens", "ner_jsl"]) \
    .setOutputCol("ner_chunk")


pipeline = nlp.Pipeline(stages=[document_assembler, sentence_detector, tokenizer, corefResolution,
                                word_embeddings, ner_jsl, ner_jsl_converter])

spanbert_base_coref download started this may take some time.
Approximate size to download 540.1 MB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl_enriched download started this may take some time.
[OK!]


In [6]:
example = """ This is a very pleasant 72-year-old woman, who I have been following for her pancytopenia.
After several bone marrow biopsies, she was diagnosed with aplastic anemia.
She started cyclosporine and prednisone on 03/30/10. She was admitted to the hospital from 07/11/10 to 07/14/10 with acute kidney injury.
It was thought that it was due to cyclosporine toxicity because her cyclosporine level was 555.
Therefore that was held. Overall, she tells me that now she feels quite well since leaving the hospital."""

data = spark.createDataFrame([[example]]).toDF("text")

model = pipeline.fit(data)

result =model.transform(data).cache()

In [7]:
from collections import defaultdict
import pyspark.sql.functions as F

def get_relation_table(result):
    """Helper function to combine NER and SpanBertCorefModel results
    """

    # get SpanBertCorefModel results
    coref_data = result.select(F.explode(F.arrays_zip(result.corefs.result,
                                     result.corefs.begin,
                                     result.corefs.end,
                                     result.corefs.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("pronoun"),
              F.expr("cols['3']['sentence']").alias("pronoun_sentence"),
              F.expr("cols['1']").alias("pronoun_begin"),
              F.expr("cols['2']").alias("pronoun_end"),
              F.expr("cols['3']['head']").alias("ref_chunk"),
              F.expr("cols['3']['head.sentence']").alias("ref_sentence"),
              F.expr("cols['3']['head.begin']").alias("ref_begin"),
              F.expr("cols['3']['head.end']").alias("ref_end")).sort("ref_begin").toPandas()


    # get NER results
    ner_data = result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['sentence']").alias("sentence")).toPandas()


    # find matches of NER entities in ROOT (referenced initial chunks)
    coref_data["ner"] = ''
    for index in coref_data.index:

        ner_list = defaultdict(list)

        for _, ner in ner_data.iterrows():
            if coref_data.ref_chunk[index]=="ROOT" and \
               coref_data.pronoun_sentence[index] == ner.sentence and \
               coref_data.pronoun_begin[index]<=ner.begin and \
               coref_data.pronoun_end[index]>=ner.end:

                ner_list[ner.ner_label].append(ner.chunk)

        coref_data.ner[index] = dict(ner_list)


    # change to type int for necessary columns
    int_column_list = ["pronoun_sentence", "pronoun_begin", "pronoun_end", "ref_sentence", "ref_begin", "ref_end"]
    coref_data[int_column_list] = coref_data[int_column_list].astype(int)


    # copy NER entities of ROOT to its references
    merged_table = coref_data.merge(coref_data[["pronoun_sentence", "pronoun_begin", "pronoun_end", "ner"]], how="left",
                                    left_on=["ref_sentence", "ref_begin", "ref_end"],
                                    right_on=["pronoun_sentence", "pronoun_begin", "pronoun_end"] )
    merged_table.drop(["pronoun_sentence_y","pronoun_begin_y","pronoun_end_y"], axis=1, inplace=True) #remove useless columns

    filtered = merged_table["ref_chunk"] != "ROOT"
    merged_table.loc[filtered,"ner_x"] = merged_table.loc[filtered,"ner_y"]

    merged_table.drop("ner_y", axis=1, inplace=True)

    merged_table.columns = ['pronoun', 'pronoun_sentence', 'pronoun_begin', 'pronoun_end',
                            'ref_chunk', 'ref_sentence', 'ref_begin', 'ref_end', 'ner'] #rename columns

    text = result.collect()[0]["text"]
    # correct chunks spell according to the original text in "pronoun" and "ref_chunk" columns
    merged_table["pronoun"] = merged_table.apply(lambda row: text[row.pronoun_begin:row.pronoun_end+1],axis=1)
    merged_table["ref_chunk"] = merged_table.apply(lambda row: text[row.ref_begin:row.ref_end+1] if row.ref_sentence!=-1 else "ROOT", axis=1)

    return merged_table[merged_table["ner"]!={}]

In [8]:
get_relation_table(result)

Unnamed: 0,pronoun,pronoun_sentence,pronoun_begin,pronoun_end,ref_chunk,ref_sentence,ref_begin,ref_end,ner
0,"a very pleasant 72-year-old woman, who I have ...",0,9,89,ROOT,-1,-1,-1,"{'Age': ['72-year-old'], 'Gender': ['woman', '..."
1,acute kidney injury,3,287,305,ROOT,-1,-1,-1,{'Kidney_Disease': ['acute kidney injury']}
3,the hospital,3,243,254,ROOT,-1,-1,-1,{'Clinical_Dept': ['hospital']}
4,the hospital,6,497,508,the hospital,3,243,254,{'Clinical_Dept': ['hospital']}
5,it,4,329,330,acute kidney injury,3,287,305,{'Kidney_Disease': ['acute kidney injury']}
7,her,0,74,76,"a very pleasant 72-year-old woman, who I have ...",0,9,89,"{'Age': ['72-year-old'], 'Gender': ['woman', '..."
8,she,1,129,131,"a very pleasant 72-year-old woman, who I have ...",0,9,89,"{'Age': ['72-year-old'], 'Gender': ['woman', '..."
9,She,2,170,172,"a very pleasant 72-year-old woman, who I have ...",0,9,89,"{'Age': ['72-year-old'], 'Gender': ['woman', '..."
10,She,3,223,225,"a very pleasant 72-year-old woman, who I have ...",0,9,89,"{'Age': ['72-year-old'], 'Gender': ['woman', '..."
11,her,4,373,375,"a very pleasant 72-year-old woman, who I have ...",0,9,89,"{'Age': ['72-year-old'], 'Gender': ['woman', '..."


Here we get all patient, disease, doctor, and hospital references that have ner label. Below two tables are used to generate above relation table.

In [9]:
coref_data = result.select(F.explode(F.arrays_zip(result.corefs.result, result.corefs.begin,
                                                  result.corefs.end, result.corefs.metadata)).alias("cols")) \
                    .select(F.expr("cols['0']").alias("pronoun"),
                            F.expr("cols['3']['sentence']").alias("pronoun_sentence"),
                            F.expr("cols['1']").alias("pronoun_begin"),
                            F.expr("cols['2']").alias("pronoun_end"),
                            F.expr("cols['3']['head']").alias("ref_chunk"),
                            F.expr("cols['3']['head.sentence']").alias("ref_sentence"),
                            F.expr("cols['3']['head.begin']").alias("ref_begin"),
                            F.expr("cols['3']['head.end']").alias("ref_end")).sort("ref_begin").toPandas()
coref_data

Unnamed: 0,pronoun,pronoun_sentence,pronoun_begin,pronoun_end,ref_chunk,ref_sentence,ref_begin,ref_end
0,"a very pleasant 72 woman , who I have been fol...",0,9,89,ROOT,-1,-1,-1
1,acute kidney injury,3,287,305,ROOT,-1,-1,-1
2,I,0,48,48,ROOT,-1,-1,-1
3,the hospital,3,243,254,ROOT,-1,-1,-1
4,the hospital,6,497,508,the hospital,3,243,254
5,it,4,329,330,acute kidney injury,3,287,305
6,me,6,450,451,I,0,48,48
7,her,0,74,76,"a very pleasant 72 woman , who I have been fol...",0,9,89
8,she,1,129,131,"a very pleasant 72 woman , who I have been fol...",0,9,89
9,She,2,170,172,"a very pleasant 72 woman , who I have been fol...",0,9,89


In [10]:
ner_data = result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.begin,
                                                result.ner_chunk.end, result.ner_chunk.metadata)).alias("cols")) \
                 .select(F.expr("cols['0']").alias("chunk"),
                         F.expr("cols['3']['entity']").alias("ner_label"),
                         F.expr("cols['1']").alias("begin"),
                         F.expr("cols['2']").alias("end"),
                         F.expr("cols['3']['sentence']").alias("sentence")).toPandas()
ner_data

Unnamed: 0,chunk,ner_label,begin,end,sentence
0,72-year-old,Age,25,35,0
1,woman,Gender,37,41,0
2,her,Gender,74,76,0
3,pancytopenia,Disease_Syndrome_Disorder,78,89,0
4,bone marrow biopsies,Procedure,107,126,1
5,she,Gender,129,131,1
6,aplastic anemia,Disease_Syndrome_Disorder,152,166,1
7,She,Gender,170,172,2
8,cyclosporine,Drug_Ingredient,182,193,2
9,prednisone,Drug_Ingredient,199,208,2


Another short axample

In [11]:
example2 = "He was diagnosed with a brain tumor in 1999. All details are still not available to us about him for that year."
data2 = spark.createDataFrame([[example2]]).toDF("text")
result2 = model.transform(data2).cache()
result2.selectExpr("explode(corefs) AS coref").selectExpr("coref.result as pronoun", "coref.metadata").show(truncate=False)

+---------+------------------------------------------------------------------------------------+
|pronoun  |metadata                                                                            |
+---------+------------------------------------------------------------------------------------+
|He       |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}|
|him      |{head.sentence -> 0, head -> He, head.begin -> 0, head.end -> 1, sentence -> 1}     |
|1999     |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}|
|that year|{head.sentence -> 0, head -> 1999, head.begin -> 39, head.end -> 42, sentence -> 1} |
+---------+------------------------------------------------------------------------------------+



In [12]:
get_relation_table(result2)

Unnamed: 0,pronoun,pronoun_sentence,pronoun_begin,pronoun_end,ref_chunk,ref_sentence,ref_begin,ref_end,ner
0,He,0,0,1,ROOT,-1,-1,-1,{'Gender': ['He']}
1,1999,0,39,42,ROOT,-1,-1,-1,{'Date': ['1999']}
2,him,1,93,95,He,0,0,1,{'Gender': ['He']}
3,that year,1,101,109,1999,0,39,42,{'Date': ['1999']}


In this example, we find the patient and date references.


In this method, you need to send all the text at once to resolve the coreferences. If you miss the original lines where the concepts are defined, you will lose the reference.

## 2. Question Answering

This is the third option for retrieving coreferences. You can detect the alias and ask questions on the fly about what they refer to.

Let's see an example.

In [13]:
context="""The patient is 57-year-old male who was seen by us for elevated PSA. The patient had a prostate biopsy with T2b disease, Gleason 6.
Options such as watchful waiting, robotic prostatectomy, seed implantation with and without radiation were discussed. Risks of anesthesia, bleeding, infection, pain, MI, DVT, PE, incontinence, rectal dysfunction, voiding issues, burning pain, unexpected complications such as fistula, rectal injury, urgency, frequency, bladder issues, need for chronic Foley for six months, etc., were discussed.
The patient understood all the risks, benefits, and options, and wanted to proceed with the procedure. The patient was told that there could be other unexpected complications.
The patient has history of urethral stricture. The patient was told about the risk of worsening of the stricture with radiation. Consent was obtained.
DETAILS OF THE OPERATION:  The patient was brought to the OR. Anesthesia was applied. The patient was placed in the dorsal lithotomy position. The patient had SCDs on.
The patient was given preop antibiotics. The patient had done bowel prep the day before. Transrectal ultrasound was performed. The prostate was measured at about 32 gm.
The images were transmitted to the computer system for radiation oncologist to determine the dosing etc. Based on the computer analysis, the grid was placed.
Careful attention was drawn to keep the grid away from the patient. There was a centimeter distance between the skin and the grid.
Under ultrasound guidance, the needles were placed, first in the periphery of the prostate, a total of 63 seeds were placed throughout the prostate.
A total of 24 needles was used. Careful attention was drawn to stay away from the urethra. Under longitudinal ultrasound guidance, all the seeds were placed.
 There were no seeds visualized in the bladder under ultrasound. There was only one needle where the seeds kind of dragged as the needle was coming out on the left side and were dropped out of position.
 Other than that, all the seeds were very well distributed throughout the prostate under fluoroscopy. Please note that the Foley catheter was in place throughout the procedure.
 Prior to the seed placement, the Foley was attempted to be placed, but we had to do it using a Glidewire to get the Foley in and we used a Councill-tip catheter.
 The patient has had history of bulbar urethral stricture. Pictures were taken of the strictures in the pre-seed placement cysto time frame.
 We needed to do the cystoscopy and Glidewire to be able to get the Foley catheter in. At the end of the procedure, again cystoscopy was done, the entire bladder was visualized.
 The stricture was wide open. The prostate was slightly enlarged. The bladder appeared normal. There was no sheath inside the urethra or in the bladder.
 The cysto was done using 30-degree and 70-degree lens. At the end of the procedure, a Glidewire was placed, and 18 Councill-tip catheter was placed.
 The plan was for Foley to be left in place overnight since the patient has history of urethral strictures. The patient is to follow up tomorrow to have the Foley removed.
 The patient could also be shown to have it removed at home.The patient was brought to Recovery in stable condition at the end of the procedure. The patient tolerated the procedure well.
"""

question1 = 'Who is the patient'.lower()
question2 = 'Which disease does he have'.lower()
question3 = 'What are the risks'.lower()
question4 = 'What is the procedure'.lower()

In [14]:
document_assembler = nlp.MultiDocumentAssembler()\
  .setInputCols(["question", "context"]) \
  .setOutputCols(["document_question", "document_context"])

spanClassifier = nlp.BertForQuestionAnswering.pretrained("bert_qa_biobert_large_cased_v1.1_squad","en")\
  .setInputCols(["document_question", "document_context"]) \
  .setOutputCol("answer") \
  .setCaseSensitive(False)

pipeline_qa = nlp.Pipeline().setStages([
  document_assembler,
  spanClassifier
])

bert_qa_biobert_large_cased_v1.1_squad download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


In [15]:
qa = [[question1, context], [question2, context], [question3, context], [question4, context]]

In [16]:
example = spark.createDataFrame(qa).toDF("question", "context")

model_qa = pipeline_qa.fit(example)
result_qa = model_qa.transform(example).cache()

result_qa.select('answer.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[57 - year - old male]                                                                                                                                                        

In the example text above, there are many references, especially to procedure and patient. With the Question-Answering model, if we ask the proper questions, then we can find the original entity for references.