[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/coreference-resolution/Coreference_Resolution_SpanBertCorefModel.ipynb)

# Coreference Resolution with SpanBertCorefModel

SpanBertCorefModel is a coreference resolution model that identifies expressions which refer to the same entity in a
text. For example, given a sentence "John told Mary he would like to borrow a book from her."
the model will link "he" to "John" and "her" to "Mary".

This example will show how to use a pretrained model.

## 0. Colab Setup

The following cell will install Spark NLP in a Colab notebook. If this notebook is run locally it should be skipped.

In [None]:
# This is only to setup PySpark and Spark NLP on Colab
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

# to process audio files
!pip install -q pyspark librosa

Let's start a Spark NLP session:

In [3]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

print(sparknlp.version())

4.2.0


## 1. Using a pretrained `SpanBertCorefModel` in a Pipeline

In [9]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

SpanBertCorefModel requires `DOCUMENT` and `TOKEN` type annotations. these are extracted first before being fed to the pretrained model for classification.

In [11]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("tokens")

coref = SpanBertCorefModel() \
    .pretrained() \
    .setInputCols(["sentences", "tokens"]) \
    .setOutputCol("corefs")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    coref
])

spanbert_base_coref download started this may take some time.
Approximate size to download 540.1 MB
[OK!]


Let's create some data so we can test the pipeline:

In [None]:
data = spark.createDataFrame([
    ["John loves Mary because she knows how to treat him. She is also fond of him. John said something to Mary but she didn't respond to him."],
]).toDF("text")

The data is then fit to the pipeline and we can extract the coreferences with an example query like so

In [12]:
model = pipeline.fit(data)

model.transform(data) \
    .selectExpr("explode(corefs) AS coref") \
    .selectExpr("coref.result as token", "coref.metadata") \
    .show(truncate=False)

+-----+------------------------------------------------------------------------------------+
|token|metadata                                                                            |
+-----+------------------------------------------------------------------------------------+
|Mary |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}|
|she  |{head.sentence -> 0, head -> Mary, head.begin -> 11, head.end -> 14, sentence -> 0} |
|She  |{head.sentence -> 0, head -> Mary, head.begin -> 11, head.end -> 14, sentence -> 1} |
|Mary |{head.sentence -> 0, head -> Mary, head.begin -> 11, head.end -> 14, sentence -> 2} |
|she  |{head.sentence -> 0, head -> Mary, head.begin -> 11, head.end -> 14, sentence -> 2} |
|John |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}|
|him  |{head.sentence -> 0, head -> John, head.begin -> 0, head.end -> 3, sentence -> 0}   |
|him  |{head.sentence -> 0, head -> John, head.begin -> 0, head.end ->