<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/11.nlp/Stanza%20Coref.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/11.nlp/Stanza%20Coref.ipynb)

In this notebook, we'll explore coreference resolution using the Stanza library, which also does dependency parsing and other core NLP tasks. You can also visualize Stanza output at [http://stanza.run](http://stanza.run).

The coreference resolution that Stanza uses is an implementation of [Conjunction-Aware Word-level Coreference Resolution (D'Oosterlinck et al, 2023)](https://arxiv.org/abs/2310.06165). You can find more information in [the documentation](https://stanfordnlp.github.io/stanza/coref.html).

In [1]:
!pip install stanza==1.10.1
!pip install peft

Collecting stanza==1.10.1
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza==1.10.1)
  Downloading emoji-2.15.0-py3-none-any.whl.metadata (5.7 kB)
Downloading stanza-1.10.1-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading emoji-2.15.0-py3-none-any.whl (608 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m608.4/608.4 kB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji, stanza
Successfully installed emoji-2.15.0 stanza-1.10.1


In [2]:
import stanza

# load the stanza NLP pipeline with coref and dependency parsing
pipe = stanza.Pipeline("en", processors="tokenize,lemma,pos,depparse,coref")

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/tokenize/combined.pt:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/mwt/combined.pt:   0%|        …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/pos/combined_charlm.pt:   0%| …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/lemma/combined_nocharlm.pt:   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/coref/udcoref_xlm-roberta-lora…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/depparse/combined_charlm.pt:  …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/forward_charlm/1billion.pt:   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/backward_charlm/1billion.pt:  …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/pretrain/conll17.pt:   0%|    …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package                  |
----------------------------------------
| tokenize  | combined                 |
| mwt       | combined                 |
| pos       | combined_charlm          |
| lemma     | combined_nocharlm        |
| coref     | udcoref_xlm-roberta-lora |
| depparse  | combined_charlm          |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: coref
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


Now let's run a sentence through Stanza.

> Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.

In [3]:
doc = pipe("Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")

The coreference output is attached the the document as `.coref`.

The output is a list of `CorefChain` objects. Each `CorefChain` object contains a list of `CorefMention` objects. You can view the properties that you can access through each of these classes by looking [in the source code](https://github.com/stanfordnlp/stanza/blob/main/stanza/models/coref/coref_chain.py).

```py
class CorefMention:
    def __init__(self, sentence, start_word, end_word):
        self.sentence = sentence
        self.start_word = start_word
        self.end_word = end_word

class CorefChain:
    def __init__(self, index, mentions, representative_text, representative_index):
        self.index = index
        self.mentions = mentions
        self.representative_text = representative_text
        self.representative_index = representative_index

class CorefAttachment:
    def __init__(self, chain, is_start, is_end, is_representative):
        self.chain = chain
        self.is_start = is_start
        self.is_end = is_end
        self.is_representative = is_representative

```

Here, we will print out the spans for each of the coref chains.

In [4]:
for coref_chain in doc.coref:
    print(f"Representative Text: {coref_chain.representative_text}")
    for mention in coref_chain.mentions:
        span = doc.sentences[mention.sentence].words[mention.start_word:mention.end_word]
        span_text = " ".join([word.text for word in span])
        print(f"{span_text}\t{span[0].start_char}\t{span[-1].end_char}")

    print()

Representative Text: Peter
he	9	11
his	31	34
Peter	41	46
He	69	71
his	76	79
they	93	97
They	116	120
they	148	152

Representative Text: his work
his work	31	39
it	65	67

Representative Text: his wife
his wife	76	84

Representative Text: a holiday
a holiday	105	114

Representative Text: the country
Spain	134	139
the country	159	170



Each coref mention consists of one or more words. We can try to get the root of the span by checking the dependency heads. In most cases, all the words except one in the coref span should point to other words inside the span; the word that depends on a word outside the span is the root.

The syntactic relation of the entire mention to the rest of the sentence is best captured by this root.

In [5]:
def get_span_root(span: list[stanza.models.common.doc.Word]):
    # if there's only one word, it is the root
    if len(span) == 1:
        return span[0]
    # find the words whose heads that exceed the span
    span_min = span[0].id
    span_max = span[-1].id
    roots = [word for word in span if (word.head) < span_min or (word.head) > span_max]
    assert len(roots) > 0
    # we just return the first one if there is more than one
    return roots[0]

In [6]:
def print_coref_chains(doc):
    for coref_chain in doc.coref:
        print(f"Representative Text: {coref_chain.representative_text}")
        for mention in coref_chain.mentions:
            span = doc.sentences[mention.sentence].words[mention.start_word:mention.end_word]
            span_text = " ".join([word.text for word in span])
            root = get_span_root(span)
            print(f"{span_text}\t{doc.sentences[mention.sentence].words[root.head - 1].text}\t{root.deprel}")
        print()

In [7]:
doc2 = pipe("The trophy would not fit in the brown suitcase because it was too big")
print_coref_chains(doc2)

Representative Text: The trophy
The trophy	fit	nsubj
it	big	nsubj

Representative Text: the brown suitcase
the brown suitcase	fit	obl



In [8]:
doc3 = pipe("The town councilors refused to give the demonstrators a permit because they feared violence.")
print_coref_chains(doc3)

Representative Text: The town councilors
The town councilors	refused	nsubj
they	feared	nsubj

Representative Text: the demonstrators
the demonstrators	give	iobj

Representative Text: a permit
a permit	give	obj

Representative Text: violence
violence	feared	obj



In [9]:
doc4 = pipe("The town councilors refused to give the demonstrators a permit because they advocated violence.")
print_coref_chains(doc4)

Representative Text: The town councilors
The town councilors	refused	nsubj

Representative Text: the demonstrators
the demonstrators	give	iobj
they	advocated	nsubj

Representative Text: a permit
a permit	give	obj

Representative Text: violence
violence	advocated	obj

