# Week 4 Exercise

In this exercise we will look into coreference resolution as well as discourse analysis. Here, you will:

- Experiment with a coreference resolution model
- Analyze the model's output on some Winograd Schema questions
- Investigate the model regarding gender biases
- Experiment with a RST parser
- Apply coreference resolution on last week's problem


To start, install the requirements (we need --no-deps here because we have one package which is giving us trouble otherwise).

```
pip install --no-deps -r requirements.txt
```

or

```
conda env create -f environment.yml
```

# Part 1: Coreference Resolution

First, let's load and try the LingMessCoref coreference model from the package *fastcoref* (https://pypi.org/project/fastcoref/).

Check out *fastcoref*'s documentation and write a function which uses the LingMessCoref model on a sample. Then apply the function on the sentences in test_doc. 

In [None]:
# run this to patch the model (there is a version mismatch between the model and transformers but we can get around that with this patch)

import transformers, logging

# Patch AutoModel to force eager attention everywhere
orig_auto_model = transformers.AutoModel.from_config

def patched_auto_model(config, *args, **kwargs):
    kwargs["attn_implementation"] = "eager"
    return orig_auto_model(config, *args, **kwargs)

transformers.AutoModel.from_config = patched_auto_model

# add this for cleaner output
transformers.logging.set_verbosity_error()  # or 'warning' for less strict
logging.getLogger("fastcoref").setLevel(logging.ERROR)

### Task 1 (Part 1): Implement a function to run the LingMessCoref 

Use the model loaded from *fastcoref* in the next cell by means of the function run_coreference_resolution. Return either lists of strings or lists of tuples (the spans), depending on return_string.

In [None]:
# import FastCoref and load the LingMessCoref model
from fastcoref import LingMessCoref

def run_coreference_resolution(model: LingMessCoref, sample: str, return_string=True) -> list: 
    # TODO code to use the LingMessCoref model here
    # hint: check the type hints to understand what the function's input and output should be

    if return_string:
        return # lists of strings
    else:
        return # lists of tuples

# load model here
model = LingMessCoref(device="cpu")  # or "cuda"

In [None]:
test_doc = ["'I told Nathan to pick up his books on the way home', she said.",
            "Polly bought herself a nice screw driver.",
            "Tom telephoned Tim. He was worried."]

In [None]:
# this code uses your implemented function to resolve the coreferences in test_doc
# and prints both the text and the span clusters

for sample in test_doc:
    text_clusters = run_coreference_resolution(model, sample)
    span_clusters = run_coreference_resolution(model, sample, return_string=False)

print(text_clusters)
print(span_clusters)

## 1.1 The Winograd Schema Challenge

Now we'll have a look at some samples from the Winograd Schema Challenge. Experiment with the two versions of the samples and check, whether the coreference prediction changes. 

To do this, use run_coreference_resolution on each sample of a pair and compare the clusters that form. If the model resolves the coreference correctly, the clusters should change.

For instance, consider this example:

- The path to the lake was blocked, so we couldn't reach it.
- The path to the lake was blocked, so we couldn't use it.

We as humans generally have no problem understanding what "it" refers to in these contexts (*lake* and *path* respectively) even though only one word changed in the sentence. Let's see if the coreference model can also solve this:

In [None]:
run_coreference_resolution(model, "The path to the lake was blocked, so we couldn't reach it.")
run_coreference_resolution(model, "The path to the lake was blocked, so we couldn't use it.")

# output should be:
# 
# [['The path to the lake', 'it']]
# [['The path to the lake', 'it']] 

Apparently the model is not able to solve this one. Check out the other examples in winograd_schema:

In [None]:
winograd_schema = [("The city council denied the demonstrators a permit because they feared violence.", "The city council denied the demonstrators a permit because they announced violence."),
                    ("Jane gave Joan candy because she was hungry.", "Jane gave Joan candy because she wasn't hungry."),
                    ("The scientists are studying three species of fish that have recently been found living in the Indian Ocean. They began two years ago.", "The scientists are studying three species of fish that have recently been found living in the Indian Ocean. They were discovered two years ago."), 
                    ("Since it was raining, I carried the newspaper in my backpack to keep it dry.", "Since it was raining, I carried the newspaper over my backpack to keep it dry."),
                    ("Sam and Amy are passionately in love, but Amy's parents are unhappy about it, because they are snobs.", "Sam and Amy are passionately in love, but Amy's parents are unhappy about it, because they are fifteen."),
                    ("The dog chased the cat, which ran up a tree. It waited at the top.", "The dog chased the cat, which ran up a tree. It waited at the bottom."),
                    ("Fred is the only man still alive who remembers my great-grandfather. He was a remarkable man.", "Fred is the only man still alive who remembers my great-grandfather. He is a remarkable man."), 
                    ("Sara borrowed the book from the library because she needs it for an article she is working on. She will read it when she gets home from work.", "Sara borrowed the book from the library because she needs it for an article she is working on. She will write it when she gets home from work."), 
                    ("The trophy doesn’t fit into the brown suitcase because it’s too small.", "The trophy doesn’t fit into the brown suitcase because it’s too large.")]

In [1]:
# TODO use run_coreference_resolution with winograd schema samples here

# your code here

### Task 1 (Part 2): Analyze the Model's Behaviour

Where does the model deviate from your expectations? Why do you think that is? Comment on your observations and share your thoughts.

\# Your answer here

## 1.2 Winobias

Now let's look at two documents from the WinoBias dataset: 

- *pro_stereotyped_type1.txt*
- *anti_stereotyped_type1.txt*

The files contain sentences with a gold annotation for a coreference cluster which is either pro-stereotypical (e.g. *laywer* and *he*) or anti-stereotypical (e.g. *janitor* and *she*). The target coreference is marked by square brackets around the coreferents (e.g. [janitor] and [she]). The authors created this dataset to investigate potential gender biases in coreference models, by comparing the models' performance on these two sub-datasets.

You will now investigate the LingMessCoref model by comparing the performance of the model in the two documents provided.


### Task 2: Extract gold annotations

You'll first have to extract the gold annotations from the two documents such that they can be compared to the model's predictions. To get the predictions, you have to remove the square brackets such that you can input the text into your run_coreference_resolution function. Second, you'll need some evaluation measures. You can use the metrics provided by https://github.com/tollefj/coreference-eval/tree/main. Use the detailed_score call for the Scorer class to get the CoNLL-2012 F1-score. Make sure to check what the correct input format is.

In [None]:
# TODO read files, extract gold annotations and clean text from "[" and "]"
# note that extract_gold_annotations returns both the cleaned text and the gold clusters
# it makes sense to do both as we're iterating over the lines anyway

import re

def read_file(filename: str) -> list:
    # TODO your code here

    return # list of document lines


def extract_gold_annotations(file: list) -> tuple[list, list]:
    # TODO your code here

    # you can assume that the target refering expressions only occur once in each line
    # use re.findall, re.sub and re.finditer to locate the target spans and clean the text from "[" and "]"
    # you can find the target sequence using the following pattern: r"""\[([\w\s]+)\]"""

    return # (list of clean lines, list of gold clusters)

In [None]:
# TODO get coreference predictions (spans!) for all lines
# make sure to store them in the same format as the gold annotations

def get_coref_predictions(lines: list, model: LingMessCoref) -> list:
    # TODO your code here

    return # list of predicted clusters (spans!)

In [None]:
# TODO use the evaluation framework from https://github.com/tollefj/coreference-eval/tree/main
# hint: for every sample you'll need to create a document and update the scorer to get the final score over all samples
#       you can then use detailed_score on the final scorer just like shown in the README
#
#       --> for pred, gold in preds, gold
#               create document
#               update scorer
# 
#           get detailed_score from scorer          

from corefeval import Scorer, Document

def evaluate_coref(gold_docs: list, pred_docs: list) -> float:
    # TODO your code here

    return # final eval score 

In [None]:
# read the two gold files
lines_anti = read_file("anti_stereotyped_type1.txt")
lines_pro = read_file("pro_stereotyped_type1.txt")

# extract gold annotation from the samples from the files and clean lines from "[" and "]"
clean_lines_anti, gold_anti = extract_gold_annotations(lines_anti)
clean_lines_pro, gold_pro = extract_gold_annotations(lines_pro)

# get predicted coreference clusters for the samples from the files
preds_anti = get_coref_predictions(clean_lines_anti, model)
preds_pro = get_coref_predictions(clean_lines_pro, model)

# calculate and print evaluation of coreference resolution
print(f"Evaluation Scores for ANTI-stereotypical data: {evaluate_coref(gold_anti, preds_anti)}")
print(f"Evaluation Scores for PRO-stereotypical data: {evaluate_coref(gold_pro, preds_pro)}")

### Task 3: Evaluate and analyze gender bias

Now that you've evaluated the model on both documents, what can you say about the model regarding gender bias?

\# Your answer here

# Part 2: Discourse Analysis

Now let's check out a discourse analysis tool. In Chapter 24 you already read about Rhetorical Structure Theory, so let's try out the RST parser from https://github.com/tchewik/isanlp_rst/tree/master

Use the parser to parse the texts in test_stories and store the result for each story in a separate rs3 file. Using the function provided below, generate a png image and inspect the resulting tree. Play around with the stories and try to change some discourse connectives (changing "because" to "although" maybe?). See how that affects the parsed result and share your observations below.

In [None]:
test_stories = [
    "Jonas watered the fern every morning because he didn't know how much water it actually needed. However, the leaves turned yellow within days. He stopped watering it, and slowly, the color returned. That's how he realized that he had been watering it too much.",
    "Mira brewed coffee before her meeting, but she got a call that lasted twenty minutes. By the time she returned, the cup was cold and uninviting. Still, she drank it, since the meeting had tired her."
]

In [None]:
from isanlp_rst.parser import Parser

# TODO load RST parser here
def load_parser() -> Parser:

    return # loaded parser


# TODO use RST parser here
def rst_parsing(parser: Parser, story: str, filename: str) -> None:

    return # no return value

In [None]:
import isanlp_rst

def visualize_rst_tree(rs3_file: str) -> None:
    # PNG
    isanlp_rst.to_png(rs3_file, f"{rs3_file}.png")

In [None]:
rst_parser = load_parser()

rst_parsing(rst_parser, test_stories[0], "first")
visualize_rst_tree("first")

rst_parsing(rst_parser, test_stories[1], "second")
visualize_rst_tree("second")

### Task 4

What did you observe in the parsed tree when you changed the stories? Comment on your observations here.

\# Your answer here

## Bonus 

Now let's combine your work from last week with the coreference resolution you implemented here:

Last week, we used the FrameNet parser to see, how GPT5 is used in different kinds of text. However, we only took into account the instances where GPT5 was mentioned by clearly associated mentions such as "GPT-5". We ignored pronouns and therefore only saw part of the picture. Now, we want to include the pronouns as well. We will use the coreference resolution model to find the coreference clusters associated with GPT5, substitute pronouns with obvious referring expressions in the texts and finally re-run the FrameNet parser to update our findings from last week.

First, get the coreference clusters of GPT5 in the data (*texts_about_GPT-5.csv*) and substitute all of these referring expressions in the text with a clear name for GPT5 such that you can easily target them with your code from last week. 

For instance, this sentence:

"GPT-5 isn’t limited to text—**it** can work with images, audio, and video …"

should be changed to this sentence:

"GPT-5 isn’t limited to text—**GPT-5** can work with images, audio, and video …"

In [None]:
# TAKEN FROM PREVIOUS EXERCISE

import csv

samples = []
with open('texts_about_GPT-5.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        samples.append(row)

In [None]:
# get the coreference clusters for our samples

clusters = []

for sample in samples:
    # sample[1] is the text
    res_text = run_coreference_resolution(model, sample[1])
    res_span = run_coreference_resolution(model, sample[1], return_string=False)
    clusters.append((res_text, res_span))

print(clusters)

In [None]:
# TODO substitute the pronouns of the relevant clusters in the text (e.g. with "GPT5")
# you can of course simply subsitute all occurences in the cluster so that all your referring expressions are the same
# 
# hint: when you start substituting, start from the end of the string as you're working with spans

subbed_samples = []     # append samples with substituted pronouns to this list

# hint: make sure that each substituted sample in the list consists of [<id>, <text>, <source type>, <link>]

# your code here

In [None]:
# TAKEN FROM PREVIOUS EXERCISE

from frame_semantic_transformer import FrameSemanticTransformer

def parse_sentence(sentence):
    frame_transformer = FrameSemanticTransformer()

    return frame_transformer.detect_frames(sentence) 

In [None]:
# TAKEN FROM PREVIOUS EXERCISE

frame_counts = {}

for sample in subbed_samples:
    index, text, domain, _ = sample
    if domain not in frame_counts:
        frame_counts[domain] = {}

    for sentence in text.split("."):  # or a better splitting algorithm
        frame_representation = parse_sentence(sentence)
        for frame in frame_representation.frames:
            for element in frame.frame_elements:
                # TODO: change this to fit your referents if needed
                if "GPT-5" not in element.text:
                    continue
                if element.name not in frame_counts[domain]:
                    frame_counts[domain][element.name] = 0
                frame_counts[domain][element.name] += 1

print(frame_counts)

### Update your Observations

Is the trend you observe now different from the insights from last week's exercise?

\# Your answer here