<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/11.nlp/HW12_Relation_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Relation extraction with LLMs

In this homework, we will explore the challenges and affordances of using LLMs for relation extraction, and how we can evaluate LLM RE systems.

In [1]:
import torch
import numpy as np

from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

In [2]:
# use the 4B model

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B", device_map="cuda", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

In [3]:
def call_llm(prompt, system_prompt="You are a helpful assistant.", generation_config=None):
    if generation_config is None:
        generation_config = {
            "max_new_tokens": 500,
            "temperature": 0.01
        }
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    # conduct text completion
    generated = model.generate(
        **model_inputs,
        **generation_config
    )

    # let's break this down:
    #                      | we take the element of the batch (our batch size is 1)
    #                      |  |-----------------------------| skip our original input
    output_ids = generated[0][len(model_inputs.input_ids[0]):].tolist()

    # decode into token space
    return tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

## Load data

We will be using the relationship triples you extracted during the in-class activity on Tuesday. These have been preprocessed to match each triple to a paragraph.

In [4]:
import pandas as pd

In [5]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/11.nlp/movie_relations.json -O movie_relations.json

--2025-11-21 15:39:55--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/11.nlp/movie_relations.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 217219 (212K) [text/plain]
Saving to: ‘movie_relations.json’


2025-11-21 15:39:55 (12.8 MB/s) - ‘movie_relations.json’ saved [217219/217219]



In [6]:
def read_data(path):
    df = pd.read_json(path)
    df = df.sample(50, random_state=42)
    texts = df.paragraph_text.to_list()
    labels = df.triples.to_list()
    return texts, labels

texts, triples = read_data("./movie_relations.json")

## Setting up the LLM

**Question 1:** Come up with **at least two different prompts or prompting methods** to perform relationship extraction using LLMs based on the relation categories we defined in the lab activity on Tuesday. Your output should be relationship triples. To enforce this, we create a `RelationTriple` wrapper class for your output.

Here's an example from the dataset:

Input:
```
Aboard the space station, Peiqiang discovers that MOSS, the station's computer commander, has decided to abandon Earth and repurpose the station as an interstellar ark to seed a new planet with Earth's biosphere. Breaking out of forced hibernation, he is joined by fellow Russian cosmonaut Maxim Makarov, whom MOSS awakens to stop Liu. While spacewalking, Makarov is killed by the spacecraft's automated security measures. Liu enters the control room, but his attempts to override the evacuation procedures are revoked. Qi's group arrives at the Sulawesi Supply Depot to find that, while most engines around the planet have been restored, the combined thrust is insufficient to divert Earth's trajectory as it approaches Jupiter's Roche limit. MOSS broadcasts a final message to the world, but Peiqiang refuses to follow the computer's instructions.
```

Output (you will want to return a list of `RelationTriple`s):
```
<Liu Peiqiang,business,Maxim Makarov>
<Liu Peiqiang,nemeses,MOSS>
<Maxim Makarov,nemeses,MOSS>
```


In [7]:
relations = [
    "family",
    "nemeses",
    "romantic",
    "friends",
    "business"
]

In [8]:
class RelationTriple():
    def __init__(self, head, tail, relation):
        self.head = head
        self.tail = tail
        self.relation = relation

    @classmethod
    def from_triple(cls, triple: str):
        parts = triple.strip("<>").split(",")
        parts = [part.strip() for part in parts]
        if len(parts) != 3:
            raise ValueError(f"triple {triple} is malformed")
        head, relation, tail = parts
        if relation not in relations:
            raise ValueError(f"triple {triple} has unsupported relation {relation}")
        return cls(head, tail, relation)

    @classmethod
    def validate(cls, triple: str):
        parts = triple.strip("<>").split(",")
        parts = [part.strip() for part in parts]
        if len(parts) != 3:
            return False
        head, relation, tail = parts
        if relation not in relations:
            return False
        return True

    def __str__(self):
        return f"<{self.head},{self.relation},{self.tail}>"

    def __repr__(self):
        return f"<{self.head},{self.relation},{self.tail}>"

In [27]:
def generate_relations_one(text: str) -> list[RelationTriple]:
    prompt = f"""You are an expert in relation extraction. Your task is to identify all relationship triples from the provided text.
Each triple must strictly follow the format: <HeadEntity, Relation, TailEntity>.
The only valid relations are: {relations}.
Only output the triples, one per line. Do not include any other text or explanations.

Text: {text}
Triples:"""
    llm_output = call_llm(prompt, system_prompt="You are an expert in relation extraction.")

    parsed_triples = []
    if llm_output:
        for line in llm_output.split('\n'):
            line = line.strip()
            if line and RelationTriple.validate(line):
                parsed_triples.append(RelationTriple.from_triple(line))
    return parsed_triples

In [28]:
def generate_relations_two(text: str) -> list[RelationTriple]:
    prompt = f"""You are an expert in relation extraction. Your task is to identify all relationship triples from the provided text based on the examples below.
Each triple must strictly follow the format: <HeadEntity, Relation, TailEntity>.
The only valid relations are: {relations}.
Only output the triples, one per line. Do not include any other text or explanations.

Example 1:
Text: Diana worked with Tommy at Google.
Triples:
<Diana,business,Tommy>

Example 2:
Text: Diana is the sister of Tommy.
Triples:
<Diana,family,Tommy>

Example 3:
Text: Diana and Tommy were childhood rivals.
Triples:
<Diana,nemeses,Tommy>

Example 4:
Text: John and Sarah are deeply in love.
Triples:
<John,romantic,Sarah>

Example 5:
Text: Michael and Emily have been friends since kindergarten.
Triples:
<Michael,friends,Emily>

Text: {text}
Triples:"""
    llm_output = call_llm(prompt, system_prompt="You are an expert in relation extraction.")

    parsed_triples = []
    if llm_output:
        for line in llm_output.split('\n'):
            line = line.strip()
            if line and RelationTriple.validate(line):
                parsed_triples.append(RelationTriple.from_triple(line))
    return parsed_triples

In [29]:
def run_on_data(fn, texts) -> list[list[RelationTriple]]:
    return [
        fn(text) for text in tqdm(texts)
    ]

In [30]:
first_outputs = run_on_data(generate_relations_one, texts)
second_outputs = run_on_data(generate_relations_two, texts)

100%|██████████| 50/50 [06:28<00:00,  7.76s/it]
100%|██████████| 50/50 [04:10<00:00,  5.01s/it]


In [31]:
print(first_outputs)
print(second_outputs)

[[<Honmoon,family,Saja Boys>, <Huntr/x,business,Saja Boys>, <Rumi,romantic,Mira>, <Rumi,romantic,Zoey>, <Rumi,family,Jinu>, <Rumi,family,Honmoon>, <Jinu,family,Honmoon>, <Jinu,family,Gwi-Ma>], [<Andy,friends,Tommy Williams>, <Andy,family,Tommy Williams>, <Tommy Williams,family,Andy>, <Tommy Williams,business,Andy>, <Andy,family,Norton>, <Norton,business,Andy>, <Andy,family,Hadley>, <Hadley,friends,Andy>, <Norton,family,Andy>, <Norton,family,Tommy Williams>, <Norton,family,Andy>, <Norton,business,Tommy Williams>], [<Seb,family,Mia>, <Seb,romantic,Mia>, <Mia,friends,Seb>, <Seb,business,restaurant>], [<Steve Rogers,friends,James "Bucky" Barnes>, <Steve Rogers,family,Howard Stark>, <Steve Rogers,business,Strategic Scientific Reserve>, <Steve Rogers,business,Howard Stark>, <Steve Rogers,business,Colonel Chester Phillips>, <Steve Rogers,business,Peggy Carter>, <Howard Stark,family,Steve Rogers>, <Howard Stark,business,Strategic Scientific Reserve>, <Abraham Erskine,family,Steve Rogers>, <Abr

## Evaluating output

In [32]:
def get_gold_labels(labels: list[list[str]]):
    return [
        [RelationTriple.from_triple(triple) for triple in paragraph if RelationTriple.validate(triple)] for paragraph in labels
    ]

In [33]:
gold_labels = get_gold_labels(triples)

### Strict matching

**Question 2:** **Implement the following functions** in order to calculate the precision / recall / F1 of your model output on both prompts.

`get_confusion_matrix` should return a `ConfusionMatrix` containing the number of false/true positives/negatives calculated for the gold and predicted labels for one paragraph. It should use `correct_fn` to compute whether two triples match.

You shoudl calculate `precision`, `recall`, and `f1` over the gold and predicted labels for the entire list of paragraphs by adding up the confusion matrices for each one.

In [64]:
def strict_correct_fn(gold: RelationTriple, pred: RelationTriple) -> bool:
    return gold.head == pred.head and gold.relation == pred.relation and gold.tail == pred.tail

In [78]:
class ConfusionMatrix():
    def __init__(self, tp=0, fp=0, tn=0, fn=0):
        self.tp = tp
        self.fp = fp
        self.tn = tn
        self.fn = fn

    def __add__(self, other):
        return ConfusionMatrix(
            self.tp + other.tp,
            self.fp + other.fp,
            self.tn + other.tn,
            self.fn + other.fn,
        )

    def to_numpy(self):
        return np.array([self.tp, self.fp, self.fn, self.tn])

def get_confusion_matrix(gold: list[RelationTriple], pred: list[RelationTriple], correct_fn) -> ConfusionMatrix:
    tp = 0
    fp = 0
    fn = 0
    # we are NOT interested in true negatives, as we only want relationships that DO exist,
    # and the set of possible non-relationships for a text is potentially infinite. So I set to 0 in fxn call.
    matched_gold_indices = set()
    matched_pred_indices = set()

    for i, g_triple in enumerate(gold):
        for j, p_triple in enumerate(pred):
            if correct_fn(g_triple, p_triple) and (i not in matched_gold_indices) and (j not in matched_pred_indices):
                tp += 1
                matched_gold_indices.add(i)
                matched_pred_indices.add(j)
                break

    fn = len(gold) - tp
    fp = len(pred) - tp
    return ConfusionMatrix(tp, fp, 0, fn)

In [66]:
def precision(confusion_matrix: ConfusionMatrix) -> float:
    if (confusion_matrix.tp + confusion_matrix.fp) == 0:
        return 0.0
    return confusion_matrix.tp / (confusion_matrix.tp + confusion_matrix.fp)

In [67]:
def recall(confusion_matrix: ConfusionMatrix) -> float:
    if (confusion_matrix.tp + confusion_matrix.fn) == 0:
        return 0.0
    return confusion_matrix.tp / (confusion_matrix.tp + confusion_matrix.fn)

In [68]:
def f1(confusion_matrix: ConfusionMatrix) -> float:
    prec = precision(confusion_matrix)
    rec = recall(confusion_matrix)
    if (prec + rec) == 0:
        return 0.0
    return 2 * (prec * rec) / (prec + rec)

In [73]:
def summed_confusion_matrix(gold: list[list[RelationTriple]], pred: list[list[RelationTriple]], correct_fn) -> ConfusionMatrix:
    return sum(
        [get_confusion_matrix(g, p, correct_fn) for g, p in zip(gold, pred)],
        ConfusionMatrix()
    )

In [79]:
cm_first = summed_confusion_matrix(gold_labels, first_outputs, strict_correct_fn)
cm_second = summed_confusion_matrix(gold_labels, second_outputs, strict_correct_fn)

print("--- Strict Matching Evaluation ---")
print("First Prompt (Zero-shot):")
print(f"  Precision: {precision(cm_first):.4f}")
print(f"  Recall:    {recall(cm_first):.4f}")
print(f"  F1 Score:  {f1(cm_first):.4f}")

print("\nSecond Prompt (Few-shot):")
print(f"  Precision: {precision(cm_second):.4f}")
print(f"  Recall:    {recall(cm_second):.4f}")
print(f"  F1 Score:  {f1(cm_second):.4f}")

--- Strict Matching Evaluation ---
First Prompt (Zero-shot):
  Precision: 0.0390
  Recall:    0.1797
  F1 Score:  0.0642

Second Prompt (Few-shot):
  Precision: 0.0787
  Recall:    0.1875
  F1 Score:  0.1109


### LLM-as-judge

**Question 3:** Use the LLM to adjudicate the output by **implementing the `llm_correct_fn`**, then computing new precision, recall, and F1 scores.

In [95]:
import re

def llm_correct_fn(gold: RelationTriple, pred: RelationTriple) -> bool:
    prompt = f"""Are the following two relation triples semantically equivalent? Respond with 'Yes' or 'No' only.

Gold Triple: {gold}
Predicted Triple: {pred}"""
    system_prompt = "You are an expert judge in natural language understanding, determining semantic equivalence of relation triples."

    llm_output = call_llm(prompt, system_prompt=system_prompt).strip().lower()
    # remove non-alphabetic characters to make sure cleaned llm output is either 'yes' or 'no'
    cleaned_output = re.sub(r'[^a-z]', '', llm_output)
    return cleaned_output == 'yes'

In [96]:
cm_first_llm = sum(
    [get_confusion_matrix(g, p, llm_correct_fn) for g, p in tqdm(zip(gold_labels, first_outputs), total=len(gold_labels), desc="Evaluating First Prompt (LLM-as-judge)")],
    ConfusionMatrix()
)
cm_second_llm = sum(
    [get_confusion_matrix(g, p, llm_correct_fn) for g, p in tqdm(zip(gold_labels, second_outputs), total=len(gold_labels), desc="Evaluating Second Prompt (LLM-as-judge)")],
    ConfusionMatrix()
)

print("--- LLM-as-judge Evaluation ---")
print("First Prompt (Zero-shot) with LLM-as-judge:")
print(f"  Precision: {precision(cm_first_llm):.4f}")
print(f"  Recall:    {recall(cm_first_llm):.4f}")
print(f"  F1 Score:  {f1(cm_first_llm):.4f}")

print("\nSecond Prompt (Few-shot) with LLM-as-judge:")
print(f"  Precision: {precision(cm_second_llm):.4f}")
print(f"  Recall:    {recall(cm_second_llm):.4f}")
print(f"  F1 Score:  {f1(cm_second_llm):.4f}")

Evaluating First Prompt (LLM-as-judge): 100%|██████████| 50/50 [11:48<00:00, 14.17s/it]
Evaluating Second Prompt (LLM-as-judge): 100%|██████████| 50/50 [05:43<00:00,  6.86s/it]

--- LLM-as-judge Evaluation ---
First Prompt (Zero-shot) with LLM-as-judge:
  Precision: 0.0696
  Recall:    0.3203
  F1 Score:  0.1144

Second Prompt (Few-shot) with LLM-as-judge:
  Precision: 0.1246
  Recall:    0.2969
  F1 Score:  0.1755





### Evaluating the evaluation

**Question 4:** For each of the evaluation methods (strict matching and LLM-as-judge), sample 10 false positives and 10 false negatives. What proportion of these are incorrectly evaluated? **In a few sentences,** compare the evaluation methods and reflect on the challenges and potential methods for evaluating relationship extraction.

### Question 4: Evaluating the evaluation

In [97]:
import random

def extract_fps_fns(gold: list[RelationTriple], pred: list[RelationTriple], correct_fn) -> tuple[list[RelationTriple], list[RelationTriple]]:
    matched_gold_indices = set()
    matched_pred_indices = set()
    for i, g_triple in enumerate(gold):
        for j, p_triple in enumerate(pred):
            if correct_fn(g_triple, p_triple) and (i not in matched_gold_indices) and (j not in matched_pred_indices):
                matched_gold_indices.add(i)
                matched_pred_indices.add(j)
                break

    fns = [g_triple for i, g_triple in enumerate(gold) if i not in matched_gold_indices]
    fps = [p_triple for j, p_triple in enumerate(pred) if j not in matched_pred_indices]

    return fps, fns

all_strict_fps_first = []
all_strict_fns_first = []
all_strict_fps_second = []
all_strict_fns_second = []
all_llm_fps_first = []
all_llm_fns_first = []
all_llm_fps_second = []
all_llm_fns_second = []

for i in tqdm(range(len(gold_labels)), desc="Collecting FPs and FNs"):
    # strict matching - first_outputs
    fps, fns = extract_fps_fns(gold_labels[i], first_outputs[i], strict_correct_fn)
    all_strict_fps_first.extend(fps)
    all_strict_fns_first.extend(fns)

    # strict matching - second_outputs
    fps, fns = extract_fps_fns(gold_labels[i], second_outputs[i], strict_correct_fn)
    all_strict_fps_second.extend(fps)
    all_strict_fns_second.extend(fns)

    # LLM-as-judge - first_outputs
    fps, fns = extract_fps_fns(gold_labels[i], first_outputs[i], llm_correct_fn)
    all_llm_fps_first.extend(fps)
    all_llm_fns_first.extend(fns)

    # LLM-as-judge - second_outputs
    fps, fns = extract_fps_fns(gold_labels[i], second_outputs[i], llm_correct_fn)
    all_llm_fps_second.extend(fps)
    all_llm_fns_second.extend(fns)

# this selects 10 random triples from the triple_list
def sample_triples(triple_list, count=10):
    unique_triples = list(set(map(str, triple_list)))
    if len(unique_triples) <= count:
        return [RelationTriple.from_triple(t) for t in unique_triples]
    return [RelationTriple.from_triple(t) for t in random.sample(unique_triples, count)]

print("\n--- Sampled False Positives and False Negatives ---")

print("\nStrict Matching - First Prompt (Zero-shot):")
print(f"  Sampled FPs ({len(all_strict_fps_first)} total): {sample_triples(all_strict_fps_first)}")
print(f"  Sampled FNs ({len(all_strict_fns_first)} total): {sample_triples(all_strict_fns_first)}")

print("\nStrict Matching - Second Prompt (Few-shot):")
print(f"  Sampled FPs ({len(all_strict_fps_second)} total): {sample_triples(all_strict_fps_second)}")
print(f"  Sampled FNs ({len(all_strict_fns_second)} total): {sample_triples(all_strict_fns_second)}")

print("\nLLM-as-judge - First Prompt (Zero-shot):")
print(f"  Sampled FPs ({len(all_llm_fps_first)} total): {sample_triples(all_llm_fps_first)}")
print(f"  Sampled FNs ({len(all_llm_fns_first)} total): {sample_triples(all_llm_fns_first)}")

print("\nLLM-as-judge - Second Prompt (Few-shot):")
print(f"  Sampled FPs ({len(all_llm_fps_second)} total): {sample_triples(all_llm_fps_second)}")
print(f"  Sampled FNs ({len(all_llm_fns_second)} total): {sample_triples(all_llm_fns_second)}")

Collecting FPs and FNs: 100%|██████████| 50/50 [17:26<00:00, 20.93s/it]


--- Sampled False Positives and False Negatives ---

Strict Matching - First Prompt (Zero-shot):
  Sampled FPs (566 total): [<Rogers,family,Erskine>, <Rose Armitage,family,Dean>, <Rogers,friends,Carter>, <Juntao,family,Soo Yung>, <Sarah,family,Peter>, <Andy,family,Tommy Williams>, <Aldous,family,Peter>, <Carter,family,Soo Yung>, <Dre,business,Volkswagen Scirocco>, <Hiccup,friends,Astrid>]
  Sampled FNs (105 total): [<Clarence Darby,nemeses,Clyde Shelton>, <Steve Rodgers,business,Erskine>, <Wade Wilson,business,Negasonic Teenage Warhead>, <Kelly Van Ryan,nemeses,Sam Lombardo>, <Carter,business,Tania Johnson>, <Oh Dae-su,friends,Mi-do>, <Carter,nemeses,Clive Cobb>, <Henrich Harlander,friends,Victor Frankenstein>, <Kate,romantic,Peter>, <Han,friends,Dre Parker>]

Strict Matching - Second Prompt (Few-shot):
  Sampled FPs (281 total): [<Oh Dae-su,family,daughter>, <Hiccup,family,Toothless>, <Morton Schmidt,romantic,Domingo>, <Tony,friends,Natalie Rushman>, <Carter,business,Special Agent in




From these results above, it looks like strict matching is, although precise, very rigid for evaluating triples as there may be slight variations in a name that are not noticed, whereas they would be by the LLM-as-judge. Also, it seems like a big challenge with this process is the activity of triple identification and extraction. This activity depends in large part on the input prompt, as we see that the few-shot prompt had almost twice the F1 score as the zero-shot prompt for *both* the strict matching and LLM-as-judge evaluation. The prompt is also very important for the LLM-as-judge evaluator itself, as ideally we want this evaluator to be as close to a human expert evaluator as possible (could be improved by using a more powerful model also).