# Relation extraction with co-occurrences and HuggingFace

In this hands-on session, we'll explore identifying associations between entities in text. We'll first figure out how to find sentences that contain entities of interest, and then explore counting co-occurrences and later more complex relation extraction

**NOTE:** If you are running this with Colab, you should make a copy for yourself. If you don't, you may lose any edits you make. To make a copy, select `File` (top-left) then `Save a Copy in Drive`. If you are not using Colab, you may need to install some prerequisites. Please see the instructions on the [Github Repo](https://github.com/Glasgow-AI4BioMed/ismb2025tutorial).

## Getting Data

As in the previous sessions, we'll download some data that we'll use later on this tutorial with the commands below:

In [None]:
!wget -O data.zip https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/Ec8sygSj-zlAj6RXZHYGXqYBygZYM978Ts2FrTgBTfqmOQ?download=1
!unzip -qo data.zip

## Getting documents with pre-extracted entities

In the last hands-on session, we looked at applying named entity recognition methods to identify mentions of biomedical concepts (e.g. diseases, etc). [PubTator](https://www.ncbi.nlm.nih.gov/research/pubtator3/) is an existing resource which contains entity annotations (for certain types) for PubMed abstracts and PubMed Central full-text articles. Instead of running NER tools in this session, let's look at getting annotated texts from PubTator.

Similar to PubMed, you can get documents through [bulk download](https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3) or through [their API](https://www.ncbi.nlm.nih.gov/research/pubtator3/api). Let's examine how to use their API to get a document. Below is the code to get the text and entity annotations for a single Pubmed abstract (pmid=20573926).

In [None]:
import requests

# Example PubMed ID
pmid = "20573926"

# PubTator API endpoint for BioC XML
url = f"https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmids={pmid}"

# Make the request
response = requests.get(url)

# Check for successful response
assert response.status_code == 200

document_xml = response.text

This gets the document in an XML format known as BioC XML. Let's take a quick look at what that looks like. We'll use some code to pretty print the XML which puts in nice indenting for us.

In [None]:
from xml.dom.minidom import parseString

dom = parseString(document_xml)
print(dom.toprettyxml(indent="  "))

The XML contains documents (with the &lt;document&gt; tag). Each document could contain multiple passages (e.g. the title, the abstract and sections of the paper could be treated as separate passages). Each passage contains text, along with information about the entities annotated there (the &lt;annotation&gt; tag).

You can view the same document with all the annotations through the PubTator website: https://www.ncbi.nlm.nih.gov/research/pubtator3/publication/20573926

Let's get back to the XML and save it out to a file for now.

In [None]:
with open('example.bioc.xml','w') as f:
  f.write(document_xml)

And we'll demonstrate loading a BioC XML file using the [bioc](github.com/bionlplab/bioc) Python package. We have to install it first:

In [None]:
!pip install bioc

Then to load up that BioC XML file that we just saved, we can use the following code:

In [None]:
from bioc import biocxml

with open('example.bioc.xml') as f:
  collection = biocxml.load(f)

Let's see how many documents are in this file. We're expecting only one as we downloaded a single one earlier:

In [None]:
len(collection.documents)

Excellent. Only one document. 

Let's see how many passages are in this single document:

In [None]:
document = collection.documents[0]
len(document.passages)

And let's output what the text is for each passage:

In [None]:
for passage in document.passages:
  print(passage.text)
  print("="*30)

Great. It appears to be the title and the abstract as separate passages.

We'll focus on the first passage for now as we explore the data:

In [None]:
passage = document.passages[0]
passage.text


Now let's see what metadata we can get for this document. With PubTator, we can check the metadata for each passage with `.infons`. It's a dictionary containing various fields including journal, year, etc.

In [None]:
passage.infons

And how about the annotations? These are the entities that have already been identified by PubTator. These include genes, diseases, chemicals, genetic variants and species.

First, how many are there in this short passage?

In [None]:
len(passage.annotations)

Let's iterate through them and provide some details:

In [None]:
for anno in passage.annotations:
  print(f"{anno.text=}\n{anno.infons=}\n{anno.total_span.offset=}\n{anno.total_span.length=}\n")

We get similar information to the previous hands-on session with named entity recognition methods. These annotations provide the location of the entities in the text (with the `.total_span.offset` and `.total_span.length` fields). They also provide the entity type with `.infons['Gene']`. And they have done entity linking to various ontologies and resources. The identifier is accessible through `.infons['identifier']`.

## Calculating co-occurrences

We'll start with the most straightforward way to identify relationships between entities - that they appear in lots of documents together. Co-occurrences may not provide nuance (by telling you the specific relation) but they can still be useful to identify that two concepts (e.g. two genes) appear to be connected because they appear together a lot. The co-occurrences could be in in sentences, paragraphs, paper abstracts or even whole papers. The granularity may depend on what you want to get, and how rare the terms are.

Let's load up a set of documents that have already been annotated through the PubTator platform. We'll use them to calculate some co-occurrence counts. You can load the data with the code below:

In [None]:
from bioc import biocxml

with open('data/cooccurrences.bioc.xml', "r") as f:
    collection = biocxml.load(f)

The data is in BioC XML data format as we explored above. Let's see how many documents we have:

In [None]:
len(collection.documents)

That's a lot of documents! You need thousands (or possibly even millions) to get a good signal for identifying co-occurrences, especially for rarer terms.

This dataset comes with text and also the annotations from PubTator. Let's remind ourselves what they look like:

In [None]:
collection.documents[0].passages[0].annotations[0]

They come with various fields including the text from the research paper, the database identifier that PubTator has linked them to and the entity type. In this case, `cyanogen bromide` has been linked to the MeSH database with ID: `MESH:D003488` and is a `Chemical`. You can use the MeSH Browser to view the page for [MESH:D003488](https://meshb.nlm.nih.gov/record/ui?ui=D003488).

For co-occurrences, it's generally a good idea to use the normalized component (i.e. the ontology/database identifier) instead of the text form (which is 'cyanogen bromide' in this case) as the text form may vary from paper to paper. To make it easier to work with, let's make a lookup to go from the identifier to a text form.

The code below creates a dictionary lookup from the type and identifier to the text:

In [None]:
quick_lookup = {}
for i,doc in enumerate(collection.documents):
  for passage in doc.passages:

    for anno in passage.annotations:
      quick_lookup[(anno.infons['type'],anno.infons['identifier'])] = anno.text

That means we can lookup the text for the Chemical with identifier `MESH:D003488` with the code below. 

In [None]:
quick_lookup[('Chemical','MESH:D003488')]

For a full project, it would be better to use the canonical name (i.e. the name that the entity is known by in the database) which would likely involve downloading the database (i.e. getting MeSH) or using an API to lookup the canonical name of entities.

Now let's look at getting some co-occurrences. What entities are in identified in the first document in the collection? The code below outputs it:

In [None]:
doc = collection.documents[0]

for passage in doc.passages:
  for anno in passage.annotations:
    print(anno.infons['type'], anno.infons['identifier'])

While there are eight annotations, they relate to only two unique entities (a Chemical with identifier `MESH:D003488` and a Gene with identifier `213`). Let's see what those are:

In [None]:
quick_lookup[('Chemical','MESH:D003488')], quick_lookup[('Gene','213')]

It's the `cyanogen bromide` Chemical that we had before and the `albumin` gene. In this case, we would count this as a single co-occurrence of these two entities, as they have appeared in one document. Alternatively, we could also check whether they appear in the same sentence, but we will stick to the document level at the moment.

Let's look at getting co-occurrence counts at scale:

### Task 1

Now it's time to go through thousands of documents in this collection (with `collection.documents`) and count up the number of documents that entities appear in, and the number of co-occurrences. You should get the identifiers for all the entities mentioned in each document, remove any duplicates (using a Python `set`) and then get every pair of entities. You could use the [itertools.combinations](https://docs.python.org/3/library/itertools.html#itertools.combinations) function to get the pairs.

You should see that `MESH:D009369` and `9606` are the most commonly co-occurring pair. With `quick_lookup` you could check and see that they map to tumors and patients (or really mentions of humans).

In [None]:
# Your code goes here


<details>
<summary>🔑 Click to see the answer 🔑</summary>

Here is the code for the task:

```python
import itertools
from collections import Counter

cooccurrences = Counter()

for doc in collection.documents:
  identifiers = [ (anno.infons['type'],anno.infons['identifier']) for passage in doc.passages for anno in passage.annotations ]

  unique_identifiers = sorted(set(identifiers))

  for id1,id2 in itertools.combinations(unique_identifiers, 2):
    cooccurrences[(id1,id2)] += 1

cooccurrences.most_common(5)
```

</details>


## A rule-based approach

Let's move onto figuring out the relationships between entities. It can be important to know more than if two entities appear in the same context. Is that Chemical causing or treating that Disease?

Sometimes, the text may follow specific patterns (e.g. on drug labels) or we want to extract relations that are worded in highly specific ways. Let's look at using patterns for this.

We'll start by looking at an example sentence:

In [None]:
sentence = {'text': 'Metformin is used for type 2 diabetes, and studies have evaluated its efficacy in polycystic ovary syndrome.',
 'chemicals': ['Metformin'],
 'diseases': ['type 2 diabetes', 'polycystic ovary syndrome']
}

The sentence follows a common pattern of `[CHEMICAL] is used for [DISEASE]`. Could we programmatically try to match that pattern for this sentence?

In [None]:
rule = "[CHEMICAL] is used for [DISEASE]"

Let's get all possible pairs of chemical and disease in this sentence:

In [None]:
pairs = [ (chemical,disease) for chemical in sentence['chemicals'] for disease in sentence['diseases'] ]
pairs

There are two possible pairs of chemical/disease in this sentence. Let's see if any match our rule. We'll focus on the first one (which we do know will match):

In [None]:
chemical, disease = pairs[0]
chemical, disease

Let's take the sentence and replace the chemical and disease with those placeholders (`[CHEMICAL]` and `[DISEASE]`):

In [None]:
sentence_with_placeholders = sentence['text'].replace(chemical,'[CHEMICAL]').replace(disease,'[DISEASE]')
sentence_with_placeholders

And then check if there is a match:

In [None]:
rule_matches = rule in sentence_with_placeholders

print(f"Match: {rule_matches}")
print(f"  [CHEMICAL]={chemical}")
print(f"  [DISEASE]={disease}")
print(f"  {sentence_with_placeholders}")

Fantastic. It did match. Well what about the other pair?

In [None]:
chemical, disease = pairs[1]
print(chemical, disease)

sentence_with_placeholders = sentence['text'].replace(chemical,'[CHEMICAL]').replace(disease,'[DISEASE]')
print(sentence_with_placeholders)

rule_matches = rule in sentence_with_placeholders
print(f"Match: {rule_matches}")

No match as expected.

### Task 2

The task is to come up with more rules and apply them a dataset of sentences to extract more ways of saying that a chemical treats a disease. The dataset to load is `data/rulebased_sentences.json` and contains more sentences in a similar format to above. When coming up with rules, you could look at sentences that don't match any rules you've already come up with.

Aim to come up with rules that match with over 20 sentences!


In [None]:
# Your code goes here

<details>
<summary>🔑 Click to see the answer 🔑</summary>

Here is the code for the task:

```python
import json

with open('data/rulebased_sentences.json') as f:
  sentences = json.load(f)

rules = [
  "[CHEMICAL] is used to treat [DISEASE]",
  "[CHEMICAL] treats [DISEASE]",
  "[CHEMICAL] is effective against [DISEASE]",
  "[CHEMICAL] has been shown to treat [DISEASE]",
  "[CHEMICAL] therapy for [DISEASE]",
  "[CHEMICAL] has therapeutic effects on [DISEASE]",
  "[CHEMICAL] is indicated for the treatment of [DISEASE]",
  "[CHEMICAL] is administered to manage [DISEASE]",
  "[CHEMICAL] is prescribed for [DISEASE]",
  "[CHEMICAL] is a treatment option for [DISEASE]",
  "[CHEMICAL] can be used for [DISEASE] therapy",
  "[CHEMICAL] is beneficial for patients with [DISEASE]",
  "Treatment of [DISEASE] with [CHEMICAL]",
  "Use of [CHEMICAL] in the treatment of [DISEASE]",
  "[CHEMICAL] alleviates symptoms of [DISEASE]"
]

counts = Counter()

for sentence in sentences:
  pairs = [ (chemical,disease) for chemical in sentence['chemicals'] for disease in sentence['diseases'] ]

  for chemical,disease in pairs:
    sentence_with_placeholders = sentence['text'].replace(chemical,'[CHEMICAL]').replace(disease,'[DISEASE]')

    if any( rule in sentence_with_placeholders for rule in rules):
      counts[(chemical,disease)] += 1

len(counts)
```

</details>


There are advantages and disadvantages with the rule-based approach:

Advantages:

- The big advantage is that you have full control over how the relations are extracted, which may be very important for your project if specific wording is really needed. 
- It is very very fast. 

Disadvantages:
- Need to write out lots of rules
- There will always be cases that don't match a rule because of a small difference

## A basic Open Information Extraction method

Sometimes you may not be sure what information you want to extract, and certainly couldn't come up with rules to extract things. Open information extraction methods attempt to extract information without pre-determined labels, schema, etc. One approach to this is to extract the main verb between two entity mentions. The main verb often describes the action going on and is often the main relation that we care about.

Let's set up an example sentence. Here the main verb is `binds` between the two entities and would be a useful thing to identify

In [None]:
sentence = {
    "text": "Cetuximab binds to the epidermal growth factor receptor, blocking cancer cell proliferation.",
    "entities": ["Cetuximab", "epidermal growth factor receptor"]
}

Let's figure out where the two entities of interests are in the text (i.e. their string coordinates):

In [None]:
pair = ("Cetuximab", "epidermal growth factor receptor")

loc1 = sentence['text'].index(pair[0])
loc2 = sentence['text'].index(pair[1])

loc1, loc2

We can use spaCy to parse the text and gets the individual tokens with the parts-of-speech as was shown in the first hands-on session. Let's do that to this sentence:

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(sentence['text'])

for token in doc:
  print(f"{token.pos_}: '{token.text}' at {token.idx}")

We want to identify the verb that occurs between our two entities (at positions 0 and 23). Let's adjust the spaCy code to get it

In [None]:
for token in doc:
    if token.pos_ == "VERB" and token.idx > loc1 and token.idx < loc2:
        print(f"{pair} {token.text}")


You could apply this at scale to find the main verbs between two entities occurring in the same sentence. This has some advantages and disadvantages too:

Advantages:
- You don't need to decide the relation types you want to extract up front
- It can help explore possible information to extract
- It can be fairly fast

Disadvantages:
- It won't catch negative cases (e.g. "Cetuximab never binds to TERT")
- There is more to meaning than the main verb
- The main verb may not be directly connecting the two entities

## A transformer model for relation classification

A machine learning model can also be used to classify the relationship between two entities in some text. One challenge of this is that you need annotated data to build and evaluate your model.

We will use a model that has been trained on labels generated by a larger language model. This has certain limitations as the labels will likely contain some errors. However, it demonstrates how a machine learning model can be applied. The model we will apply is [Glasgow-AI4BioMed/synthetic_relex](https://huggingface.co/Glasgow-AI4BioMed/synthetic_relex). First, let's load it up:

In [None]:
from transformers import pipeline

classifier = pipeline("text-classification", model="Glasgow-AI4BioMed/synthetic_relex")

The input to the model is text with the two entities of interest wrapped with `[E1][/E1]` and `[E2][/E2]` tags. Those denote the first and second entity in a relation.

The classifier then predicts the label of the relation when text is passed to it:

In [None]:
classifier("[E1]Paclitaxel[/E1] is a common chemotherapy used for [E2]lung cancer[/E2].")

This model can classify a lot of different relation types including *treats*, *upregulates*, *binds*, etc between two entities. Check the [model page](https://huggingface.co/Glasgow-AI4BioMed/synthetic_relex) to see the full list.

The `[E1][/E1]` and `[E2][/E2]` tags tell you the subject and object of the relation as there is directionality in the relations. The relation E1→E2 is different from E2→E1. If we switch the tags around in the text above, we get a different result and a prediction of no relation for this case:

In [None]:
output = classifier("[E2]Paclitaxel[/E2] is a common chemotherapy used for [E1]lung cancer[/E1].")

# Let's just get the label this time
output[0]['label']

Make up your own sentence and see what works for the classifier and what doesn't.

In [None]:
classifier("Your own text here (remember to include the [E1][/E1] and [E2][/E2] tags).")

### Task 3

Your task is to apply the [Glasgow-AI4BioMed/synthetic_relex](https://huggingface.co/Glasgow-AI4BioMed/synthetic_relex) relation classifier model to a dataset of sentences that contain two entities and count the number of labels. You'll need to insert the `[E1][/E1]` and `[E2][/E2]` tags into the sentence text. Then you run the model with the `classifier` and get out the label. You should find that there are 48 sentences that give the `inhibits` relation.

Here's some code to load up the small dataset of sentences. Each sentence has exactly two entities which you should find the relationship between.

In [None]:
import json

with open('data/relex_sentences.json') as f:
  sentences = json.load(f)

sentences[0]

In [None]:
# Your code goes here

<details>
<summary>🔑 Click to see the answer 🔑</summary>

Here is the code for the task:

```python
from tqdm.auto import tqdm

label_counts = Counter()

for sentence in tqdm(sentences):
  entity1, entity2 = sentence['entity1'], sentence['entity2']

  sentence_with_tags = sentence['text'].replace(entity1,f'[E1]{entity1}[/E1]').replace(entity2,f'[E2]{entity2}[/E2]')

  output = classifier(sentence_with_tags)
  label = output[0]['label']

  label_counts[label] += 1

label_counts
```

</details>


## End of Hands-on Session

And that brings us to the end of the session. You've learned about:

- Getting documents and annotations from [PubTator Central](https://www.ncbi.nlm.nih.gov/research/pubtator3/) using [their API](https://www.ncbi.nlm.nih.gov/research/pubtator3/api).
- Calculating co-occurrence counts between entities for a large set of documents
- Using rules to extract specifically phrased relations
- Extracting the main verb to do open information extraction where you aren't sure what you want to extract
- Applying a [BERT-based transformer model](https://huggingface.co/Glasgow-AI4BioMed/synthetic_relex) to classify relations between two entities.

## Optional Extras

If you've got extra time, you could try some of the following ideas:

- Calculate p-values for each co-occurrence by creating a contingency matrix of document counts of when two entities appear (and appear together)
- Apply one of the other relation extraction methods (e.g. rule-based or open information extraction) to the final dataset