# Extracting entities from text

This hands-on session will introduce the process to identify important biomedical concepts (e.g. drugs, diseases, etc) that are mentioned in text.

**NOTE:** If you are running this with Colab, you should make a copy for yourself. If you don't, you may lose any edits you make. To make a copy, select `File` (top-left) then `Save a Copy in Drive`. If you are not using Colab, you may need to install some prerequisites. Please see the instructions in the [Github repository](https://github.com/Glasgow-AI4BioMed/ismb2025tutorial).

## Getting data

As in the previous sessions, we'll download some data that we'll use later in this tutorial with the commands below:

In [None]:
!wget -O data.zip https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EZS4UymC9ENLk3BGcO_1CeMBRz6NE34JZNWjwpm7X9kC8w?download=1
!unzip -qo data.zip

## Applying named entity recognition

The first step in biomedical information extraction is to figure out which words refer to various biomedical entities. Which words refer to diseases, genes, etc.? We can use a named entity recognition approach for this and the specific model we use will determine which types of entities we can identify.

The HuggingFace library is the standard way to access transformer-based language models and apply them for standard tasks. We'll use a `token-classification` pipeline with the [Glasgow-AI4BioMed/bioner_medmentions_st21pv](https://huggingface.co/Glasgow-AI4BioMed/bioner_medmentions_st21pv) model. This is a model trained on MedMentions which is one of the well-known biomedical text datasets. It identifies biomedical concepts mentioned in text and categorizes them with a large set of categories including *Chemicals & Drugs* and *Devices*. Check out the [model page](https://huggingface.co/Glasgow-AI4BioMed/bioner_medmentions_st21pv) for the full list.

To set that up, we use the code below that will fetch the model and prepare it for token classification. It may take a minute to get it.

In [None]:
from transformers import pipeline

ner_pipeline = pipeline("token-classification", model="Glasgow-AI4BioMed/bioner_medmentions_st21pv")

We can apply the token classification system to some text by calling it with the input text as below. This will return information about the words that have been identified as different types of biomedical entities.

In [None]:
text = "A recent study shows that metformin suppresses AKT1 activation in hepatocellular carcinoma."
ner_pipeline(text)

Note that some of the words have been categorised into different groups. And that they have `B-` or `I-` as prefixes. These signify whether the word is the beginning of an entity mention (with `B-`) or a continuation where it is inside the entity mention (with `I-`). This is known as [Inside–outside–beginning](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) tagging.

Practically, we often don't care about the individual words but want to group things up. So that the words `"small cell lung cancer"` which might be tagged with `["B-DISEASE", "I-DISEASE", "I-DISEASE", "I-DISEASE"]` is extracted as a single thing identified with label `DISEASE`. We can add an `aggregation_strategy` to the token-classification pipeline below which will do that for us:

In [None]:
ner_pipeline = pipeline("token-classification", 
                        model="Glasgow-AI4BioMed/bioner_medmentions_st21pv",
                        aggregation_strategy="max")

And then when we apply it to text, we will get spans of text tagged with different labels.

In [None]:
entities = ner_pipeline(text)
entities

For every entity, we get a `start` and `end` which provide coordinates into the input string. For instance, we can check that the text of the final entity matches:

In [None]:
text[66:90]

## Pre-prepared sentences

We have pre-prepared some sentences that we'll work with. They contain lots of biomedical entities. Let's load it up:

In [None]:
import json

with open('data/entity_sentences.json') as f:
  sentences = json.load(f)

len(sentences)

Let's take a quick look at the first few:

In [None]:
sentences[:5]

### 📋 Task 1: Chemical and drug sentences

Now it's time for the first task of this hands-on session. Your task is to find every sentence from the pre-prepared set that contains a chemical (called 'Chemicals & Drugs' by the NER model) and a disease (called a 'Disorders' by the model). You should find that there are 42 unique sentences that contain entities tagged as 'Chemicals & Drugs' and 'Disorders'.

In [None]:
# Your code goes here!

<details>
<summary>🔑 Click to see the answer 🔑</summary>

Here is the code for the task:

```python
from tqdm.auto import tqdm

chemical_disease_sentences = []
for text in tqdm(sentences):
  
  entities = ner_pipeline(text)

  chemicals = [ entity['word'] for entity in entities if entity['entity_group'] == 'Chemicals & Drugs' ]
  diseases = [ entity['word'] for entity in entities if entity['entity_group'] == 'Disorders']

  if chemicals and diseases:
    chemical_disease_sentences.append((text, chemicals, diseases))

len(chemical_disease_sentences)
```

</details>


There are some entity types that are more fine-grained than X or Y. For instance, 'Neoplasm process' is a subtype of 'Disease or Syndrome'. The tree of entity categories can be viewed [on this page](https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html). You could go further and find all sentences that mention at least one disease (including its subtypes) or chemical (including its subtypes).

## Entity linking

Figuring out which words are different entity types is generally only half the challenge. We want to know which specific biomedical entities are being discussed. This can be challenging for several reasons:

- Entities can be known by many different names including some that aren't documented in common lists of drug/gene/etc names
- Mentions may be ambiguous: does APC refer to "Adenomatous Polyposis Coli" or "Activated Protein C"?
- Mentions may not map perfectly to existing entity name lists, perhaps due to misspellings. For example, "NSCLCL" is probably a mistype of the acronym for "non-small cell lung cancer".

## Using vector search

Let's explore a technique that uses dense vectors to represent our different entities. We'll use the [Disease Ontology](https://disease-ontology.org/) for mapping our identified mentions of diseases. For ease, we have pre-prepared a JSON version of the ontology. The ontology can be [downloaded](https://disease-ontology.org/downloads/) in standard ontology formats such as OBO and OWL.

Let's load up the pre-prepared JSON one:

In [None]:
with open('data/disease_ontology.json') as f:
  disease_ontology = json.load(f)

len(disease_ontology)

We can take a look at a random term from this file. We can see that there is an identifier, a standard name and a list of aliases that is known by. The full Disease Ontology (which we won't use in this session) also provides a lot of additional information including descriptions, links with other ontologies and connections to other ontology terms (e.g. its parent term).

In [None]:
disease_ontology[100]

To create a dense vector for a term, we will use [SapBERT](https://github.com/cambridgeltl/sapbert). It is a transformer-based model that has been trained to generate similar vectors for terms that refer to the same entity. So "non-small cell lung cancer" and "NSCLC" should give similar vectors. We can then use that similarity to figure out the most likely terms in an ontology that some text refers to.

Let's load up SapBERT and use a `feature-extraction` pipeline that enables us to get the vectors from some text.

In [None]:
from transformers import pipeline

model_name = "cambridgeltl/SapBERT-from-PubMedBERT-fulltext"
extractor = pipeline("feature-extraction", model=model_name)

Let's apply it to some text:

In [None]:
text = "cold"

features = extractor(text)

If you print out the result, you get lots and lots of numbers:

In [None]:
features

We need to pull out one specific output. Each token in the text is transformed into a vector that is 768 elements wide (which is a common trait of BERT models). We'll use the [numpy library](https://numpy.org/) to make it nicer to work with. If we convert the result with numpy, we can see that it is a matrix of 1x3x768. 

In [None]:
import numpy as np

context_vectors = np.array(features)
context_vectors.shape

The vector we want is the first one (with index 0) as that's how SapBERT works. So, we'll use the first row of the matrix (so the first 768-wide vector) to represent this entity.

In [None]:
mention_vector = context_vectors[0,0,:]

mention_vector.shape

This vector is a numeric representation of "cold". We want to compare this vector with vectors that represent every term in the Disease Ontology. We have pre-processed the Disease Ontology dataset beforehand. Let's load it up:

In [None]:
import numpy as np
disease_ontology_vectors = np.load("data/disease_ontology_vectors.npy")
disease_ontology_vectors.shape

This shows that we have a vector for 14398 elements. Let's check how many elements we have in our preprocessed Disease Ontology set:

In [None]:
len(disease_ontology)

Great. We have one vector for every element in the ontology.

Now if we wanted to get a score for an element with our SapBERT-created vectors, we can use the [cosine similarity metric](https://en.wikipedia.org/wiki/Cosine_similarity). For instance, let's look at element 100:

In [None]:
disease_ontology[100]

We can get the corresponding vector with `disease_ontology_vectors[100]`. To compare it to our `mention_vector`, we will use the [cosine_similarity function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) from scikit-learn. We have to use `.reshape` to turn the 1D vectors into 2D matrices which the function expects.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(mention_vector.reshape(1,-1), disease_ontology_vectors[100].reshape(1,-1))

Is that a high or low similarity? Really it only matters in comparison to all the other possible entities in the Disease Ontology.

Hence, we want to score everything in the Disease Ontology and see which ones give the highest values. We can do that by multiplying the vector for our term against the entire matrix of vectors.

In [None]:
scores = cosine_similarity(mention_vector.reshape(1,-1), disease_ontology_vectors).flatten().tolist()
len(scores)

Now we have scores for all the terms when trying to match "common cold". What one gives us the highest score? We can use [np.argmax](https://numpy.org/doc/2.2/reference/generated/numpy.argmax.html) which tells you the index of the element with the maximum value.

In [None]:
top_idx = np.argmax(scores)
top_score = scores[top_idx]
top_idx, top_score

Element 6824 gives us the highest score. Let's see what that is!

In [None]:
disease_ontology[top_idx]

Excellent. In this case, the closest term for "cold" is likely the correct term.

### 📋 Task 2: SapBERT function

Your task is to package up the approach for creating vectors for a disease name and finding the best match. Write a function (called `get_closest_disease`) that takes in text, runs SapBERT and returns the ID of the best disease ontology term.

Try running the input "nsclc". You should find that the ontology term for "lung non-small cell carcinoma" is the closest match.

In [None]:
# Your code goes here


<details>
<summary>🔑 Click to see the answer 🔑</summary>

Here is the code for the task:

```python
def get_closest_disease(disease_name):
  features = extractor(disease_name)

  context_vectors = np.array(features)

  mention_vector = context_vectors[0,0,:]
  
  scores = cosine_similarity(mention_vector.reshape(1,-1), disease_ontology_vectors).flatten().tolist()

  top_idx = np.argmax(scores)

  return top_idx
```

</details>


Here's the example usage from the task instructions.

In [None]:
top_idx = get_closest_disease('nsclc')
print(disease_ontology[top_idx])

## Dictionary lookup

Let's look at a simpler approach as well. Sometimes we have a fairly exhaustive list of possible names for entities. Chemical names can often be highly specific so let's look at using the list of aliases and matching them exactly.

We'll use this when matching chemicals to the [ChEBI](https://www.ebi.ac.uk/chebi/) ontology. We have again preprocessed it for ease for this hands-on session. Let's load it below:

In [None]:
with open('data/chebi.json') as f:
  chebi = json.load(f)

len(chebi)

It is notably larger than the Disease Ontology. Let's see what an element of this ontology looks like. We have structured it in the same form as the previously loaded Disease Ontology: identifier, names and a list of aliases.

In [None]:
chebi[178670]

Now we can make a lookup table that gives the index of term from the text:

In [None]:
chebi_lookup = {}
for i,x in enumerate(chebi):
  for alias in x['aliases']:
    chebi_lookup[alias.lower()] = i
  chebi_lookup[x['name'].lower()] = i

len(chebi_lookup)

For instance, if we looked up "warfarin", we would get index 115409:

In [None]:
chebi_lookup.get("warfarin")

Let's check that that record is what we expect:

In [None]:
chebi[115409]

But if we search for something that didn't exist, we would get None:

In [None]:
print(chebi_lookup.get("a fictional drug"))

Notably, this approach ignores terms which may refer to multiple things. You could keep track of aliases that have multiple mappings, but for simplicity we won't here.

### 📋 Task 3: Apply entity linkers

The final task for this session is to apply the two different entity linking approaches to sentences that contain chemicals and diseases to see which appear the most. We've made a small dataset that can be loaded from `data/entity_linking_sentences.json` as below.

For the disease, use the SapBERT approach and pick the highest scoring disease from the Disease Ontology. For the chemical, use the lookup approach (remembering to use lower-case strings when searching).

For chemicals, you should find that the *sulfuric acid* appears 20 times. For diseases, you should find that *hypertension* appears 10 times.

*Hint: You could use the [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) to help with counting frequent chemicals and diseases*

In [None]:
with open('data/entity_linking_sentences.json') as f:
  entity_linking_sentences = json.load(f)

entity_linking_sentences[:2]

In [None]:
# Your code goes here

<details>
<summary>🔑 Click to see the answer 🔑</summary>

Here is the code for the task:

```python
from collections import Counter
from tqdm.auto import tqdm

with open('data/entity_linking_sentences.json') as f:
  entity_linking_sentences = json.load(f)

chemical_counts = Counter()
disease_counts = Counter()

for sentence in tqdm(entity_linking_sentences):
  for chemical_mention in sentence['chemicals']:
    chemical_idx = chebi_lookup.get(chemical_mention.lower())
    
    if chemical_idx:
      chemical = chebi[chemical_idx]['name']
      chemical_counts[chemical] += 1

  for disease_mention in sentence['diseases']:
    disease_idx = get_closest_disease(disease_mention)
    disease = disease_ontology[disease_idx]['name']
    
    disease_counts[disease] += 1

print(chemical_counts.most_common(3))
print(disease_counts.most_common(3))
```

</details>


#### 💡 Bonus Idea

The two methods used here only use the mention text (i.e. the words describing the chemical/disease) and do not actually use the text of the whole sentence. Methods that use the full text may perform better, but are very computationally expensive. A combination of approaches may be best!

## 🏁 End of Hands-on Session

And that brings us to the end of the session. You've learned about:

- Named entity recognition to figure out which words refer to entities of different types
- Entity linking for deciding which entity from an ontology is being discussed
  - Vector-based approaches such as SapBERT
  - Dictionary matching using the known names and aliases of terms

## 🧰 Optional Extras

If you've got extra time, you could try some of the following ideas:

- Some names in CHEBI may map to multiple items (e.g. "erythromycin" -> ('CHEBI:42355', 'erythromycin A') and ('CHEBI:48923', 'erythromycin')). Change the code to map those cases to "AMBIGUOUS". How might you deal with those cases better?
- Use the code from the first hands-on session to get sentences from a PubMed document. Apply the NER model, and then the entity linking models. Count the number of chemicals and diseases that appear in that document.