# Large language models for information extraction

This hands-on session introduces the different ways to run a large language model (LLM) and explores how they can be applied to biomedical information extraction. There are two main ways to use an LLM: locally and through an API. They both have their pros and cons.

**NOTE:** If you are running this with Colab, you should make a copy for yourself. If you don't, you may lose any edits you make. To make a copy, select `File` (top-left) then `Save a Copy in Drive`. If you are not using Colab, you may need to install some prerequisites. Please see the instructions on the [Github Repo](https://github.com/Glasgow-AI4BioMed/ismb2025tutorial).

## Getting Data

As in the previous sessions, we'll download some data that we'll use later on this tutorial with the commands below:

In [None]:
!wget -O data.zip https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EZaU9DTcAwdCpg07eGCIiqMBvGmYJdqfnfhP1ygcDkRkBg?download=1
!unzip -qo data.zip

## Getting a GPU

This session will make use of the free GPU available on Google Colab. To get it...

## Loading data

We've prepared some text that contains various information that we can be extracted. Let's load it up and take a quick look.

In [None]:
import json
with open('data/llm_sentences.json') as f:
  sentences = json.load(f)

len(sentences)

And if we look at a few of the sentences, we can see that they contain various biomedical entities (e.g. drugs, genes, etc) and relations between them that we might want to extract:

In [None]:
sentences[:5]

## Running an LLM locally

The first way we'll examine to run an LLM is locally, on the machine that you're using. This has some advantages and disadvantages. Firstly, if you are running the code on a machine you control, then you are able to use sensitive data (e.g. patient medical records) which typically can't be transmitted outside an organisation. It also gives you fine-grained control over the LLM and can enable some adjustments to the LLM. However, it potentially requires a large GPU (or GPUs) and the best performing models are not publicly available.

LLMs are typically measured in size by the number of parameters in them. LLMs are largely composed of a huge number of matrix multiplications, and the count of values across all their internal matrices gives the parameter count. The largest best performing models are measured in the hundreds of billions - which requires multiple of the most expensive GPUs to run. Smaller models in the 1B to 70B range can be run on more modest GPUs (with a few tricks).

We'll use a smaller 1B model today. The code below loads up the Falcon3-1B-Instruct model. It may take a minute to download the model and its associated tokenizer. 

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tiiuae/Falcon3-1B-Instruct"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Notably in the code above, we tell the code to load the model with float16. Typically computers store values with four bytes (known as float32) so we're asking it to use a smaller representation for the model. There's lots of research that shows that LLMs can work very well even when their internal parameters are compressed down to only a few bits.

Let's check how many parameters there are in the model:

In [None]:
model.num_parameters()

Roughly 1.7 billion parameters. How much memory does that need?

In [None]:
bytes = model.num_parameters() * 2
gigabytes = bytes / (1024*1024*1024)
gigabytes

So a small model still needs over 3 gigabytes of memory. That's quite a bit of GPU memory.

Now we want to use it. Let's load it into a text-generation pipeline. 

In [None]:
from transformers import pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda')

This code actually auto-detects if you have a GPU and uses it when appropriate. However, a lot of code requries you to explicitly tell it to use a GPU. It's always a good idea to check if you're actually using the GPU or your code could be very slow.

Now let's put together a query. We'll focus on information extraction use-cases where we have some text that we want to extract information from. Let's look at a simple binary classification of whether a sentence contains a drug:

In [None]:
sentence_text = 'EGFR binds to the EGFR receptor.'

query = f"{sentence_text}\n\nDoes the prior text contain a drug? Answer only Yes Or No"

Many LLMs are instruction-tuned which means that they have been specialised to take instructions in the form of a chat. So we need to make sure our instruction gets put in the right form. In this case, we don't pass the query in directly, but pass in the form of a list of messages. The first message is known as a system prompt that gives instructions to the LLM about its general task.

Let's create a system prompt and then put in our query. Notice how the role is `system` for the system prompt and then `user` for our message.

In [None]:
messages = [
    {"role": "system", "content": "You are a friendly chatbot"},
    {"role": "user", "content": query},
]


In [None]:

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

In [None]:
result = generator(messages, max_new_tokens=1)
generated_message = result[0]['generated_text'][-1]

print(generated_message['content'])

## Using an LLM through an API

https://aistudio.google.com/apikey

In [None]:
GEMINI_API_KEY = ''

In [None]:
import os

# Set your Gemini API key
os.environ["GOOGLE_API_KEY"] = GEMINI_API_KEY

os.environ["GOOGLE_API_KEY"]

In [None]:
sentences[0]

In [None]:
from google.generativeai import GenerativeModel

model = GenerativeModel("gemma-3-27b-it")

prompt = "What Disease Ontology (DOID) term does GBM match to?"

response = model.generate_content(prompt)

print(response.text)

## Task

## Optional Extras

If you've got extra time, you could try some of the following ideas: