# Large language models for information extraction (Part 1)

This hands-on session introduces the different ways to run a large language model (LLM) and explores how they can be applied to biomedical information extraction. There are two main ways to use an LLM: locally and through an API. They both have their pros and cons.

**NOTE:** If you are running this with Colab, you should make a copy for yourself. If you don't, you may lose any edits you make. To make a copy, select `File` (top-left) then `Save a Copy in Drive`. If you are not using Colab, you may need to install some prerequisites. Please see the instructions on the [Github Repo](https://github.com/Glasgow-AI4BioMed/ismb2025tutorial).

## Getting a GPU

This session will make use of the free GPU available on Google Colab. To get it, select `Edit` in the top-left, then `Notebook Settings`, select `T4 GPU` under Hardware accelerator and click `Save`.

This gets you a small GPU that is sufficient for this session. Note that you may get timed out if you leave the notebook inactive for too long.

## Getting Data

As in the previous sessions, we'll download some data that we'll use later on this tutorial with the commands below:

In [None]:
!wget -O data.zip https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/Ec8sygSj-zlAj6RXZHYGXqYBygZYM978Ts2FrTgBTfqmOQ?download=1
!unzip -qo data.zip

## Running an LLM locally

The first way we'll examine to run an LLM is locally, on the machine that you're using. This has some advantages and disadvantages. Firstly, if you are running the code on a machine you control, then you are able to use sensitive data (e.g. patient medical records) which typically can't be transmitted outside an organisation. It also gives you fine-grained control over the LLM and can enable some adjustments to the LLM. However, it potentially requires a large GPU (or GPUs) and the best performing models are not publicly available.

LLMs are typically measured in size by the number of parameters in them. LLMs are largely composed of a huge number of matrix multiplications, and the count of values across all their internal matrices gives the parameter count. The largest best performing models are measured in the hundreds of billions - which requires multiple of the most expensive GPUs to run. Smaller models in the 1B to 70B range can be run on more reasonable GPUs (with a few tricks).

Hugging Face's [transformers library](https://github.com/huggingface/transformers) is the main way to get access to transformer-based models. It can also be a bit verbose so for this hands-on session, we'll turn off some of its output:

In [None]:
import transformers

transformers.logging.set_verbosity_error()

We'll use a smaller 1B model today. The code below loads up the [Falcon3-1B-Instruct model](https://huggingface.co/tiiuae/Falcon3-1B-Instruct). It may take a minute to download the model and its associated tokenizer.

*Tip: You may get warnings about HF_TOKEN not existing. These can be ignored.*

In [7]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tiiuae/Falcon3-1B-Instruct"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Notably in the code above, we tell the code to load the model with float16. Typically computers store numbers with four bytes (known as float32) so we're asking it to use a smaller representation for the model (float16 = 2 bytes per number). There's lots of research that shows that LLMs can work very well even when their internal parameters are compressed down to only a few bits.

Let's check how many parameters there are in the model:

In [None]:
model.num_parameters()

Roughly 1.7 billion parameters. How much memory does that need (using 2 bytes per parameter)?

In [None]:
bytes = model.num_parameters() * 2
gigabytes = bytes / (1024*1024*1024)
gigabytes

So a small model still needs over 3 gigabytes of memory. That's quite a bit of GPU memory.

Now we want to use it. Let's load it into a text-generation pipeline. 

In [None]:
from transformers import pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda')

This code above tells the code to explicitly use the GPU on this machine (with `device='cuda'`). Most code requires you to explicitly tell it to use the GPU. It's a good idea to check if you're actually using the GPU or your code could be very slow.

Now let's put together a query. We'll focus on information extraction use-cases where we have some text that we want to extract information from. Let's look at a simple binary classification of whether a sentence contains a drug:

In [None]:
sentence_text = "Lung cancer is broadly classified into two main types: non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC)."

query = f"{sentence_text}\n\nDoes the prior text contain a disease? Answer only Yes Or No"

Many LLMs are instruction-tuned which means that they have been specialised to take instructions in the form of a chat. So we need to make sure our instruction gets put in the right form. In this case, we don't pass the query in directly, but pass in the form of a list of messages. The first message is known as a system prompt that gives instructions to the LLM about its general task.

Let's create a system prompt and then put in our query. Notice how the role is `system` for the system prompt and then `user` for our message.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful biomedical lab scientist"},
    {"role": "user", "content": query},
]


Language models work with strings, not the data structure above. So each LLM actually has its own way of formatting a series of messages into a single string. Let's see how this LLM's tokenizer turns those messages into a string to be passed to the LLM:

In [None]:
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)
print(tokenizer.decode(tokenized_chat))

That formatting can actually be done behind the scenes. To use our LLM, we can pass in the messages directly. We'll set a few parameters:

- `max_new_tokens=1` : The query asked for a response of only Yes or No, so let's ask for only one token.
- `do_sample=False` : Language models produce probability distributions to pick the next word. We'll tell it to always pick the highest probability, and not sample from that distribution. That makes it deterministic.
- `return_full_text=False` : We only care about the generated text, so we'll tell it to not return the original text as well.

Let's run it on our query (and ignore any warnings you get).

In [None]:
# Asking if there is a disease mentioned in "Lung cancer is broadly classified into two main types: non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC)."

result = generator(messages, max_new_tokens=1, do_sample=False, return_full_text=False)
result

Yay, it seems to have identified that there is a disease mention in that sentence (i.e. lung cancer).

Let's put those steps together into a function to make it easier to work with:

In [None]:
def run_query(query, max_new_tokens=1):
  messages = [
      {"role": "system", "content": "You are a biomedical scientist"},
      {"role": "user", "content": query},
  ]

  result = generator(messages, max_new_tokens=max_new_tokens, do_sample=False, return_full_text=False)

  generated_text = result[0]['generated_text']

  return generated_text

We'll try running the same inquiry with a different sentence. This sentence doesn't contain a mention of a disease. Will our LLM be able to identify that?

In [None]:
sentence_text = "Penicillin's development dates back to Alexander Fleming's work in 1928."

query = f"{sentence_text}\n\nDoes the prior text contain a disease? Answer only Yes Or No"

run_query(query, max_new_tokens=30)

Excellent. It correctly identified that no disease is mentioned there.

Let's push it. We'll ask it to extract diseases from a longer passage. Let's see what it does:

In [None]:
long_text = """
Dysregulation of TP53, BRCA1, and PTEN in conjunction with aberrant expression of EGFR, KRAS, and
PIK3CA has been implicated in the pathogenesis of triple-negative breast cancer (TNBC) and non-small
cell lung carcinoma (NSCLC), particularly when exacerbated by chronic exposure to benzo[a]pyrene,
formaldehyde, and arsenic trioxide, leading to increased genomic instability, enhanced epithelial-
to-mesenchymal transition (EMT), and resistance to cisplatin, paclitaxel, and immune checkpoint
inhibitors targeting PD-1 and CTLA-4, thereby necessitating combinatorial therapeutic strategies
incorporating PARP inhibitors, MEK inhibitors, and monoclonal antibodies against VEGF, especially in
patients with co-morbidities such as type 2 diabetes mellitus, hepatitis C, and systemic lupus
erythematosus, which further modulate the tumor microenvironment through altered cytokine profiles
including IL-6, TNF-α, and TGF-β1.
"""

query = f"{long_text}\n\nExtract the diseases from the previous text."

run_query(query, max_new_tokens=100)

Very cool. It seems to have identified some of the diseases in the sentence, though it may missed some (e.g. type 2 diabetes, etc). This is a smaller LLM (1B parameters) which will limit it. We'll try a larger LLM later in this hands-on session.

### 📋 Task 1: Disease sentences

We'll return to the task of identifying if a sentence contains a disease. The LLM worked well for the two sentences that we used and was correct for both. However, we should benchmark its performance with some more sentences and see how it does, and if we can get it any better.

Let's load up a dataset of sentences. These are derived from the [BC5CDR dataset](https://pmc.ncbi.nlm.nih.gov/articles/PMC4860626/) which is available [here](https://ftp.ncbi.nlm.nih.gov/pub/lu/BC5CDR/).

Our subset contains 200 sentences and whether they contain a mention of a disease. See the structure below. 

In [None]:
import json

with open('data/llm_disease_sentences.json') as f:
  sentences = json.load(f)

sentences[:3]

Your task is to evaluate our LLM for the task of identifying sentences mentioning disease. Take the query below and apply it to all 200 sentences. Then count how many times it's answer (e.g. saying Yes) matches with the correct label in the dataset (`sentence['has_disease']`). Calculate what percentage of the sentences it gets it correct for. You should find that it is 60%.

```python
query = f"{sentence_text}\n\nDoes the prior text contain a disease? Answer only Yes Or No"
```

In [None]:
# Your code goes here

<details>
<summary>🔑 Click to see the answer 🔑</summary>

Here is the code for the task:

```python
from tqdm.auto import tqdm

correct, total = 0, 0
for sentence in tqdm(sentences):

  query = f"{sentence['text']}\n\nDoes the prior text mention a disease? Answer only Yes Or No"

  result = run_query(query, max_new_tokens=1)
  prediction = (result == 'Yes')

  if prediction == sentence['has_disease']:
    correct += 1
  total += 1

print(f"{correct}/{total} = {correct/total:.1%}")
```

</details>


### Optional Extra

- You could try adjusting the query. This is known as prompt engineering. Can you improve the performance at all? If you can the performance over 70% you're doing well.

## Using an LLM through an API

The other way to use a large language model is through a company's API. This enables you to use much larger LLMs which require huge GPU resources. However, it may also cost money and can have some limits on how many queries you can make.

We'll use one of the free LLMs available through Google's AI Studio. There is no strong reason to use a Google LLM versus an OpenAI or Anthropic (or other provider). Generally you want to benchmark different ones for your task and decide with that information. We're using Google models today as they offer some free models and getting the API key will (hopefully) be easy.

### Getting an API key

All services require you to get an API key, so let's get an API key from the AI Studio. Follow the steps below:

- Click this link: https://aistudio.google.com/apikey
- Click the blue `Create API key` button in the top-right
- It may generate an API directly
  * OR: You may have the option to create a project to `Create API key in new project`. Select that one.
  * OR: It may ask you to `Create API key in existing project`. Select one of your existing projects and select the button.
- Copy the API key and put it into the `GEMINI_API_KEY` variable below.

**You should NOT need to enter any payment methods. We will be using the FREE LLMs available**

Do not share your API key

In [None]:
import os

GEMINI_API_KEY = ''
assert GEMINI_API_KEY, "The remainder of this hands-on session needs a Gemini API key from Google's AI studio"

# Set your Gemini API key
os.environ["GOOGLE_API_KEY"] = GEMINI_API_KEY

## Using the API

The AI studio offers a large number of models including Google's flagship Gemini LLMs. We'll use a large model that is available free: `gemma-3-27b-it`. You can read more about it [here](https://deepmind.google/models/gemma/gemma-3/). It is a lot larger than the model we ran locally earlier (27 billion parameters versus 1 billion parameters). The `it` part of its name means that it has been *instruction tuned*. That means that it can be given instructions, and not all language models can.

We can use the `google.genai` interface to call the model. It loads up our API key from above and uses it. All of this is then running on remote servers, so you could use a very basic computer to do this:

In [None]:
from google import genai

client = genai.Client()
    
response = client.models.generate_content(
  model="gemma-3-27b-it",
  contents="Tell a short bioinformatics joke",
)

print(response.text)

Great. It can tell jokes (as we well know that LLMs can do).

We can get it to do the task from earlier of identifying if a sentence mentions a disease. We'll add the extra parameters including `max_output_tokens=1` to only get one token.

In [None]:
sentence_text = "Erlotinib is a target therapy in lung cancer."
query = f"{sentence_text}\n\nDoes the prior text mention a disease? Answer only Yes Or No"

response = client.models.generate_content(
  model="gemma-3-27b-it",
  contents=query,
  config=genai.types.GenerateContentConfig(
    max_output_tokens=1,
  )
)

print(response.text)

Looks like it got it right. We'd expect this model to be more capable and perform better than the smaller model that we ran earlier.

Prompt engineering involves adjusting the instructions given to an LLM to try to improve performance and get the outputs to align better with what we want. A powerful method for this is few-shot learning where a few examples are included in the prompt. The instructions below contain a few examples to demonstrate this idea.

In [None]:
sentence_text = "Erlotinib is a target therapy in lung cancer."

query = f"""
Examples:
Aspirin treats headaches -> Yes
The capital of France is Paris -> No

{sentence_text}

Does the prior text mention a disease? Answer only Yes Or No
"""

response = client.models.generate_content(
  model="gemma-3-27b-it",
  contents=query,
  config=genai.types.GenerateContentConfig(
    max_output_tokens=1,
  )
)

print(response.text)

### 📋 Task 2: Extracting chemicals

The larger LLMs are much more capable of complex tasks. Let's get it to do named entity recognition for us. Create a function `extract_chemicals` that takes a string (`sentence_text`). It should run the `gemma-3-27b-it` LLM with instructions to extract a list of chemicals from the sentence text. You likely want to describe the ideal output, e.g. a comma-delimited list and nothing else. Then it will need to take the result (in `response.text`), split it appropriately (by commas if that's appropriate) and maybe remove some whitespace.

Ideally we'd like the result of `extract_chemicals("Some chemicals include aspirin, benzene and water.")` to be ` ['aspirin', 'benzene', 'water']`.

In [None]:
# Your code goes here

<details>
<summary>🔑 Click to see the answer 🔑</summary>

Here is the code for the task:

```python
def extract_chemicals(sentence_text):
    prompt = f"""
<text>{sentence_text}</text>

Extract any chemicals mentioned in the sentence above. Output the names exactly as they appear in the text. Return only the chemical names, separated by a comma and nothing else.
    """

    client = genai.Client()
    
    response = client.models.generate_content(
        model="gemma-3-27b-it",
        contents=prompt,
        config=genai.types.GenerateContentConfig(
            max_output_tokens=100,
        )
    )

    chemicals = [ x.strip() for x in response.text.split(',') ]

    return chemicals
```

</details>


And here's the function call that we described in the task instructions:

In [None]:
extract_chemicals("Some chemicals include aspirin, benzene and water.")

As before, we'd really like to benchmark this with an appropriate dataset. We'll turn to the [BC5CDR dataset](https://pmc.ncbi.nlm.nih.gov/articles/PMC4860626/) again and use their chemical annotations. We have pre-prepared a subset of sentences that contains the text of the sentence along with the text of the chemicals mentioned in that text. We can load it with the code below and see a few examples:


In [None]:
with open('data/llm_chemical_sentences.json') as f:
  sentences = json.load(f)

sentences[:3]

We can run our LLM on the first sentence with the code below. It gives us a good output that matches the extract sentence.

In [None]:
extract_chemicals(sentences[0]['text'])

We want to understand how our LLM does on this task. How does it compare to an NER model trained on the BC5CDR dataset? Let's examine a model similar to what we used in the earlier hands-on session that is specialised for the BC5CDR dataset.

The [Glasgow-AI4BioMed/bioner_bc5cdr](https://huggingface.co/Glasgow-AI4BioMed/bioner_bc5cdr) is a BERT-based model trained on BC5CDR. The model page provides an overview of its performance and it gets an **F1 score of 0.926** specifically for chemicals. Let's see how an LLM does when given instructions.

Let's run our model on ten sentences (for time reasons). If you get a `429 RESOURCE_EXHAUSTED` error, you may want to put `time.sleep(2)` in your code as there is a request limit for this language model (30 requests per minute).

In [None]:
from tqdm.auto import tqdm
import time

extracted = []
for sentence in tqdm(sentences[:20]):
    extracted.append( extract_chemicals(sentence['text']) )

Now we need to compare the chemicals that the LLM extracted with the correct answers. The code below calculates the true positives, false positives and false negatives.

In [None]:
TP,FP,FN = 0,0,0
for sentence,predictions in zip(sentences,extracted):
    TP += len(set(sentence['chemicals']).intersection(predictions)) # Number of matching chemicals in LLM predictions and correct answers
    FN += len(set(sentence['chemicals']).difference(predictions)) # Number of missing chemicals in LLM predictions
    FP += len(set(predictions).difference(sentence['chemicals'])) # Number of extra chemicals in LLM predictions (that aren't correct)
    
print(f"{TP=} {FP=} {FN=}")

Those counts are a little hard to interpret. We can get some of the standard machine learning metrics (precision, recall and F1 score). Recall that the [Glasgow-AI4BioMed/bioner_bc5cdr](https://huggingface.co/Glasgow-AI4BioMed/bioner_bc5cdr) model has a F1 score of 0.926 for this task. While we're only working on a subset (so not directly comparable), let's see if we're in the same rough neighbourhood.

In [None]:
precision = TP/(TP+FP) if (TP+FP) > 0 else 0
recall = TP/(TP+FN) if (TP+FN) > 0 else 0
f1 = 2*(precision*recall)/(precision+recall) if (precision+recall) > 0 else 0

print(f"{precision=:.3f} {recall=:.3f} {f1=:.3f}")

Well that is much lower than the trained [BERT model](https://huggingface.co/Glasgow-AI4BioMed/bioner_bc5cdr).

A few takeaways:

- If you have sufficient data, training a model to do information extraction tasks (e.g. named entity recognition) will generally outperform large language models that are given instructions
- Few-shot prompting (where you give it examples) may help you improve performance
- But if you have no (or very little data), an LLM can be a great way to get started on extracting entity mentions - but should still be benchmarked as early as possible.

## LLMs and Ontologies

An important task we covered earlier was entity linking. Deciding which entity is being referred to by a mention. For instance, "GBM" generally refers to the cancer "glioblastoma multiforme" in research articles. Specifically, this is entry [DOID:3068](http://www.disease-ontology.org/?id=DOID:3068) in the [Disease Ontology](https://disease-ontology.org) - the resource we used early for a list of diseases.

It would be wonderful if an LLM could point us directly to the correct entry in the Disease Ontology for GBM. Let's see how `gemma-3-27b-it` does:

In [None]:
query = "What Disease Ontology (DOID) term does GBM match to?"

response = client.models.generate_content(
  model="gemma-3-27b-it",
  contents=query
)

print(response.text)

Most likely, it will not have given you DOID:3068 as the correct identifier. It will likely have hallucinated an identifier.

LLMs are not good at recalling specific details (e.g. the identifier for an ontology that it may or may not have seen during training). It's better to include all the information it needs in the instructions.

The example below shows another approach to entity linking where a short-list of candidates has already been identified. The LLM is asked to pick which PC4 refers to in the initial sentence. The correct answer is 2.

In [None]:
query = """
Sentence: Overexpression of PC4 enhances DNA repair efficiency by promoting the recruitment of repair factors to sites of double-strand breaks, highlighting its critical role in genome stability.

1. Silicon phthalocyanine 4 (chemical) A synthetic photosensitizer agent containing a large macrocyclic ring chelated with silicon.
2. SUB1 (gene/protein). Involved in mediating transcription induced by upstream activators. Also known as PC4.
3. Pachyonychia Congenita 4 (disorder). A rare genetic skin disorder caused by mutations in the KRT16 gene
4. Pericardium 4. An acupuncture point on the forearm

What does PC4 refer to in the initial sentence. Answer with only the corresponding number.
"""

response = client.models.generate_content(
  model="gemma-3-27b-it",
  contents=query
)

print(response.text)

For this sentence, it normally gets it correct. We'd need to benchmark with an appropriate dataset, but at least it didn't hallucinate ontology IDs this time.

## The Power of LLMs

The real power of LLMs is the diversity of tasks that can be thrown at them. Let's look at some other tasks:

In [None]:
query = """
Summarize the abstract below into a lay summary of 2-3 sentences.

Deep learning models that predict functional genomic measurements from DNA sequence are powerful tools for deciphering the genetic regulatory code. Existing methods trade off between input sequence length and prediction resolution, thereby limiting their modality scope and performance. We present AlphaGenome, which takes as input 1 megabase of DNA sequence and predicts thousands of functional genomic tracks up to single base pair resolution across diverse modalities – including gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chro- matin contact maps, splice site usage, and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest respective available external models on 24 out of 26 evaluations on variant effect prediction. AlphaGenome’s ability to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically-relevant variants near the TAL1 oncogene. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence.
"""

response = client.models.generate_content(
  model="gemma-3-27b-it",
  contents=query
)

print(response.text)

The above code asked for a lay summary of an abstract. None of the previous methods we've explored would be able to help with this. It's a more challenging task to benchmark (i.e. what makes a good summary) but the LLM certainly seems to have had a good go at it.

Another power of LLMs is the ability to output in specific formats. We will switch over to the larger `gemini-2.5-flash` model that is more powerful, but has stricter rate limits on it.

Take a look at the instructions below which ask for JSON outputs:

In [None]:
query = "Output five pairs of drugs and their protein targets. Output a JSON list of dictionaries where each dictionary has two keys: 'drug' and 'target'."

response = client.models.generate_content(
  model="gemini-2.5-flash", # Switch to bigger model (with stricter rate limits)
  contents=query,
)

print(response.text)


Brilliant. In this case, it followed the instructions well and the drugs and targets are output correctly in JSON that could be easily extracted. However, sometimes it may make mistakes such as outputting invalid JSON or adding extra fields. We can actually constrain the LLM to follow a specific format.

The code below adds in a `response_schema` configuration that tells the output to a list of objects that match the fields found in the `DrugsAndTargets` class. This guarantees that the LLM will output data in the format desired. Notice that the format isn't even really explained in the instructions (though it may still be useful to explain any nuances).

In [None]:
from pydantic import BaseModel

class DrugsAndTargets(BaseModel):
  drug: str
  target: str

query = "Output five pairs of drugs and their protein targets. Use a JSON format."

response = client.models.generate_content(
  model="gemini-2.5-flash",
  contents=query,
  config={
    "response_mime_type": "application/json",
    "response_schema": list[DrugsAndTargets],
  },
)

# Use the response as a JSON string.
print(response.text)


### 📋 Task 3: Open information extraction

Another huge strength of large language models is their abilities in open information extraction where we want structured data, but we're not entirely sure of the structure or relation labels or other factors. Let's use an LLM to extract knowledge triples that contain a subject, a relation and an object. For instance, `"paracetamol treats headache"` could be represented as `{"subject":"paracetamol", "relation":"treats", "object": "headache"}`.

The final task is to instruct `gemini-2.5-flash` to extract knowledge triples from the long paragraph below. You want the output to be a list of triples in the JSON format above (with subject, relation and object keys). Try using the `response_schema` approach to force the LLM to output in that specific format.

In [None]:
paragraph = """
Upon epidermal growth factor (EGF) binding, the epidermal growth factor receptor (EGFR) undergoes
dimerization and autophosphorylation, creating docking sites for adaptor proteins like GRB2, which
subsequently recruits SOS1 to activate RAS. Activated RAS promotes RAF1 activation, leading to a
phosphorylation cascade involving MEK1 and ERK1/2. Phosphorylated ERK1/2 translocates to the
nucleus, where it modulates gene expression by phosphorylating transcription factors such as ELK1.
"""

In [None]:
## Your code goes here

<details>
<summary>🔑 Click to see the answer 🔑</summary>

Here is the code for the task:

```python
class Relation(BaseModel):
  subject: str
  relation: str
  object: str

query = f"{paragraph}\n\nExtract knowledge triples from the text above in JSON format."

response = client.models.generate_content(
  model="gemini-2.5-flash",
  contents=query,
  config={
    "response_mime_type": "application/json",
    "response_schema": list[Relation],
  },
)

print(response.text)
```

</details>


Fantastic. You could provide more detailed instructions about what constitutes subjects and objects, or even limit the types of relations you are interested in. But this gives an idea of how an LLM can give controlled structured outputs. Of course, you would still want some form of human evaluation to help understand whether the performance is good enough for what you want.

## 📌 Main Takeaways

- Trained models (especially using BERT-based approaches) generally outperform large language models that have been given instructions for well-defined tasks (e.g. named entity recognition)
- Large language models can do a vast number of tasks that would be challenging for BERT-based approaches (e.g. summarization)
- LLMs can follow very detailed instructions, take diverse inputs and give very specific outputs
- Human evaluation remains key to understanding the strengths and weaknesses of any approach

## 🏁 End of Hands-on Session

And that brings us to the end of the session. You've learned about:

- Getting a GPU from Google Colab
- Running an LLM locally and using it to classify sentences
- Using an LLM through an API
- Applying an LLM for named entity recognition and its performance compared to a trained BERT model
- The challenge of LLMs for entity linking where they may hallucinate things
- Getting LLMs to output to very specific formats (e.g. a specific JSON structure)

## 🧰 Optional Extras

If you've got extra time, you could try some of the following ideas:

- Try different models (e.g. the smaller gemma models) and see what effect they have. The list of models available through Google's AI Studio can be found at: https://ai.google.dev/gemini-api/docs/models
- Learn about the text generation parameters including `top_p`, `top_k` and `temperature`. Check out: https://huggingface.co/blog/how-to-generate
- Read about prompt engineering and try out some techniques: https://www.promptingguide.ai/