# Biomedical LLMs — Practical Example with BioGPT and PubMedBERT

This notebook demonstrates in a **simple and educational** manner how to use two pre-trained biomedical language models with **Hugging Face**:

- **BioGPT** → text generation (medical answers or explanations)  
- **PubMedBERT** → understanding and classification of biomedical text

We will consistently follow the logical flow:  
**Text → Tokenizer → Model → Readable Output**

In [1]:
# !pip install transformers torch datasets evaluate tqdm ipywidgets
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, pipeline
import torch

print("Libraries successfully imported.")

Libraries successfully imported.


## What is a Biomedical Model

**Biomedical LLMs** (Large Language Models for clinical-scientific texts) are Transformers trained on **PubMed, medical articles, and clinical data**.  
They can:
- understand and classify medical sentences,  
- generate coherent scientific text,  
- answer biomedical questions.  

We will now look at two examples:
1. **BioGPT** for generating a textual response,  
2. **PubMedBERT** for classifying medical text.

## Example with BioGPT (text generation)

BioGPT is a *causal* model (like GPT-2) trained on biomedical articles.  
It is capable of **generating coherent and plausible text** based on a medical prompt.

#### Objective
Generate a brief medical description or explanation from an input.

In [2]:
# Load the tokenizer and the BioGPT model
tokenizer = AutoTokenizer.from_pretrained("microsoft/BioGPT")
model = AutoModelForCausalLM.from_pretrained("microsoft/BioGPT")

# Input prompt — the model will continue this biomedical text
prompt = (
    "The role of the BRCA1 gene in breast cancer is a critical topic in oncology research. "
    "Several studies have shown that BRCA1 is involved in DNA repair and tumor suppression. "
    "In particular,"
)

# Tokenization → converts text into numerical tensors
inputs = tokenizer(prompt, return_tensors="pt")

print("Example of input_ids:", inputs['input_ids'][0][:15])

Example of input_ids: tensor([    2,    18,   151,     5,     6, 10319,   131,    10,   502,   101,
           21,    14,   842,  6404,    10])


#### Interpretation 

The **tokenizer** has transformed the input text into a numerical representation that is readable by the model.

- **`input_ids`** → is the sequence of numerical IDs corresponding to the tokens (words or subwords).  
  Each number represents a term in BioGPT's internal vocabulary.  

In summary:
> The tokenizer is the bridge between human language and computational language —  
> it converts words into numbers, allowing the model to compute relationships and context.

In [3]:
# Biomedical Text Generation
output_tokens = model.generate(
    **inputs,               # pass the tokenized inputs to the model
    max_length=120,         # maximum length of generated sequence (in tokens)
    temperature=0.8,        # controls “creativity” → higher = more diverse output
    top_p=0.9               # nucleus sampling: considers only the most probable 90% of tokens
)

# Decoding → converts numeric IDs back to human-readable text
generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("""\nGenerated text from BioGPT:\n""")
print(generated_text)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



Generated text from BioGPT:

The role of the BRCA1 gene in breast cancer is a critical topic in oncology research. Several studies have shown that BRCA1 is involved in DNA repair and tumor suppression. In particular, BRCA1 is involved in homologous recombination (HR), a process that is essential for the repair of DNA double-strand breaks (DSBs).


#### Interpretation
- The **input** is a medical prompt (“The role of BRCA1 gene…”).  
- The **tokenizer** converts it into numerical IDs for the model.  
- The **BioGPT model** generates a plausible textual continuation.  
- The **output** is coherent biomedical text, useful for *question answering* or *text summarization* applications.

> BioGPT does not truly “understand” the content, but generates text based on learned linguistic patterns.

## Example with PubMedBERT (Text Classification)

PubMedBERT is a BERT-based model, trained from scratch on scientific texts from PubMed.  
Here, we use it to **classify** a biomedical text in a straightforward manner.

In [4]:
# Create a classification pipeline using BioBERT
classifier = pipeline("text-classification", model="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")

text = "Aspirin is used to treat cardiovascular diseases and prevent heart attacks."

# Performs tokenization, model inference, and decoding automatically
result = classifier(text)

# Interpretation
label = result[0]['label']
score = result[0]['score']

# Demonstrative label mapping (example)
label_map = {"LABEL_0": "Non disease-related", "LABEL_1": "Disease-related"}
mapped_label = label_map.get(label, label)

print(f"\nPredicted label: {mapped_label}")
print(f"Model confidence: {score:.2f}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu



Predicted label: Disease-related
Model confidence: 0.52


#### Interpretation
- The **input** is a biomedical sentence.  
- The **pipeline** automatically applies tokenization, inference, and decoding.  
- The **output** returns a label and a probability (how confident the model is).  

> This flow simplifies the practical use of biomedical LLMs, without the need to write additional code for tensors or tokens.

##### Note on the BiomedNLP-PubMedBERT Model

The model **`microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`** has been generally trained on biomedical texts (abstracts and full texts from PubMed).  
However, it is **not specifically fine-tuned** for the classification of "disease-related vs non-disease".  

To achieve accurate results for this type of task, it is necessary to perform **supervised fine-tuning**, that is, to retrain the model on a labeled dataset with the two desired classes.  
In the absence of fine-tuning, the model provides only a **rough estimate** based on its general knowledge of biomedical language.

## Conclusion

We have seen two modes of use:
1. **BioGPT** → generation of coherent biomedical text.  
2. **PubMedBERT** → classification or understanding of clinical sentences.  

### General Flow:
```
Text → Tokenizer → Model → Readable Output
```

### Key Points to Remember
- `from_pretrained()` automatically loads architecture and weights.  
- `AutoTokenizer` converts text into numerical input.  
- `pipeline()` combines all steps into a single command.

Now you can experiment with other tasks: *NER*, *question answering*, *summarization* — all following the same logic.