## BME i9400
## Fall 2024
### Transformers and Large Language Models

## Goals of this Lecture

1. Understand the fundamentals of the transformer architecture.
2. Learn how the attention mechanism works and why it is central to transformers.
3. Gain intuition about why transformers are so effective.
4. Explore how transformers are used to build large language models through pretraining and fine-tuning.

## Some fantastic resources for understanding the Transformer architecture
- Wrapping your head around transformers will take multiple passes. However, it is well worth the effort!
- Some resources that I found particularly helpful:
    - [Transformer Explainer](https://poloclub.github.io/transformer-explainer/)
    - [Understanding self-attention](https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html)
    - [3 Blue 1 Brown video on transformers](https://www.youtube.com/watch?v=eMlx5fFNoYc) 

## The classic picture of the Transformer architecture
- This is the picture that appears in the original paper by Vaswani et al. (2017).
    - Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
- Despite appearing in the paper, it is not a good starting point for understanding transformers.
- In fact the original paper is not a good resource for learning the architecture.
- In the paper, transformers are mainly posed as a solution to machine translation: converting a sentence in one language to another.
<img src="transformer.webp" width="600" height="600">

## What is a Transformer?
- *Transformers* are a type of model architecture that has revolutionized natural language processing (NLP).
- Key innovation: Self-attention mechanism for processing sequences.
    - Replaced previous architectures that included convolutional neural nets (CNNs) models in NLP.
- Enabled state-of-the-art performance in:
    - Translation
    - Summarization
    - Question answering


## Transformer Architecture Overview
- Composed of an encoder and decoder
    - Encoder: Processes input sequences (i.e., English sentence)
	- Decoder: Generates output sequences (i.e., French translation)
- Core components:
	1.	Self-Attention: Captures relationships between words in a sequence.
	2.	Feedforward Layers: Nonlinear transformations for learning representations.
	3.	Positional Encoding: Adds order information to the sequence.
- Stacked layers of attention and feedforward modules allow deep learning.

## Why Attention?

- Attention allows the model to focus on relevant parts of the sequence.
- Example: In a sentence, “The cat sat on the mat,” the word “cat” is strongly related to “sat.”
- Attention weights capture these relationships.
- Self-attention captures long-range dependencies better than previous approaches.

## Tokenization, Embedding, Positional Encoding
- To start, we convert the input sentence into a sequence of tokens
- These tokens are initially represented by integers
- We then convert these integers into dense vectors called *embeddings*
- To capture the order of the tokens, we add *positional encodings* to the embeddings
    - Positional encoding is a sine-cosine function of the position of the token in the sequence
    - These sine-cosine functions are literally added to the embeddings
<img src="embedding.png" width="900">

## The Attention Mechanism
1.	Input: A sequence of word embeddings.
2.  Query, Key, Value (Q, K, V):
    - Represent each word with three vectors.
3. Attention Scores:
	    - Compute scores for each word using:
 
___
$ \text{Score}(Q, K) = \frac{Q \cdot K^\top}{\sqrt{d_k}} $ 
___
        
4.  Apply softmax to normalize scores.
5. Compute context vector as weighted summation of values:
	    - Combine values (V) weighted by attention scores.
**The context vector is the output of the attention mechanism.**


## Multi-Head Attention
-   Splits attention into multiple “heads” for diverse perspectives.
-   Each head independently computes attention, focusing on different relationships.
-   Outputs are concatenated and linearly transformed.

___ 
$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots)W^O$

___ 
<img src="QKV.png" width="900">


## Masked Self-Attention
- Attention Score
    - The dot product between all query-key pairs is computed and represents how much the quary is related to the key.
- Masking
    - A "mask" is applied to the matrix of attention scores so that the model is not able to see the future words (this would be cheating in the context of language modeling).
- Softmax
    - Attention scores are converted into probabilities using the softmax function.
    - The resulting matrix elements represent how strongly each word relates to the words to its left. 
<img src="attention.png" width="900" >

## Output of self-attention block
- The (softmax-ed) self-attention scores are multiplied by the value matrix to produce the output of the self-attention block.

## Feedforward Layer
- The context vectors from the self-attention block are passed through a feedforward layer.
- The task of this layer is to exploit the context of the sentence in order to predict the next word.
- In the end, this results in a vector of probabilities, one for each token in the vocabulary.
- The token with the highest probability is the predicted next word.
<img src="mlp.png" width="900">

## Why Are Transformers So Effective?
-   Parallelization: Processes entire sequences simultaneously (vs. sequentially in RNNs).
-   Scalability: Handles very large datasets and model sizes.
-   Representation Power: Captures complex relationships with self-attention.
-   Versatility: Works across various tasks with minimal architecture changes.

## How Are LLMs Built?

-   Pretraining: train on massive corpora (e.g., books, biomedical papers).
-   Pretraining tasks:
    -   Masked language modeling (BERT).
	-   Autoregressive prediction (GPT).
-   Fine-Tuning:
    - Adapt pretrained models to specific tasks with smaller datasets (e.g., question answering).

## Transformers in Biomedical Engineering
-	Literature mining: Extract disease-gene associations from PubMed abstracts.
-   Clinical notes analysis: Summarize patient records or identify key trends.
-	Drug discovery: Predict drug-target interactions or design new molecules.

## Hands on demo with the Hugging Face Transformers library

In [9]:
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load a pretrained masked language model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Encode a sentence with a masked token
text = "The heart is a [MASK] organ."
inputs = tokenizer(text, return_tensors="pt")

# Display the token IDs
print("Input IDs:", inputs.input_ids)

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

Input IDs: tensor([[ 101, 1996, 2540, 2003, 1037,  103, 5812, 1012,  102]])


In [10]:
# Generate the mode's prediction for the masked token
outputs = model(**inputs)

# Get predicted token for the masked position
predictions = outputs.logits
mask_index = inputs.input_ids[0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = predictions[0, mask_index].argmax(dim=-1).item()
print("Predicted Token:", tokenizer.decode([predicted_token_id]))

Predicted Token: cardiac


In [11]:
# get the top 5 predictions
top_k = 5
top_predictions = predictions[0, mask_index].topk(top_k).indices
top_predictions = [tokenizer.decode([token_id]) for token_id in top_predictions]
print("Top Predictions:", top_predictions)

Top Predictions: ['cardiac', 'large', 'major', 'living', 'mechanical']


## Finally, what you've all been waiting for

## Install the openai API
```! pip install openai```

## Load the API key

In [2]:
import os
from dotenv import load_dotenv

# Load environment variables from the .env file
load_dotenv(dotenv_path="/Users/jacekdmochowski/.env")  # Adjust the path if your .env file is elsewhere

# Access the API key
openai_api_key = os.getenv("OPENAI_API_KEY")

## Talk to the GPT-4o model

In [7]:
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an expert forecaster of the future."},
        {
            "role": "user",
            "content": "What will a second Trump presidency mean for the future of biomedical engineering?"
        }
    ]
)

In [8]:
response = completion.choices[0].message.content
# render with line breaks
print(response.replace("\n", "\n"))

As an expert in forecasting, I can provide some insights based on trends and historical data up to October 2023. However, predicting the future with absolute certainty is impossible. Should a second Trump presidency occur, several factors could potentially influence the field of biomedical engineering:

1. **Regulatory Environment**: Historically, Republican administrations, including the first Trump presidency, focused on deregulation. A second Trump presidency might accelerate the approval process for biomedical innovations, possibly favoring industry growth but raising concerns about safety and efficacy standards.

2. **Funding and Investment**: The level of federal funding for scientific research, including biomedical engineering, could fluctuate. While the Trump administration previously proposed cuts to agencies like the NIH, budget outcomes depended heavily on congressional negotiations. Public-private partnerships might be encouraged to fill any gaps.

3. **Healthcare Policy**: