[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CLDiego/SPE_GeoHackathon_2025/blob/main/MOD_00_LLM_BASICS.ipynb)

***
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/write.svg" width="20"/> Follow along by running each cell in order
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/code.svg" width="20"/> Make sure to run the environment setup cells first
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> Wait for each installation to complete before proceeding
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/success.svg" width="20" /> Verify installations by running the test cells
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/list.svg" width="20" /> Don't worry if installations take a while - this is normal!


# Module 00: Large Language Models (LLMs) Basics with HuggingFace

LLMs are a type of artificial intelligence model designed to understand and generate human-like text based on the input they receive. They are trained on vast amounts of text data and can perform a variety of language tasks, such as translation, summarization, and question-answering.

The revolutionary aspect of LLMs lies in their ability to generate coherent and contextually relevant text, making them valuable tools for a wide range of applications, from chatbots to content creation. LLMs massive breakthrough came from the paper "Attention is all you need" by Vaswani et al. in 2017, which introduced the Transformer architecture. This architecture relies on a mechanism called "self-attention" that allows the model to weigh the importance of different words in a sentence, regardless of their position. This was a significant departure from previous models that processed text sequentially, enabling much more efficient training on large datasets.

The transformer architecture is shown below:

![Transformer Architecture](https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/transformer.png)

A transformer is composed of:
- An encoder that processes the input text and generates a contextual representation.
- A decoder that takes the encoder's output and generates the final text output.
- Multi-head self-attention mechanisms that allow the model to focus on different parts of the input text simultaneously.
- Feed-forward neural networks that process the information from the attention layers.
- Layer normalization and residual connections that help stabilize and improve the training process.

## Passing Inputs Through the Transformer
While the input text is processed by the encoder, it is transformed into a series of continuous representations, or embeddings, that capture the meaning and context of the words. These embeddings are then passed to the decoder, which generates the output text one token at a time. The self-attention mechanism allows the model to consider the entire input sequence when producing each token, enabling it to generate more coherent and contextually relevant text.

However, before the input text can be processed by the transformer, it must be tokenized and converted into a format that the model can understand. This involves breaking the text into smaller units, or tokens, and mapping these tokens to their corresponding embeddings in a high-dimensional space. The transformer then processes these embeddings through its layers, applying self-attention and feed-forward networks to generate the final output.

## Tokenizers

In [1]:
from transformers import BertTokenizer
from IPython.display import HTML, display

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def bert_tokenize_and_color(text, tokenizer ):
    colored_text = ""
    colors = ['#FF5733', '#33FF57', '#3357FF', '#FFD700', '#00CED1', '#FF00FF', '#FFFF00',
              '#FF0000', '#00FF00', '#0000FF', '#00FFFF', '#FF1493', '#8A2BE2',
              '#FF8C00', '#228B22', '#DC143C', '#32CD32', '#1E90FF', '#FFD700', '#FF69B4']

    for line in text:
        line_html = ""
        tokens = tokenizer.tokenize(line)
        for token in tokens:
            color = colors[hash(token) % len(colors)]
            token_html = f'<span style="background-color:{color}; color: white; margin-right: 5px;">{token}</span>'
            line_html += token_html
        colored_text += f'<div style="margin-bottom: 10px;">{line_html}</div>'

    display(HTML(colored_text))

# Example usage
text = [
    "I and Pangur Bán, my cat,",
    "'Tis a like task we are at;",
    "Hunting mice is his delight,",
    "Hunting words I sit all night.",
    "Better far than praise of men",
    "'Tis to sit with book and pen;",
    "Pangur bears me no ill-will,",
    "He, too, plies his simple skill.",
    "'Tis a merry thing to see",
    "At our tasks how glad are we,",
    "When at home we sit and find",
    "Entertainment to our mind.",
    "Oftentimes a mouse will stray",
    "In the hero Pangur's way;",
    "Oftentimes my keen thought set",
    "Takes a meaning in its net.",
    "'Gainst the wall he sets his eye",
    "Full and fierce and sharp and sly;",
    "'Gainst the wall of knowledge I",
    "All my little wisdom try.",
    "When a mouse darts from its den,",
    "O! how glad is Pangur then;",
    "O! what gladness do I prove",
    "When I solve the doubts I love.",
    "So in peace our task we ply,",
    "Pangur Bán, my cat, and I;",
    "In our arts we find our bliss,",
    "I have mine, and he has his.",
    "Practice every day has made",
    "Pangur perfect in his trade;",
    "I get wisdom day and night,",
    "Turning darkness into light."
]
bert_tokenize_and_color(text, tokenizer)

In [12]:
# Display sample vocabulary, special tokens, and token mapping

# Sample vocab (first 20 keys)
vocab = tokenizer.get_vocab()
print("Sample vocabulary (first 20):", list(vocab.keys())[:20])

# Special tokens
print("\nSpecial tokens:", tokenizer.special_tokens_map)

# Mapping for the first line of the poem
sample_text = text[0]
tokens = tokenizer.tokenize(sample_text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"\nSample text: {sample_text}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")

# Full encoding for the first line
encoded = tokenizer(sample_text, return_tensors='pt')
print(f"\nFull encoding (input_ids): {encoded['input_ids']}")
print(f"Attention mask: {encoded['attention_mask']}")

decoded = tokenizer.decode(token_ids)
print(f"Decoded tokens: {decoded}")

Sample vocabulary (first 20): ['[PAD]', '[unused0]', '[unused1]', '[unused2]', '[unused3]', '[unused4]', '[unused5]', '[unused6]', '[unused7]', '[unused8]', '[unused9]', '[unused10]', '[unused11]', '[unused12]', '[unused13]', '[unused14]', '[unused15]', '[unused16]', '[unused17]', '[unused18]']

Special tokens: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}

Sample text: I and Pangur Bán, my cat,
Tokens: ['i', 'and', 'pang', '##ur', 'ban', ',', 'my', 'cat', ',']
Token IDs: [1045, 1998, 20657, 3126, 7221, 1010, 2026, 4937, 1010]

Full encoding (input_ids): tensor([[  101,  1045,  1998, 20657,  3126,  7221,  1010,  2026,  4937,  1010,
           102]])
Attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Decoded tokens: i and pangur ban, my cat,


In [13]:
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load model + tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Encode sentences
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state[:,0,:]  # CLS token

# Reduce dimensions to 3D with t-SNE
tsne = TSNE(n_components=3, perplexity=5, random_state=42)
embeddings_3d = tsne.fit_transform(embeddings.numpy())

## Visualization Explanation

This interactive 3D scatter plot displays the sentence embeddings of lines from the poem "Pangur Bán" (by an anonymous Irish monk, c. 9th century), reduced to 3D space using t-SNE for dimensionality reduction. Each point represents one line of the poem, positioned based on its semantic similarity in the embedding space. The color gradient corresponds to the line index (0 to 31), indicating the sequential order in the poem. Hover over any point to view the full text of that line. This helps visualize how the model captures thematic and contextual relationships between the poem's verses.

In [14]:
import plotly.express as px

fig = px.scatter_3d(
    x=embeddings_3d[:,0],
    y=embeddings_3d[:,1],
    z=embeddings_3d[:,2],
    hover_name=text,
    color=list(range(len(text))),
    title="Interactive 3D Embedding Visualization",
    labels={'x':'Dimension 1', 'y':'Dimension 2', 'z':'Dimension 3'}
)
fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.update_layout(coloraxis_colorbar_title_text="Line Index in Poem", font_family='monospace', width=900, height=700)
fig.show()

## Difference between Model and AutoModel

AutoModel is a high-level class in the transformers library that automatically selects the appropriate model architecture based on the model configuration. For example:
- For BERT models, it loads `BertModel`.
- For GPT models, it loads `GPT2Model`.
- For Llama models, it loads `LlamaModel`.

This is convenient when you don't know or don't want to specify the exact model class. In contrast, using specific model classes like `BertModel` requires you to know the architecture in advance, but it can be more explicit.

For tasks like text generation, we use `AutoModelForCausalLM`, which loads the appropriate causal language model for generation.

Below, we demonstrate loading the same BERT model using both approaches.

In [15]:
from transformers import BertModel, AutoModel

# Load using specific model class
model_specific = BertModel.from_pretrained('bert-base-uncased')

# Load using AutoModel
model_auto = AutoModel.from_pretrained('bert-base-uncased')

print("Type of model_specific:", type(model_specific))
print("Type of model_auto:", type(model_auto))
print("Are they the same?", type(model_specific) == type(model_auto))

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Type of model_specific: <class 'transformers.models.bert.modeling_bert.BertModel'>
Type of model_auto: <class 'transformers.models.bert.modeling_bert.BertModel'>
Are they the same? True


## Text Generation with Transformers

This section showcases text generation using a public transformer model (GPT-2) from Hugging Face. We'll load the model and generate text based on a prompt from the poem.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a public model (GPT-2)
model_name = 'gpt2'
tokenizer_llama = AutoTokenizer.from_pretrained(model_name)
model_llama = AutoModelForCausalLM.from_pretrained(model_name)

# Set pad token if not set
if tokenizer_llama.pad_token is None:
    tokenizer_llama.pad_token = tokenizer_llama.eos_token

# Prompt from the poem
prompt = "I and Pangur Bán, my cat,"
inputs = tokenizer_llama(prompt, return_tensors='pt')

# Generate text
outputs = model_llama.generate(**inputs, max_length=50, num_return_sequences=1, temperature=0.7, do_sample=True)

# Decode and display
generated_text = tokenizer_llama.decode(outputs[0], skip_special_tokens=True)
print("Prompt:", prompt)
print("Generated text:", generated_text)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]