[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CLDiego/SPE_GeoHackathon_2025/blob/main/MOD_00_SETUP.ipynb)

***
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/write.svg" width="20"/> Follow along by running each cell in order
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/code.svg" width="20"/> Make sure to run the environment setup cells first
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> Wait for each installation to complete before proceeding
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/success.svg" width="20" /> Verify installations by running the test cells
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/list.svg" width="20" /> Don't worry if installations take a while - this is normal!


# Module 00: Large Language Models (LLMs) Basics with HuggingFace

LLMs are a type of artificial intelligence model designed to understand and generate human-like text based on the input they receive. They are trained on vast amounts of text data and can perform a variety of language tasks, such as translation, summarization, and question-answering.

The revolutionary aspect of LLMs lies in their ability to generate coherent and contextually relevant text, making them valuable tools for a wide range of applications, from chatbots to content creation. LLMs massive breakthrough came from the paper "Attention is all you need" by Vaswani et al. in 2017, which introduced the Transformer architecture. This architecture relies on a mechanism called "self-attention" that allows the model to weigh the importance of different words in a sentence, regardless of their position. This was a significant departure from previous models that processed text sequentially, enabling much more efficient training on large datasets.

The transformer architecture is shown below:

![Transformer Architecture](https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/transformer.png)

A transformer is composed of:
- An encoder that processes the input text and generates a contextual representation.
- A decoder that takes the encoder's output and generates the final text output.
- Multi-head self-attention mechanisms that allow the model to focus on different parts of the input text simultaneously.
- Feed-forward neural networks that process the information from the attention layers.
- Layer normalization and residual connections that help stabilize and improve the training process.

## Passing Inputs Through the Transformer
While the input text is processed by the encoder, it is transformed into a series of continuous representations, or embeddings, that capture the meaning and context of the words. These embeddings are then passed to the decoder, which generates the output text one token at a time. The self-attention mechanism allows the model to consider the entire input sequence when producing each token, enabling it to generate more coherent and contextually relevant text.

However, before the input text can be processed by the transformer, it must be tokenized and converted into a format that the model can understand. This involves breaking the text into smaller units, or tokens, and mapping these tokens to their corresponding embeddings in a high-dimensional space. The transformer then processes these embeddings through its layers, applying self-attention and feed-forward networks to generate the final output.

## Tokenizers

In [9]:
from transformers import BertTokenizer
from IPython.display import HTML
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def bert_tokenize_and_color(text, tokenizer ):
    text = ' '.join(text)
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)

    colored_text = ""
    colors = ['#FF5733', '#33FF57', '#3357FF', '#FFD700', '#00CED1', '#FF00FF', '#222222',
              '#FF0000', '#00FF00', '#0000FF', '#FFFF00', '#00FFFF', '#FF1493', '#8A2BE2',
              '#FF8C00', '#228B22', '#8B0000', '#2F4F4F', '#4B0082', '#000000']

    for token in tokens:
        color = colors[hash(token) % len(colors)]
        token_html = f'<span style="background-color:{color}">{token}</span>'
        colored_text += token_html + ' '

    return HTML(colored_text)

# Example usage
text = [
    "Pangur Bán and I at work,",
    "Adepts, equals, cat and clerk:",
    "His whole instinct is to hunt,",
    "Mine to free the meaning pent.",
    "More than loud acclaim, I love",
    "Books, silence, thought, my alcove.",
    "Happy for me, Pangur Bán",
    "Child-plays round some mouse’s den.",
    "Truth to tell, just being here,",
    "Housed alone, housed together,",
    "Adds up to its own reward:",
    "Concentration, stealthy art.",
    "Next thing an unwary mouse",
    "Bares his flank: Pangur pounces.",
    "Next thing lines that held and held",
    "Meaning back begin to yield.",
    "All the while, his round bright eye",
    "Fixes on the wall, while I",
    "Focus my less piercing gaze",
    "On the challenge of the page.",
    "With his unsheathed, perfect nails",
    "Pangur springs, exults and kills.",
    "When the longed-for, difficult",
    "Answers come, I too exult.",
    "So it goes. To each his own.",
    "No vying. No vexation.",
    "Taking pleasure, taking pains,",
    "Kindred spirits, veterans.",
    "Day and night, soft purr, soft pad,",
    "Pangur Bán has learned his trade.",
    "Day and night, my own hard work",
    "Solves the cruxes, makes a mark."
]
bert_tokenize_and_color(text, tokenizer)


In [10]:
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load model + tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Encode sentences
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state[:,0,:]  # CLS token

# Reduce dimensions to 3D with t-SNE
tsne = TSNE(n_components=3, perplexity=5, random_state=42)
embeddings_3d = tsne.fit_transform(embeddings.numpy())

In [14]:
import plotly.express as px

fig = px.scatter_3d(
    x=embeddings_3d[:,0],
    y=embeddings_3d[:,1],
    z=embeddings_3d[:,2],
    hover_name=text,
    color=list(range(len(text))),
    title="Interactive 3D Embedding Visualization",
    labels={'x':'Dimension 1', 'y':'Dimension 2', 'z':'Dimension 3'}
)
fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.update_layout(coloraxis_colorbar_title_text="Line Index in Poem", font_family='monospace', width=900, height=700)
fig.show()

## Visualization Explanation

This interactive 3D scatter plot displays the sentence embeddings of lines from the poem "Pangur Bán" (by an anonymous Irish monk, c. 9th century), reduced to 3D space using t-SNE for dimensionality reduction. Each point represents one line of the poem, positioned based on its semantic similarity in the embedding space. The color gradient corresponds to the line index (0 to 31), indicating the sequential order in the poem. Hover over any point to view the full text of that line. This helps visualize how the model captures thematic and contextual relationships between the poem's verses.