[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CLDiego/SPE_GeoHackathon_2025/blob/main/S1_M1_LLM_HF.ipynb)

***
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/write.svg" width="20"/> Follow along by running each cell in order
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/code.svg" width="20"/> Make sure to run the environment setup cells first
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> Wait for each installation to complete before proceeding
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/list.svg" width="20" /> Don't worry if installations take a while - this is normal!

In [None]:
# Download utils from GitHub
!wget -q --show-progress https://raw.githubusercontent.com/CLDiego/SPE_GeoHackathon_2025/dev/spe_utils.txt -O spe_utils.txt
!wget -q --show-progress -x -nH --cut-dirs=3 -i spe_utils.txt

In [1]:
# Environment setup [If running outside Colab]
# !pip install transformers torch matplotlib plotly scikit-learn ipython

# import warnings
# warnings.filterwarnings('ignore')

In [2]:
# Hugging Face API token
# Retrieving the token is required to get access to HF hub
# from google.colab import userdata
# hf_token = userdata.get('HF_TOKEN')

In [3]:
import spe_utilts.core

Faculty of Science and Engineering ðŸ”¬
[95mThe University of Manchester [0m
Invoking SPE GeoHackathon 2025 utils version: [92m0.0.1[0m


In [4]:
from spe_utilts.data import (
    GEOSCIENCE_TERMS,
    TOKENIZATION_EXAMPLES,
    SIMPLE_PROMPTS,
    GEOPHYSICS_TEXTS,
    GEOPHYSICS_CATEGORIES,
    get_texts_by_category,
    get_available_categories,
    get_random_texts
)

# Session 01 // Module 01: Large Language Models (LLMs) with HuggingFace

In this module, we'll explore the fundamentals of Large Language Models (LLMs) using HuggingFace transformers. We'll cover tokens, embeddings, context windows, and hands-on text generation with a focus on geoscience applications.

## Learning Objectives
- Understand what tokens, embeddings, and context windows are
- Load and use a small HuggingFace model
- Generate simple text completions
- Apply LLMs to geoscience definition tasks

## 1. Understanding Tokens

**Tokens** are the basic units that language models work with. Text is broken down into tokens before being processed by the model. A token can be:
- A whole word (e.g., "seismic")
- Part of a word (e.g., "seis", "mic")
- Punctuation marks
- Special symbols

Let's see how tokenization works with a geoscience example:

In [5]:
from transformers import BertTokenizer
from spe_utilts.visualisation import bert_tokenize_and_color

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

for text in TOKENIZATION_EXAMPLES:
    bert_tokenize_and_color(text, tokenizer)

Original text: Seismic inversion is a geophysical technique.
Number of tokens: 8


Tokens: ['seismic', 'inversion', 'is', 'a', 'geo', '##physical', 'technique', '.']
--------------------------------------------------------------------------------
Original text: Hydrocarbon exploration uses seismic surveys.
Number of tokens: 7


Tokens: ['hydro', '##carbon', 'exploration', 'uses', 'seismic', 'surveys', '.']
--------------------------------------------------------------------------------
Original text: Reservoir characterization involves petrophysical analysis.
Number of tokens: 10


Tokens: ['reservoir', 'characterization', 'involves', 'pet', '##rop', '##hy', '##sic', '##al', 'analysis', '.']
--------------------------------------------------------------------------------
Original text: What is the porosity and permeability of this formation?
Number of tokens: 13


Tokens: ['what', 'is', 'the', 'por', '##osity', 'and', 'per', '##me', '##ability', 'of', 'this', 'formation', '?']
--------------------------------------------------------------------------------


In [7]:
# Display sample vocabulary, special tokens, and token mapping

# Sample vocab (first 20 keys)
vocab = tokenizer.get_vocab()
print("Sample vocabulary (first 20):", list(vocab.keys())[:20])

# Special tokens
print("\nSpecial tokens:", tokenizer.special_tokens_map)

# Mapping for the first tokenization example
sample_text = TOKENIZATION_EXAMPLES[0]  
tokens = tokenizer.tokenize(sample_text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"\nSample text: {sample_text}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")

# Full encoding for the first example
encoded = tokenizer(sample_text, return_tensors='pt')
print(f"\nFull encoding (input_ids): {encoded['input_ids']}")
print(f"Attention mask: {encoded['attention_mask']}")

decoded = tokenizer.decode(token_ids)
print(f"Decoded tokens: {decoded}")

Sample vocabulary (first 20): ['[PAD]', '[unused0]', '[unused1]', '[unused2]', '[unused3]', '[unused4]', '[unused5]', '[unused6]', '[unused7]', '[unused8]', '[unused9]', '[unused10]', '[unused11]', '[unused12]', '[unused13]', '[unused14]', '[unused15]', '[unused16]', '[unused17]', '[unused18]']

Special tokens: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}

Sample text: Seismic inversion is a geophysical technique.
Tokens: ['seismic', 'inversion', 'is', 'a', 'geo', '##physical', 'technique', '.']
Token IDs: [22630, 28527, 2003, 1037, 20248, 23302, 6028, 1012]

Full encoding (input_ids): tensor([[  101, 22630, 28527,  2003,  1037, 20248, 23302,  6028,  1012,   102]])
Attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Decoded tokens: seismic inversion is a geophysical technique.


Notes:
- You can use `AutoTokenizer` for automatic model selection.
- To perform tokenization, you can use the `tokenizer` object created from the `BertTokenizer` class or the `AutoTokenizer` class.

## 2. Understanding Embeddings

**Embeddings** are numerical representations of tokens in a high-dimensional space. Each token is converted to a vector of numbers that captures its meaning and relationships to other tokens.

Key properties of embeddings:
- Similar words have similar embeddings
- Embeddings capture semantic relationships
- Typical dimensions: 512, 768, 1024, or higher

In [None]:
from transformers import AutoModel, AutoTokenizer
import torch

# Load a small model for embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer_embed = AutoTokenizer.from_pretrained(model_name)
model_embed = AutoModel.from_pretrained(model_name)

def get_embeddings(texts, tokenizer, model):
    """Get sentence embeddings"""
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
        # Use CLS token embedding (first token) for sentence representation
        embeddings = outputs.last_hidden_state[:, 0, :]
    
    return embeddings.numpy()

# Get embeddings for geoscience terms
geoscience_terms = [
    "seismic inversion",
    "reservoir characterization", 
    "hydrocarbon exploration",
    "petrophysical analysis",
    "porosity measurement",
    "permeability analysis"
]

# Get embeddings for geoscience terms
# Remove the hardcoded list and use the imported constant
embeddings = get_embeddings(GEOSCIENCE_TERMS, tokenizer_embed, model_embed)

print(f"Embedding shape: {embeddings.shape}")
print(f"Each term is represented by {embeddings.shape[1]} numbers")
print(f"\nFirst 10 embedding values for '{GEOSCIENCE_TERMS[0]}':")
print(embeddings[0][:10])

Embedding shape: (6, 384)
Each term is represented by 384 numbers

First 10 embedding values for 'seismic inversion':
[ 0.04884824 -0.23083036  0.63500595  0.37743944 -0.0877682  -0.79720074
 -0.5652189  -0.12953708 -0.17343518  0.01015258]


In [12]:
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load model + tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")


# Remove the entire hardcoded geophysics_texts list and replace with:
print(f"Total number of geophysics texts: {len(GEOPHYSICS_TEXTS)}")
print("Sample texts:")
for i, text in enumerate(GEOPHYSICS_TEXTS[:5]):
    print(f"{i+1}. {text}")

Total number of geophysics texts: 56
Sample texts:
1. Seismic inversion transforms seismic reflection data into quantitative subsurface rock properties.
2. P-wave velocity depends on rock density and bulk modulus in elastic media.
3. S-wave velocity is controlled by shear modulus and density of the formation.
4. Seismic amplitude variation with offset reveals fluid content and lithology changes.
5. Pre-stack seismic inversion simultaneously estimates multiple elastic parameters from angle stacks.


In [14]:
# Encode all geophysics sentences
inputs = tokenizer(GEOPHYSICS_TEXTS, padding=True, truncation=True, return_tensors="pt", max_length=512)
with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state[:,0,:]  # CLS token

print(f"Embeddings shape: {embeddings.shape}")
print(f"Each sentence is represented by {embeddings.shape[1]} dimensional vector")

# Reduce dimensions to 3D with t-SNE
perplexity = min(30, len(GEOPHYSICS_TEXTS) - 1)
tsne = TSNE(n_components=3, perplexity=perplexity, random_state=42, max_iter=1000)
embeddings_3d = tsne.fit_transform(embeddings.numpy())

print(f"3D embeddings shape: {embeddings_3d.shape}")
print(f"Using perplexity: {perplexity}")

Embeddings shape: torch.Size([56, 384])
Each sentence is represented by 384 dimensional vector


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


3D embeddings shape: (56, 3)
Using perplexity: 30


In [24]:
import plotly.express as px
# Create the 3D scatter plot using imported data
fig = px.scatter_3d(
    x=embeddings_3d[:,0],
    y=embeddings_3d[:,1],
    z=embeddings_3d[:,2],
    hover_name=GEOPHYSICS_TEXTS,  # Use imported data
    color=GEOPHYSICS_CATEGORIES,  # Use imported categories
    title="Interactive 3D Geophysics Text Embeddings",
    labels={'x':'Dimension 1', 'y':'Dimension 2', 'z':'Dimension 3'},
)

fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.update_layout(
    template='plotly_dark', font_family='monospace', width=900, height=700)
fig.show()


In [19]:
# Analyze semantic similarities within categories
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Calculate similarity matrix
similarity_matrix = cosine_similarity(embeddings.numpy())

# Find most similar sentence pairs
similarity_pairs = []
for i in range(len(GEOPHYSICS_TEXTS)):
    for j in range(i+1, len(GEOPHYSICS_TEXTS)):
        similarity_pairs.append({
            'text1': GEOPHYSICS_TEXTS[i][:50] + '...',
            'text2': GEOPHYSICS_TEXTS[j][:50] + '...',
            'category1': GEOPHYSICS_CATEGORIES[i],
            'category2': GEOPHYSICS_CATEGORIES[j],
            'similarity': similarity_matrix[i, j],
            'same_category': GEOPHYSICS_CATEGORIES[i] == GEOPHYSICS_CATEGORIES[j]
        })

# Convert to DataFrame and sort by similarity
df_similarities = pd.DataFrame(similarity_pairs)
df_top_similar = df_similarities.nlargest(10, 'similarity')

print("Top 10 Most Similar Sentence Pairs:")
print("=" * 80)
for idx, row in df_top_similar.iterrows():
    same_cat = "âœ“" if row['same_category'] else "âœ—"
    print(f"Similarity: {row['similarity']:.3f} | Same Category: {same_cat}")
    print(f"1. [{row['category1']}] {row['text1']}")
    print(f"2. [{row['category2']}] {row['text2']}")
    print("-" * 80)

# Calculate average similarity within vs between categories
within_category_sim = df_similarities[df_similarities['same_category']]['similarity'].mean()
between_category_sim = df_similarities[~df_similarities['same_category']]['similarity'].mean()

print(f"\nAverage similarity within same category: {within_category_sim:.3f}")
print(f"Average similarity between different categories: {between_category_sim:.3f}")
print(f"Difference: {within_category_sim - between_category_sim:.3f}")

Top 10 Most Similar Sentence Pairs:
Similarity: 0.905 | Same Category: âœ“
1. [Drilling & Completion] Perforation creates communication pathways between...
2. [Drilling & Completion] Wellbore trajectory optimization maximizes reservo...
--------------------------------------------------------------------------------
Similarity: 0.892 | Same Category: âœ“
1. [Reservoir Properties] Porosity measures the void space available for flu...
2. [Reservoir Properties] Capillary pressure controls fluid distribution at ...
--------------------------------------------------------------------------------
Similarity: 0.874 | Same Category: âœ“
1. [Geology & Geochemistry] Migration pathways allow hydrocarbons to move from...
2. [Geology & Geochemistry] Seal integrity prevents hydrocarbon leakage from r...
--------------------------------------------------------------------------------
Similarity: 0.866 | Same Category: âœ“
1. [Seismic Methods] Pre-stack seismic inversion simultaneously estimat...
2. [

## 3. Understanding Context Windows

**Context window** refers to the maximum number of tokens a model can process at once. This is a crucial limitation that affects:
- How much text the model can "remember"
- The maximum input size for generation tasks
- Computational requirements

Common context window sizes:
- GPT-2: 1,024 tokens
- GPT-3: 4,096 tokens  
- GPT-4: 8,192 - 32,768 tokens
- Claude: 100,000+ tokens

In [None]:
# Demonstrate context window limitations
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load GPT-2 model
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained('gpt2')
model_gpt2 = GPT2LMHeadModel.from_pretrained('gpt2')

# Set pad token
tokenizer_gpt2.pad_token = tokenizer_gpt2.eos_token

print(f"GPT-2 maximum position embeddings: {model_gpt2.config.n_positions}")
print(f"This means the context window is {model_gpt2.config.n_positions} tokens")

# Create a long geoscience text to test context limits
long_text = """
Seismic inversion is a geophysical technique used to derive subsurface properties from seismic data. 
The process involves converting seismic reflection data into quantitative rock and fluid properties such as 
acoustic impedance, porosity, and lithology. This technique is fundamental in hydrocarbon exploration 
and reservoir characterization. The inversion process typically starts with seismic data acquisition, 
followed by data processing, and finally the inversion itself. There are several types of seismic inversion 
including post-stack inversion, pre-stack inversion, and simultaneous inversion. Post-stack inversion 
works with stacked seismic data to derive acoustic impedance. Pre-stack inversion uses angle-dependent 
reflectivity information to derive multiple elastic properties. Simultaneous inversion integrates seismic 
and well log data to provide more accurate and detailed subsurface models.
""" * 10  # Repeat to make it longer

# Tokenize the long text
tokens = tokenizer_gpt2.tokenize(long_text)
print(f"\nLong text has {len(tokens)} tokens")
print(f"Exceeds context window: {len(tokens) > model_gpt2.config.n_positions}")

# Show what happens when we truncate
max_length = model_gpt2.config.n_positions - 50  # Leave room for generation
truncated_tokens = tokens[:max_length]
print(f"Truncated to {len(truncated_tokens)} tokens for processing")

## 4. Loading a Small HuggingFace Model

Let's load and explore a small language model suitable for text generation tasks.

In [20]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a small, efficient model for text generation
model_name = "distilgpt2"  # Smaller, faster version of GPT-2
tokenizer_gen = AutoTokenizer.from_pretrained(model_name)
model_gen = AutoModelForCausalLM.from_pretrained(model_name)

# Set pad token
if tokenizer_gen.pad_token is None:
    tokenizer_gen.pad_token = tokenizer_gen.eos_token

print(f"Model: {model_name}")
print(f"Vocabulary size: {tokenizer_gen.vocab_size:,}")
print(f"Model parameters: {model_gen.num_parameters():,}")
print(f"Context window: {model_gen.config.n_positions} tokens")
print(f"Embedding dimension: {model_gen.config.n_embd}")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Cancellation requested; stopping current tasks.


KeyboardInterrupt: 

## 5. Generate Simple Text Completions

Now let's use our model to generate text completions with various prompts.

In [None]:
def generate_text(prompt, tokenizer, model, max_length=100, temperature=0.7, num_return_sequences=1):
    """Generate text completion given a prompt"""
    inputs = tokenizer(prompt, return_tensors='pt')
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            num_return_sequences=num_return_sequences,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            no_repeat_ngram_size=2  # Avoid repetition
        )
    
    generated_texts = []
    for output in outputs:
        generated_text = tokenizer.decode(output, skip_special_tokens=True)
        generated_texts.append(generated_text)
    
    return generated_texts

# Test with simple prompts
simple_prompts = [
    "The geology of this region",
    "Oil and gas exploration requires",
    "Seismic waves travel through"
]

print("=== Simple Text Completions ===")
for prompt in simple_prompts:  # Use imported prompts
    generated = generate_text(prompt, tokenizer_gen, model_gen, max_length=60)
    print(f"\nPrompt: '{prompt}'")
    print(f"Completion: '{generated[0]}'")
    print("-" * 80)

In [None]:
# Experiment with different generation parameters
prompt = "Reservoir characterization involves"

print("=== Effect of Different Parameters ===")
print(f"Prompt: '{prompt}'\n")

# Low temperature (more deterministic)
low_temp = generate_text(prompt, tokenizer_gen, model_gen, max_length=50, temperature=0.3)
print(f"Low temperature (0.3): {low_temp[0]}")

# High temperature (more creative)
high_temp = generate_text(prompt, tokenizer_gen, model_gen, max_length=50, temperature=1.2)
print(f"High temperature (1.2): {high_temp[0]}")

# Multiple generations
multiple = generate_text(prompt, tokenizer_gen, model_gen, max_length=50, temperature=0.8, num_return_sequences=3)
print("\nMultiple generations:")
for i, gen in enumerate(multiple, 1):
    print(f"{i}. {gen}")

## Summary

In this module, we covered:

1. **Tokens**: Basic units that LLMs process (words, subwords, punctuation)
2. **Embeddings**: Numerical representations that capture semantic meaning
3. **Context Windows**: Maximum input size limitations (1,024 tokens for GPT-2)
4. **Model Loading**: Using HuggingFace transformers to load pre-trained models
5. **Text Generation**: Creating completions with different parameters
6. **Geoscience Applications**: Generating definitions for technical terms

### Key Takeaways:
- Tokenization breaks text into processable units
- Embeddings capture semantic relationships between concepts
- Context windows limit how much text models can process at once
- Different prompting strategies can yield different results
- Temperature controls randomness in generation

### Next Steps:
- Experiment with larger models for better geoscience definitions
- Try fine-tuning models on domain-specific geoscience text
- Explore retrieval-augmented generation (RAG) for factual accuracy