[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CLDiego/SPE_GeoHackathon_2025/blob/main/S1_M1_LLM_HF.ipynb)

***
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/write.svg" width="20"/> Follow along by running each cell in order
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/code.svg" width="20"/> Make sure to run the environment setup cells first
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> Wait for each installation to complete before proceeding
- <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/list.svg" width="20" /> Don't worry if installations take a while - this is normal!

In [None]:
# Environment setup [If running outside Colab]
# !pip install transformers torch matplotlib plotly scikit-learn ipython

# import warnings
# warnings.filterwarnings('ignore')

In [None]:
# Hugging Face API token
# Retrieving the token is required to get access to HF hub
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')

# Session 01 // Module 01: Large Language Models (LLMs) with HuggingFace

In this module, we'll explore the fundamentals of Large Language Models (LLMs) using HuggingFace transformers. We'll cover tokens, embeddings, context windows, and hands-on text generation with a focus on geoscience applications.

## Learning Objectives
- Understand what tokens, embeddings, and context windows are
- Load and use a small HuggingFace model
- Generate simple text completions
- Apply LLMs to geoscience definition tasks

## 1. Understanding Tokens

**Tokens** are the basic units that language models work with. Text is broken down into tokens before being processed by the model. A token can be:
- A whole word (e.g., "seismic")
- Part of a word (e.g., "seis", "mic")
- Punctuation marks
- Special symbols

Let's see how tokenization works with a geoscience example:

In [None]:
# TODO: Use utils library
from IPython.display import HTML, display
def bert_tokenize_and_color(text, tokenizer ):
    colored_text = ""
    colors = ['#FF5733', '#33FF57', '#3357FF', '#FFD700', '#00CED1', '#FF00FF', '#FFFF00',
              '#FF0000', '#00FF00', '#0000FF', '#00FFFF', '#FF1493', '#8A2BE2',
              '#FF8C00', '#228B22', '#DC143C', '#32CD32', '#1E90FF', '#FFD700', '#FF69B4']

    tokens = tokenizer.tokenize(text)
    colored_html = ""
    
    for i, token in enumerate(tokens):
        color = colors[i % len(colors)]
        # Replace special characters for display
        display_token = token.replace('Ġ', '▁')  # GPT-2 uses Ġ for spaces
        colored_html += f'<span style="background-color:{color}; color: white; padding: 2px 4px; margin: 1px; border-radius: 3px;">{display_token}</span>'
    
    print(f"Original text: {text}")
    print(f"Number of tokens: {len(tokens)}")
    display(HTML(colored_html))
    print(f"Tokens: {tokens}")
    print("-" * 80)

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Geoscience text examples
texts = [
    "Seismic inversion is a geophysical technique.",
    "Hydrocarbon exploration uses seismic surveys.",
    "Reservoir characterization involves petrophysical analysis.",
    "What is the porosity and permeability of this formation?"
]

for text in texts:
    bert_tokenize_and_color(text, tokenizer)

In [None]:
# Display sample vocabulary, special tokens, and token mapping

# Sample vocab (first 20 keys)
vocab = tokenizer.get_vocab()
print("Sample vocabulary (first 20):", list(vocab.keys())[:20])

# Special tokens
print("\nSpecial tokens:", tokenizer.special_tokens_map)

# Mapping for the first line of the poem
sample_text = text[0]
tokens = tokenizer.tokenize(sample_text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"\nSample text: {sample_text}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")

# Full encoding for the first line
encoded = tokenizer(sample_text, return_tensors='pt')
print(f"\nFull encoding (input_ids): {encoded['input_ids']}")
print(f"Attention mask: {encoded['attention_mask']}")

decoded = tokenizer.decode(token_ids)
print(f"Decoded tokens: {decoded}")

Notes:
- You can use `AutoTokenizer` for automatic model selection.
- To perform tokenization, you can use the `tokenizer` object created from the `BertTokenizer` class or the `AutoTokenizer` class.

## 2. Understanding Embeddings

**Embeddings** are numerical representations of tokens in a high-dimensional space. Each token is converted to a vector of numbers that captures its meaning and relationships to other tokens.

Key properties of embeddings:
- Similar words have similar embeddings
- Embeddings capture semantic relationships
- Typical dimensions: 512, 768, 1024, or higher

In [None]:
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np

# Load a small model for embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer_embed = AutoTokenizer.from_pretrained(model_name)
model_embed = AutoModel.from_pretrained(model_name)

def get_embeddings(texts, tokenizer, model):
    """Get sentence embeddings"""
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
        # Use CLS token embedding (first token) for sentence representation
        embeddings = outputs.last_hidden_state[:, 0, :]
    
    return embeddings.numpy()

# Get embeddings for geoscience terms
geoscience_terms = [
    "seismic inversion",
    "reservoir characterization", 
    "hydrocarbon exploration",
    "petrophysical analysis",
    "porosity measurement",
    "permeability analysis"
]

embeddings = get_embeddings(geoscience_terms, tokenizer_embed, model_embed)

print(f"Embedding shape: {embeddings.shape}")
print(f"Each term is represented by {embeddings.shape[1]} numbers")
print(f"\nFirst 10 embedding values for '{geoscience_terms[0]}':")
print(embeddings[0][:10])

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load model + tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Comprehensive geophysics and petroleum engineering texts
geophysics_texts = [
    # Seismic Methods
    "Seismic inversion transforms seismic reflection data into quantitative subsurface rock properties.",
    "P-wave velocity depends on rock density and bulk modulus in elastic media.",
    "S-wave velocity is controlled by shear modulus and density of the formation.",
    "Seismic amplitude variation with offset reveals fluid content and lithology changes.",
    "Pre-stack seismic inversion simultaneously estimates multiple elastic parameters from angle stacks.",
    "Post-stack seismic inversion derives acoustic impedance from normal incidence reflectivity.",
    "Seismic interpretation identifies structural features like faults, folds, and stratigraphic boundaries.",
    "Time-lapse seismic monitoring tracks reservoir changes during production and injection.",
    
    # Reservoir Properties
    "Porosity measures the void space available for fluid storage in reservoir rocks.",
    "Permeability quantifies the ability of fluids to flow through porous rock matrices.",
    "Water saturation represents the fraction of pore space occupied by formation water.",
    "Net-to-gross ratio indicates the proportion of reservoir quality rock in a formation.",
    "Reservoir pressure drives hydrocarbon flow from formation to wellbore during production.",
    "Capillary pressure controls fluid distribution at pore scale in reservoir rocks.",
    "Relative permeability curves describe multiphase flow behavior in porous media.",
    "Formation volume factor accounts for fluid expansion from reservoir to surface conditions.",
    
    # Well Logging
    "Gamma ray logs measure natural radioactivity to identify shale and clay content.",
    "Resistivity logs detect hydrocarbon presence by measuring electrical resistance of formations.",
    "Neutron logs respond to hydrogen content, indicating porosity and fluid types.",
    "Density logs measure bulk density to calculate porosity and identify lithology.",
    "Photoelectric factor from density logs helps distinguish different rock types and minerals.",
    "Spontaneous potential logs indicate permeable zones and formation water resistivity.",
    "Caliper logs measure borehole diameter to identify washouts and tight spots.",
    "Nuclear magnetic resonance logs provide pore size distribution and permeability estimates.",
    
    # Drilling and Completion
    "Drilling mud maintains wellbore stability and carries cuttings to surface.",
    "Casing design protects formations and enables safe drilling to target depths.",
    "Hydraulic fracturing creates artificial fractures to enhance reservoir permeability.",
    "Horizontal drilling maximizes contact with thin reservoir layers.",
    "Perforation creates communication pathways between wellbore and reservoir.",
    "Sand control prevents formation sand production that could damage equipment.",
    "Acidizing dissolves formation damage and enhances near-wellbore permeability.",
    "Wellbore trajectory optimization maximizes reservoir contact while avoiding hazards.",
    
    # Production Engineering
    "Artificial lift systems maintain production when reservoir pressure declines.",
    "Nodal analysis optimizes production system performance from reservoir to separator.",
    "Decline curve analysis predicts future production rates and ultimate recovery.",
    "Enhanced oil recovery techniques mobilize residual oil after primary depletion.",
    "Water flooding maintains reservoir pressure and sweeps oil toward producers.",
    "Gas injection improves oil recovery through miscible or immiscible displacement.",
    "Thermal recovery methods reduce oil viscosity in heavy oil reservoirs.",
    "Production optimization balances rate, pressure, and equipment constraints.",
    
    # Geology and Geochemistry
    "Source rock maturation generates hydrocarbons through thermal decomposition of organic matter.",
    "Migration pathways allow hydrocarbons to move from source to reservoir rocks.",
    "Structural traps accumulate hydrocarbons through folding and faulting processes.",
    "Stratigraphic traps form through depositional and diagenetic rock property changes.",
    "Seal integrity prevents hydrocarbon leakage from reservoir to surface.",
    "Basin modeling predicts hydrocarbon generation, migration, and accumulation timing.",
    "Sequence stratigraphy correlates rock units across regional geological frameworks.",
    "Diagenesis modifies reservoir quality through cementation and dissolution processes.",
    
    # Advanced Technologies
    "Machine learning algorithms identify patterns in seismic and well log data.",
    "Digital twins create virtual reservoir models for production optimization.",
    "Microseismic monitoring tracks fracture growth during stimulation operations.",
    "Fiber optic sensing provides distributed measurements along wellbore length.",
    "Cloud computing enables large-scale reservoir simulation and data analytics.",
    "Automated drilling systems improve efficiency and reduce human error.",
    "Real-time optimization adjusts operations based on continuous data streams.",
    "Carbon capture and storage requires geological characterization for safe sequestration."
]

print(f"Total number of geophysics texts: {len(geophysics_texts)}")
print("Sample texts:")
for i, text in enumerate(geophysics_texts[:5]):
    print(f"{i+1}. {text}")

In [None]:
# Encode all geophysics sentences
inputs = tokenizer(geophysics_texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state[:,0,:]  # CLS token

print(f"Embeddings shape: {embeddings.shape}")
print(f"Each sentence is represented by {embeddings.shape[1]} dimensional vector")

# Reduce dimensions to 3D with t-SNE
# Adjust perplexity based on dataset size (should be less than n_samples)
perplexity = min(30, len(geophysics_texts) - 1)
tsne = TSNE(n_components=3, perplexity=perplexity, random_state=42, max_iter=1000)
embeddings_3d = tsne.fit_transform(embeddings.numpy())

print(f"3D embeddings shape: {embeddings_3d.shape}")
print(f"Using perplexity: {perplexity}")

In [None]:
import plotly.express as px
import numpy as np

# Create category labels for color coding
categories = []
category_names = [
    "Seismic Methods", "Seismic Methods", "Seismic Methods", "Seismic Methods", 
    "Seismic Methods", "Seismic Methods", "Seismic Methods", "Seismic Methods",
    "Reservoir Properties", "Reservoir Properties", "Reservoir Properties", "Reservoir Properties",
    "Reservoir Properties", "Reservoir Properties", "Reservoir Properties", "Reservoir Properties",
    "Well Logging", "Well Logging", "Well Logging", "Well Logging",
    "Well Logging", "Well Logging", "Well Logging", "Well Logging",
    "Drilling & Completion", "Drilling & Completion", "Drilling & Completion", "Drilling & Completion",
    "Drilling & Completion", "Drilling & Completion", "Drilling & Completion", "Drilling & Completion",
    "Production Engineering", "Production Engineering", "Production Engineering", "Production Engineering",
    "Production Engineering", "Production Engineering", "Production Engineering", "Production Engineering",
    "Geology & Geochemistry", "Geology & Geochemistry", "Geology & Geochemistry", "Geology & Geochemistry",
    "Geology & Geochemistry", "Geology & Geochemistry", "Geology & Geochemistry", "Geology & Geochemistry",
    "Advanced Technologies", "Advanced Technologies", "Advanced Technologies", "Advanced Technologies",
    "Advanced Technologies", "Advanced Technologies", "Advanced Technologies", "Advanced Technologies"
]

# Create the 3D scatter plot
fig = px.scatter_3d(
    x=embeddings_3d[:,0],
    y=embeddings_3d[:,1],
    z=embeddings_3d[:,2],
    hover_name=geophysics_texts,
    color=category_names,
    title="Interactive 3D Geophysics Text Embeddings",
    labels={'x':'Dimension 1', 'y':'Dimension 2', 'z':'Dimension 3'},
    color_discrete_sequence=px.colors.qualitative.Set3
)

fig.update_traces(marker=dict(size=6, opacity=0.8))
fig.update_layout(
    font_family='Arial',
    width=1000, 
    height=800,
    scene=dict(
        xaxis_title="Semantic Dimension 1",
        yaxis_title="Semantic Dimension 2",
        zaxis_title="Semantic Dimension 3"
    )
)

fig.show()

In [None]:
# Analyze semantic similarities within categories
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Calculate similarity matrix
similarity_matrix = cosine_similarity(embeddings.numpy())

# Find most similar sentence pairs
similarity_pairs = []
for i in range(len(geophysics_texts)):
    for j in range(i+1, len(geophysics_texts)):
        similarity_pairs.append({
            'text1': geophysics_texts[i][:50] + '...',
            'text2': geophysics_texts[j][:50] + '...',
            'category1': category_names[i],
            'category2': category_names[j],
            'similarity': similarity_matrix[i, j],
            'same_category': category_names[i] == category_names[j]
        })

# Convert to DataFrame and sort by similarity
df_similarities = pd.DataFrame(similarity_pairs)
df_top_similar = df_similarities.nlargest(10, 'similarity')

print("Top 10 Most Similar Sentence Pairs:")
print("=" * 80)
for idx, row in df_top_similar.iterrows():
    same_cat = "✓" if row['same_category'] else "✗"
    print(f"Similarity: {row['similarity']:.3f} | Same Category: {same_cat}")
    print(f"1. [{row['category1']}] {row['text1']}")
    print(f"2. [{row['category2']}] {row['text2']}")
    print("-" * 80)

# Calculate average similarity within vs between categories
within_category_sim = df_similarities[df_similarities['same_category']]['similarity'].mean()
between_category_sim = df_similarities[~df_similarities['same_category']]['similarity'].mean()

print(f"\nAverage similarity within same category: {within_category_sim:.3f}")
print(f"Average similarity between different categories: {between_category_sim:.3f}")
print(f"Difference: {within_category_sim - between_category_sim:.3f}")

## 3. Understanding Context Windows

**Context window** refers to the maximum number of tokens a model can process at once. This is a crucial limitation that affects:
- How much text the model can "remember"
- The maximum input size for generation tasks
- Computational requirements

Common context window sizes:
- GPT-2: 1,024 tokens
- GPT-3: 4,096 tokens  
- GPT-4: 8,192 - 32,768 tokens
- Claude: 100,000+ tokens

In [None]:
# Demonstrate context window limitations
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load GPT-2 model
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained('gpt2')
model_gpt2 = GPT2LMHeadModel.from_pretrained('gpt2')

# Set pad token
tokenizer_gpt2.pad_token = tokenizer_gpt2.eos_token

print(f"GPT-2 maximum position embeddings: {model_gpt2.config.n_positions}")
print(f"This means the context window is {model_gpt2.config.n_positions} tokens")

# Create a long geoscience text to test context limits
long_text = """
Seismic inversion is a geophysical technique used to derive subsurface properties from seismic data. 
The process involves converting seismic reflection data into quantitative rock and fluid properties such as 
acoustic impedance, porosity, and lithology. This technique is fundamental in hydrocarbon exploration 
and reservoir characterization. The inversion process typically starts with seismic data acquisition, 
followed by data processing, and finally the inversion itself. There are several types of seismic inversion 
including post-stack inversion, pre-stack inversion, and simultaneous inversion. Post-stack inversion 
works with stacked seismic data to derive acoustic impedance. Pre-stack inversion uses angle-dependent 
reflectivity information to derive multiple elastic properties. Simultaneous inversion integrates seismic 
and well log data to provide more accurate and detailed subsurface models.
""" * 10  # Repeat to make it longer

# Tokenize the long text
tokens = tokenizer_gpt2.tokenize(long_text)
print(f"\nLong text has {len(tokens)} tokens")
print(f"Exceeds context window: {len(tokens) > model_gpt2.config.n_positions}")

# Show what happens when we truncate
max_length = model_gpt2.config.n_positions - 50  # Leave room for generation
truncated_tokens = tokens[:max_length]
print(f"Truncated to {len(truncated_tokens)} tokens for processing")

## 4. Loading a Small HuggingFace Model

Let's load and explore a small language model suitable for text generation tasks.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a small, efficient model for text generation
model_name = "distilgpt2"  # Smaller, faster version of GPT-2
tokenizer_gen = AutoTokenizer.from_pretrained(model_name)
model_gen = AutoModelForCausalLM.from_pretrained(model_name)

# Set pad token
if tokenizer_gen.pad_token is None:
    tokenizer_gen.pad_token = tokenizer_gen.eos_token

print(f"Model: {model_name}")
print(f"Vocabulary size: {tokenizer_gen.vocab_size:,}")
print(f"Model parameters: {model_gen.num_parameters():,}")
print(f"Context window: {model_gen.config.n_positions} tokens")
print(f"Embedding dimension: {model_gen.config.n_embd}")

## 5. Generate Simple Text Completions

Now let's use our model to generate text completions with various prompts.

In [None]:
def generate_text(prompt, tokenizer, model, max_length=100, temperature=0.7, num_return_sequences=1):
    """Generate text completion given a prompt"""
    inputs = tokenizer(prompt, return_tensors='pt')
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            num_return_sequences=num_return_sequences,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            no_repeat_ngram_size=2  # Avoid repetition
        )
    
    generated_texts = []
    for output in outputs:
        generated_text = tokenizer.decode(output, skip_special_tokens=True)
        generated_texts.append(generated_text)
    
    return generated_texts

# Test with simple prompts
simple_prompts = [
    "The geology of this region",
    "Oil and gas exploration requires",
    "Seismic waves travel through"
]

print("=== Simple Text Completions ===")
for prompt in simple_prompts:
    generated = generate_text(prompt, tokenizer_gen, model_gen, max_length=60)
    print(f"\nPrompt: '{prompt}'")
    print(f"Completion: '{generated[0]}'")
    print("-" * 80)

In [None]:
# Experiment with different generation parameters
prompt = "Reservoir characterization involves"

print("=== Effect of Different Parameters ===")
print(f"Prompt: '{prompt}'\n")

# Low temperature (more deterministic)
low_temp = generate_text(prompt, tokenizer_gen, model_gen, max_length=50, temperature=0.3)
print(f"Low temperature (0.3): {low_temp[0]}")

# High temperature (more creative)
high_temp = generate_text(prompt, tokenizer_gen, model_gen, max_length=50, temperature=1.2)
print(f"High temperature (1.2): {high_temp[0]}")

# Multiple generations
multiple = generate_text(prompt, tokenizer_gen, model_gen, max_length=50, temperature=0.8, num_return_sequences=3)
print("\nMultiple generations:")
for i, gen in enumerate(multiple, 1):
    print(f"{i}. {gen}")