![Workshop Banner](assets/S1_M2.png)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CLDiego/SPE_GeoHackathon_2025/blob/dev/S1_M2_ChatAgent.ipynb)

***
# Session 01 // Module 02: LangChain

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/write.svg" width="20"/> Goal: Build a geoscience-focused chat assistant using modern LangChain, and Hugging Face Transformers

This module upgrades from basic HF usage to a real chat workflow:
- Load a conversational model (Phi-3 / Llama 3.1) efficiently.
- Compose prompts and chains using LCEL (LangChain Expression Language).
- Add conversation memory with RunnableWithMessageHistory.
- Ship an interactive Gradio UI.
- Explore an advanced exercise with a larger scientific model.

## What we’ll use and why

- LangChain (modern API: LCEL, ChatModels, Runnables)
  - Compose pipelines with a declarative operator (|).
  - Treat “chat” as structured messages, not plain strings.
- Hugging Face Transformers
  - Load and run open models locally or in Colab.
  - Stream tokens, quantize large models with BitsAndBytes.
- BitsAndBytes (4-bit quantization)
  - Fit larger models in memory; good for Colab GPUs. Not supported on macOS Metal (MPS).
- Gradio
  - Simple, fast UI to interact with your agent in the browser.

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> Tip: Prefer running this notebook on Colab with a GPU. On Apple Silicon (MPS), BitsAndBytes 4-bit is not supported — use a small model or CPU-friendly settings (snippet below).

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Environment setup
!pip -q install langchain langchain-core langchain-community langchain-huggingface torch gradio
!pip -q install bitsandbytes==0.46.0 transformers==4.48.3 

In [None]:
# Hugging Face API token
# Retrieving the token is required to get access to HF hub
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')

# 1. Model loading and quantization

One of the powerful features of Hugging Face Transformers is the ability to run models locally, either on your machine or in a Colab notebook. This gives you more control over the model, data privacy, and cost (no API fees). However, local models can be large and require careful loading to fit in memory.

Local LLMs can be large (7B+ parameters) and require careful loading to fit in memory. We’ll use Hugging Face Transformers with BitsAndBytes 4-bit quantization to run a mid-sized model (Phi-3 mini, 6.7B) on a single Colab GPU.

4-bit quantization (BitsAndBytes) is a great way to reduce memory usage with minimal quality loss. 8-bit is also an option, but 4-bit is better for larger models (7B+). On macOS (MPS), BitsAndBytes is not supported, so you’ll need to use a small model or CPU-friendly settings.

***

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> Hardware notes
> - Colab + GPU: 4-bit quantization (BitsAndBytes) is great for 7–8B models.
> - macOS (MPS): BitsAndBytes is not supported. Use a small model (≤1–2B), CPU, or skip quantization (see alternative snippet).
> - Hugging Face gated models require an HF token and license acceptance.

***

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/write.svg" width="20"/> Key parameters (BitsAndBytesConfig)
> - `load_in_4bit=True`: Enable 4-bit weight quantization (huge memory savings).
> - `bnb_4bit_quant_type="nf4"`: Quantization scheme tuned for LLMs (good quality). Other options: `fp4` (faster, lower quality), `int4` (legacy).
> - `bnb_4bit_compute_dtype=torch.bfloat16`: Arithmetic precision during compute.
> - `bnb_4bit_use_double_quant=True`: Second-stage quantization for further memory savings.
> - `device_map="auto"`: Automatically place model layers on available devices (GPU/CPU).
> - `torch_dtype=torch.float16`: Use half-precision floats to save memory and speed up computation. You can also use `torch.bfloat16` if supported by your hardware, which can offer better performance for some models. The b stands for "brain float" and is particularly useful on TPUs and some newer GPUs.

***

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/code.svg" width="20"/> Snippet: Alternative for Mac (skip 4-bit)

```python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # small & MPS-friendly
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if torch.backends.mps.is_available() else torch.float32,
    device_map="auto"  # will use 'mps' if available
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

```

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> **Tip**: If you encounter CUDA out-of-memory errors, try lowering `max_new_tokens`. For higher accuracy, increase quantization bits (use 8-bit instead of 4-bit).

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Use a more conversational model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model_name = "microsoft/Phi-3-mini-4k-instruct"

# Create HuggingFace pipeline
# Steps:
# 1. Load tokenizer
# 2. Create quantization config
# 3. Load model with quantization config
tokenizer = AutoTokenizer.from_pretrained(model_name)

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto", 
    quantization_config=quant_config
)

# Set pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 2. From Transformers pipeline to a LangChain ChatModel

We are going to build a local text-generation stack and expose it through LangChain’s chat interface.

- Transformers pipeline
  - The Hugging Face `pipeline("text-generation", ...)` wraps your `model` and `tokenizer` and calls `model.generate(...)` under the hood with the decoding parameters you pass.
  - It handles tokenization, batched inference, and decoding for you.

- LangChain wrappers
  - `HuggingFacePipeline`: Bridges a HF pipeline into a LangChain LLM so it can be composed in chains (LCEL).
  - `ChatHuggingFace`: Adapts that LLM to a chat-style interface (system/human/ai messages), making it compatible with `ChatPromptTemplate`, `RunnableWithMessageHistory`, etc.


In [None]:
from transformers import pipeline
from langchain import HuggingFacePipeline
from langchain.chat_models import ChatHuggingFace

# Create text generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=150,
    temperature=0.2,
    do_sample=True, # Sampling enables more diverse outputs
    pad_token_id=tokenizer.eos_token_id,
    return_full_text=False # The generated text will not include the prompt
)

# Create LangChain LLM
llm = HuggingFacePipeline(pipeline=pipe)

# Wrap with ChatHuggingFace for modern interface
chat_model = ChatHuggingFace(llm=llm)

print(f"Model loaded: {model_name}")
print(f"Model parameters: {model.num_parameters():,}")

### Why these parameters?

- `max_new_tokens=150`
  - Caps how many tokens the model may generate beyond the prompt. Lower for faster, higher for more complete answers.
- `temperature=0.2`
  - Controls randomness. 0.2 is conservative and tends to be factual/concise. Increase for more creativity.
- `do_sample=True`
  - Enables stochastic decoding (required for temperature/top_p/top_k to take effect). Set False for deterministic greedy decoding.
- `pad_token_id=tokenizer.eos_token_id`
  - Many causal LMs have no pad token. Using EOS avoids padding-related errors in batching.
- `return_full_text=False`
  - Return only the model’s continuation (not the prompt), which is what you want in most chat UIs.

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> **Tips**:
> - Add `top_p=0.9` or `top_k=50` to control sampling. Use one at a time for easier tuning.
> - For repetition issues, try `repetition_penalty=1.1` or `no_repeat_ngram_size=2`.



## 3. Prompting with ChatPromptTemplate

Structure matters for chat models. A good System message defines persona and scope; Human carries the user’s question. For memory-enabled chats, insert MessagesPlaceholder("history").

Guidelines for the system prompt:
- Define role, expertise, and tone (e.g., Dr. GeoBot).
- List do/don’t rules: brevity, technical accuracy, safety.
- Encourage step-by-step explanations when math is needed.

***

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/code.svg" width="20"/> Snippet: Quick test

```python

from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers import StrOutputParser

test = (ChatPromptTemplate.from_messages([
    ("system", "You are Dr. GeoBot, a concise geoscience expert."),
    ("human", "{question}")
]) | chat_model | StrOutputParser())

test.invoke({"question": "Explain porosity vs permeability."})

```

In [None]:
from langchain.prompts import ChatPromptTemplate

# Create a system prompt for geoscience expertise
# The system prompt sets the behavior and personality of the assistant
system_prompt = """
You are Dr. GeoBot, an expert geophysicist and petroleum engineer with 20 years of experience.
You specialize in seismic interpretation, reservoir characterization, and hydrocarbon exploration.

Guidelines:
- Provide accurate, helpful answers about geoscience topics
- Keep responses concise but informative (2-3 sentences)
- Use technical terms but explain them when needed
- Focus on practical applications and formulas
- If unsure, acknowledge limitations
"""

# Create chat prompt template
prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{question}")
])

# Test the template
test_question = "What is porosity?"
formatted_prompt = prompt_template.format_messages(question=test_question)
print("Formatted prompt:")
for message in formatted_prompt:
    print(f"{message.type}: {message.content}")

## 4. LCEL chains: from prompt to answer

LangChain Expression Language (LCEL) lets you compose pipelines declaratively using the `|` operator. Each component (prompt, model, parser) is a Runnable that takes inputs and produces outputs. The previous wrappers enable us to build a full chat pipeline:

- After `llm = HuggingFacePipeline(...)`, you can plug the model into LCEL chains:
  - `prompt | chat_model | StrOutputParser()`
  - `StrOutputParser()` extracts the text from the model’s response object. It’s useful when you want just the generated text without any additional metadata.
- After `chat_model = ChatHuggingFace(llm=llm)`, you can use chat prompts, memory, and multi-turn workflows.

***
> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/code.svg" width="20"/> **Snippet**: Quick usage example

```python

# Compose a minimal chain (later cell)
simple_chain = prompt_template | chat_model | StrOutputParser()
simple_chain.invoke({"question": "What is seismic inversion?"})

```

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> **Sanity checks**
> - `print(f"Model loaded: {model_name}")`: Confirms which repo/model is active.
> - `print(f"Model parameters: {model.num_parameters():,}")`: Rough size indicator; larger models need more VRAM/quantization.

In [None]:
from langchain.output_parsers import StrOutputParser

# Create a simple chain using LCEL
simple_chain = prompt_template | chat_model | StrOutputParser()

# Test the chain
print("=== Testing Simple Chain ===")
response = simple_chain.invoke({"question": "What is the difference between porosity and permeability?"})
print(f"Response: {response}")

# 5. Low-level Transformers and streaming

Sometimes you need raw control (e.g., custom token streaming to console).

- `TextStreamer`: Streams decoded tokens as they are generated.
- `Manual prompt strings`: Some instruct models respond better with [SYSTEM]/[USER] tags or chat templates (check model card).

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> Caveat: Each model has its own chat template. Prefer `tokenizer.apply_chat_template `if available.

***

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/code.svg" width="20"/> **Snippet**: Streaming output example

```python

from transformers import TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
inputs = tokenizer("User: What is porosity?\nAssistant:", return_tensors="pt").to(model.device)
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=200)

```

In [None]:
from transformers import TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

full_prompt = ""
for msg in formatted_prompt:
    if msg.type == "system":
        full_prompt += f"[SYSTEM]\n{msg.content}\n"
    elif msg.type == "human":
        full_prompt += f"[USER]\n{msg.content}\n"

inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

# Stream tokens as they are generated
model.generate(**inputs, streamer=streamer, max_new_tokens=200)

In [None]:
# Test with multiple geoscience questions
test_questions = [
    "What is seismic resolution?",
    "How do P-waves differ from S-waves?",
    "What factors affect hydrocarbon migration?"
]

print("=== Testing Multiple Questions ===")
for i, question in enumerate(test_questions, 1):
    print(f"\n{i}. Question: {question}")
    response = simple_chain.invoke({"question": question})
    print(f"   Answer: {response}")
    print("-" * 80)

## 6. Conversation memory with RunnableWithMessageHistory

Why memory?
- Keep context across turns (references like “this technique”).
- Reduces repetition; improves coherence.

How it works:
- You provide a function that returns a `BaseChatMessageHistory` per `session_id`.
- `RunnableWithMessageHistory` injects/updates that history automatically when invoking the chain.

Memory best practices:
- Keep histories short on small context windows.
- Summarize old messages if conversations get long (advanced).
- Use session IDs per user/tab.

***

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/code.svg" width="20"/> **Snippet**: Memory-enabled chain

```python
from langchain.prompts import MessagesPlaceholder
from langchain.memory import ChatMessageHistory, BaseChatMessageHistory
from langchain.chains import RunnableWithMessageHistory

conversational_prompt = ChatPromptTemplate.from_messages([
  ("system", "You are Dr. GeoBot..."),
  MessagesPlaceholder(variable_name="history"),
  ("human", "{question}")
])

conv_chain = conversational_prompt | chat_model | StrOutputParser()
mem_chain = RunnableWithMessageHistory(
  conv_chain,
  lambda sid: store.setdefault(sid, ChatMessageHistory()),
  input_messages_key="question",
  history_messages_key="history",
)

```

In [None]:
from langchain.prompts import MessagesPlaceholder
from langchain.memory import ChatMessageHistory, BaseChatMessageHistory
from langchain.chains import RunnableWithMessageHistory

# Create conversational prompt template with history
conversational_prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{question}")
])

# Create the conversational chain
conversational_chain = conversational_prompt | chat_model | StrOutputParser()

# Store for conversation histories
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

# Create conversational chain with memory
conversational_with_memory = RunnableWithMessageHistory(
    conversational_chain,
    get_session_history,
    input_messages_key="question",
    history_messages_key="history",
)

print("Conversational chain with memory created!")

## 6.1 In-depth explanation of memory components

The previous cell builds a chat pipeline that remembers prior turns per `session_id`.

High-level flow:
- Define a chat prompt with a system message + a placeholder for prior messages + the new user question.
- Compose the model pipeline with LCEL: prompt → chat model → text output.
- Maintain a per-session message history in a simple in-memory store.
- Wrap the chain with RunnableWithMessageHistory so history is automatically injected and updated on every call.

***

Key pieces explained:
- `MessagesPlaceholder`
  - Acts as a slot in the prompt where past messages (human/ai) will be inserted each turn.
  - `variable_name="history"` must match `history_messages_key` in the wrapper.

- `ChatMessageHistory`
  - An in-memory list of chat messages for one session (human/ai/system).
  - Used to persist conversation turns across invocations.
  - `BaseChatMessageHistory` is the interface type; `ChatMessageHistory` is the concrete implementation.

- `RunnableWithMessageHistory`
  - Wraps any LCEL chain (Runnable) and:
    - Fetches a session-specific history via your getter (`get_session_history`).
    - Injects that history into the prompt at the `MessagesPlaceholder`.
    - Appends the new user/ai messages after the run.
  - `input_messages_key="question"` tells it which input field is the human message.
  - `history_messages_key="history"` tells it which prompt placeholder to fill.

***

How to call with memory:
- Pass a session id via the config:
  - `conversational_with_memory.invoke({"question": "..."}, config={"configurable": {"session_id": "my_user"}})`
- Each unique `session_id` gets its own message list in store.
- Change `session_id` to start a fresh conversation or clear store to reset.

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/reminder.svg" width="20"/> Tip: `store` is a simple Python dict for demo purposes. For multi-user apps, consider a durable backend (e.g., Redis).

In [None]:
# Test conversational memory
print("=== Testing Conversational Memory ===")

session_config = {"configurable": {"session_id": "test_session"}}

# First question
response1 = conversational_with_memory.invoke(
    {"question": "What is seismic inversion?"},
    config=session_config
)
print(f"Q1: What is seismic inversion?")
print(f"A1: {response1}")
print()

# Follow-up question that refers to previous context
response2 = conversational_with_memory.invoke(
    {"question": "What are the main types of this technique?"},
    config=session_config
)
print(f"Q2: What are the main types of this technique?")
print(f"A2: {response2}")
print()

# Another follow-up
response3 = conversational_with_memory.invoke(
    {"question": "Which type is most commonly used in the industry?"},
    config=session_config
)
print(f"Q3: Which type is most commonly used in the industry?")
print(f"A3: {response3}")
print()

# Check memory content
print("=== Current Memory ===")
history = get_session_history("test_session")
for message in history.messages:
    print(f"{message.type}: {message.content[:100]}...")

## 7. Troubleshooting and performance

- CUDA out of memory
  - Lower max_new_tokens, increase quantization, choose a smaller model.
- Slow generation
  - Reduce max_new_tokens, temperature; use smaller models; close other notebooks.
- BitsAndBytes not available (macOS)
  - Use CPU/MPS without quantization; pick a 1–2B model; or run on Colab GPU.
- Tokenizer/pad errors
  - Ensure tokenizer.pad_token is set (fallback to eos_token).
- HF token issues
  - In Colab: from google.colab import userdata; userdata.get("HF_TOKEN")
  - Locally: export HF_TOKEN=... (macOS: add to ~/.zshrc), then use use_auth_token in from_pretrained if needed.

> <img src="https://github.com/CLDiego/uom_fse_dl_workshop/raw/main/figs/icons/code.svg" width="20"/> macOS: set token for terminal sessions

```bash
# macOS (zsh)
echo 'export HF_TOKEN=YOUR_TOKEN' >> ~/.zshrc
source ~/.zshrc
```

