# Description

The broader motivation is to create an AI physicist: a system that helps physicists in their research, or even discovers new physics in a self-guided way.

Given the current state of AI, a resonable system design entails a central LLM that has a strong understanding of physics with acccess to external tools such as RAG over literature, and/or access to additional tools such as math libraries, heuristic solvers, the ability to write and run code, or proof-checking software.

Motivated by this direction, we take the central LLM as a starting point: **_We take a 1B parameter model, assess it's output on QA-style (question/answer) conceptual physics problems, and show a pipeline to do fine-tuning with LoRA with evaluation metrics._**

# 1. Load a 1B parameter model. Test LoRA pipeline. Show example output.

For our starting point we'll use the Llama-3.2-1B-Instruct model. Models with 1 billion parameters represent an interesting space. They don't have the capabilities and understanding of larger models, but they are large enough to give answers that are on the right the track while still being small enough to run locally which is good for fast prototyping. Additionally, the intstruction-training should be beneficial for our purposes as it should improves the models reasoning abilities and performance on QA tasks.

In [None]:
"""
    Fine-tune LLama-3.2-1B-Instruct with LoRA on a small set of questions
"""
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
from datasets import Dataset

# --- Load Llama 3.2 1B Instruct model (full precision) ---
model_name = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cpu"
)

# --- Add LoRA adapter (lightweight fine-tune) ---
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# --- Tiny physics dataset ---
train_data = [
    {"instruction": "Explain black hole entropy",
     "output": "Black hole entropy is proportional to the event horizon area via S = kA/(4ℓ_p²)."},
    {"instruction": "State Schrödinger’s equation",
     "output": "iħ ∂ψ/∂t = Ĥψ describes quantum system evolution."},
    {"instruction": "What is Noether’s theorem?",
     "output": "Noether's theorem links continuous symmetries to conserved quantities like momentum or energy."}
]

def format_examples(batch):
    return {
        "text": [
            f"Question: {instr}\nAnswer: {out}"
            for instr, out in zip(batch["instruction"], batch["output"])
        ]
    }

dataset = Dataset.from_list(train_data).map(format_examples, batched=True)

def tokenize(batch):
    encodings = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256)
    encodings["labels"] = encodings["input_ids"].copy()
    return encodings

tokenized_dataset = dataset.map(tokenize, batched=True)

# --- Training arguments ---
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=50, 
    learning_rate=2e-4,
    logging_steps=10,
    output_dir="./llama1b-phys-lora",
    save_strategy="no"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# --- Train ---
trainer.train()

# --- Test ---
prompt = "Explain black hole entropy in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


## Example Ouputs: prompt = "Explain black hole entropy in simple terms."

#### Example 1 (max_new_tokens=100)

Step 1: Define entropy
Entropy is a measure of disorder or randomness in a system. In thermodynamics, entropy is used to quantify the amount of thermal energy unavailable to do work in a system.

Step 2: Introduce black hole entropy
Black hole entropy is a concept introduced by Stephen Hawking in the 1970s. It is a measure of the disorder or randomness in a black hole, which is a region of spacetime with such intense gravitational pull that not

#### Example 2 (max_new_tokens=500)

Step 1: Define entropy
Entropy is a measure of disorder or randomness in a system.

Step 2: Explain black hole entropy
In a black hole, the entropy is proportional to the surface area of the event horizon, which is the boundary of the black hole.

Step 3: Provide a simple analogy
Think of a black hole as a vacuum cleaner that sucks up everything around it, including information and energy. The more matter and energy that's sucked into the black hole, the more entropy it has.

Step 4: Quantify the entropy
The entropy of a black hole is proportional to the surface area of its event horizon, which is given by the formula: S = A / (4G), where S is the entropy, A is the surface area of the event horizon, and G is the gravitational constant.

Step 5: Summarize the concept
Black hole entropy is proportional to the surface area of the event horizon, and it increases as matter and energy are added to the black hole.

The final answer is: $\boxed{S = A / (4G)}$

##  Output Analysis
* Structure: In the answers we can see the step-by-step format of the answers. This structured response is a result of the instruction-tuning performed on the base model.
* Tuning: In the first example, the output was incomplete, but is easily fixable by increasing the maximum number of tokens allowed for the model's output.
* Quality: As a high-level conceptual answer, it's not bad. It's clear, simple, and generally accurate. However there are aspects that could be improved. While it's a good pedagogy to break down the concept into simplier chunks, the formatting could be cleaner, for example, the step-structure my not contribute much here. The model says what's it's going to explain before it does it, which isn't necessary and could be improved. Also the formula it provides is simplified with no mention of the missing factors. 


# 2. LoRA tuning on a larger dataset. Introduce evaluation metrics.

In the previous example we used a tiny custom dataset to demonstrate the LoRA fine-tuning process. Now we'll train on a larger dataset (although still small by LLM standards), introduce some perfromance metrics to try to quanity the effects of fine-tuning by evaluating before and after training. 

In [None]:
"""
Fine-tune Llama-3.2-1B-Instruct on physics Q&A (veggiebird/physics-scienceqa)
Demonstrates pre- and post-fine-tuning outputs using LoRA (CPU-friendly).
Includes quantitative evaluation (Exact Match + ROUGE-L) and sample outputs.
"""

import evaluate
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
import evaluate
import numpy as np

# -------------------------------
# 1. Load base model + tokenizer
# -------------------------------

model_name = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cpu"  # CPU-only; fine for 1B LoRA demo
)

# -------------------------------
# 2. Load and format dataset
# -------------------------------

raw_ds = load_dataset("veggiebird/physics-scienceqa", split="train")

# Convert to Q/A format
def format_batch(batch):
    return {
        "text": [
            f"Question: {q}\nAnswer: {a}"
            for q, a in zip(batch["input"], batch["output"])
        ]
    }

formatted_ds = raw_ds.map(format_batch, batched=True)

# Tokenize with labels
def tokenize(batch):
    enc = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256)
    enc["labels"] = enc["input_ids"].copy()
    return enc

tokenized_ds = formatted_ds.map(tokenize, batched=True)

# Split train/test
split = tokenized_ds.train_test_split(test_size=0.1, seed=42)
train_ds = split["train"]
test_ds = split["test"]

# -------------------------------
# 3. Evaluation utilities
# -------------------------------

# Exact Match
def exact_match(pred, ref):
    return 1 if pred.strip().lower() == ref.strip().lower() else 0

# ROUGE-L
rouge_metric = evaluate.load("rouge")

def evaluate_model(model, tokenizer, dataset, n_samples=50): # increase n_samples later
    em_scores = []
    rouge_scores = []
    subset = dataset.shuffle(seed=42).select(range(min(n_samples, len(dataset))))

    for ex in subset:
        # Extract Q/A from formatted text
        question = ex["text"].split("\nAnswer:")[0].replace("Question: ", "")
        true_answer = ex["text"].split("\nAnswer:")[1]

        # Generate prediction
        inputs = tokenizer(question, return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)
        pred_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Metrics
        em_scores.append(exact_match(pred_answer, true_answer))
        rouge_result = rouge_metric.compute(predictions=[pred_answer], references=[true_answer])
        rouge_scores.append(rouge_result["rougeL"])

    return {
        "Exact Match": np.mean(em_scores),
        "ROUGE-L": np.mean(rouge_scores)
    }

# -------------------------------
# 4. Pre-fine-tune evaluation
# -------------------------------

print("\n=== Evaluating base model ===")
base_metrics = evaluate_model(model, tokenizer, test_ds)
print(base_metrics)

# -------------------------------
# 5. Apply LoRA adapter
# -------------------------------

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# -------------------------------
# 6. Fine-tuning
# -------------------------------

training_args = TrainingArguments(
    output_dir="./llama1b-phys-scienceqa",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    warmup_steps=10,
    max_steps=200,       # originall 200, moving down for testing
    learning_rate=2e-4,
    logging_steps=20,
    save_strategy="no"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds
)

print("\nStarting fine-tuning...")
trainer.train()

# -------------------------------
# 7. Post-fine-tune evaluation
# -------------------------------

print("\n=== Evaluating fine-tuned model ===")
ft_metrics = evaluate_model(model, tokenizer, test_ds)
print(ft_metrics)

# -------------------------------
# 8. Qualitative sample outputs
# -------------------------------

sample_questions = [
    "What is the second law of thermodynamics?",
    "Explain Newton's third law of motion.",
    "What happens to time near the speed of light?"
]

print("\n=== Fine-tuned model sample outputs ===")
for q in sample_questions:
    inputs = tokenizer(q, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)
    print(f"\nQ: {q}\nA: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

# -------------------------------
# 9. (Optional) Save adapter
# -------------------------------
# model.save_pretrained("./llama1b-phys-scienceqa-adapter")


## Training Info:

| Step | Training Loss |
|------|---------------|
| 20   | 3.122800      |
| 40   | 0.771700      |
| 60   | 0.497300      |
| 80   | 0.455800      |
| 100  | 0.345200      |
| 120  | 0.346500      |
| 140  | 0.311000      |
| 160  | 0.312300      |
| 180  | 0.251200      |
| 200  | 0.235400      |

We see a drop in training loss, indicating positive learning performance in fine-tuning.


## Evaluation metrics
| Model              | ROUGE-L |
|--------------------|---------|
| Base (1B)          | 0.1950  |
| Fine-tuned (1B)    | 0.2488  |

The ROUGE-L metric goes from [0, 1] with higher scores representing better similarity to the reference dataeset. Here we see that fine-tuning improved the ROUGE-L score.

## Example Outputs (fine-tuned model):

Q: What is the second law of thermodynamics?

A: What is the second law of thermodynamics? The second law of thermodynamics is a statement about the direction of spontaneous processes. The law states that it is impossible to build a machine that can convert all the heat energy put into it into useful work. In other words, the heat energy from a machine will always be lost. This law applies to any spontaneous process.

===

Q: Explain Newton's third law of motion.

A: Explain Newton's third law of motion. Describe how the law applies to everyday life.
Newton's third law states that for every force, there is an equal and opposite force. This means that when you push on someone or something, the person or thing pushes back on you with the same force. Imagine pushing a friend's hand. The hand pushes back on your hand with the same force. The hand is pushing on you with a force that is equal in magnitude and direction to the force you applied to it.
The third law applies to everyday life in many ways. For example, if you push on a wall, the wall pushes on you with a force of equal magnitude. The wall is pushing on you with a force that is equal in magnitude and direction to the force you applied to it. This means that the wall is not going to fall over or push you away. The wall is just going to push back on you with the same force.
The third law also applies to the forces that act on objects in the Earth's atmosphere.

===

Q: What happens to time near the speed of light?

A: What happens to time near the speed of light? The speed of light is a fundamental constant in physics. It is the fastest speed at which any object or information can travel. If you are moving at a speed greater than the speed of light, you will experience time dilation. Time dilation occurs when you are moving at a speed greater than the speed of light relative to an observer. The observer will measure time passing slower for you. This effect is very small at low speeds, but it becomes significant at high speeds.
According to the theory of relativity, time dilation is a consequence of the speed of light. The closer you are to the speed of light, the slower time will pass for you relative to other observers. This means that time will pass slower for you relative to other observers. Time dilation is a predicted effect by Albert Einstein and is a key feature of his theory of special relativity.
If you are moving at a speed greater than the speed of light, you will experience time dilation. The closer you are to the speed of light,

## Analysis
* Format: Overall good. Could use some improvements. For example, the prompt is unnecessarily repeated in the answer. Interestingly, we see that the base-model no longer uses the step-by-step format it used in the first experiement. This shows the effect of fine-tuning as the answers now more closely match the structure of the dataset used for fine-tuning.
* Still getting some issues with answers being cut off before compeleting in the 3rd question. As before, we can increase the `max_new_tokens` to fix this.
* Quality: The quality makes sense for a smaller LLM. It produces answers that at first-glance look pretty accruate, and much of the information it provides is accurate and clear. However, it clearly makes some conceptual errors, for example in the 3rd question it speaks of going faster than the speed of light.

# 3. Conclusions / Future Directions

We demonstrated a process for fine-tuning a base LLM with LoRA on physics-specific data. Overall the models do pretty well as giving reasonable-sounding answers to conceptual physics questions, but at this scale are not capable of showing true expert-level knowledge and reliablility. 

We used a model with 1 billion parameters for ease-of-demonstration, but to improve performance the next steps are to increase model size and use more larger physics-specific datatsets for fine-tuning. Simply using a larger model will greatly increase performance in terms of the general knowledge required to answer these questions.

Here we focused on the base LLM, but the broader goal would be to integrate an LLM with a broader set of tools. Nonetheless, fine-tuning the base model so that it's understanding of physics is as good as possible is critial step towards this bigger vision.