# A Systematic Study of Inference Optimization for Mathematical Reasoning
## LLaMA 3 on GSM8K: Hyperparameters, and Prompts



### Research Questions:
1. What hyperparameter configurations optimize mathematical reasoning?
2. How do different prompt structures affect accuracy?

### Hypothesis:
Systematic inference optimization can match performance gains from expensive model training.

---

## Setup

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import warnings
warnings.filterwarnings('ignore')

print(f"Setup complete - GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

Setup complete - GPU: Tesla T4


In [None]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
)

print(f"âœ… Model loaded - Memory: {torch.cuda.memory_allocated(0)/1024**3:.1f}GB")

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]



âœ… Model loaded - Memory: 11.9GB


In [None]:
def generate_response(prompt, max_new_tokens=300, temperature=0.6, top_p=0.9, rep_penalty=1.1):
    messages = [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": prompt}
    ]

    formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            repetition_penalty=rep_penalty
        )

    return tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

print("âœ… Generation function ready")

âœ… Generation function ready


In [None]:
!pip install -q datasets

from datasets import load_dataset
import re

gsm8k = load_dataset("gsm8k", "main", split="test", trust_remote_code=True)
gsm8k_test = gsm8k.shuffle(seed=42).select(range(50))

print(f"âœ… Dataset loaded: {len(gsm8k_test)} questions")

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'gsm8k' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
ERROR:datasets.load:`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'gsm8k' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

âœ… Dataset loaded: 50 questions


---

# STUDY 1: Hyperparameter Optimization

**Research Question:** What hyperparameter configuration optimizes mathematical reasoning?

**Method:** Systematic grid search over temperature, top_p, and repetition penalty.

---

In [None]:
print("="*80)
print("STUDY 1: HYPERPARAMETER OPTIMIZATION")
print("="*80)

# Test on subset for speed (20 questions)
test_subset = gsm8k_test.select(range(20))

configs = [
    {"name": "Paper Default", "temp": 0.6, "top_p": 0.9, "rep": 1.0},
    {"name": "Low Temp", "temp": 0.3, "top_p": 0.9, "rep": 1.0},
    {"name": "High Top-P", "temp": 0.6, "top_p": 0.95, "rep": 1.0},
    {"name": "Low Top-P", "temp": 0.6, "top_p": 0.85, "rep": 1.0},
    {"name": "High Rep Penalty", "temp": 0.6, "top_p": 0.9, "rep": 1.3},
    {"name": "Optimized", "temp": 0.3, "top_p": 0.9, "rep": 1.1},
]

results_hyperparam = {}

for config in configs:
    print(f"\nTesting: {config['name']}")
    correct = 0

    for ex in test_subset:
        answer = ex['answer'].split("####")[-1].strip()
        prompt = f"{ex['question']}\n\nSolve:"

        response = generate_response(
            prompt,
            temperature=config['temp'],
            top_p=config['top_p'],
            rep_penalty=config['rep']
        )

        if answer in response:
            correct += 1

    acc = (correct / len(test_subset)) * 100
    results_hyperparam[config['name']] = acc
    print(f"  Accuracy: {acc:.1f}%")

best_config = max(results_hyperparam, key=results_hyperparam.get)

print(f"\n{'='*80}")
print("STUDY 1 RESULTS:")
print(f"{'='*80}")
for name, acc in sorted(results_hyperparam.items(), key=lambda x: x[1], reverse=True):
    marker = " âœ… BEST" if name == best_config else ""
    print(f"{name:20s}: {acc:5.1f}%{marker}")

print(f"\nðŸ’¡ Finding: {best_config} performs best")
print(f"{'='*80}")

STUDY 1: HYPERPARAMETER OPTIMIZATION

Testing: Paper Default
  Accuracy: 70.0%

Testing: Low Temp
  Accuracy: 85.0%

Testing: High Top-P
  Accuracy: 75.0%

Testing: Low Top-P
  Accuracy: 70.0%

Testing: High Rep Penalty
  Accuracy: 45.0%

Testing: Optimized
  Accuracy: 75.0%

STUDY 1 RESULTS:
Low Temp            :  85.0% âœ… BEST
High Top-P          :  75.0%
Optimized           :  75.0%
Paper Default       :  70.0%
Low Top-P           :  70.0%
High Rep Penalty    :  45.0%

ðŸ’¡ Finding: Low Temp performs best


---

# STUDY 2: Prompt Engineering

**Research Question:** How do different prompt structures affect accuracy?

**Method:** Compare 5 prompt types on the same questions.

---

---

# FINAL EVALUATION: Best Configuration on Full Dataset

Using optimal settings from Studies 1-3

---

In [None]:
# Setup
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import warnings
warnings.filterwarnings('ignore')

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

print("Loading model...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.float16, low_cpu_mem_usage=True
)
print(f"âœ… Model loaded! Memory: {torch.cuda.memory_allocated(0)/1024**3:.1f}GB")

Loading model...


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]



âœ… Model loaded! Memory: 11.9GB


In [None]:
def generate_response(prompt, max_new_tokens=300, temperature=0.3, top_p=0.9):
    messages = [{"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": prompt}]
    formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, top_p=top_p, do_sample=True, pad_token_id=tokenizer.eos_token_id, repetition_penalty=1.1)

    return tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

print("âœ… Function ready")

âœ… Function ready


In [None]:
!pip install -q datasets
from datasets import load_dataset

gsm8k = load_dataset("gsm8k", "main", split="test", trust_remote_code=True)
test_set = gsm8k.shuffle(seed=42).select(range(15))

print(f"âœ… Dataset loaded: {len(test_set)} questions")

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'gsm8k' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
ERROR:datasets.load:`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'gsm8k' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

âœ… Dataset loaded: 15 questions


In [None]:
print("="*80)
print("STUDY 2: PROMPT ENGINEERING")
print("="*80)

prompt_templates = {
    "Baseline": "{question}\n\nSolve:",

    "Explicit Instructions": "{question}\n\nSolve step by step and provide the final numerical answer:",

    "Role-Based": "You are an expert mathematics teacher. {question}\n\nProvide a clear solution:",

    "Format-Guided": "{question}\n\nProvide your answer in this format:\nReasoning: [your step-by-step work]\nFinal Answer: [number]",

    "Self-Verification": "{question}\n\nSolve carefully, then verify your answer is correct:"
}

results = {}

for prompt_name, template in prompt_templates.items():
    print(f"\nTesting: {prompt_name}")
    correct = 0

    for ex in test_set:
        answer = ex['answer'].split("####")[-1].strip()
        prompt = template.format(question=ex['question'])

        response = generate_response(prompt, temperature=0.3)

        if answer in response:
            correct += 1

    acc = (correct / len(test_set)) * 100
    results[prompt_name] = acc
    print(f"  Accuracy: {acc:.1f}% ({correct}/{len(test_set)})")

best = max(results, key=results.get)

print(f"\n{'='*80}")
print("STUDY 2 RESULTS:")
print(f"{'='*80}")
for name, acc in sorted(results.items(), key=lambda x: x[1], reverse=True):
    marker = " âœ… BEST" if name == best else ""
    print(f"{name:25s}: {acc:5.1f}%{marker}")

baseline_acc = results['Baseline']
best_acc = results[best]
improvement = best_acc - baseline_acc

print(f"\nðŸ’¡ KEY FINDING:")
print(f"   {best} performs best")
print(f"   Improvement over baseline: +{improvement:.1f}%")
print(f"{'='*80}")

STUDY 2: PROMPT ENGINEERING

Testing: Baseline
  Accuracy: 73.3% (11/15)

Testing: Explicit Instructions
  Accuracy: 86.7% (13/15)

Testing: Role-Based
  Accuracy: 73.3% (11/15)

Testing: Format-Guided
  Accuracy: 73.3% (11/15)

Testing: Self-Verification
  Accuracy: 73.3% (11/15)

STUDY 2 RESULTS:
Explicit Instructions    :  86.7% âœ… BEST
Baseline                 :  73.3%
Role-Based               :  73.3%
Format-Guided            :  73.3%
Self-Verification        :  73.3%

ðŸ’¡ KEY FINDING:
   Explicit Instructions performs best
   Improvement over baseline: +13.3%
