# Week 6 â€” Exercise 3: Tune, Reason, Act

### Overview
This summative exercise pulls together **efficient fine-tuning (PEFT/LoRA)** and **reasoning + tool-use (CoT/ReAct)**. You will:
1. Run a small **LoRA** fine-tune for text classification and **tune key hyperparameters**.
2. Build a minimal **ReAct** loop (Thought â†’ Action â†’ Observation) with a **Calculator** and **Wikipedia search** tools.
3. Compare **Direct vs CoT vs ReAct** on a few questions and analyze when tools help.

**Deliverables (upload the executed notebook):**
- **Part A:** Best LoRA config + validation metrics + short comparison vs a baseline.
- **Part B:** ReAct transcripts for at least **3** questions (success + one failure).
- **Reflection (150â€“200 words):** What you changed in Imports & Config, why it worked on your hardware, and how tool-use affected correctness.

**Tip:** Use GPU in Colab: *Runtime â†’ Change runtime type â†’ GPU*.

## 1) Setup
Install libraries and print environment info.

In [1]:
#!pip -q install -U pip setuptools wheel > /dev/null
# I need to install different versions of the tools as otherwise I can't use the required transformers modules.
#!pip -q install -U "transformers==4.49.0" "peft==0.17.1" "datasets>=2.19.0" "accelerate>=0.31.0" evaluate wikipedia==1.4.0 > /dev/null
import torch, platform

if torch.cuda.is_available():
    device = torch.device("cuda")
    device_name = torch.cuda.get_device_name(0)
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    device_name = "Apple Metal (MPS)"
else:
    device = torch.device("cpu")
    device_name = "CPU"

print(f"âœ… Torch: {torch.__version__} | Device: {device_name} | Python: {platform.python_version()}")



âœ… Torch: 2.9.1 | Device: Apple Metal (MPS) | Python: 3.14.0


## 2) Part A â€” **LoRA Finetune** (Classification)

Weâ€™ll use **GLUE/SST-2** (binary sentiment) with **DistilBERT** for speed. Your job is to **tune the LoRA + training knobs** and report the **best** run.

**What to submit for Part A**
- Best configuration: `MODEL_NAME`, `LR`, `BATCH_SIZE`, `NUM_EPOCHS`, `LORA_RANK`, `LORA_ALPHA`, `LORA_DROPOUT`, `TARGET_MODULES`.
- Metrics: validation accuracy + loss, and runtime. Save to `artifacts/partA_results.json`.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
import evaluate, numpy as np, json, os
from typing import Dict

os.makedirs('artifacts', exist_ok=True)

# ---- Imports & Config (YOU WILL TUNE THESE) ----

# In general the model is quite small, and the first run showed that the accuracy drops rather than increases with every epoch.
# So a gnerler learning approach is required.

# distilbert is an encoder only model based on the BERT architecture (cf. https://huggingface.co/distilbert/distilbert-base-uncased, https://medium.com/@pickleprat/encoder-only-architecture-bert-4b27f9c76860)
# Distillation in this context means that a model has been trained by the original BERT to return the same probabilities with less parameters (= a smaller model).
# Encoder only means that the model returns embeddings rather than newly generated tokens.
# With that, encoder only models provide numerical values that are suitable e.g. for further classification.
MODEL_NAME   = 'distilbert-base-uncased'   # small & fast
# sst2 results from the Stanford Sentiment Treebank and is a list of sentences from movie reviews.
# The sentences are annotated with sentiment labels 0 (negative) and 1 (positive), respectively (cf. https://openreview.net/pdf?id=rJ4km2R5t7).
# The dataset is used to benchmark sentiment analysis tasks.
TASK_NAME    = 'sst2'                      # GLUE/SST-2
SEED         = 42
# The batch size is the number of examples that are inferred before the error is calculated and the weights updated.
BATCH_SIZE   = 32
# The learning rate is set so the gradient finds the minimum error rather than exploding or getting stuck in a local minimum.
# The first run yielded bad results (i.e. accuracy was decreasing with every training epoch.)
# So I change the learning rate to a lower value. (2e-4 > 1e-5)
LR           = 5e-4
# Epochs are number of times all examples are being used in training.
# Giving it another epoch to see if the accuracy increases. (3 > 4)
NUM_EPOCHS   = 5
# The LoRA rank is the interface dimension between the two LoRA matrices A and B. The purpose is to create a bottleneck for the re-trained layer in the original model.
# The rank is considerably smaller than that layer, which results in the number of parameters being updated during training to be much smaller than what would be required
# if the entire model were to be fine tuned.
# the rank may be too high 8 > 4
LORA_RANK    = 16
# The alpha value is a regulation factor that is supposed to ensure that the A and B matrix paameters don't become too big (or vanish). alpha/rank is multiplied with AB
# to result in change to the parameters in the target model's layer: W' = W + Î”W = W + Î±/r*AB
# (cf. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, L., Wang, W., Chen, W., & Chen, Y. (2021). LoRA: Low-rank adaptation of large language models. arXiv:2106.09685. https://doi.org/10.48550/arXiv.2106.09685)
# I'm increasing the alpha value to 32 (16 > 32)
LORA_ALPHA   = 32
# The dropout value ensures that the model doesn't get hung up on individual weights. In order to achieve that, in every batch a sample of weights are set to 0.
# The dropout rate sets the number of 0s. 0.05 = (Random) 5% of all weights in A and B are set to 0 in every training batch.
# This is for training only, and is not applied when inferring.
# (cf. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929â€“1958. http://www.jmlr.org/papers/v15/srivastava14a.html)
# Also increasing the dropout 0.05 > 0.1
LORA_DROPOUT = 0.05
# LoRA can be applied to any layer. Often it is applied to the attention layers. In this case LoRa is applied both to the query and value matrices
# (i.e. two separate A and B matrices, like one AB for q and one AB for v, are trained.)
# Adding q and out to the k, v layers (hoping to increase the performance)
TARGET_MODULES = ["q_lin", "k_lin", "v_lin", "out_lin"]        # DistilBERT attention projections
print('âœ… Config loaded â€” tune these for best results')

raw = load_dataset('glue', TASK_NAME)
# Loading the tokenizer for distilbert_base_uncased
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Apply the tokenizer to a batch of example sentences. Cut off all tokens beyond the max number of tokens.
def tokenize_fn(batch: Dict):
    return tokenizer(batch['sentence'], truncation=True)

# Using map to enable vectorised tokenization in batches
tokenized = raw.map(tokenize_fn, batched=True)
# Setting up the data_collator, which pads token sequences so all have the max number of tokens.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Simple converters from the label integers to strings and vv.
id2label = {0:'negative', 1:'positive'}
label2id = {v:k for k,v in id2label.items()}
print('ðŸ”¹ Dataset:', raw)

âœ… Config loaded â€” tune these for best results
ðŸ”¹ Dataset: DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})


In [6]:

# Setting up the distilbert_base_uncased model and adding a classifier head for 2 labels.
base_model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=2, id2label=id2label, label2id=label2id
)

# Configuring LoRA
peft_cfg = LoraConfig(
    r=LORA_RANK,
    lora_alpha=LORA_ALPHA,
    target_modules=TARGET_MODULES,
    lora_dropout=LORA_DROPOUT,
    # excluding the model's bias values from being updated / no LoRA parameters for the bias values will be trained.
    bias='none',
    # Configure for Sequence Classification (as we want to use one embedding or vector representation for the entire sentence, rather than e.g. a list of embeddings for each token)
    task_type='SEQ_CLS'
)
# wrapping distilbert_base_uncased with a LoRA wrapper which will be trained
model = get_peft_model(base_model, peft_cfg)
model.print_trainable_parameters()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 739,586 || all params: 67,694,596 || trainable%: 1.0925


In [7]:

# In this section we are setting up the training regime.

# Using the standard Hungging Face Accuracy metric
accuracy = evaluate.load('accuracy')

# From the predictions always use the highest logit value as the outcome / inferred label
# Calculate the  accuracy based on that
# It will be done for an entire epoch (as below we set evaluation_strategy='epoch')
# These values are not actually used in training, but only show us how the training goes for each epoch.
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=preds, references=labels)

# The arguments for the trainer
args = TrainingArguments(
    output_dir='outputs-lora-sst2',
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=NUM_EPOCHS,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    # Make sure to return the weights that yielded the highest accuracy.
    # (Accuracy is used in the compute_metrics function. We could also use other measures.)
    load_best_model_at_end=True,
    # Log the loss every 50 steps, i.e. every 50 batches
    logging_steps=50,
    seed=SEED,
    report_to='none'
)

# create the Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)
print('âœ… Trainer ready')

  trainer = Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


âœ… Trainer ready


### (Optional) Merge LoRA for latency-free inference
Try merging and saving a single checkpoint.

In [8]:
try:
    merged = model.merge_and_unload()
    merged.save_pretrained('outputs-lora-sst2-merged')
    tokenizer.save_pretrained('outputs-lora-sst2-merged')
    print('âœ… Merged model saved to outputs-lora-sst2-merged')
except Exception as e:
    print('Merge skipped or not supported:', e)

âœ… Merged model saved to outputs-lora-sst2-merged


In [9]:


train_out = trainer.train()
eval_out  = trainer.evaluate()
print(train_out)
print(eval_out)

summary = {
    'timestamp_utc': '2025-10-15T15:33:19.587081',
    'config': {
        'MODEL_NAME': MODEL_NAME,
        'LR': LR,
        'BATCH_SIZE': BATCH_SIZE,
        'NUM_EPOCHS': NUM_EPOCHS,
        'LORA_RANK': LORA_RANK,
        'LORA_ALPHA': LORA_ALPHA,
        'LORA_DROPOUT': LORA_DROPOUT,
        'TARGET_MODULES': TARGET_MODULES,
    },
    'train': {
        'global_step': getattr(train_out, 'global_step', None),
        'training_loss': getattr(train_out, 'training_loss', None),
    },
    'eval': eval_out
}
with open('artifacts/partA_results.json','w') as f:
    json.dump(summary, f, indent=2)
print('ðŸ’¾ Saved â†’ artifacts/partA_results.json')



Epoch,Training Loss,Validation Loss,Accuracy
1,0.42,0.406355,0.830275
2,0.3917,0.390583,0.823394
3,0.4183,0.388518,0.832569
4,0.3888,0.38908,0.830275




TrainOutput(global_step=16840, training_loss=0.41565222513647376, metrics={'train_runtime': 695.9936, 'train_samples_per_second': 387.067, 'train_steps_per_second': 24.196, 'total_flos': 2455429931383956.0, 'train_loss': 0.41565222513647376, 'epoch': 4.0})
{'eval_loss': 0.3890800476074219, 'eval_accuracy': 0.8302752293577982, 'eval_runtime': 2.1458, 'eval_samples_per_second': 406.366, 'eval_steps_per_second': 25.631, 'epoch': 4.0}
ðŸ’¾ Saved â†’ artifacts/partA_results.json


## 3) Part B â€” **Reasoning + Action** (CoT & ReAct)
Compare **Direct** vs **CoT** vs **ReAct** on 3+ questions (at least one arithmetic, one factual). Keep transcripts short and machine-readable.

In [16]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import wikipedia, re, json

GEN_MODEL = 'google/flan-t5-small'  # small & fast for demo
g_tok = AutoTokenizer.from_pretrained(GEN_MODEL)
g_mod = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL, device_map='auto')
print('âœ… Loaded generator:', GEN_MODEL)

def generate(prompt, max_new_tokens=128):
    inputs = g_tok(prompt, return_tensors='pt').to(g_mod.device)
    out = g_mod.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    return g_tok.decode(out[0], skip_special_tokens=True)

FINAL_RE = re.compile(r"Final:\s*(.+)", re.IGNORECASE)
def ask_direct(q):
    return generate(f"Q: {q}\nA:")
def ask_cot(q):
    return generate(f"Q: {q}\nA: Let's think step by step.\nAt the end, output exactly:\nFinal: <answer>")
def extract_final(text):
    m = FINAL_RE.search(text); return m.group(1).strip() if m else text.strip()

def tool_search(query, sentences=2):
    try:
        wikipedia.set_lang('en')
        page = wikipedia.page(query, auto_suggest=True)
        txt = wikipedia.summary(query, sentences=sentences)
        return f"[SEARCH RESULT: {page.title}] " + txt.replace('\n',' ')
    except Exception as e:
        return f"[SEARCH ERROR] {e}"
def safe_calc(expr):
    allowed = re.sub(r"[^0-9\+\-\*\/\^\(\)\.\s]", "", expr).replace('^','**')
    try:
        val = eval(allowed, {"__builtins__": {}}, {})
        return str(val)
    except Exception as e:
        return f"[CALC ERROR] {e}"

REACT_SYSTEM = (
    "You are a helpful assistant that reasons step-by-step and uses tools when needed.\n"
    "Tools you can use:\n- Search[query]\n- Calc[expression] (use ^ for powers)\n\n"
    "Use this exact format:\nThought: <reasoning>\nAction: <Search[...] or Calc[...] >\nObservation: <tool result>\n...\nThought: I can answer.\nFinal Answer: <concise answer>\n"
)
FEWSHOT = (
    "Q: Who wrote Pride and Prejudice and in what year was it first published?\n"
    "Thought: I should look up the author and year.\n"
    "Action: Search[Pride and Prejudice]\n"
    "Observation: [SEARCH RESULT: Pride and Prejudice] Pride and Prejudice is a novel by Jane Austen, first published in 1813.\n"
    "Thought: I can answer.\n"
    "Final Answer: Jane Austen, 1813.\n\n"
    "Q: What is (23^2 - 17^2)?\n"
    "Thought: I can compute with the difference of squares.\n"
    "Action: Calc[(23^2 - 17^2)]\n"
    "Observation: 276\n"
    "Thought: I can answer.\n"
    "Final Answer: 276.\n"
)
def react_answer(question, max_steps=4):
    prompt = REACT_SYSTEM + "\n" + FEWSHOT + "\nQ: " + question + "\n"
    transcript = ""
    for _ in range(max_steps):
        text = prompt + transcript + "Thought:"
        response = generate(text, max_new_tokens=96)
        cont = response.split("Thought:",1)[-1]
        if 'Final Answer:' in cont:
            fa = cont.split('Final Answer:',1)[-1].strip()
            transcript += f"Thought: I can answer.\nFinal Answer: {fa}\n"
            break
        m = re.search(r"Action:\\s*(Search\\[(.*?)\\]|Calc\\[(.*?)\\])", cont, re.IGNORECASE|re.DOTALL)
        if not m:
            transcript += f"Thought: {cont.strip()}\nThought: I can answer.\nFinal Answer: (no tool) {cont.strip()}\n"
            break
        action_full = m.group(1); search_q = m.group(2); calc_expr = m.group(3)
        obs = tool_search(search_q.strip()) if search_q is not None else safe_calc(calc_expr.strip())
        transcript += f"Thought: {cont.strip()}\nAction: {action_full}\nObservation: {obs}\n"
    if 'Final Answer:' not in transcript:
        transcript += "Thought: I can answer.\nFinal Answer: (stopped without explicit answer)\n"
    return transcript

QUESTIONS = [
    "Who discovered penicillin and what year was it discovered?",
    "What is (125^2 - 120^2) / 5?",
    "In which city is the Eiffel Tower located?"
]
results_B = []
for q in QUESTIONS:
    d = ask_direct(q); c = ask_cot(q); r = react_answer(q)
    results_B.append({'question': q, 'direct': d, 'cot': c, 'react': r})
import json
print(json.dumps(results_B, indent=2))
os.makedirs('artifacts', exist_ok=True)
with open('artifacts/partB_transcripts.json','w') as f:
    json.dump(results_B, f, indent=2)
print('ðŸ’¾ Saved â†’ artifacts/partB_transcripts.json')

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

âœ… Loaded generator: google/flan-t5-small
[
  {
    "question": "Who discovered penicillin and what year was it discovered?",
    "direct": "Jacques Chin",
    "cot": "(IV).",
    "react": "Thought: I should look up the author and year.\nThought: I can answer.\nFinal Answer: (no tool) I should look up the author and year.\n"
  },
  {
    "question": "What is (125^2 - 120^2) / 5?",
    "direct": "1252 - 1202 / 5",
    "cot": "(d).",
    "react": "Thought: I can calculate with the difference of squares. Action: Calc[(1252 - 1202) / 5] Observation: [(1252 - 1202) / 5]\nThought: I can answer.\nFinal Answer: (no tool) I can calculate with the difference of squares. Action: Calc[(1252 - 1202) / 5] Observation: [(1252 - 1202) / 5]\n"
  },
  {
    "question": "In which city is the Eiffel Tower located?",
    "direct": "san francisco",
    "cot": "Paris",
    "react": "Thought: Paris\nThought: I can answer.\nFinal Answer: (no tool) Paris\n"
  }
]
ðŸ’¾ Saved â†’ artifacts/partB_transcripts.json

## 4) Analysis & Reflection
Complete the prompts below. Keep your reflection **150â€“200 words**.

**Part A â€” Best LoRA configuration (fill below):**  
- Model:  
- LR / Batch / Epochs:  
- LoRA: rank / alpha / dropout / targets:  
- Validation metrics (acc, loss):  
- Notes on stability/VRAM/runtime:  

**Part B â€” CoT vs ReAct (brief):**  
- 1 case where **ReAct** improved correctness:  
- 1 case where it failed or over-used tools:  
- Your mitigation (e.g., stricter format, top-k search, fallback):  

**Reflection (150â€“200 words)** â€” *What you tuned and why; how hardware constraints influenced choices; how tool-use changed outcomes.*

## 5) Packaging Results
This cell lists your saved artifacts for upload (JSON + optional merged model).

In [None]:
import os, glob
print('Artifacts:')
for p in sorted(glob.glob('artifacts/*')):
    try:
        size = os.path.getsize(p)
    except Exception:
        size = 'n/a'
    print(' -', p, size, 'bytes')
print('\nOptional model dirs:')
for d in ['outputs-lora-sst2', 'outputs-lora-sst2-merged']:
    if os.path.isdir(d):
        print(' -', d)