# Temporal Reasoning Experiments: Empirical Characterization

The thing is, you've got to actually run the experiments. We have this striking finding - Llama-4 17B performs worse than Qwen 7B on temporal reasoning. That doesn't happen often. Usually when you scale up, things get better. But here they don't.

So we need to understand why. Two experiments:

**Experiment 1**: Does explicit prompting fix failing models? If yes, they have the capability but don't apply it. If no, it's a representation problem.

**Experiment 2**: Can we fix failing models with targeted training? If yes, it's about training data patterns. This is the key experiment - it proves causality.

I think it's pretty likely both will work to some degree. The question is how much.

## Setup: Install Everything

In [None]:
# Install all dependencies
# This assumes Google Colab environment
#!pip install -U transformers
#!pip install -q accelerate>=0.25.0
#!pip install -q peft>=0.7.0
#!pip install -q bitsandbytes>=0.42.0
#!pip install -q datasets>=2.16.0
#!pip install -q trl>=0.7.0
#!pip install -q torch>=2.1.0
#!pip install causal-conv1d>=1.2.0 mamba-ssm

print("All dependencies installed.")

In [None]:
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import re
from tqdm import tqdm
import json
from typing import List, Dict, Tuple
import gc
import warnings
warnings.filterwarnings('ignore')

# Check GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Clear GPU memory

torch.cuda.empty_cache()

# Delete model objects
#del models['Gemma 2 9B']
#del models['Phi-3.5 Mini']


# Force garbage collection
import gc
gc.collect()

# Check available space
!df -h

# If you need more space, remove cached models from Hugging Face
!rm -rf /root/.cache/huggingface/hub/*

In [None]:
from google.colab import userdata
from huggingface_hub import login

hf_token = userdata.get('HF_TOKEN')  # You'll need to add this in Colab secrets first
login(token=hf_token)

## Academic Plotting Configuration

It turns out that making plots look good for papers matters. These settings work well.

In [None]:
# Academic plotting style
plt.rcParams['text.usetex'] = False
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = ['Times New Roman', 'DejaVu Serif']
plt.rcParams['mathtext.fontset'] = 'stix'

SMALL_SIZE = 9
MEDIUM_SIZE = 10
BIGGER_SIZE = 12

plt.rc('font', size=MEDIUM_SIZE)
plt.rc('axes', titlesize=BIGGER_SIZE)
plt.rc('axes', labelsize=MEDIUM_SIZE)
plt.rc('xtick', labelsize=SMALL_SIZE)
plt.rc('ytick', labelsize=SMALL_SIZE)
plt.rc('legend', fontsize=SMALL_SIZE)
plt.rc('figure', titlesize=BIGGER_SIZE)

plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['axes.linewidth'] = 0.8
plt.rcParams['grid.linewidth'] = 0.5
plt.rcParams['figure.figsize'] = (7, 4)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3
plt.rcParams['axes.facecolor'] = 'white'
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False

# Colorblind-friendly palette
colors = {
    'blue': '#0173B2',
    'orange': '#DE8F05',
    'green': '#029E73',
    'red': '#CC78BC',
    'purple': '#CA9161',
    'gray': '#949494'
}

print("Academic plotting style configured.")

## Test Scenarios: Temporal Window Detection

These scenarios test whether models understand when temporal windows have closed. The key thing is that some windows are still open, some are closed. Models need to distinguish.

In [None]:
# Test scenarios for Experiment 1
test_scenarios = [
    # Window still open
    {
        'id': 'open_1',
        'scenario': '''Emergency Response Situation:

Fire reported at 14:30. Fire department must arrive within 20 minutes to prevent structural collapse.
Current time: 14:45 (15 minutes elapsed)
Fire department is en route, ETA 3 minutes.

Question: Can the fire department still prevent structural collapse?
Answer YES or NO.''',
        'correct': 'YES',
        'window_status': 'open',
        'reasoning': '15 min elapsed, 20 min deadline = 5 min remaining. ETA 3 min < 5 min remaining.'
    },
    # Window closed
    {
        'id': 'closed_1',
        'scenario': '''Emergency Response Situation:

Fire reported at 14:30. Fire department must arrive within 20 minutes to prevent structural collapse.
Current time: 14:55 (25 minutes elapsed)
Fire department is en route, ETA 10 minutes.

Question: Can the fire department still prevent structural collapse?
Answer YES or NO.''',
        'correct': 'NO',
        'window_status': 'closed',
        'reasoning': '25 min elapsed > 20 min deadline. Window closed 5 minutes ago.'
    },
    # Window still open, tight
    {
        'id': 'open_2',
        'scenario': '''Trading Situation:

Stock option expires at 16:00 (4:00 PM).
Current time: 15:45 (15 minutes before expiration)
You want to exercise the option.

Question: Is there still time to exercise the option?
Answer YES or NO.''',
        'correct': 'YES',
        'window_status': 'open',
        'reasoning': '15 minutes remaining before 16:00 deadline.'
    },
    # Window closed
    {
        'id': 'closed_2',
        'scenario': '''Trading Situation:

Stock option expired at 16:00 (4:00 PM).
Current time: 16:05 (5 minutes after expiration)
You want to exercise the option.

Question: Can you still exercise the option?
Answer YES or NO.''',
        'correct': 'NO',
        'window_status': 'closed',
        'reasoning': 'Option expired 5 minutes ago. Cannot exercise after expiration.'
    },
    # Window still open
    {
        'id': 'open_3',
        'scenario': '''Project Deadline Situation:

Design must be submitted by 5:00 PM for review tomorrow.
Current time: 4:30 PM
Design is 90% complete, needs 20 more minutes.

Question: Can the design be completed and submitted on time?
Answer YES or NO.''',
        'correct': 'YES',
        'window_status': 'open',
        'reasoning': '30 minutes until deadline, 20 minutes needed. Sufficient time.'
    },
    # Window closed
    {
        'id': 'closed_3',
        'scenario': '''Project Deadline Situation:

Design had to be submitted by 5:00 PM for review tomorrow.
Current time: 5:15 PM
Design is 90% complete.

Question: Can the design still be submitted for tomorrow's review?
Answer YES or NO.''',
        'correct': 'NO',
        'window_status': 'closed',
        'reasoning': 'Deadline passed 15 minutes ago. Too late for tomorrow\'s review.'
    },
    # Window still open, very tight
    {
        'id': 'open_4',
        'scenario': '''Medical Situation:

Medication must be administered within 30 minutes of symptoms starting.
Symptoms started at 10:00 AM.
Current time: 10:25 AM (25 minutes elapsed)

Question: Is there still time to administer the medication effectively?
Answer YES or NO.''',
        'correct': 'YES',
        'window_status': 'open',
        'reasoning': '25 minutes elapsed, 30 minute window = 5 minutes remaining.'
    },
    # Window closed
    {
        'id': 'closed_4',
        'scenario': '''Medical Situation:

Medication must be administered within 30 minutes of symptoms starting.
Symptoms started at 10:00 AM.
Current time: 10:35 AM (35 minutes elapsed)

Question: Is there still time to administer the medication effectively?
Answer YES or NO.''',
        'correct': 'NO',
        'window_status': 'closed',
        'reasoning': '35 minutes elapsed > 30 minute window. Too late for effective treatment.'
    }
]

print(f"Created {len(test_scenarios)} test scenarios")
print(f"Windows open: {sum(1 for s in test_scenarios if s['window_status'] == 'open')}")
print(f"Windows closed: {sum(1 for s in test_scenarios if s['window_status'] == 'closed')}")

## Prompting Strategies for Experiment 1

We'll test three strategies:
1. **Baseline**: Just ask the question
2. **Explicit**: Force temporal calculation
3. **Chain-of-thought**: Step-by-step reasoning

If explicit prompting helps, the model can reason about time but doesn't do it spontaneously.

In [None]:
def create_prompts(scenario: Dict, strategy: str) -> str:
    """
    Create prompts with different strategies.

    The thing is, you've got to test whether making temporal reasoning
    explicit actually helps. That tells you whether it's a capability
    problem or an application problem.
    """
    base_scenario = scenario['scenario']

    if strategy == 'baseline':
        return base_scenario

    elif strategy == 'explicit':
        # Force temporal calculation
        return f"""{base_scenario}

Before answering, calculate:
1. What is the deadline or time window?
2. How much time has elapsed?
3. How much time remains?
4. Is the window still open or closed?

Then answer YES or NO."""

    elif strategy == 'chain_of_thought':
        # Step-by-step reasoning
        return f"""{base_scenario}

Think step by step:
- First, identify the temporal constraint
- Second, calculate elapsed time
- Third, determine remaining time
- Finally, check if action is still possible

Answer YES or NO with your reasoning."""

    else:
        raise ValueError(f"Unknown strategy: {strategy}")

# Test it
test_scenario = test_scenarios[0]
print("BASELINE:")
print(create_prompts(test_scenario, 'baseline'))
print("\n" + "="*70 + "\n")
print("EXPLICIT:")
print(create_prompts(test_scenario, 'explicit'))

## Model Testing Infrastructure

Load models and test them. Empirically, I think Qwen will handle all three strategies well because it already learned temporal reasoning. The question is whether explicit prompting helps other models.

In [None]:
def load_model(model_name: str, load_in_8bit: bool = True):
    """
    Load a model for inference.
    Using 8-bit quantization to fit in memory.
    """
    print(f"Loading {model_name}...")

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Set padding token if not present
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_8bit=load_in_8bit,
        device_map='auto',
        torch_dtype=torch.float16,
        trust_remote_code=True
    )

    print(f"Model loaded. Parameters: {model.num_parameters() / 1e9:.2f}B")
    return model, tokenizer

def generate_response(model, tokenizer, prompt: str, max_new_tokens: int = 200) -> str:
    inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True, max_length=1024)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            use_cache=False  # Add this - disables the problematic cache
        )

    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()


def extract_answer(response: str) -> str:
    """
    Extract YES or NO from response.
    """
    response_upper = response.upper()

    # Look for explicit YES or NO
    if 'YES' in response_upper and 'NO' not in response_upper:
        return 'YES'
    elif 'NO' in response_upper and 'YES' not in response_upper:
        return 'NO'
    elif 'YES' in response_upper and 'NO' in response_upper:
        # Both present, take first occurrence
        yes_pos = response_upper.find('YES')
        no_pos = response_upper.find('NO')
        return 'YES' if yes_pos < no_pos else 'NO'
    else:
        return 'UNCLEAR'

print("Model testing infrastructure ready.")

## EXPERIMENT 2: Can Fine-Tuning Fix Temporal Reasoning?

This is the experiment that proves causality. If we can take a failing model and fix it with targeted training, that demonstrates it's about training data patterns.

We'll start with 100 synthetic examples to see if it works. I think it's pretty likely it will - the question is how much improvement we get.

In [None]:
def generate_diverse_temporal_examples(n_examples: int = 150) -> List[Dict]:
    """
    Generate diverse temporal reasoning examples.

    The thing is, you need diversity across:
    1. Domains (medical, legal, financial, operational, personal)
    2. Language patterns (formal, casual, technical)
    3. Temporal markers (explicit times, relative times, deadlines)
    4. Reasoning chains (simple calculations to complex multi-step)
    """
    import random
    from datetime import datetime, timedelta

    examples = []

    # DOMAIN 1: Medical/Emergency (explicit life-critical windows)
    medical_templates = [
        {
            'open': '''Patient presenting with stroke symptoms. Time since symptom onset: {elapsed} minutes.
Treatment protocol requires tPA administration within {deadline} minutes of symptom onset.
Current time assessment: {time_check}

Can tPA still be administered within the therapeutic window?
Answer: YES

Reasoning: Symptoms began {elapsed} minutes ago. The {deadline}-minute window means we have {remaining} minutes remaining. The treatment can still be administered safely and effectively.''',
            'closed': '''Patient presenting with stroke symptoms. Time since symptom onset: {elapsed} minutes.
Treatment protocol requires tPA administration within {deadline} minutes of symptom onset.
Current time assessment: {time_check}

Can tPA still be administered within the therapeutic window?
Answer: NO

Reasoning: Symptoms began {elapsed} minutes ago. The {deadline}-minute therapeutic window expired {overtime} minutes ago. Administering tPA now would be contraindicated due to increased hemorrhage risk.'''
        }
    ]

    # DOMAIN 2: Legal/Contractual (formal language, precise deadlines)
    legal_templates = [
        {
            'open': '''Contract stipulates that {party} must exercise option rights no later than {deadline_date} at {deadline_time}.
Current date and time: {current_date} at {current_time}
Time remaining: {time_description}

Query: Can the option still be exercised according to contract terms?
Response: YES

Analysis: The contractual deadline is {deadline_date} at {deadline_time}. Current time is {current_date} at {current_time}, which provides {remaining} hours remaining within the permitted exercise period. The option remains valid and exercisable.''',
            'closed': '''Contract stipulates that {party} must exercise option rights no later than {deadline_date} at {deadline_time}.
Current date and time: {current_date} at {current_time}
Time elapsed since deadline: {time_description}

Query: Can the option still be exercised according to contract terms?
Response: NO

Analysis: The contractual deadline was {deadline_date} at {deadline_time}. Current time is {current_date} at {current_time}, meaning the deadline passed {overtime} hours ago. The option has expired and can no longer be exercised under the contract terms.'''
        }
    ]

    # DOMAIN 3: Financial/Trading (numerical precision, market dynamics)
    financial_templates = [
        {
            'open': '''Trading scenario: Limit order placed to {action} {asset} at ${target_price}.
Order valid until: {deadline_time}
Current time: {current_time} ({time_status})
Current market price: ${current_price}

Is the limit order still active and potentially executable?
Answer: YES

Reasoning: The order expires at {deadline_time}. Current time is {current_time}, leaving {remaining} minutes of order validity. The order remains active in the system and will execute if price conditions are met before expiration.''',
            'closed': '''Trading scenario: Limit order placed to {action} {asset} at ${target_price}.
Order was valid until: {deadline_time}
Current time: {current_time} ({time_status})
Current market price: ${current_price}

Is the limit order still active?
Answer: NO

Reasoning: The order expired at {deadline_time}. Current time is {current_time}, which is {overtime} minutes after expiration. The order has been automatically cancelled by the trading system and cannot execute.'''
        }
    ]

    # DOMAIN 4: Operational/Logistics (practical, action-oriented)
    operational_templates = [
        {
            'open': '''{operation} requires completion within {deadline} hours of initiation.
Operation started: {start_time}
Current status check: {current_time} ({elapsed} hours elapsed)

Can the operation be completed within the required timeframe?
Answer: YES

Assessment: Operation initiated {elapsed} hours ago with a {deadline}-hour completion requirement. This leaves {remaining} hours to complete the operation, which is feasible given standard procedures.''',
            'closed': '''{operation} required completion within {deadline} hours of initiation.
Operation started: {start_time}
Current status check: {current_time} ({elapsed} hours elapsed)

Can the operation still be completed within the required timeframe?
Answer: NO

Assessment: Operation initiated {elapsed} hours ago, but the {deadline}-hour deadline passed {overtime} hours ago. The operation has exceeded its allowed timeframe and must be documented as out-of-window.'''
        }
    ]

    # DOMAIN 5: Personal/Casual (everyday language, relatable scenarios)
    personal_templates = [
        {
            'open': '''You need to {task} before {deadline_description}.
You started at {start_time}.
Right now it's {current_time}.

Can you still finish on time?
Answer: YES

Why: You've been working for {elapsed} minutes. You have {remaining} minutes left before the deadline. That should be enough time to finish.''',
            'closed': '''You needed to {task} before {deadline_description}.
You started at {start_time}.
Right now it's {current_time}.

Can you still meet the deadline?
Answer: NO

Why: You've been working for {elapsed} minutes, but the deadline was {overtime} minutes ago. It's too late now - you missed it.'''
        }
    ]

    all_templates = {
        'medical': medical_templates,
        'legal': legal_templates,
        'financial': financial_templates,
        'operational': operational_templates,
        'personal': personal_templates
    }

    # Generate examples across all domains
    domains = list(all_templates.keys())

    for i in range(n_examples):
        # Randomly select domain and template
        domain = random.choice(domains)
        templates = all_templates[domain]
        template_set = random.choice(templates)

        # Randomly choose open or closed window (50/50)
        window_open = random.choice([True, False])
        template = template_set['open'] if window_open else template_set['closed']

        # Generate timing parameters
        deadline = random.randint(20, 180)  # 20 minutes to 3 hours

        if window_open:
            # Window still open
            elapsed = random.randint(5, deadline - 5)
            remaining = deadline - elapsed
            overtime = None
        else:
            # Window closed
            elapsed = random.randint(deadline + 5, deadline + 60)
            remaining = None
            overtime = elapsed - deadline

        # Domain-specific variable filling
        if domain == 'medical':
            variables = {
                'elapsed': elapsed,
                'deadline': deadline,
                'time_check': f'T+{elapsed}min',
                'remaining': remaining if remaining else 0,
                'overtime': overtime if overtime else 0
            }
        elif domain == 'legal':
            base_date = datetime(2025, 1, 15, 9, 0)
            deadline_dt = base_date + timedelta(hours=deadline)
            current_dt = base_date + timedelta(hours=elapsed)

            variables = {
                'party': random.choice(['Buyer', 'Seller', 'Lessor', 'Lessee']),
                'deadline_date': deadline_dt.strftime('%B %d, 2025'),
                'deadline_time': deadline_dt.strftime('%I:%M %p'),
                'current_date': current_dt.strftime('%B %d, 2025'),
                'current_time': current_dt.strftime('%I:%M %p'),
                'time_description': f'{remaining} hours remaining' if remaining else f'{overtime} hours past deadline',
                'remaining': remaining if remaining else 0,
                'overtime': overtime if overtime else 0
            }
        elif domain == 'financial':
            asset_choices = ['AAPL', 'GOOGL', 'TSLA', 'MSFT', 'AMZN']
            base_price = random.uniform(50, 500)

            variables = {
                'action': random.choice(['buy', 'sell']),
                'asset': random.choice(asset_choices),
                'target_price': f'{base_price:.2f}',
                'current_price': f'{base_price * random.uniform(0.95, 1.05):.2f}',
                'deadline_time': f'{9 + deadline // 60}:{deadline % 60:02d}',
                'current_time': f'{9 + elapsed // 60}:{elapsed % 60:02d}',
                'time_status': f'{remaining} min remaining' if remaining else f'expired {overtime} min ago',
                'remaining': remaining if remaining else 0,
                'overtime': overtime if overtime else 0
            }
        elif domain == 'operational':
            operations = [
                'System backup procedure',
                'Quality inspection protocol',
                'Equipment maintenance cycle',
                'Security audit process',
                'Data migration task'
            ]

            variables = {
                'operation': random.choice(operations),
                'deadline': deadline // 60,  # Convert to hours
                'start_time': f'{8 + random.randint(0, 3)}:00 AM',
                'current_time': f'{8 + elapsed // 60}:{elapsed % 60:02d}',
                'elapsed': elapsed // 60,
                'remaining': remaining // 60 if remaining else 0,
                'overtime': overtime // 60 if overtime else 0
            }
        else:  # personal
            tasks = [
                'pick up the package',
                'submit your application',
                'finish your homework',
                'catch the last train',
                'return the rental car'
            ]

            variables = {
                'task': random.choice(tasks),
                'deadline_description': f'{deadline} minutes from when you started',
                'start_time': f'{random.randint(1, 4)} PM',
                'current_time': f'{random.randint(1, 5)}:{random.randint(0, 59):02d} PM',
                'elapsed': elapsed,
                'remaining': remaining if remaining else 0,
                'overtime': overtime if overtime else 0
            }

        # Fill template
        try:
            example_text = template.format(**variables)
            examples.append({
                'text': example_text,
                'window_status': 'open' if window_open else 'closed',
                'domain': domain
            })
        except KeyError as e:
            print(f"Template formatting error in {domain}: {e}")
            continue

    return examples

# Generate training data
training_examples = generate_diverse_temporal_examples(n_examples=200)

print(f"Generated {len(training_examples)} synthetic training examples")
print(f"Open windows: {sum(1 for e in training_examples if e['window_status'] == 'open')}")
print(f"Closed windows: {sum(1 for e in training_examples if e['window_status'] == 'closed')}")
print("\nExample:")
print(training_examples[0]['text'])

In [None]:
# Prepare dataset for training
train_dataset = Dataset.from_dict({
    'text': [e['text'] for e in training_examples]
})

print("Training dataset prepared.")
print(f"Total examples: {len(train_dataset)}")

## Test Model BEFORE Fine-Tuning

Empirically, you've got to measure performance before and after. That's how you prove the training worked.

In [None]:
def test_model_on_scenarios(model, tokenizer, scenarios: List[Dict], desc: str = "Testing") -> pd.DataFrame:
    """
    Test a model on temporal reasoning scenarios.
    """
    results = []

    for scenario in tqdm(scenarios, desc=desc):
        prompt = scenario['scenario']

        try:
            response = generate_response(model, tokenizer, prompt, max_new_tokens=100)
            answer = extract_answer(response)
            correct = (answer == scenario['correct'])

            results.append({
                'scenario_id': scenario['id'],
                'window_status': scenario['window_status'],
                'correct_answer': scenario['correct'],
                'model_answer': answer,
                'correct': correct,
                'response': response
            })
        except Exception as e:
            results.append({
                'scenario_id': scenario['id'],
                'window_status': scenario['window_status'],
                'correct_answer': scenario['correct'],
                'model_answer': 'ERROR',
                'correct': False,
                'response': str(e)
            })

    return pd.DataFrame(results)

# Load model for fine-tuning experiment
# mistralai/Mistral-7B-Instruct-v0.3
# Qwen/Qwen2.5-7B-Instruct
# deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
# meta-llama/Llama-3.1-8B


print("Loading model fine-tuning experiment...")
base_model, base_tokenizer = load_model("meta-llama/Llama-3.1-8B", load_in_8bit=True)

# Test BEFORE fine-tuning
print("\nTesting BEFORE fine-tuning...")
before_results = test_model_on_scenarios(base_model, base_tokenizer, test_scenarios, "Before FT")

before_accuracy = before_results['correct'].mean()
print(f"\nAccuracy BEFORE fine-tuning: {before_accuracy:.1%}")
print("\nBreakdown by window status:")
print(before_results.groupby('window_status')['correct'].mean())

## Fine-Tune the Model

Using LoRA for efficient fine-tuning. The thing is, you don't need to update all parameters - just adapters works well.

In [None]:
# Prepare model for LoRA training
base_model = prepare_model_for_kbit_training(base_model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank of adaptation matrices
    lora_alpha=32,
    target_modules= ["q_proj", "v_proj", "k_proj", "o_proj"],  # ["in_proj", "x_proj", "dt_proj"]
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Add LoRA adapters
model_to_train = get_peft_model(base_model, lora_config)
print("LoRA adapters added.")
model_to_train.print_trainable_parameters()

# Tokenize training data
def tokenize_function(examples):
    outputs = base_tokenizer(
        examples['text'],
        truncation=True,
        max_length=512,
        padding='max_length'
    )
    outputs['labels'] = outputs['input_ids'].copy()
    return outputs

tokenized_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=train_dataset.column_names
)

print("Dataset tokenized.")

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./temporal_reasoning_finetuned',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    warmup_steps=10,
    logging_steps=5,
    save_strategy='epoch',
    fp16=True,
    optim='paged_adamw_8bit',
    report_to='none'
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=base_tokenizer,
    mlm=False
)

# Trainer
trainer = Trainer(
    model=model_to_train,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

print("Trainer configured. Starting fine-tuning...")
print("This should take ~10-15 minutes on A100 for 100 examples.")

In [None]:
# Train!
trainer.train()

print("\nFine-tuning complete!")

# Save the fine-tuned model
trainer.save_model('./temporal_reasoning_finetuned/final')
print("Model saved.")

## Test Model AFTER Fine-Tuning

The key question: did the targeted training fix the temporal reasoning problem?

In [None]:
# Test AFTER fine-tuning
print("Testing AFTER fine-tuning...")
after_results = test_model_on_scenarios(model_to_train, base_tokenizer, test_scenarios, "After FT")

after_accuracy = after_results['correct'].mean()
print(f"\nAccuracy AFTER fine-tuning: {after_accuracy:.1%}")
print(f"Improvement: {(after_accuracy - before_accuracy):.1%}")
print("\nBreakdown by window status:")
print(after_results.groupby('window_status')['correct'].mean())

# Save results
before_results.to_csv('experiment2_before_finetuning.csv', index=False)
after_results.to_csv('experiment2_after_finetuning.csv', index=False)

## Visualize Experiment 2 Results

This is the money plot - did fine-tuning fix the problem?

In [None]:
# Compare before and after
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Overall accuracy
comparison_data = pd.DataFrame({
    'Before': [before_accuracy],
    'After': [after_accuracy]
})

comparison_data.T.plot(kind='bar', ax=axes[0], legend=False,
                       color=[colors['red'], colors['green']], alpha=0.8,
                       edgecolor='black', linewidth=1)
axes[0].set_ylabel('Accuracy')
axes[0].set_xlabel('Training Stage')
axes[0].set_title('Fine-Tuning Impact on Temporal Reasoning')
axes[0].set_ylim([0, 1.1])
axes[0].set_xticklabels(['Before', 'After'], rotation=0)
axes[0].axhline(0.5, color='gray', linestyle='--', alpha=0.3, label='Chance')

# Add value labels
for i, v in enumerate([before_accuracy, after_accuracy]):
    axes[0].text(i, v + 0.03, f'{v:.1%}', ha='center', fontweight='bold')

# Add improvement annotation
improvement = after_accuracy - before_accuracy
axes[0].annotate('', xy=(1, after_accuracy), xytext=(0, before_accuracy),
                arrowprops=dict(arrowstyle='->', lw=2, color='black', alpha=0.5))
axes[0].text(0.5, (before_accuracy + after_accuracy) / 2 + 0.05,
            f'+{improvement:.1%}', ha='center', fontweight='bold', fontsize=11,
            bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))

# Window status breakdown
before_by_window = before_results.groupby('window_status')['correct'].mean()
after_by_window = after_results.groupby('window_status')['correct'].mean()

window_comparison = pd.DataFrame({
    'Before': [before_by_window.get('open', 0), before_by_window.get('closed', 0)],
    'After': [after_by_window.get('open', 0), after_by_window.get('closed', 0)]
}, index=['Open Window', 'Closed Window'])

window_comparison.plot(kind='bar', ax=axes[1],
                       color=[colors['red'], colors['green']], alpha=0.8,
                       edgecolor='black', linewidth=0.5)
axes[1].set_ylabel('Accuracy')
axes[1].set_xlabel('Window Status')
axes[1].set_title('Performance by Window Status')
axes[1].set_ylim([0, 1.1])
axes[1].set_xticklabels(['Open Window', 'Closed Window'], rotation=0)
axes[1].legend(title='Training Stage')

plt.tight_layout()
plt.savefig('experiment2_finetuning_results.pdf', dpi=300, bbox_inches='tight')
plt.show()

print("\n" + "="*70)
print("KEY FINDING")
print("="*70)
if improvement > 0.2:
    print("Fine-tuning SIGNIFICANTLY improved temporal reasoning.")
    print("This proves it's about training data patterns.")
    print("Models CAN learn temporal reasoning with the right examples.")
elif improvement > 0.05:
    print("Fine-tuning helped, but not dramatically.")
    print("May need more examples or different training approach.")
else:
    print("Fine-tuning didn't help much.")
    print("Either need more examples, or the problem is deeper.")