#Baseline Model Evaluation

##Base Model Selection
We'll be using the OLMo-1B model as our base model. Here's the rationale for this choice:

*  Reasonable size for fine-tuning (1B parameters)
*  Decent general language capabilities
*  Open source and well-documented
*  Suitable for running on available GPU resources in Colab

Let's start by evaluating its out-of-the-box performance on math problem generation. Its poor performance will motivate finetuning.

In [1]:
# Check GPU availability
!nvidia-smi

# Install required packages
!pip install transformers datasets evaluate -q

Mon Dec 23 13:01:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def setup_base_model():
    """Initialize the base model and tokenizer."""
    model_name = "allenai/OLMo-1B-hf"

    # Load tokenizer and model with progress bars
    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')

    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",  # Automatically handle device placement
        torch_dtype=torch.float16  # Use fp16 for efficiency
    )

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id

    return model, tokenizer

# Clear GPU memory if needed
import gc
gc.collect()
torch.cuda.empty_cache()

# Load model
model, tokenizer = setup_base_model()

Loading tokenizer...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/5.37k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Loading model...


config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.71G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [3]:
message = ["This is a simple math problem:"]
inputs = tokenizer(message, return_tensors="pt", padding=True).to(model.device)
response = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(response[0], skip_special_tokens=True))

This is a simple math problem: if you have a $100 bill, and you want to buy a $10 item, you need to have $90 in your pocket.
If you have a $100 bill, and you want to buy a $10 item, you need to have $90 in your pocket.
If you have a $100 bill, and you want to buy a $10 item, you need to have $90 in your pocket.
If you have a $100 bill, and you want to buy


In [12]:
# Try different temperatures
for temp in [0.3, 0.7, 1.0]:
    print(f"\nGeneration with temperature {temp}:")
    outputs = model.generate(
        **inputs,
        max_length=100,
        temperature=temp,
        do_sample=True
    )
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Generation with temperature 0.3:
This is a simple math problem: if you have a $100 bill, and you spend $10, you have a $90 bill. You can’t spend $90 and have a $100 bill. You can’t spend $90 and have a $90 bill. You can’t spend $90 and have a $90 bill.
You can’t spend $90 and have a $90 bill. You can’t spend $90 and have a $90 bill. You

Generation with temperature 0.7:
This is a simple math problem: if you can use a pencil to make a straight line then you can use a pencil to draw a picture. For example, if you have a pencil you can make a line from one end of the pencil to the other. If you have a pencil, you can use it to draw a picture.
A pencil is not a perfect tool for drawing pictures. It’s not a perfect tool for writing letters. It’s not a perfect tool for drawing pictures

Generation with temperature 1.0:
This is a simple math problem: 1 + 1 equals 2. But it’s also the first step of solving any math problem. Before we begin, we must clearly define what math is meant to solve. 

In [4]:
# Test prompts
test_prompts = [
    "Here's another elementary school math problem: A student",
    "Here's another elementary school math problem: In a store",
    "Here's another elementary school math problem: During a game",
]

In [10]:
def generate_problems(model, tokenizer, prompts, max_length=100):
    """Generate math problems using the base model."""
    outputs = []

    for prompt in prompts:
        # Prepare input
        inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)

        # Generate with error handling
        try:
            output_ids = model.generate(
                **inputs,
                max_length=max_length,
                temperature=0.7,
                top_p=0.9,
                do_sample=True,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id
            )

            output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
            outputs.append(output)

        except RuntimeError as e:
            print(f"Error during generation: {e}")
            outputs.append("Generation failed")

    return outputs

out_examples = generate_problems(model, tokenizer, test_prompts)

In [11]:
for example in out_examples:
    print(example)

Here's another elementary school math problem: A student was working on a problem in which he had to count how many apples were in a basket.
The student said, "I counted 12 apples in the basket."
"Then you're not counting correctly," the teacher responded.
"I'm not counting correctly," the student said.
"Then you're counting incorrectly," the teacher responded.
"I'm counting correctly," the student said.
"Then you're counting correctly," the
Here's another elementary school math problem: In a store, you have 10 apples, 10 oranges, and 10 bananas. The clerk gives you a bag of apples. How many apples are in the bag?
This is a little harder than the previous question. You're dealing with numbers, but you're not dealing with numbers you can see. So you're working with numbers that aren't real.
Let's say that you have a bag of apples. How many apples are in the
Here's another elementary school math problem: During a game of basketball, the team shoots the ball into the basket at a rate of 3

It is clear that the model is not very good at generating valid math problems on its own. Results hint at something that resembles math problems (sometimes contain numbers, etc.) but there are problems on many fronts: repetitions, logical inconsistencies, etc.

Let's try to fix this via finetuning