# Model Playground
This file serves as an easy, low effort platform to get a feel for how models behave before and after my fine-tuning runs. With extra comforts like interactive inputs and real-time streaming!

## Quick prep
This is for fast downloads!

In [1]:
# !pip install -q huggingface_hub[hf_transfer]
# !export HF_HUB_ENABLE_HF_TRANSFER=1
# !huggingface-cli download Neelectric/OLMo-2-1124-7B-Instruct_SFTv00.05

## Imports

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

  from .autonotebook import tqdm as notebook_tqdm


## Model registry
Here we choose the model(s) and corresponding tokenizers, as well as datasets we want to load in

In [3]:
model_id_base = "allenai/OLMo-2-1124-7B-Instruct"
model_id_ft = "Neelectric/OLMo-2-1124-7B-Instruct_SFTv00.05"

In [4]:
tokenizer_base = AutoTokenizer.from_pretrained(model_id_base)
tokenizer_ft = AutoTokenizer.from_pretrained(model_id_ft)

In [5]:
device_base = "cuda:0"
device_ft = "cuda:1"
model_base = AutoModelForCausalLM.from_pretrained(
    model_id_base,
    torch_dtype=torch.bfloat16,
    # device_map=device_base    
).to(device_base)
model_ft = AutoModelForCausalLM.from_pretrained(
    model_id_ft,
    torch_dtype=torch.bfloat16,
    # device_map=device_ft  
).to(device_ft)

Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.90it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.94it/s]


In [6]:
dataset_name = "open-r1/OpenR1-Math-220k"
split_name = "train"
open_r1 = load_dataset(dataset_name)[split_name]

In [7]:
streamer_base = TextStreamer(tokenizer_base)
streamer_ft = TextStreamer(tokenizer_ft)

## Let's prompt for some user input!

In [None]:
# (alternatively, override with this open-r1/OpenR1-Math-220k question (first one in the dataset)
user_input = """## Task B-1.3.

A ship traveling along a river has covered $24 \mathrm{~km}$ upstream and $28 \mathrm{~km}$ downstream. For this journey, it took half an hour less than for traveling $30 \mathrm{~km}$ upstream and $21 \mathrm{~km}$ downstream, or half an hour more than for traveling $15 \mathrm{~km}$ upstream and $42 \mathrm{~km}$ downstream, assuming that both the ship and the river move uniformly.

Determine the speed of the ship in still water and the speed of the river."""

In [15]:
user_input = input()
chatML_input = [{"role":"user", "content": user_input}]
print(user_input)

## Base model
Time to see what the base model says here:

In [16]:
templated_base = tokenizer_base.apply_chat_template(
        chatML_input, 
        tokenize=False, 
        truncation=False,
        # return_tensors="pt",
        add_generation_prompt=True,
        )

In [17]:
inputs_base = tokenizer_base(templated_base, return_tensors="pt").to(device_base)
print(inputs_base.input_ids.shape)

output_base = model_base.generate(
    **inputs_base, 
    streamer=streamer_base, 
    max_new_tokens=1024,
    do_sample=False,
    )

torch.Size([1, 40])
<|endoftext|><|user|>
When I apply supervised fine-tuning, should special tokens from the tokenizer be masked out to a label of -100 so pytorch ignores them?
<|assistant|>


When you apply supervised fine-tuning, especially in the context of natural language processing (NLP) tasks using models like BERT, GPT, or RoBERTa, it's common to mask out certain tokens to force the model to learn meaningful representations of the input text. These masked tokens are typically replaced with a special token (often `[MASK]` or `<MASK>`) to indicate that the model should predict their value based on the context provided by the other unmasked tokens.

However, using a label of `-100` or any other negative number to mask tokens is unconventional and not recommended for the following reasons:

1. **Model Interpretability**: Using a standard token like `[MASK]` or `<MASK>` is widely recognized and easily interpretable by others working with your model. Negative numbers, especially `-100`, do not convey this meaning clearly and could lead to confusion.

2. **Training Stability**: Negative numbers might not be handled correctly by the model during training, potentially causing

## Fine-tuned model
And now we check what the model thinks after my fine-tuning was applied!

In [18]:
templated_ft = tokenizer_ft.apply_chat_template(
        chatML_input, 
        tokenize=False, 
        truncation=False,
        # return_tensors="pt",
        add_generation_prompt=True,
        )

In [19]:
inputs_ft = tokenizer_ft(templated_ft, return_tensors="pt").to(device_ft)
print(inputs_ft.input_ids.shape)

output_ft = model_ft.generate(
    **inputs_ft, 
    streamer=streamer_ft, 
    max_new_tokens=4096,
    do_sample=False,
    )

torch.Size([1, 102])
<|endoftext|><|system|>
You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>
...
</think>
<answer>
...
</answer>
<|user|>When I apply supervised fine-tuning, should special tokens from the tokenizer be masked out to a label of -100 so pytorch ignores them?
<|assistant|>


<think>
Okay, so I need to figure out whether special tokens from the tokenizer should be masked out to a label of -100 when applying supervised fine-tuning. Hmm, let me start by recalling what supervised fine-tuning is. From what I remember, it's a technique used in machine learning where you take a pre-trained model and fine-tune it on a specific task. The idea is to leverage the existing knowledge in the pre-trained model while adjusting it to the new task. 

Now, when you fine-tune a model, you usually do this by training it on a dataset that's relevant to the task you want to solve. The model learns from the data, updating its weights to perform better on the new task. But here's the thing: during this process, the model might encounter tokens that it hasn't seen before. These could be special tokens, like the ones added by the tokenizer to represent things like padding, start, and end of sequences, or maybe even other tokens that are specific to the tokenizer's implementation.

T