In [7]:
pip install git+https://github.com/huggingface/transformers datasets evaluate torch

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /private/var/folders/lt/mt1wm0qn1xvggfy2hl96wrlh0000gn/T/pip-req-build-cwcbl3bt
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /private/var/folders/lt/mt1wm0qn1xvggfy2hl96wrlh0000gn/T/pip-req-build-cwcbl3bt
  Resolved https://github.com/huggingface/transformers to commit c8c8dffbe45ebef0a8dba4a51024e5e5e498596b
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


In [8]:
from IPython.display import Markdown, display

In [13]:
# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM2-135M"
device = "cpu" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)


In [30]:
'''
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
'''



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Gravity is the force that holds the Earth and the Moon together.

The Moon is a satellite of the


In [28]:
from datasets import load_dataset


# Load the MMLU "all" configuration for the validation split
dataset = load_dataset("cais/mmlu", "all", split="validation")
print(dataset[0])
# This should print a sample question and associated choices.


Generating test split: 100%|██████████| 14042/14042 [00:00<00:00, 508639.77 examples/s]
Generating validation split: 100%|██████████| 1531/1531 [00:00<00:00, 397344.19 examples/s]
Generating dev split: 100%|██████████| 285/285 [00:00<00:00, 131692.92 examples/s]
Generating auxiliary_train split: 100%|██████████| 99842/99842 [00:00<00:00, 686242.30 examples/s]

{'question': 'The cyclic subgroup of Z_24 generated by 18 has order', 'subject': 'abstract_algebra', 'choices': ['4', '8', '12', '6'], 'answer': 0}





In [31]:
import torch
from torch.nn.functional import log_softmax
from tqdm import tqdm

# Assuming "dataset" is the validation set you loaded
# dataset = load_dataset("cais/mmlu", "all", split="validation")

correct = 0
total = 0

for item in tqdm(dataset):
    question = item["question"]
    choices = item["choices"]
    correct_answer_idx = item["answer"]  # integer index of the correct choice

    # We'll build a prompt for each choice and compute its log prob.
    # Prompt template:
    # "Q: {question}\nA: {candidate_answer}"
    # We'll return the sum of logprobs of the tokens in candidate_answer.

    choice_logprobs = []
    for i, choice_text in enumerate(choices):
        prompt = f"Q: {question}\nA: "
        # We'll get the probability of the choice tokens given the prompt
        prompt_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
        choice_ids = tokenizer.encode(choice_text, return_tensors="pt").to(device)

        # We feed the model with prompt+choice, then compare logprobs
        input_ids = torch.cat([prompt_ids, choice_ids[:, 1:]], dim=1) # omit the initial bos from choice
        
        with torch.no_grad():
            outputs = model(input_ids, return_dict=True)
        
        # logits shape: [batch, seq_length, vocab_size]
        logits = outputs.logits
        # We want the logprob of each choice token conditioned on the previous text
        # The logprobs for the i-th token is based on logits of i-1-th position
        # We'll sum them up.
        # We start summation from the prompt length since we only care about the probability assigned to choice tokens.
        
        # Separate the prompt and choice tokens
        prompt_len = prompt_ids.shape[1]
        choice_token_ids = input_ids[0, prompt_len:]  # tokens corresponding to the choice

        # Get logits corresponding to these choice tokens
        choice_logits = logits[0, prompt_len-1:-1, :]  # -1 because we shift by one for next-token prediction

        # Compute log probabilities
        log_probs = log_softmax(choice_logits, dim=-1)
        
        # Sum the logprobs of the tokens in the choice
        # For each token in the choice, find the corresponding logprob
        token_logprobs = []
        for idx, token_id in enumerate(choice_token_ids):
            token_logprob = log_probs[idx, token_id].item()
            token_logprobs.append(token_logprob)

        choice_score = sum(token_logprobs)
        choice_logprobs.append(choice_score)

    # Pick the choice with the highest logprob
    predicted_idx = torch.argmax(torch.tensor(choice_logprobs)).item()
    
    if predicted_idx == correct_answer_idx:
        correct += 1
    total += 1

accuracy = correct / total * 100
print(f"MMLU Accuracy: {accuracy:.2f}%")


  1%|          | 18/1531 [06:02<6:00:25, 14.29s/it] 