# Fine-Tuning a Pruned LLaMA 3 Model using LoRA

This notebook demonstrates how to fine-tune a pruned version of the LLaMA 3 (8B) language model using the Low-Rank Adaptation (LoRA) technique. The pruning process reduces the number of transformer layers to improve efficiency, while LoRA enables efficient parameter tuning with limited resources.

The workflow includes:
- Installing required libraries and loading the dataset
- Initializing the pruned model and tokenizer
- Applying LoRA configuration for fine-tuning
- Training the model with Hugging Face's `Trainer`
- Saving the fine-tuned model for later use

This project focuses on maintaining performance while reducing computational costs through pruning and parameter-efficient tuning.


In [1]:
!pip install datasets lm_eval py7zr



In [2]:
from tqdm.notebook import tqdm

from datasets import load_dataset
import torch
from torch.utils.data import DataLoader

from peft import (
    get_peft_model,
    LoraConfig,
    TaskType,
)
import transformers
from transformers import default_data_collator, Trainer, TrainingArguments,AutoModelForCausalLM, AutoTokenizer


# Loading the model

In [3]:
model_name = "Shahrukh0/shortgpt_llama3.1_8B_hellaswag_angular"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
# models architecture
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-26): 27 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_

# loading the dataset

In [5]:
import logging
logging.basicConfig(level=logging.INFO)

In [6]:
from datasets import load_dataset

# Load the HellaSwag dataset
raw_datasets = load_dataset("hellaswag")

# Print column names for the 'train' split
print(raw_datasets["train"].column_names)

['ind', 'activity_label', 'ctx_a', 'ctx_b', 'ctx', 'endings', 'source_id', 'split', 'split_type', 'label']


## Preprocessing HellaSwag

Loads HellaSwag, formats inputs using a prompt template, tokenizes context + correct ending, applies padding/truncation, masks labels for loss, and returns processed train/val datasets.


In [7]:
# Preprocess HellaSwag dataset
def get_preprocessed_hellaswag(max_length=512):
    try:
        raw_datasets = load_dataset("hellaswag", cache_dir="./dataset_cache")
    except Exception as e:
        logger.error(f"Failed to load HellaSwag dataset: {e}")
        raise

    prompt_template = "Given the following context, generate the most plausible continuation:\nContext: {context}\nContinuation:\n"

    def apply_prompt_template(sample):
        return {
            "prompt": prompt_template.format(context=sample["ctx"]),
            "continuation": sample["endings"][int(sample["label"])]
        }

    def tokenize_add_label(sample):
        prompt = tokenizer.encode(tokenizer.bos_token + sample["prompt"], add_special_tokens=False)
        continuation = tokenizer.encode(sample["continuation"] + tokenizer.eos_token, add_special_tokens=False)
        input_ids = prompt + continuation

        if len(input_ids) > max_length:
            logger.warning(f"Truncating sample with length {len(input_ids)} to {max_length}")
            input_ids = input_ids[:max_length]
            continuation = continuation[:max_length - len(prompt)]

        attention_mask = [1] * len(input_ids)
        labels = [-100] * len(prompt) + continuation

        padding_length = max_length - len(input_ids)
        if padding_length > 0:
            input_ids += [tokenizer.pad_token_id] * padding_length
            attention_mask += [0] * padding_length
            labels += [-100] * padding_length

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
        }

    train_dataset = raw_datasets["train"].map(
        apply_prompt_template, remove_columns=raw_datasets["train"].column_names, num_proc=4
    )
    val_dataset = raw_datasets["validation"].map(
        apply_prompt_template, remove_columns=raw_datasets["validation"].column_names, num_proc=4
    )
    train_dataset = train_dataset.map(tokenize_add_label, remove_columns=train_dataset.column_names, num_proc=4)
    val_dataset = val_dataset.map(tokenize_add_label, remove_columns=val_dataset.column_names, num_proc=4)

    return train_dataset, val_dataset

In [8]:
train_dataset, eval_dataset = get_preprocessed_hellaswag()
print(f"Train dataset loaded with {len(train_dataset)} examples AND Validation dataset loaded with {len(eval_dataset)} examples")

Train dataset loaded with 39905 examples AND Validation dataset loaded with 10042 examples


## Applying LoRA for Parameter-Efficient Fine-Tuning

Defines and applies a LoRA configuration for causal language modeling. Targets attention projection layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`) with rank `r=8`, scaling `alpha=16`, and `0.1` dropout. Wraps the base model with PEFT, enabling low-rank adaptation. Also logs trainable parameter count.


In [9]:
model.train()

def create_peft_config(model):
    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        target_modules = ["q_proj", "v_proj", "k_proj", "o_proj"]
    )

    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    return model, peft_config

# create peft config
model, lora_config = create_peft_config(model)

trainable params: 5,750,784 || all params: 6,945,452,032 || trainable%: 0.0828


In [10]:
model.config.use_cache = False

In [11]:
output_dir = "tmp/"

config = {
    'lora_config': lora_config,
    'learning_rate': 2e-5,
    'num_train_epochs': 1,
    'per_device_train_batch_size': 1,
    'gradient_accumulation_steps':8,
    'gradient_checkpointing': True,
    'lr_scheduler_type': 'cosine',
    'warmup_ratio': 0.1,
}

In [12]:
allowed_keys = set(TrainingArguments.__init__.__code__.co_varnames)
safe_config = {k: v for k, v in config.items() if k in allowed_keys}
safe_config

{'learning_rate': 2e-05,
 'num_train_epochs': 1,
 'per_device_train_batch_size': 1,
 'gradient_accumulation_steps': 8,
 'gradient_checkpointing': True,
 'lr_scheduler_type': 'cosine',
 'warmup_ratio': 0.1}

## Training Setup with Hugging Face Trainer

Initializes `TrainingArguments` with mixed-precision (`fp16`), logging/saving by step and epoch, checkpointing (keep last 2), and best-model tracking based on validation loss. Uses `adamw_torch_fused` optimizer. Builds a `Trainer` with the pruned-LoRA model, preprocessed datasets, and default data collator.


In [13]:
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    logging_strategy="steps",
    logging_steps=50,
    save_strategy="epoch",  # <-- change this!
    save_steps=50,          # save every N steps
    save_total_limit=2,      # keep only last 2 checkpoints
    eval_strategy="epoch",  # to compare losses
    load_best_model_at_end=True,  # <-- important
    metric_for_best_model="loss", # track loss
    greater_is_better=False,      # we want to minimize loss
    fp16=True,
    report_to="none",
    optim="adamw_torch_fused",
    **{k: v for k, v in config.items() if k != 'lora_config'}
)

# Create Trainer instance
training_args.gradient_checkpointing = False
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator,
    callbacks=[],
)


No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [14]:
torch.cuda.empty_cache()

In [15]:
import torch
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()


In [16]:
# Start training
print("Starting training...")
trainer.train()
print("Training completed!")

Starting training...


Epoch,Training Loss,Validation Loss
0,1.9575,1.966412


Training completed!


In [17]:
model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

NameError: name 'model_input' is not defined

In [18]:
# Merge LoRA adapter with base model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("llama3.1_8B_finetune_angular_5L")
tokenizer.save_pretrained("llama3.1_8B_finetune_angular_5L")


('llama3.1_8B_finetune_angular_5L/tokenizer_config.json',
 'llama3.1_8B_finetune_angular_5L/special_tokens_map.json',
 'llama3.1_8B_finetune_angular_5L/tokenizer.json')

In [19]:
# Login to Hugging Face Hub
from huggingface_hub import notebook_login
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Token has not been saved to git credential helper.


In [20]:
# Push to Hub
merged_model.push_to_hub("rahatneuron/llama3.1_8B_finetune_angular_5L")
tokenizer.push_to_hub("rahatneuron/llama3.1_8B_finetune_angular_5L")

model-00001-of-00006.safetensors:   0%|          | 0.00/4.89G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Upload 6 LFS files:   0%|          | 0/6 [00:00<?, ?it/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/3.21G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rahatneuron/llama3.1_8B_finetune_angular_5L/commit/f8e3eacff4df5b702d1f4b2f6860e0da7c39d111', commit_message='Upload tokenizer', commit_description='', oid='f8e3eacff4df5b702d1f4b2f6860e0da7c39d111', pr_url=None, repo_url=RepoUrl('https://huggingface.co/rahatneuron/llama3.1_8B_finetune_angular_5L', endpoint='https://huggingface.co', repo_type='model', repo_id='rahatneuron/llama3.1_8B_finetune_angular_5L'), pr_revision=None, pr_num=None)

In [21]:
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM

def evaluate_loaded_model(model, tokenizer, tasks, num_fewshot=0):
    """
    Evaluates an already-loaded Hugging Face model using lm-eval harness
    
    Args:
    - model: Loaded Hugging Face model
    - tokenizer: Loaded Hugging Face tokenizer
    - tasks: List of tasks to evaluate
    - num_fewshot: Number of few-shot examples to use
    
    Returns:
    - Dictionary of metrics
    """
    # Create model wrapper for lm-eval
    lm = HFLM(
        pretrained=model,
        tokenizer=tokenizer,
        device="cuda",
        batch_size=1  # Keep batch_size=1 to match original hyperparameters
    )
    
    # Run evaluation with same parameters
    results = evaluator.simple_evaluate(
        model=lm,
        tasks=tasks,
        num_fewshot=num_fewshot,
        limit=1000,
        bootstrap_iters=10,
        log_samples=False
    )
    
    return results['results']

In [22]:

tasks = ["hellaswag", "mmlu", "boolq", "lambada", "arc_easy"]
metrics = evaluate_loaded_model(model, tokenizer, tasks, num_fewshot=0)
print(metrics)

INFO:lm_eval.evaluator:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm_eval.evaluator:Using pre-initialized model


Map:   0%|          | 0/39905 [00:00<?, ? examples/s]

Map:   0%|          | 0/10042 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

mmlu_no_train.py:   0%|          | 0.00/5.86k [00:00<?, ?B/s]

data.tar:   0%|          | 0.00/166M [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]



README.md:   0%|          | 0.00/18.2k [00:00<?, ?B/s]

super_glue.py:   0%|          | 0.00/30.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3245 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/4.99k [00:00<?, ?B/s]

lambada_openai.py:   0%|          | 0.00/4.82k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


0000.parquet:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]



README.md:   0%|          | 0.00/7.32k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/269M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/281M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2662 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4869 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/9.00k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train-00000-of-00001.parquet:   0%|          | 0.00/331k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


test-00000-of-00001.parquet:   0%|          | 0.00/346k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


validation-00000-of-00001.parquet:   0%|          | 0.00/86.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2251 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2376 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/570 [00:00<?, ? examples/s]

INFO:lm_eval.api.task:Building contexts for arc_easy on rank 0...
100%|██████████| 1000/1000 [00:00<00:00, 1451.56it/s]
INFO:lm_eval.api.task:Building contexts for lambada_standard on rank 0...
100%|██████████| 1000/1000 [00:01<00:00, 762.63it/s]
INFO:lm_eval.api.task:Building contexts for lambada_openai on rank 0...
100%|██████████| 1000/1000 [00:01<00:00, 767.94it/s]
INFO:lm_eval.api.task:Building contexts for boolq on rank 0...
100%|██████████| 1000/1000 [00:00<00:00, 2737.99it/s]
INFO:lm_eval.api.task:Building contexts for mmlu_high_school_biology on rank 0...
100%|██████████| 310/310 [00:00<00:00, 845.98it/s]
INFO:lm_eval.api.task:Building contexts for mmlu_astronomy on rank 0...
100%|██████████| 152/152 [00:00<00:00, 873.68it/s]
INFO:lm_eval.api.task:Building contexts for mmlu_college_physics on rank 0...
100%|██████████| 102/102 [00:00<00:00, 858.55it/s]
INFO:lm_eval.api.task:Building contexts for mmlu_college_computer_science on rank 0...
100%|██████████| 100/100 [00:00<00:00, 

bootstrapping for stddev: perplexity


100%|██████████| 1/1 [00:00<00:00, 8905.10it/s]


bootstrapping for stddev: perplexity


100%|██████████| 1/1 [00:00<00:00, 1098.85it/s]
fatal: not a git repository (or any parent up to mount point /teamspace/studios)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).


{'arc_easy': {'alias': 'arc_easy', 'acc,none': 0.763, 'acc_stderr,none': 0.01345407046257795, 'acc_norm,none': 0.764, 'acc_norm_stderr,none': 0.013434451402438683}, 'boolq': {'alias': 'boolq', 'acc,none': 0.703, 'acc_stderr,none': 0.0144568322948011}, 'hellaswag': {'alias': 'hellaswag', 'acc,none': 0.517, 'acc_stderr,none': 0.015810153729833427, 'acc_norm,none': 0.691, 'acc_norm_stderr,none': 0.014619600977206494}, 'lambada_openai': {'alias': 'lambada_openai', 'perplexity,none': 8.588829746629266, 'perplexity_stderr,none': 0.33105226449814545, 'acc,none': 0.523, 'acc_stderr,none': 0.015802554246726098}, 'lambada_standard': {'alias': 'lambada_standard', 'perplexity,none': 6.830204718283761, 'perplexity_stderr,none': 0.41332676970571325, 'acc,none': 0.577, 'acc_stderr,none': 0.015630589090476345}, 'mmlu': {'acc,none': 0.6213355048859935, 'acc_stderr,none': 0.0038995344758725454, 'alias': 'mmlu'}, 'mmlu_humanities': {'acc,none': 0.5667705586190362, 'acc_stderr,none': 0.00695161933732601, 