## Low-cost Model Adaptation in Low-precision Sparse Foundation Models: **SQFT + SparsePEFT (NLS)** 🚀

Welcome to an exhilarating journey as we delve into the realm of fine-tuning for efficient large language models (LLMs)! 🌟

We'll be working with Apple's OpenELM model and optimizing it for a specific task with one of our SQFT pipelines: **SQFT + SparsePEFT (NLS)**.

All of our SQFT notebooks:

- **SQFT (LoRA)**: [link]()
- **SQFT (NLS)**: [link]()
- **SQFT + SparsePEFT (LoRA)**: [link]()
- **SQFT + SparsePEFT (NLS)**: [link]()
- **SQFT + QA-SparsePEFT (LoRA)**: [link]()
- **SQFT + QA-SparsePEFT (NLS)**: [link]()

> Note: This notebook introduces the use of this pipeline with NLS. For a simpler understanding of SQFT, you can start with some simple experiments using the twin notebook [SQFT + SparsePEFT (LoRA)](). NLS enhances the performance of sparse or quantized models beyond what LoRA offers.
  

### Overview

In this notebook, you will learn a practical and novel solution that generates efficient (**sparse** or **quantized**) models fine-tuned for downstream-specific tasks for real-world applications. The notebook introduces the solution of SQFT + SparsePEFT (NLS), covering the following key points:

1. Setup the Environment ⚙️
2. Sparsification ✂️
3. Load Model 🚀
4. Test the Model Before Tuning 🧪
5. Configure the LoRA Settings 🏋️‍♂️
6. Configure NNCF For NLS 🛠️
7. Prepare the Dataset 📚
8. Finetune the Model 🎯
9. Extract the heuristic sub-adapter 🔧
10. Search an optimal sub-adapter (Optional) 🔍
11. Merge the Model 🧩
12. Evaluate the Finetuned Model 🏆

This notebook illustrates how fine-tuning can significantly boost the performance of a sparse model across a diverse array of topics, enhancing its versatility and applicability to various domains. You will gain valuable insights into the process of developing a **task-specific** highly-efficient model capable of delivering accurate and relevant responses to a broad spectrum of questions.

### Quick Start

___
#### Step 1: Setup the Environment ⚙️

Let's start by setting up our environment! We'll install all the essential packages, including the Hugging Face `transformers` library, `peft` library, and a few additional tools. 📦
Please follow https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SQFT#setup to set up the environment for SQFT.

👀 Check whether GPU is available:

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    torch.cuda.empty_cache()
    print("GPU is available. Using GPU.")
else:
    print("GPU is not available.")

🔑 Logging into Hugging Face Hub:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

___
#### Step 2: Sparsification ✂️

Before fine-tuning, SQFT employs a simple but effective pruning approach [Wanda](https://arxiv.org/abs/2306.11695) to
sparsify the language model, serving as the base model (frozen) for adapter training.
Clone the Wanda repo and apply our patch:

In [None]:
!git clone https://github.com/locuslab/wanda.git && cd wanda && git checkout 8e8fc87 && git apply --ignore-space-change --ignore-whitespace ../wanda-modifications-for-sqft-usage.patch

Below is the command for unstructured sparsifying [apple/OpenELM-1_1B](https://huggingface.co/apple/OpenELM-1_1B)
with Wanda, to achieve unstructured 50% sparsity. Please note that we retain the model in FP32 precision to maximize its performance, and the tokenizer used by the OpenELM model is the one from [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf).

In [None]:
!python wanda/main.py --model apple/OpenELM-1_1B --dtype auto --tokenizer meta-llama/Llama-2-7b-hf --prune_method wanda --sparsity_ratio 0.5 --sparsity_type unstructured --save wanda_out --save_model sqft-openelm-1_1b-50-base

- `--model`: The identifier for the model on the Hugging Face model hub or local path.
- `--sparsity_ratio`: Specifies the percentage of weights to be pruned.
- `--save_model`: Specifies the directory where the pruned language model will be stored.

Further details can be referred to [Wanda](https://github.com/locuslab/wanda).
Note that the sparsifying step can be replaced by other sparse algorithms.
Feel free to try other pruning approaches for the base model before training. 😊

___
#### Step 3: Load Model 🚀

Let's get started by loading the sparsified OpenELM model with Hugging Face's `AutoModelForCausalLM` class. We'll also bring in the corresponding tokenizer to handle our input data preprocessing. 🛠️

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "sqft-openelm-1_1b-50-base"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# Set padding side to the right to ensure proper attention masking during fine-tuning
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).to(device)
# Disable caching mechanism to reduce memory usage during fine-tuning
model.config.use_cache = False

___
#### Step 4: Test the Model Before Tuning 🧪

Before diving into fine-tuning, let's first evaluate the out-of-the-box performance of the **sparsified** OpenELM model. We will use [lm-evaluation-harness v0.4.2](https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.2) to assess the model on the `ARC-Easy` dataset with the default configuration. In the subsequent fine-tuning steps, we will also use the training set of `ARC-Easy` to observe the performance changes before and after fine-tuning. Please note that this notebook is intended to demonstrate the usage of SQFT, so a small dataset is used here for demonstration.

In [None]:
!lm_eval --model hf --model_args pretrained=sqft-openelm-1_1b-50-base,add_bos_token=True,trust_remote_code=True --tasks arc_easy --batch_size auto:4

We can get the following results:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.4966|±  |0.0103|
|        |       |none  |     0|acc_norm|0.4415|±  |0.0102|


___
#### Step 5: Configure the LoRA Settings 🏋️‍♂️

To efficiently finetune our model, we'll leverage the LoRA (Low-Rank Adaptation) technique.
LoRA enables us to tailor the model to our specific task by training only a small subset of additional parameters. This significantly reduces on training time and memory usage! ⏰

We'll set up the LoRA configuration by specifying the low rank (r) and the target modules we aim to adapt. For the SQFT + SparsePEFT solution (which is what this notebook introduces), we need to set `sparse_adapter` to `True` in order to make the adapter sparse, so that the adapter can be merged into the base model without any loss of sparsity. 🎯

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare the model for k-bit training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["qkv_proj"],
    bias="none",
    task_type="CAUSAL_LM",
    sparse_adapter=True, # SparsePEFT
)

# Apply the LoRA configuration to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

___
#### Step 6: Configure NNCF For NLS 🛠️

To enhance performance, SQFT leverages Neural Low-rank Adapter Search (NLS) to identify the optimal adapter configuration, surpassing the traditional LoRA approach. 🚀 Alternatively, you can skip this step and refer to the [SQFT + SparsePEFT (LoRA)]() notebook to conduct some preliminary experiments with LoRA.

The following code demonstrates how to configure and create a neural network model with elastic low-rank support using the Neural Network Compression Framework (NNCF). Specifically, this code defines an NNCF configuration dictionary, sets up input information and BootstrapNAS (Neural Architecture Search) training parameters, including the progressive shrinking algorithm and width elasticity. 📊 It then creates an NNCF network based on the configuration dictionary and applies compression control and model compression algorithms. 🔧


In [None]:
from nncf import NNCFConfig
from nncf.experimental.torch.nas.bootstrapNAS.training.model_creator_helpers import create_compressed_model_from_algo_names
from nncf.torch.model_creation import create_nncf_network

rank_search_space = [16, 12, 8]

# Define the NNCF configuration dictionary, including input information and BootstrapNAS training parameters
nncf_config_dict = {
    "input_info": [
        {
            "sample_size": [1, 256],
            "type": "long",
            "keyword": "input_ids"
        },
        {
            "sample_size": [1, 256],
            "type": "long",
            "keyword": "attention_mask"
        }
    ],
    "bootstrapNAS": {
        "training": {
            "algorithm": "progressive_shrinking",
            "frozen_layers_allowed": True,
            "progressivity_of_elasticity": ["width"],
            "batchnorm_adaptation": {
                "num_bn_adaptation_samples": 0
            },
            "schedule": {
                "list_stage_descriptions": [
                    {"train_dims": ["width"], "epochs": 3, "depth_indicator": 1, "width_indicator": 5, "init_lr": 1e-4, "epochs_lr": 3, "sample_rate": 1}
                ]
            },
            "elasticity": {
                "available_elasticity_dims": ["width"],
                "width": {
                    "overwrite_groups": [
                        [
                            "{re}PeftModelForCausalLM/LoraModel[base_model]/OpenELMForCausalLM[model]/OpenELMModel[transformer]/ModuleList[layers]/OpenELMDecoderLayer[{*}]/OpenELMMultiHeadCausalAttention[attn]/Linear[qkv_proj]/ModuleDict[lora_A]/NNCFLinear[default]/linear_0"
                        ]
                    ],
                    "overwrite_groups_widths": [
                        rank_search_space # the search space of low rank (r)
                    ]
                }
            }
        }
    }
}

# Process the overwrite groups and their corresponding widths
base_overwrite_groups = nncf_config_dict["bootstrapNAS"]["training"]["elasticity"]["width"]["overwrite_groups"]
base_overwrite_groups_widths = nncf_config_dict["bootstrapNAS"]["training"]["elasticity"]["width"]["overwrite_groups_widths"]
overwrite_groups, overwrite_groups_widths = [], []
num_layers = model.config.num_transformer_layers
for group, width in zip(base_overwrite_groups, base_overwrite_groups_widths):
    cur_search_space = width
    if group[0].startswith("{re}"):
        new_group = [[item.replace("{re}", "").replace("{*}", str(i)) for item in group] for i in range(num_layers)]
        new_width = [cur_search_space for _ in range(num_layers)]
    else:
        new_group = [group]
        new_width = [cur_search_space]
    overwrite_groups.extend(new_group)
    overwrite_groups_widths.extend(new_width)

# Update the configuration dictionary with the processed overwrite groups and widths
nncf_config_dict["bootstrapNAS"]["training"]["elasticity"]["width"]["overwrite_groups"] = overwrite_groups
nncf_config_dict["bootstrapNAS"]["training"]["elasticity"]["width"]["overwrite_groups_widths"] = overwrite_groups_widths
nncf_config = NNCFConfig.from_dict(nncf_config_dict)

# Create the NNCF network and apply compression control and model compression algorithms
nncf_network = create_nncf_network(model, nncf_config)
compression_ctrl, model = create_compressed_model_from_algo_names(
    nncf_network, nncf_config, algo_names=["progressive_shrinking"]
)

___
#### Step 7: Prepare the Dataset 📚

We prepare a domain-specific dataset to finetune our sparsified model.
As tested in step 4, to introduce our process more simply, we will utilize a small dataset, `ARC-Easy`. This dataset features authentic grade-school level, multiple-choice science questions, designed to foster research in advanced question-answering. By honing in on the question-answer pairs, we aim to adapt our model to deliver precise and relevant responses to a variety of inquiries. 🔍

🔧 Loading Data and Preparing Prompts:

In [None]:
from datasets import Dataset, load_dataset

"""
This function is inspired by the implementation in `lm-evaluation-harness` library.
It processes the ARC-Easy dataset to create prompts for question-answering tasks.
"""
def add_prompt_func_arc(doc):

    def _process_doc(doc):
        """
        Process a single document to convert numeric answer keys to letters and
        format the document into a dictionary.
        """
        # Map numeric answer keys to letters
        num_to_letter = {"1": "A", "2": "B", "3": "C", "4": "D", "5": "E"}
        doc["answerKey"] = num_to_letter.get(doc["answerKey"], doc["answerKey"])

        # Create a dictionary with necessary fields
        out_doc = {
            "id": doc["id"],
            "query": "Question: " + doc["question"] + "\nAnswer:",
            "choices": doc["choices"]["text"],
            "gold": ["A", "B", "C", "D", "E"].index(doc["answerKey"]),
        }
        return out_doc

    # Process the document and create the full prompt
    doc = _process_doc(doc)
    prompt = doc["query"]
    answer = doc["choices"][doc["gold"]]
    doc["full_prompt"] = prompt + " " + answer
    return doc

# Load the ARC-Easy dataset
arc_e_dataset = load_dataset("ai2_arc", "ARC-Easy", split="train")

# Apply the add_prompt_func_arc function to each document in the dataset
dataset = arc_e_dataset.map(add_prompt_func_arc)

print(f"Number of examples in the dataset: {len(dataset)}")
print(f"Fields in the dataset: {list(dataset.features.keys())}")

📝 Tokenizing for Model Training:

In [None]:
def tokenize(prompt, add_eos_token=True):
    """
    Tokenizes the given prompt and optionally adds an end-of-sequence (EOS) token.

    Args:
        prompt (str): The input text to tokenize.
        add_eos_token (bool): Whether to add an EOS token at the end of the tokenized input.

    Returns:
        dict: A dictionary containing tokenized input ids, attention mask, and labels.
    """
    # Tokenize the prompt with truncation and padding
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=256,
        padding=True,
        return_tensors=None,
    )

    # Add EOS token if necessary
    if (
        result["input_ids"][-1] != tokenizer.eos_token_id
        and len(result["input_ids"]) < 256
        and add_eos_token
    ):
        result["input_ids"].append(tokenizer.eos_token_id)

        result["attention_mask"].append(1)

    # Create labels for the tokenized input
    result["labels"] = result["input_ids"].copy()

    return result

def generate_and_tokenize_prompt(data_point):
    """
    Generates a tokenized prompt from the data point.

    Args:
        data_point (dict): A dictionary containing the full prompt.

    Returns:
        dict: A dictionary containing tokenized input ids, attention mask, and labels.
    """
    full_prompt = data_point["full_prompt"]
    tokenized_full_prompt = tokenize(full_prompt)
    return tokenized_full_prompt

# Shuffle the dataset and apply the generate_and_tokenize_prompt function to each data point
dataset = dataset.shuffle().map(generate_and_tokenize_prompt)

# Print the first example
print(dataset[0])

___
#### Step 8: Finetune the Model 🎯

It's time to finetune our OpenELM model! We'll set up the training parameters, including batch size, learning rate, and evaluation strategy. 📊

By feeding the model question-answer pairs from the `ARC-Easy` dataset, we can train it to generate more accurate and relevant responses. It enables the model to learn the unique patterns and relationships within the diverse topics covered by the dataset. 🎯

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq

# Define the path where the fine-tuned adapter will be saved
finetuned_super_adapter_path = "sqft-sparsepeft-openelm-1_1b-50-arce-adapter"
training_args = TrainingArguments(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    warmup_steps=100,
    num_train_epochs=3,
    # learning rate is already set in NNCF configuration.
    # learning_rate=1e-4,
    optim="adamw_torch",
    save_strategy="epoch",
    save_total_limit=2,
    fp16=True,
    logging_steps=20,
    output_dir=finetuned_super_adapter_path,
)
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=None,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
    compression_ctrl=compression_ctrl,  # NLS requires
)
results = trainer.train()
metrics = results.metrics
metrics["train_samples"] = len(dataset)
trainer.save_model()
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

___
#### Step 9: Extract the heuristic sub-adapter 🔧

Due to the use of the NLS strategy, we trained a super LoRA adapter. To quickly evaluate its quality and performance, we will use a heuristic sub-adapter ("intermediate" adapter) as a reference. The following code demonstrates how to extract a heuristic sub-adapter. 📊

In [None]:
import os

from peft.utils import CONFIG_NAME, SAFETENSORS_WEIGHTS_NAME
from safetensors.torch import load_file, save_file

# Load the super adapter weights
super_adapter_weights = load_file(os.path.join(finetuned_super_adapter_path, SAFETENSORS_WEIGHTS_NAME))
num_adapters = sum("lora_A" in key for key in super_adapter_weights)

# Initialize heuristic adapter configuration
heu_adapter_config = [rank_search_space[(len(rank_search_space) - 1) // 2]] * num_adapters

# Function to save sub-adapter weights based on the given configuration
def save_sub_adapter(config, save_dir):
    sub_adapter_weights = super_adapter_weights.copy()
    num_pruned_adapter = 0
    for key in sub_adapter_weights:
        if "lora_A" not in key:
            continue
        rank = config[num_pruned_adapter]
        sub_adapter_weights[key] = super_adapter_weights[key][:rank].clone()
        lora_b_key = key.replace("lora_A", "lora_B")
        sub_adapter_weights[lora_b_key] = super_adapter_weights[lora_b_key][:, :rank].clone()
        num_pruned_adapter += 1
    os.makedirs(save_dir, exist_ok=True)
    save_file(sub_adapter_weights, os.path.join(save_dir, SAFETENSORS_WEIGHTS_NAME))
    config_path = os.path.join(finetuned_super_adapter_path, CONFIG_NAME)
    os.system(f"cp {config_path} {save_dir}")

# save the heuristic sub-adapter
sub_adapter_path = os.path.join(finetuned_super_adapter_path, "heuristic_sub_adapter")
save_sub_adapter(heu_adapter_config, sub_adapter_path)

___
#### Step 10: Search an optimal sub-adapter (Optional) 🔍

To get a better performing sub-adapter, we can leverage the simple hill-climbing algorithm to extract an optimal sub-adapter from the trained super LoRA adapter. The following code demonstrates the process of identifying and saving the best-performing sub-adapter configuration based on the validation set. 📊

Before exploring the super adapter, we need to define a custom task for the search phase that applies the validation set of the `ARC-Easy`.
Specifically, we define the task for the validation set by modifying the existing `arc_easy.yaml` configuration file to create a new task configuration for `arc_easy_val`.




In [None]:
import importlib

from lm_eval.utils import load_yaml_config

# Get the directory path where the `lm_eval` module is located
spec = importlib.util.find_spec("lm_eval")
module_path = spec.origin
module_dir = os.path.dirname(module_path)
arc_yaml_file = os.path.join(module_dir, "tasks/arc/arc_easy.yaml")
task_config = load_yaml_config(arc_yaml_file)

# Modify the task configuration to define the validation set task
task_config["task"] = "arc_easy_val"
task_config["dataset_name"] = "ARC-Easy"
task_config["test_split"] = "validation"

Next, we implement the hill-climbing algorithm to search for the optimal sub-adapter configuration. This involves evaluating different configurations on the validation set and selecting the best-performing one.

In [None]:
import json
import tempfile
import yaml
import importlib.util

from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM
from lm_eval.tasks import TaskManager
from lm_eval.utils import load_yaml_config
from peft import PeftModel

lm = HFLM(model, batch_size=64, trust_remote_code=True, add_bos_token=True)
task_manager = TaskManager("INFO", include_path=None)
request_caching_args = {'cache_requests': False, 'delete_requests_cache': False, 'rewrite_requests_cache': False}

# Hill-climbing algorithm
t, T = 0, 10
anchor_adapter_config = heu_adapter_config
visited = set()

while t < T:
    # Find all possible sub-adapters in this turn
    all_neighbors = []
    for idx in range(len(anchor_adapter_config)):
        cur_rank = anchor_adapter_config[idx]
        space_idx = rank_search_space.index(cur_rank)
        if space_idx != 0:
            new_sub_adapter_config = anchor_adapter_config.copy()
            new_sub_adapter_config[idx] = rank_search_space[space_idx - 1]
            all_neighbors.append(new_sub_adapter_config)
        if space_idx != len(rank_search_space) - 1:
            new_sub_adapter_config = anchor_adapter_config.copy()
            new_sub_adapter_config[idx] = rank_search_space[space_idx + 1]
            all_neighbors.append(new_sub_adapter_config)
    all_neighbors = [neighbor for neighbor in all_neighbors if str(neighbor) not in visited]
    print(f"Found {len(all_neighbors)} neighbors in Turn {t}")

    best = (None, -1.0)
    for i, possible_config in enumerate(all_neighbors):
        visited.add(str(possible_config))
        with tempfile.TemporaryDirectory() as temp_sub_adapter_path:
            save_sub_adapter(
                possible_config,
                temp_sub_adapter_path
            )
            result_log = os.path.join(temp_sub_adapter_path, "result.json")
            model = PeftModel.from_pretrained(lm.model.get_base_model(), temp_sub_adapter_path)
            # Update the model in HFLM (there might be a better way here)
            lm._model = model

            # Evaluate the current sub-adapter configuration
            results = evaluator.simple_evaluate(
                model=lm,
                tasks=[task_config],
                batch_size=64,
                log_samples=False,
                task_manager=task_manager,
                **request_caching_args,
            )
            accuracy = results["results"]["arc_easy_val"]["acc_norm,none"]
            print(f"Accuracy: {accuracy} ({i+1}/{len(all_neighbors)} possible sub-adapter in Turn {t})")

            # Update the best configuration if current one is better
            if accuracy > best[1]:
                best = (possible_config, accuracy)
    print(f"Best accuarcy in Turn {t} is {best[1]}")
    anchor_adapter_config = best[0]
    t += 1

# save the optimal sub-adapter
sub_adapter_path = os.path.join(finetuned_super_adapter_path, "optimal_sub_adapter")
save_sub_adapter(best[0], sub_adapter_path)

In this code, we first load the super adapter weights and initialize the heuristic configuration. We then define a function to save the sub-adapter weights based on the current configuration. The hill-climbing algorithm iterates through possible configurations, evaluates them on the validation set, and updates the best configuration found. Finally, we save the optimal sub-adapter configuration.

By following these steps, we can effectively identify and save the best-performing sub-adapter configuration from the trained super LoRA adapter.

___
#### Step 11: Merge the Model 🧩

After finetuning and extracting a sub-adapter, it's time to put our model to the test! But first, we need to merge the heuristic sub-adapter weights with the base model. This step is essential because the adapter weights contain the adaptations learned during finetuning. By merging these weights, we effectively integrate the newly acquired knowledge into the base model. 🎛️

To merge the weights, we'll use the `merge_and_unload()` function from the PEFT library. This function seamlessly combines the heuristic sub-adapter weights with the base model's corresponding weights, resulting in a unified model that incorporates the finetuned knowledge. 🔧

Once the heuristic sub-adapter weights are merged, we'll save the finetuned model to preserve its state. This ensures that we can easily load and use the finetuned model for future tasks without needing to repeat the finetuning process. ✨

In [None]:
from peft import PeftModel

base_model_path = "sqft-openelm-1_1b-50-base"
tuned_model_path = "sqft-sparsepeft-openelm-1_1b-50-arce"

base_model = AutoModelForCausalLM.from_pretrained(base_model_path, trust_remote_code=True).to(device)
model = PeftModel.from_pretrained(base_model, sub_adapter_path)

# Merge the adapter weights into the base model and unload the adapter
merged_model = model.merge_and_unload()
merged_model.train(False)
base_model.save_pretrained(tuned_model_path, state_dict=merged_model.state_dict())

# Load and save the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)
tokenizer.save_pretrained(tuned_model_path)

___
#### Step 12: Evaluate the Finetuned Model 🏆

Now, let's compare the result with the pre-finetuned model to see the improvements. Prepare to be amazed by the power of finetuning! 🤩

By merging the adapter weights, we've ensured that our model is equipped to handle real-world tasks with its newly acquired knowledge. So, let's put it to the test and see how it performs! 🌟


In [None]:
!lm_eval --model hf --model_args pretrained=sqft-sparsepeft-openelm-1_1b-50-arce,add_bos_token=True,trust_remote_code=True --tasks arc_easy --batch_size auto:4

We can get the following results:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.5972|±  |0.0101|
|        |       |none  |     0|acc_norm|0.6166|±  |0.0100|


The evaluation results clearly demonstrate the significant improvements achieved through fine-tuning the sparsified OpenELM model on the `ARC-Easy` dataset. Initially, the out-of-the-box performance of the model yielded an accuracy (acc) of 0.4966 and a normalized accuracy (acc_norm) of 0.4415. After fine-tuning, the model's accuracy increased to **0.5972**, and the normalized accuracy rose to **0.6166**. 📈 Notably, this notebook is intended to introduce the usage process of SQFT and simply demonstrate the effectiveness of SQFT. Achieving better fine-tuning results requires more experiments and extensive parameter exploration. More importantly, NLS can also obtain better-performing sub-adapters through simple searches. 💥


### Summary

In summary, our approach demonstrates that low-cost model adaptation in low-precision sparse foundation models can significantly enhance performance while maintaining efficiency. This experiment underscores the potential of low-precision and sparsity-aware methods in making machine learning more accessible and scalable. 🚀🌟

We encourage you to test our cost-effective and efficient method on your custom fine-tuning datasets! 🎉📊