<a href="https://colab.research.google.com/github/BoxOfCereal/Fine-Tuning-Loop/blob/main/fine_tune_loop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#title
[link text](https://huggingface.co/blog/falcon#fine-tuning-with-peft)
[link text](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
[link text](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

[ AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/)

# FTL - Fine Tune Loop

## Intro

In this notebook I will attempt to show all the steps necessary to go from data to text generation. The main headings will demonstrate the easiest way to go through a whole training Loop including loading data, loading your model from the hugging face ecosystem, training the model, benchmarking the model, inferencing the model, and taking that model and using it in your prompt library in our case we'll be using line chain



The subsections of each heading will contain more in depth variations of each of these steps. It is my hope that seeing multiple examples that are trying to accomplish the same thing will show the underlying patterns needed to not only understand how to collect data train a model and run inference on it but also adapt it to your use case with your own custom data, model, and inferencing needs.

This notebook is designed to take you from data to application with large language models. A Star Emoji is marked to show which path is recommended for a first time use

## Pre-requisites:
* A hugging face account - [Sign up](https://huggingface.co/join)
* A weights and biases account [Sign Up](https://wandb.ai/login?signup=true)
* Some python experience
* Some basic experience with large language models
[Course](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt)

## Legend:
⭐ - Recommended

## Fine tuning a model

## Finetune Falcon-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Falcon-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

### Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [None]:
!pip install -q -U git+https://github.com/lvwerra/trl.git git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb #einops for falcon

### Dataset

For our experiment, we will use the Guanaco dataset, which is a clean subset of the OpenAssistant dataset adapted to train general purpose chatbots.

The dataset can be found [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

#### "timdettmers/openassistant-guanaco" ⭐

In [2]:
from datasets import load_dataset

dataset_name = "timdettmers/openassistant-guanaco"
dataset = load_dataset(dataset_name, split="train")

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Downloading and preparing dataset json/timdettmers--openassistant-guanaco to /root/.cache/huggingface/datasets/timdettmers___json/timdettmers--openassistant-guanaco-6126c710748182cf/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/timdettmers___json/timdettmers--openassistant-guanaco-6126c710748182cf/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


[know your dataset](https://huggingface.co/docs/datasets/access)

In [None]:
# dataset[0]

#### Custom Dataset (Coming soon!)

[LIMA: Less Is More for Alignment](https://arxiv.org/pdf/2305.11206.pdf)

We observe that, for the purpose of alignment, scaling up input diversity and output quality have
measurable positive effects, while scaling up quantity alone might not.

how much data is needed to teach a pre-trained large language model new factual information?

[Textbooks Are All You Need](https://arxiv.org/abs/2306.11644)
[autolabel](https://github.com/refuel-ai/autolabel)

* A filtered code-language dataset, which is a subset
of The Stack and StackOverflow, obtained by
using a language model-based classifier (consisting of about 6B tokens).
* A synthetic textbook dataset consisting of <1B tokens of GPT-3.5 generated Python textbooks.
* A small synthetic exercises dataset consisting of ∼180M tokens of Python exercises and solutions.

Filtering of existing code datasets using a transformer-based classifie

 The following example demonstrates the synthetically generated textbook text:

To begin, let us define singular and nonsingular matrices. A matrix is said to be singular if its
determinant is zero. On the other hand, a matrix is said to be nonsingular if its determinant is not
zero. Now, let's explore these concepts through examples.
Example 1:
Consider the matrix A = np.array([[1, 2], [2, 4]]). We can check if this matrix is singular or
nonsingular using the determinant function. We can define a Python function, `is_singular(A)`, which
returns true if the determinant of A is zero, and false otherwise.
```python
import numpy as np
def is_singular(A):
det = np.linalg.det(A)
if det == 0:
return True
else:
return False
A = np.array([[1, 2], [2, 4]])
print(is_singular(A)) # True
```

### Loading the model

####🦙 togethercomputer/RedPajama-INCITE-Base-3B-v1 ⭐

In [4]:
model_name = "togethercomputer/RedPajama-INCITE-Base-3B-v1"
repo_name = f"nolestock/{model_name.split('/')[-1]}-SFT-guanaco-lora"

In [None]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

MIN_TRANSFORMERS_VERSION = '4.25.1'

# check transformers version
assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'

# init
tokenizer = AutoTokenizer.from_pretrained(model_name)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)

# model = model.to('cuda:0')



In [None]:
model.modules

In [None]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

#### 🦅 ybelkada/falcon-7b-sharded-bf16 in 4bit

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,

model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

Let's also load the tokenizer below

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Below we will load the configuration file in order to create the LoRA model. According to [QLoRA paper](https://arxiv.org/abs/2305.14314), it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

A really good video on QLoRA is[AemonAlgiz](https://www.youtube.com/@AemonAlgiz)'s video
[QLoRA Is More Than Memory Optimization. Train Your Models With 10% of the Data for More Performance.](https://youtu.be/v6sf4EF45fI) . WARNING: he does go into some math, but even if you don't understand it all ( which I certainly don't ) he explains it in a very satisfying way.

In [None]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

Let's take a look at the models modules so we can see what we're targeting and how to find the linear modules in any other architecture we're interested in:

In [None]:
model.modules

<bound method Module.modules of RWForCausalLM(
  (transformer): RWModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x DecoderLayer(
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
        (self_attention): Attention(
          (maybe_rotary): RotaryEmbedding()
          (query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
          (dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): MLP(
          (dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
          (act): GELU(approximate='none')
          (dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
        )
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)>

### Inference

No matter which model you load above This inference should work with either although expect different responses

In [6]:
# infer
prompt = "Alan Turing is"
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
input_length = inputs.input_ids.shape[1]
outputs = model.generate(
    **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True,
)
token = outputs.sequences[0, input_length:]
output_str = tokenizer.decode(token)
print(output_str)
"""
a name that has been synonymous with the computer age since the 1950s. The British mathematician, logician, and cryptanalyst is widely regarded as the father of modern computing. His contributions to the development of the modern computer and the theory of computation have had a profound impact on the world we live in today.
Turing’s contributions to the development of the modern computer were made in the 1940s and 1950s. He is most famous for his work on the Turing machine, a theoretical model of a computing machine that was able to perform all the mathematical operations of a computer. Turing’s work on the...
"""


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


 the man who cracked the Enigma code, the code used by the Nazis to communicate with their submarines during World War II.
The code was used by the Germans to send messages to their submarines and it was a code that could only be broken by a machine.
Turing was a brilliant mathematician and he was also a brilliant cryptographer.
He was also a brilliant scientist and a brilliant engineer.
He was also a brilliant philosopher and a great thinker.
He was also a great mathematician.
He was also a great scientist.
He was also a great philosopher.
He was a great think


'\na name that has been synonymous with the computer age since the 1950s. The British mathematician, logician, and cryptanalyst is widely regarded as the father of modern computing. His contributions to the development of the modern computer and the theory of computation have had a profound impact on the world we live in today.\nTuring’s contributions to the development of the modern computer were made in the 1940s and 1950s. He is most famous for his work on the Turing machine, a theoretical model of a computing machine that was able to perform all the mathematical operations of a computer. Turing’s work on the...\n'

### Loading the trainer

#### 🦅🦙 Supervised Fine-tuning Trainer QLoRA 4-Bit ⭐

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [None]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

Then finally pass everthing to the trainer

In [None]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)



Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [None]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

##### Train the model

Now let's train the model! Simply call `trainer.train()`

In [None]:
trainer.train()

During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

### Inference

#### pipeline 1️⃣

### Saving

In [None]:
trainer.save_model()

In [None]:
from datasets.utils.file_utils import huggingface_hub
huggingface_hub.login(token="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")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
model_name = "ybelkada/falcon-7b-sharded-bf16"
repo_name = f"nolestock/{model_name.split('/')[-1]}-finetuned-guanaco-lora"


In [None]:
repo_name

'nolestock/falcon-7b-sharded-bf16-finetuned-guanaco-lora'

In [None]:
trainer.model.push_to_hub(repo_name)

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.bin:   0%|          | 0.00/522M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/nolestock/falcon-7b-sharded-bf16-finetuned-guanaco-lora/commit/456f9867df497d978dad00f905c1b3a583aff3a6', commit_message='Upload model', commit_description='', oid='456f9867df497d978dad00f905c1b3a583aff3a6', pr_url=None, pr_revision=None, pr_num=None)

## loading (RESTART)

### From Huggingface hub (Model and Adapter)

In [None]:
%%capture
!pip install -q -U git+https://github.com/lvwerra/trl.git git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

In [None]:
model_name = MODEL_NAME or "ybelkada/falcon-7b-sharded-bf16"
repo_name = REPO_NAME or f"nolestock/{model_name.split('/')[-1]}-finetuned-guanaco-lora"

NameError: ignored

In [None]:
from peft import PeftConfig, PeftModel
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch


tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    load_in_4bit=True,
    device_map="auto",
)
inference_model = PeftModel.from_pretrained(model, repo_name)


NameError: ignored

### (Adapter only)

In [None]:
# peft_model_id = PEFT_MODEL_ID
peft_model_id = "Bruno/Harpia-7b-guanacoLora"

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer, GenerationConfig


config = PeftConfig.from_pretrained(peft_model_id)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             return_dict=True,
                                             quantization_config=bnb_config,
                                             trust_remote_code=True,
                                             device_map={"":0})

### Inference

#### pipeline

In [None]:
pipeline = transformers.pipeline(
    "text-generation",
    model=inference_model,
    tokenizer=tokenizer,
)

p = """### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant:"""
sequences = pipeline(
    p,
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")


The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerN

Result: ### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: The term'monopsony' refers to a type of market structure in which there exists only one buyer for a particular good or service. This can arise when there are high barriers to entry or exit from the marketplace, leading to a limited number of firms that can supply the necessary goods or services.

Monopsony is often discussed in the context of labour markets. For example, a firm may have monopsony power in relation to a particular type of employee or skill-set. This can lead to the firm engaging in practices that negatively impact workers, such as wage suppression or poor working conditions.

There are a variety of potential examples related to potential monopsonies in the labour market, such as:

1


#### more control


See [behind the pipeline](https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt) page for more info.

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer, GenerationConfig

peft_model_id = "Bruno/Harpia-7b-guanacoLora"

config = PeftConfig.from_pretrained(peft_model_id)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             return_dict=True,
                                             quantization_config=bnb_config,
                                             trust_remote_code=True,
                                             device_map={"":0})


prompt_input = ""
prompt_no_input = ""

def create_prompt(instruction, input=None):
  if input:
    return  prompt_input.format(instruction=instruction, input=input)
  else:
    return prompt_no_input.format(instruction=instruction)

def generate(
        instruction,
        input=None,
        max_new_tokens=128,
        temperature=0.1,
        top_p=0.75,
        top_k=40,
        num_beams=4,
        **kwargs,
):
    prompt = create_prompt(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to("cuda")
    attention_mask = inputs["attention_mask"].to("cuda")
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        **kwargs,
    )
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    return output.split("### Respuesta:")[1]

instruction = "Me conte algumas curiosidades sobre o Brasil"

print("Instruções:", instruction)
print("Resposta:", generate(instruction))


## Merging (OPTIONAL)(RESTART)
[link text](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/apply_lora.py)

### Model already loaded

In [None]:
%%capture
!pip install -q -U git+https://github.com/lvwerra/trl.git git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb #einops for falcon

In [None]:
model_name = "ybelkada/falcon-7b-sharded-bf16"
repo_name = f"nolestock/{model_name.split('/')[-1]}-finetuned-guanaco-lora"

In [None]:
from peft import PeftConfig, PeftModel
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch


tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    # load_in_4bit=True,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, repo_name)

print("Applying the LoRA")
model = model.merge_and_unload()

print(f"Saving the target model")
# ValueError: Cannot merge LORA layers when the model is loaded in 8-bit mode
model.save_pretrained()
tokenizer.save_pretrained()





Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

### Pushing Merged model to the hub

## Eval (RESTART)

### Language Model Evaluation Harness

#### QLoRA

In [None]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness
!cd lm-evaluation-harness && pip install -e ".[auto-gptq]"

In [None]:
%%capture
!pip install -q datasets bitsandbytes einops git+https://github.com/huggingface/peft #wandb

In [None]:
model_name = "ybelkada/falcon-7b-sharded-bf16"
repo_name = f"nolestock/{model_name.split('/')[-1]}-finetuned-guanaco-lora"

[docs/task_table.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md)

In [None]:
!python ./lm-evaluation-harness/main.py \
    --model hf-causal-experimental \
    --model_args pretrained={model_name},peft={repo_name},dtype=float16,trust_remote_code=True,load_in_4bit=True \
    --tasks bigbench_causal_judgement \
    --device cuda:0 \
    --output_base_path ./


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Selected Tasks: ['truthfulqa_gen']
Loading checkpoint shards: 1

In [None]:
!python ./lm-evaluation-harness/main.py \
    --model hf-causal-experimental \
    --model_args pretrained={model_name},peft={repo_name},dtype=float16,trust_remote_code=True,load_in_4bit=True \
    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
    --device cuda:0

## Integration

### Langchain

#### ⚔️ Minihack

[Using NLE on Google Colab issue](https://github.com/facebookresearch/nle/issues/117#issuecomment-906508738)

In [6]:
!sudo apt update
!sudo apt install -y build-essential autoconf libtool pkg-config python3-dev \
    python3-pip python3-numpy git flex bison libbz2-dev

!wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | sudo apt-key add -
!sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ bionic main'
!sudo apt-get update && apt-get --allow-unauthenticated install -y \
    cmake \
    kitware-archive-keyring

# feel free to use a more elegant solution to make /usr/bin/cmake the default one
!sudo rm $(which cmake)
!$(which cmake) --version

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Hit:3 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:7 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease
Hit:8 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Hit:10 https://apt.kitware.com/ubuntu bionic InRelease
Hit:11 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu focal InRelease
Hit:12 http://ppa.launchpad.net/ubuntugis/ppa/ubuntu focal InRelease
Reading package lists... Done
Building dependency tree       
Reading state information... Done
14 packages can be upgraded. Run 'apt list

In [2]:
! python --version

Python 3.10.12


In [7]:
# -v can be dropped unless you run into some other trouble
!pip3 install -Uv nle

import nle, gym

env = gym.make("NetHackChallenge-v0")
_ = env.reset()
env.render()

Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nle
  Using cached nle-0.9.0.tar.gz (7.0 MB)
  Running command pip subprocess to install build dependencies
  Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
  Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://us-python.pkg.dev/colab-wheels/public/simple/
  Collecting setuptools>=40.8.0
    Using cached setuptools-68.0.0-py3-none-any.whl (804 kB)
  Collecting wheel
    Using cached wheel-0.40.0-py3-none-any.whl (64 kB)
  Installing collected packages: wheel, setuptools
    Creating /tmp/pip-build-env-aex1earp/overlay/local/bin
    changing mode of /tmp/pip-build-env-aex1earp/overlay/local/bin/wheel to 755
  ERROR: pip's dependency resolver does not currently take into account all the packages that are installed