<a href="https://colab.research.google.com/github/BoxOfCereal/Fine-Tuning-Loop/blob/main/fine_tune_loop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FTL - Fine Tune Loop

## Intro

In this notebook I will attempt to show all the steps necessary to go from data to text generation. The main headings will demonstrate the easiest way to go through a whole training Loop including loading data, loading your model from the hugging face ecosystem, training the model, benchmarking the model, inferencing the model, and taking that model and using it in your prompt library in our case we'll be using line chain



The subsections of each heading will contain more in depth variations of each of these steps. It is my hope that seeing multiple examples that are trying to accomplish the same thing will show the underlying patterns needed to not only understand how to collect data train a model and run inference on it but also adapt it to your use case with your own custom data, model, and inferencing needs.

This notebook is designed to take you from data to application with large language models. A Star Emoji is marked to show which path is recommended for a first time use

This notebook was adapted from hugging faces blog post about training falcon using QLoRA. Which can be found [here](https://huggingface.co/blog/falcon) and [here is the associated notebook](https://colab.research.google.com/drive/1BiQiw31DT7-cDp1-0ySXvvhzqomTdI-o?usp=sharing).




## Pre-requisites:
* A hugging face account - [Sign up](https://huggingface.co/join)
* A weights and biases account [Sign Up](https://wandb.ai/login?signup=true)
* Some python experience
* Some basic experience with large language models
[Course](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt)

In [3]:
%%capture
!pip install --upgrade huggingface_hub
from huggingface_hub import login, whoami, create_repo
login(token="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")

user = whoami()['name']
# repo_id = f'{user}/hf-hub-modelcards-pr-test'


## Legend:
⭐ - Recommended

## Fine tuning a model

Welcome to this Google Colab notebook that demonstrates how to fine-tune a language model using QLoRA on a single Google Colab instance. We will leverage the PEFT library from the Hugging Face ecosystem for efficient fine-tuning. With QLoRA, we can make the most of limited memory resources during the fine-tuning process. Let's get started and transform this model into a chatbot that you can interact with!

### 1️⃣ Dataset

In [None]:
!pip install -q datasets

#### "timdettmers/openassistant-guanaco" ⭐

For our experiment, we will use the Guanaco dataset, which is a clean subset of the OpenAssistant dataset adapted to train general purpose chatbots.

The dataset can be found [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco).

In [None]:
from datasets import load_dataset

dataset_name = "timdettmers/openassistant-guanaco"
dataset = load_dataset(dataset_name, split="train")

Let's take a look at what's inside the data set. It's always important to understand what you're feeding your model.

In [None]:
dataset[0]

If you want some more information you can check out [the hugging face documentation](https://huggingface.co/docs/datasets/access)

#### Custom Dataset (WIP!)

[LIMA: Less Is More for Alignment](https://arxiv.org/pdf/2305.11206.pdf)

We observe that, for the purpose of alignment, scaling up input diversity and output quality have
measurable positive effects, while scaling up quantity alone might not.

how much data is needed to teach a pre-trained large language model new factual information?

[Textbooks Are All You Need](https://arxiv.org/abs/2306.11644)
[autolabel](https://github.com/refuel-ai/autolabel)

* A filtered code-language dataset, which is a subset
of The Stack and StackOverflow, obtained by
using a language model-based classifier (consisting of about 6B tokens).
* A synthetic textbook dataset consisting of <1B tokens of GPT-3.5 generated Python textbooks.
* A small synthetic exercises dataset consisting of ∼180M tokens of Python exercises and solutions.

Filtering of existing code datasets using a transformer-based classifie

 The following example demonstrates the synthetically generated textbook text:

To begin, let us define singular and nonsingular matrices. A matrix is said to be singular if its
determinant is zero. On the other hand, a matrix is said to be nonsingular if its determinant is not
zero. Now, let's explore these concepts through examples.
Example 1:
Consider the matrix A = np.array([[1, 2], [2, 4]]). We can check if this matrix is singular or
nonsingular using the determinant function. We can define a Python function, `is_singular(A)`, which
returns true if the determinant of A is zero, and false otherwise.
```python
import numpy as np
def is_singular(A):
det = np.linalg.det(A)
if det == 0:
return True
else:
return False
A = np.array([[1, 2], [2, 4]])
print(is_singular(A)) # True
```

### 2️⃣ Loading the model

####🦙 togethercomputer/RedPajama-INCITE-Base-3B-v1 ⭐

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

In [None]:
%%capture
!pip install -q -U git+https://github.com/lvwerra/trl.git git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/peft.git
!pip install -q bitsandbytes wandb

In [None]:
model_name = "togethercomputer/RedPajama-INCITE-Base-3B-v1"
repo_name = f"{user}/{model_name.split('/')[-1]}-SFT-guanaco-lora"

In [None]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

MIN_TRANSFORMERS_VERSION = '4.25.1'

# check transformers version
assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'

# init
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)

# model = model.to('cuda:0')



Downloading (…)okenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/604 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/5.69G [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
model.modules

<bound method Module.modules of GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50432, 2560)
    (layers): ModuleList(
      (0-31): 32 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (attention): GPTNeoXAttention(
          (rotary_emb): RotaryEmbedding()
          (query_key_value): Linear4bit(in_features=2560, out_features=7680, bias=True)
          (dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (dense_4h_to_h): Linear4bit(in_features=10240, out_features=2560, bias=True)
          (act): GELUActivation()
        )
      )
    )
    (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
  )
  (embed_out): Linear(in_features=2560, out_featur

In [None]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

#### 🦅 ybelkada/falcon-7b-sharded-bf16 in 4bit

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [None]:
%%capture
!pip install -q -U git+https://github.com/lvwerra/trl.git git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb #einops for falcon

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,

model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

Let's also load the tokenizer below

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Below we will load the configuration file in order to create the LoRA model. According to [QLoRA paper](https://arxiv.org/abs/2305.14314), it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

A really good video on QLoRA is[AemonAlgiz](https://www.youtube.com/@AemonAlgiz)'s video
[QLoRA Is More Than Memory Optimization. Train Your Models With 10% of the Data for More Performance.](https://youtu.be/v6sf4EF45fI) . WARNING: he does go into some math, but even if you don't understand it all ( which I certainly don't ) he explains it in a very satisfying way.

In [None]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

Let's take a look at the models modules so we can see what we're targeting and how to find the linear modules in any other architecture we're interested in:

In [None]:
model.modules

<bound method Module.modules of RWForCausalLM(
  (transformer): RWModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x DecoderLayer(
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
        (self_attention): Attention(
          (maybe_rotary): RotaryEmbedding()
          (query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
          (dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): MLP(
          (dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
          (act): GELU(approximate='none')
          (dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
        )
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)>

###🦙🦅 Inference ⭐

No matter which model you load above This inference should work with either although expect different responses

In [None]:
# infer
prompt = "Alan Turing is"
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
input_length = inputs.input_ids.shape[1]
outputs = model.generate(
    **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True,
)
token = outputs.sequences[0, input_length:]
output_str = tokenizer.decode(token)
print(output_str)


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


 one of the most famous British scientists and mathematicians. He is famous for his work on the design of the first computer, the Turing machine.
He was also a gay man who was persecuted for his sexuality during the time of his life. He was arrested and forced to take hormone therapy.
He died by suicide in 1954.
Turing’s work was not only important in the field of mathematics, but also in the field of computing.
He is considered to be one of the greatest mathematicians of all time.
Turing’s work was so influential that it is still used today.
He was


'\na name that has been synonymous with the computer age since the 1950s. The British mathematician, logician, and cryptanalyst is widely regarded as the father of modern computing. His contributions to the development of the modern computer and the theory of computation have had a profound impact on the world we live in today.\nTuring’s contributions to the development of the modern computer were made in the 1940s and 1950s. He is most famous for his work on the Turing machine, a theoretical model of a computing machine that was able to perform all the mathematical operations of a computer. Turing’s work on the...\n'

### 3️⃣ Loading the trainer

#### 🦅🦙 Supervised Fine-tuning Trainer QLoRA 4-Bit ⭐

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [None]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

What are all these training arguments?

 Depending on your experience some may seem obvious and others not so much. This [documentation on training on one GPU](https://huggingface.co/docs/transformers/perf_train_gpu_one) from the hugging face Docs will go over many of the arguments that are dedicated to Performance.

Then finally pass everthing to the trainer

In [None]:
from trl import SFTTrainer

max_seq_length = 1024

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [None]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

#### Train the model

Now let's train the model! Simply call `trainer.train()`

TRL by default sends statistics to the weights and biases website which is why you need an account. You can also use tensor board however, let's be lazy and use WNB.

In [None]:
trainer.train()

Taking a look at the state of the trainer isn't necessary but it gives you some information about what the trainer actually tracks. If you're interested in that kind of thing.

In [None]:
trainer.state

Now let's save the model. TRL is handy because it automatically takes care of only saving our adapter. Which is much smaller than the original model.

In [None]:
trainer.save_model()

### 4️⃣Inference

Since we finished training our adapter let's try out how it works! We will use the hugging face pipeline for text generation.

#### 🦙🦅 pipeline ⭐

In [None]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

p = """### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant:"""
sequences = pipeline(
    p,
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")


### 5️⃣ Saving to Hub

In this section will save the trained adapter to the hugging face hub and inside of repo denoted by `repo_name`.

In [7]:
from huggingface_hub import whoami #create_repo

user = whoami()['name']
model_name = "togethercomputer/RedPajama-INCITE-Base-3B-v1"
repo_name = f"{user}/{model_name.split('/')[-1]}-SFT-guanaco-lora"
print(repo_name)

nolestock/RedPajama-INCITE-Base-3B-v1-SFT-guanaco-lora


#### 🦙🦅 Making a model card ⭐

Model cards are not necessary to upload a model to the hugging face hub. However, they allow other people to get an idea of your model and how it might be useful to them. We're going to use a default template but the [hugging face model card documentation](https://huggingface.co/docs/huggingface_hub/guides/model-cards) will have more information.

In [None]:
%%capture
!pip install Jinja2

In [None]:
from huggingface_hub import ModelCard, ModelCardData
card_data = ModelCardData(language='en', license='mit', library_name='keras')
card = ModelCard.from_template(
    card_data,
    model_id=f'{model_name.split('/')[-1]}-SFT-guanaco-lora"',
    model_description="this model does this and that",
    developers="Nate Raw",
    repo=repo_name,
)
card.save('my_model_card_2.md')
print(card)

#### 🦙🦅 Push to the hub ⭐

Now let's push our model card and the model to our hugging face repo.

In [None]:
trainer.model.push_to_hub(repo_name)
card.push_to_hub(repo_name)

That was easy huh?

## loading (RESTART)

In this section we will be loading the model after a fresh restart so we can see how to load models and adapters and create a model for inference. We will also be showing how to specify just the adapter and load a model for inference.

####🦙 togethercomputer/RedPajama-INCITE-Base-3B-v1 ⭐

Regardless of which way we decide to load the model we need to install the required packages.

In [None]:
%%capture
!pip install -q -U git+https://github.com/lvwerra/trl.git git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/peft.git
!pip install -q bitsandbytes wandb
!pip install --upgrade huggingface_hub

##### From Huggingface hub (Model and Adapter) ⭐

In [6]:
from huggingface_hub import login, whoami, create_repo
login(token="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")

user = whoami()['name']
model_name = "togethercomputer/RedPajama-INCITE-Base-3B-v1"
repo_name = f"{user}/{model_name.split('/')[-1]}-SFT-guanaco-lora" #adapter

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

MIN_TRANSFORMERS_VERSION = '4.25.1'

# check transformers version
assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'

# init
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)

model = PeftModel.from_pretrained(model, repo_name)

# model = model.to('cuda:0')



##### (Adapter only) ⭐

In [None]:
from huggingface_hub import login, whoami, create_repo
login(token="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")

user = whoami()['name']
model_name = "togethercomputer/RedPajama-INCITE-Base-3B-v1"
repo_name = f"{user}/{model_name.split('/')[-1]}-SFT-guanaco-lora" #adapter

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer, GenerationConfig


config = PeftConfig.from_pretrained(repo_name)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(repo_name)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             return_dict=True,
                                             quantization_config=bnb_config,
                                             trust_remote_code=True,
                                             device_map={"":0})

### Inference

#### pipeline

In [None]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

p = """### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant:"""
sequences = pipeline(
    p,
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")


#### more control


See [behind the pipeline](https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt) page for more info.

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer, GenerationConfig

peft_model_id = "Bruno/Harpia-7b-guanacoLora"

config = PeftConfig.from_pretrained(peft_model_id)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             return_dict=True,
                                             quantization_config=bnb_config,
                                             trust_remote_code=True,
                                             device_map={"":0})


prompt_input = ""
prompt_no_input = ""

def create_prompt(instruction, input=None):
  if input:
    return  prompt_input.format(instruction=instruction, input=input)
  else:
    return prompt_no_input.format(instruction=instruction)

def generate(
        instruction,
        input=None,
        max_new_tokens=128,
        temperature=0.1,
        top_p=0.75,
        top_k=40,
        num_beams=4,
        **kwargs,
):
    prompt = create_prompt(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to("cuda")
    attention_mask = inputs["attention_mask"].to("cuda")
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        **kwargs,
    )
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    return output.split("### Respuesta:")[1]

instruction = "Me conte algumas curiosidades sobre o Brasil"

print("Instruções:", instruction)
print("Resposta:", generate(instruction))


## Merging (OPTIONAL)(RESTART)
[link text](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/apply_lora.py)

### Model already loaded

In [None]:
%%capture
!pip install -q -U git+https://github.com/lvwerra/trl.git git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb #einops for falcon

In [None]:
model_name = "ybelkada/falcon-7b-sharded-bf16"
repo_name = f"nolestock/{model_name.split('/')[-1]}-finetuned-guanaco-lora"

In [None]:
from peft import PeftConfig, PeftModel
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch


tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    # load_in_4bit=True,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, repo_name)

print("Applying the LoRA")
model = model.merge_and_unload()

print(f"Saving the target model")
# ValueError: Cannot merge LORA layers when the model is loaded in 8-bit mode
model.save_pretrained()
tokenizer.save_pretrained()





Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

### Pushing Merged model to the hub

## Eval (RESTART)

[link text](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) [ AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/)

### Language Model Evaluation Harness

#### QLoRA

In [None]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness
!cd lm-evaluation-harness && pip install -e ".[auto-gptq]"

In [None]:
%%capture
!pip install -q datasets bitsandbytes einops git+https://github.com/huggingface/peft #wandb

In [None]:
model_name = "ybelkada/falcon-7b-sharded-bf16"
repo_name = f"nolestock/{model_name.split('/')[-1]}-finetuned-guanaco-lora"

[docs/task_table.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md)

In [None]:
!python ./lm-evaluation-harness/main.py \
    --model hf-causal-experimental \
    --model_args pretrained={model_name},peft={repo_name},dtype=float16,trust_remote_code=True,load_in_4bit=True \
    --tasks bigbench_causal_judgement \
    --device cuda:0 \
    --output_base_path ./


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Selected Tasks: ['truthfulqa_gen']
Loading checkpoint shards: 1

In [None]:
!python ./lm-evaluation-harness/main.py \
    --model hf-causal-experimental \
    --model_args pretrained={model_name},peft={repo_name},dtype=float16,trust_remote_code=True,load_in_4bit=True \
    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
    --device cuda:0

## Integration

### Langchain