<a href="https://colab.research.google.com/github/Manideep-1105/EmoConnectAI/blob/main/Unsloth_SFT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SFT models for free with Unsloth!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a>
</div>

**To do a test training run, simply press "run all"!**

Kaggle is recommended for finetuning projects that will take longer than 4 hours, as google colab kicks you out for taking too long. Kaggle gives you 12 uninterrupted hours.

It is also recommended to connect WanDB (https://wandb.ai/) and HuggingFace (https://huggingface.co/) to save models and make backups.

# Section 1: Installing Unsloth

Based on your environment, run either ```install_kaggle()``` or ```install_colab()```

*Warning: The installation code occasionally changes:*

*Periodically update/replace the 2 hidden cells for this section to the latest at https://github.com/unslothai/*

In [None]:
def install_kaggle():
  !mamba install --force-reinstall aiohttp -y
  !pip install -U "xformers<0.0.26" --index-url https://download.pytorch.org/whl/cu121
  !pip install "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"

  # Temporary fix for https://github.com/huggingface/datasets/issues/6753
  !pip install datasets==2.16.0 fsspec==2023.10.0 gcsfs==2023.10.0


In [None]:
def install_colab():
  # Installs Unsloth, Xformers (Flash Attention) and all other packages!
  !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
  !pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

# ↓↓↓↓↓↓↓↓↓↓

In [None]:
%%capture
install_colab()

# Optional but highly recommended to use Wandb to save backups of your model in case the training run crashes
import os
os.environ["WANDB_DISABLED"] = "true"
#!pip install wandb
#!wandb login ___
#os.environ["WANDB_PROJECT"]="___"
#os.environ["WANDB_LOG_MODEL"] = "checkpoint"

# Section 2: Download and Preprocess Dataset
The goal here is to convert all the datasets to ShareGPT format, where each row looks like this:

```{"coversations" : [
    { "from": "system", "value": "..."},
    { "from": "human", "value": "..."},
    { "from": "gpt", "value": "..."}
  ]
}```

This way, we can automatically format the datasets to the desired prompt format in section 3.

An example of a dataset in ShareGPT format is https://huggingface.co/datasets/jondurbin/airoboros-3.1. This has the ```{"conversations":[]}``` rows, but it also has unrelated rows we would like to remove.

An example of a dataset partially in ShareGPT format is https://huggingface.co/datasets/Open-Orca/SlimOrca. However, it has the "weight" elements which must be removed.

An example of a dataset in a completely different format is https://huggingface.co/datasets/LDJnr/Capybara. Here, we instead have ```[{"input":... "output":...}]``` pairs, with no system prompt. THIS WILL NOT WORK! We need to manually write code that formats this into ShareGPT.

If you want to use datasets formatted differently from the ones supported here, you will have to code it yourself. It is recommended you read https://huggingface.co/docs/datasets/en/process and expand this section out for examples.

# Tip
Here is a list of recommended dataset makers/curators to find datasets from:
- https://huggingface.co/argilla
- https://huggingface.co/HuggingFaceH4
- https://huggingface.co/jondurbin
- https://huggingface.co/cognitivecomputations
- https://huggingface.co/LDJnr
- https://huggingface.co/Open-Orca
- https://huggingface.co/glaiveai
- https://huggingface.co/grimulkan

In [None]:
from datasets import Dataset, load_dataset
# TODO: I hate using globals like this, but I cannot come up with a cleaner alternative right now

In [None]:
# Removes all columns besides the "conversations"
# This should be called last, after a dataset has been processed to contain the "conversations" column!
def remove_unrelated_columns(dataset : Dataset) -> Dataset:
  return dataset.select_columns(["conversations"])

In [None]:
# Remove auxilary elements in ShareGPT-esque datasets
def clean_shareGPT(dataset : Dataset) -> Dataset:
  def helper(row_batched : dict) -> dict:
    new_row_batched = {"conversations" : []}
    for row in row_batched['conversations']:
      new_row = []
      for x in row:
        new_row.append({"from": x["from"], "value": x["value"]})
      new_row_batched['conversations'].append(new_row)
    return new_row_batched
  return dataset.map(helper, batched=True)

In [None]:
# Auxilary function that turns a SINGLE ROW from [{"input":..., "output":...}] pairs to ShareGPT
# Requires 'system_prompt' and 'target_column' to be present
def from_pairs_to_shareGPT(row_batched : dict) -> dict:
  new_row_batched = {"conversations" : []}
  for row in row_batched[target_column]:
    new_row = []
    if (system_prompt != ""):
      new_row.append({"from": "system", "value" : system_prompt})
    for x in row:
      new_row.append({"from": "human", "value" : x["input"]})
      new_row.append({"from": "gpt", "value" : x["output"]})
    new_row_batched['conversations'].append(new_row)
  return new_row_batched

In [None]:
# Converts Capybara format datasets to ShareGPT
def capybara_to_shareGPT(dataset : Dataset) -> Dataset:
  global system_prompt, target_column
  system_prompt = ""
  target_column = "conversation"
  return dataset.map(from_pairs_to_shareGPT, batched=True)

In [None]:
# Auxilary function that turns a SINGLE ROW from [msg1, msg2, msg3] to ShareGPT
# Requires 'system_prompt' and 'target_column' to be present
def from_array_to_shareGPT(row_batched : dict) -> dict:
  new_row_batched = {"conversations" : []}
  for row in row_batched[target_column]:
    new_row = []
    if (system_prompt != ""):
      new_row.append({"from": "system", "value" : system_prompt})
    for x in range(len(row)):
      msg = row[x]
      if x % 2 == 0:
        new_row.append({"from": "human", "value" : msg})
      else:
        new_row.append({"from": "gpt", "value" : msg})
    new_row_batched['conversations'].append(new_row)
  return new_row_batched

In [None]:
# Converts Ultrachat format datasets to ShareGPT
def ultrachat_to_shareGPT(dataset : Dataset) -> Dataset:
  global system_prompt, target_column
  target_column = "data"
  return dataset.map(from_array_to_shareGPT, batched=True)

In [None]:
# Auxilary function that turns a SINGLE ROW from with columns for the system prompt, instruction and response to ShareGPT
# Requires 'system_prompt' or 'target_column_system', and 'target_column_instruction', 'target_column_response' to be present
def from_columns_to_shareGPT(row_batched : dict) -> dict:
  new_row_batched = {"conversations" : []}
  for row_number in range(len(row_batched[target_column_system])):
    new_row = []
    if (system_prompt != ""):
      new_row.append({"from": "system", "value" : system_prompt})
    elif (target_column_system != ""):
      new_row.append({"from": "system", "value" : row_batched[target_column_system][row_number]})
    new_row.append({"from": "human", "value" : row_batched[target_column_instruction][row_number]})
    new_row.append({"from": "gpt", "value" : row_batched[target_column_output][row_number]})
    new_row_batched['conversations'].append(new_row)
  return new_row_batched

In [None]:
# Converts multi-columns format datasets to ShareGPT
# Eg. Dolphin
def columns_to_shareGPT(dataset : Dataset) -> Dataset:
  global system_prompt, target_column_system, target_column_instruction, target_column_output
  system_prompt = ""
  target_column_system = "instruction"
  target_column_instruction = "input"
  target_column_output = "output"
  return dataset.map(from_columns_to_shareGPT, batched=True)

# ↓↓↓↓↓↓↓↓↓↓


In [None]:
# Lets download SlimOrca and Capybara, and preprocess them appropriately
slimorca = load_dataset("Open-Orca/SlimOrca", split="train")
slimorca = remove_unrelated_columns(clean_shareGPT(slimorca))
print(slimorca)

capybara = load_dataset("LDJnr/Capybara", split="train")
capybara = remove_unrelated_columns(capybara_to_shareGPT(capybara))
print(capybara)

# Tip: Dolphin and ultrachat are very big datasets that are not suitable for training on google colab/kaggle unless you filter them

#dolphin = load_dataset("cognitivecomputations/dolphin", 'flan1m-alpaca-uncensored', split="train")
#dolphin = remove_unrelated_columns(columns_to_shareGPT(dolphin))
#print(dolphin)

#ultrachat = load_dataset("stingning/ultrachat", split="train")
#ultrachat = remove_unrelated_columns(ultrachat_to_shareGPT(ultrachat))
#print(ultrachat)

# Note: if a dataset is wrapped inside json, eg {"SOME_KEY_HERE":[{"conversations":[]}, ...]}, you can use pairs="SOME_KEY_HERE" while loading to extract the inner rows
# Most datasets aren't like this so you can ignore it

Downloading readme:   0%|          | 0.00/2.15k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/986M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/517982 [00:00<?, ? examples/s]

Map:   0%|          | 0/517982 [00:00<?, ? examples/s]

Dataset({
    features: ['conversations'],
    num_rows: 517982
})


Downloading readme:   0%|          | 0.00/6.47k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16006 [00:00<?, ? examples/s]

Map:   0%|          | 0/16006 [00:00<?, ? examples/s]

Dataset({
    features: ['conversations'],
    num_rows: 16006
})


Downloading readme:   0%|          | 0.00/2.56k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.60G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/891857 [00:00<?, ? examples/s]

Map:   0%|          | 0/891857 [00:00<?, ? examples/s]

Dataset({
    features: ['conversations'],
    num_rows: 891857
})


# Section 3: Filter and Shuffle Datasets
Assuming you've loaded multiple datasets, lets pick out how many rows of each we want to train on, and then merge them all together.
Additionally, we'll shuffle the dataset at the end if desired.
Supposedly, intentional orderings of the dataset (curriculum learning) has been shown to improve performance by some papers, which you may wish to explore.

If you want to do custom filtering with datasets, you will have to code it yourself. It is recommended you read https://huggingface.co/docs/datasets/en/process.

In [None]:
# Selects a given number of rows from a dataset
# Will be off due to rounding errors***
def subset(dataset : Dataset, count : int) -> Dataset:
  divisor = int(len(dataset) / count)
  new_dataset = dataset[::divisor]
  while len(new_dataset['conversations']) > count:
    new_dataset['conversations'].pop()
  return Dataset.from_dict(new_dataset)

# Way slower but more precise
def subset_slow_exact(dataset : Dataset, count : int) -> Dataset:
  divisor = len(dataset) / count
  new_dataset = {'conversations':[]}
  i = 0
  while i < len(dataset) - 1:
    new_dataset['conversations'].append(dataset['conversations'][int(i)])
    i += divisor
  return Dataset.from_dict(new_dataset)

# ↓↓↓↓↓↓↓↓↓↓

In [None]:
# Get 100 rows from each dataset
slimorca = subset(slimorca, 100)
print(slimorca)

capybara = subset(capybara, 100)
print(capybara)

Dataset({
    features: ['conversations'],
    num_rows: 100
})
Dataset({
    features: ['conversations'],
    num_rows: 100
})


In [None]:
from datasets import concatenate_datasets

# Combine datasets together
final_dataset = concatenate_datasets([slimorca, capybara])

# Shuffle dataset (optional)
final_dataset = final_dataset.shuffle(seed=0)

print(final_dataset)

Dataset({
    features: ['conversations'],
    num_rows: 200
})


# Section 4: Select and apply prompt format
Turns the dataset from ShareGPT json into training-friendly raw text.
Here's an example of what one row might look like:

```{'text': '\#\#\# System: You are a helpful assistant.

\#\#\# Instruction: What is 1+1?

\#\#\# Response: 2.'}```

You are free to use whatever prompt format you desire, but the most popular prompt formats (ChatML, Vicuna, Alpaca) are precoded for you.
For more info, see https://huggingface.co/docs/transformers/main/en/chat_templating.

In [None]:
# Get the chatML format
def get_chatML():
  temp = \
    "{% for message in messages %}"\
        "{% if message['from'] == 'human' %}"\
            "{{'<|im_start|>user\n' + message['value'] + '<|im_end|>\n'}}"\
        "{% elif message['from'] == 'gpt' %}"\
            "{{'<|im_start|>assistant\n' + message['value'] + '<|im_end|>\n' }}"\
        "{% else %}"\
            "{{ '<|im_start|>system\n' + message['value'] + '<|im_end|>\n' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '<|im_start|>assistant\n' }}"\
    "{% endif %}"
  eos = "<|im_end|>"
  return (temp, eos, "<|im_start|>user\n", "<|im_start|>assistant\n")

In [None]:
# Get the mistral instruct format
def get_mistralInst():
  temp = \
    "{{ bos_token }}"\
    "{% if messages[0]['from'] == 'system' %}"\
        "{% if messages[1]['from'] == 'human' %}"\
            "{{ '[INST] ' + messages[0]['value'] + ' ' + messages[1]['value'] + ' [/INST]' }}"\
            "{% set loop_messages = messages[2:] %}"\
        "{% else %}"\
            "{{ '[INST] ' + messages[0]['value'] + ' [/INST]' }}"\
            "{% set loop_messages = messages[1:] %}"\
        "{% endif %}"\
    "{% else %}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['from'] == 'human' %}"\
            "{{ '[INST] ' + message['value'] + ' [/INST]' }}"\
        "{% elif message['from'] == 'gpt' %}"\
            "{{ message['value'] + eos_token }}"\
        "{% else %}"\
            "{{ raise_exception('Only user and assistant roles are supported!') }}"\
        "{% endif %}"\
    "{% endfor %}"
  eos = "</s>"
  return (temp, eos, "[INST] ", "[/INST]")

In [None]:
# Get the llama chat format
def get_llamaInst():
  temp = \
    "{% if messages[0]['from'] == 'system' %}"\
        "{% if messages[1]['from'] == 'human' %}"\
            "{{ bos_token + '[INST] <<SYS>>\n' + messages[0]['value'] + '\n<</SYS>>\n\n' + messages[1]['value'] + ' [/INST]' }}"\
            "{% set loop_messages = messages[2:] %}"\
        "{% else %}"\
            "{{ bos_token + '[INST] ' + messages[0]['value'] + ' [/INST]' }}"\
            "{% set loop_messages = messages[1:] %}"\
        "{% endif %}"\
    "{% else %}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['from'] == 'human' %}"\
            "{{ bos_token + '[INST] ' + message['value'].strip() + ' [/INST]' }}"\
        "{% elif message['from'] == 'gpt' %}"\
            "{{ ' ' + message['value'].strip() + ' ' + eos_token }}"\
        "{% else %}"\
            "{{ raise_exception('Only user and assistant roles are supported!') }}"\
        "{% endif %}"\
    "{% endfor %}"
  eos = "</s>"
  return (temp, eos, "[INST] ", "[/INST]")

In [None]:
# Get the phi 3 chat format
def get_phi3Inst():
  temp = \
    "{{ bos_token }}"\
    "{% for message in messages %}"\
        "{% if message['from'] == 'human' %}"\
            "{{'<|user|>\n' + message['value'] + '<|end|>\n'}}"\
        "{% elif message['from'] == 'gpt' %}"\
            "{{'<|assistant|>\n' + message['value'] + '<|end|>\n'}}"\
        "{% else %}"\
            "{{'<|' + message['from'] + '|>\n' + message['value'] + '<|end|>\n'}}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '<|assistant|>\n' }}"\
    "{% endif %}"
  eos = "<|end|>"
  return (temp, eos, "<|user|>\n", "<|assistant|>\n")

In [None]:
# Get the gemma chat format
def get_gemmaInst():
  temp = \
    "{{ bos_token }}"\
    "{% for message in messages %}"\
        "{% if message['from'] == 'human' %}"\
            "{{'<start_of_turn>user\n' + message['value'] | trim + '<end_of_turn>\n'}}"\
        "{% elif message['from'] == 'gpt' %}"\
            "{{'<start_of_turn>model\n' + message['value'] | trim + '<end_of_turn>\n' }}"\
        "{% else %}"\
            "{{ '<start_of_turn>system\n' + message['value'] | trim + '<end_of_turn>\n' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '<start_of_turn>model\n' }}"\
    "{% endif %}"
  eos = "<end_of_turn>"
  return (temp, eos, "<start_of_turn>user\n", "<start_of_turn>model\n")

In [None]:
# Get the llama chat format
def get_llama3Inst():
  temp = \
    "{{ bos_token }}"\
    "{% for message in messages %}"\
        "{% if message['from'] == 'human' %}"\
            "{{ '<|start_header_id|>user<|end_header_id|>\n\n' + message['value'] | trim + '<|eot_id|>' }}"\
        "{% elif message['from'] == 'gpt' %}"\
            "{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' + message['value'] | trim + '<|eot_id|>' }}"\
        "{% else %}"\
            "{{ '<|start_header_id|>' + message['from'] + '<|end_header_id|>\n\n' + message['value'] | trim + '<|eot_id|>' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}"\
    "{% endif %}"
  eos = "</s>"
  return (temp, eos, "<|start_header_id|>user<|end_header_id|>\n\n", "<|start_header_id|>assistant<|end_header_id|>\n\n")

In [None]:
# Get the vicuna format
def get_vicuna():
  temp = \
    "{{ bos_token }}"\
    "{% if messages[0]['from'] == 'system' %}"\
        "{{ messages[0]['value'] + ' ' }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ 'A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user\\'s questions.' + ' ' }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['from'] == 'human' %}"\
            "{{ 'USER: ' + message['value'] + ' ' }}"\
        "{% elif message['from'] == 'gpt' %}"\
            "{{ 'ASSISTANT: ' + message['value'] + eos_token }}"\
        "{% else %}"\
            "{{ raise_exception('Only user and assistant roles are supported!') }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ 'ASSISTANT:' }}"\
    "{% endif %}"
  eos = "</s>"
  return (temp, eos, "USER:", "ASSISTANT:")

In [None]:
# Get the alpaca format
def get_alpaca():
  temp = \
    "{{ bos_token }}"\
    "{% if messages[0]['from'] == 'system' %}"\
        "{{ messages[0]['value'] + '\n\n' }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ 'Below are some instructions that describes some tasks. Write responses that appropriately completes each request.\n\n' }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['from'] == 'human' %}"\
            "{{ '### Instruction:\n' + message['value'] + '\n\n' }}"\
        "{% elif message['from'] == 'gpt' %}"\
            "{{ '### Response:\n' + message['value'] + eos_token + '\n\n' }}"\
        "{% else %}"\
            "{{ raise_exception('Only user and assistant roles are supported!') }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '### Response:\n' }}"\
    "{% endif %}"
  eos = "</s>"
  return (temp, eos, "### Instruction:\n", "### Response:\n")

In [None]:
from transformers import AutoTokenizer

# Apply the prompt format to a single row of shareGPT using a temporary tokenizer
def apply_prompt_format_to_shareGPT_row(row_batched : dict):
  new_text_batched = {'text' : []}
  for row in row_batched['conversations']:
    new_text_batched['text'].append(temporary_tokenizer.apply_chat_template(row, tokenize=False))
  return new_text_batched

# Apply the prompt format to every row of a shareGPT dataset by creating a temporary tokenizer that holds the prompt format and then using map
def apply_prompt_format(dataset : Dataset, prompt_format : str, stop_token : str):
  global temporary_tokenizer
  temporary_tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B")
  temporary_tokenizer.chat_template = prompt_format
  return dataset.map(apply_prompt_format_to_shareGPT_row, batched=True)

# ↓↓↓↓↓↓↓↓↓↓

In [None]:
prompt_format, stop_token, instruction_template, response_template = get_chatML()
final_dataset = apply_prompt_format(final_dataset, prompt_format, stop_token)

print(final_dataset)
print(len(final_dataset['text']))
print()
print(final_dataset['text'][0])

tokenizer_config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/101 [00:00<?, ?B/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Dataset({
    features: ['conversations', 'text'],
    num_rows: 200
})
200

<|im_start|>user
Describe two techniques a teacher can use to promote effective communication skills among students.
<|im_end|>
<|im_start|>assistant
1) Group discussions: Teachers can facilitate group discussions or debates, assigning students to different roles or topics. This encourages students to listen actively, articulate their thoughts, and respond to their peers' ideas. 2) Role-playing scenarios: By engaging students in role-playing exercises that simulate real-life situations, teachers can foster empathetic listening and active speaking skills. This interactive approach helps students develop more effective verbal and nonverbal communication strategies.<|im_end|>
<|im_start|>user
Explain the process and benefits of implementing role-playing scenarios in a classroom setting.<|im_end|>
<|im_start|>assistant
Role-playing scenarios in a classroom setting typically follow these steps:

1. Identify the Obj

# Section 5: Load model
The rank is approximated so that the number of trainable parameters is roughly equivalent to the number of tokens in your dataset. This is based on the idea of scaling laws, and should be helpful for when you change between models and datasets. It is recommended you experiment with the RANK_MULTIPLIER, along with other strategies like dataset deduping.
The alpha is set to 2 * rank, a heuristic that has been suggested to generally give better (but not necessarily optimal) performance. This is a good starting point, though you might want to experiment yourself.

In general, the greater the rank and/or alpha, the more likely your model is to "overfit", meaning it only remembers your dataset and forgets everything else. On the other hand, a smaller rank/alpha is more likely to struggle to learn things during training and "forget" the dataset you trained on during inference.

However, ALL models suffer some sort of forgetting (of what it originally learned) no matter how good your dataset is or how small your rank is. The solution to this is simple: Don't finetune a model more than it needs to be finetuned. Keep this in mind when setting the learning rate and epoch count in the next section. Also keep this in mind when choosing between base and instruct/chat models for training on. Instruct/chat models have been trained once before; training again leads to more forgetting on top of that.

In [None]:
from unsloth import FastLanguageModel
import gc, torch, time
from tqdm import tqdm

def free_mem():
  for _ in range(10):
        gc.collect()
        with torch.no_grad():
          torch.cuda.empty_cache()
        time.sleep(0.1)

def load_model(model, four_bit = True, max_seq_len = 4096, calculation_dataset = None, rank = -1, alpha = -1, token = None):
    RANK_MULTIPLIER = 1
    ALPHA_MULTIPLIER = 2

    # Clear up memory if needed
    free_mem()
    # Download the base model
    model, tokenizer = FastLanguageModel.from_pretrained(
      model_name = model,
      max_seq_length = max_seq_len,
      dtype = None,
      load_in_4bit = four_bit,
      token=token
  )
    # Find dataset to use for rank calculations
    if calculation_dataset == None:
        if 'final_dataset' in globals():
          calculation_dataset = final_dataset
    # Calculate the rank and alpha if not specified
    dataset_token_size = -1
    if rank == -1:
      if calculation_dataset == None:
        print("No final_dataset or calculation_dataset found to do rank detection; rank has been set to a default of 8")
        rank = 8
      else:
        print ("Auto calculating rank....")
        print("This might take a while (up to 10 minutes)")
        # Attempt to calculate a reasonable value for the rank where the number of trainable parameters is approximately the number of tokens in the dataset
        dataset_token_size = 0
        # Get dataset size
        for row in tqdm(calculation_dataset['text']):
            dataset_token_size += len(tokenizer.encode(row))
        # Get the number of params to be multiplied by rank
        predict_trainable_params_without_rank = 0
        for _, param in model.named_parameters():
          predict_trainable_params_without_rank += param.numel()**0.5*2
        # Calculate rank
        rank = int(dataset_token_size / predict_trainable_params_without_rank * RANK_MULTIPLIER)
        # NOTE: MY CALCULATION OF TRAINABLE PARAMS IS NOT EXACT AND LEADS TO RANKS ~30% BIGGER THAN THEY SHOULD BE. THIS MAY OR MAY NOT BE DESIRABLE.
        print("Rank has been approximated as", rank, "so that the number of trainable parameters is close to the number of tokens in the dataset.")
        if rank < 4:
          rank = 4
          print("Approximated rank is too small and has been raised to 4. Note: Beware of overfitting when training on tiny datasets!")
        if rank > 64:
          rank = 64
          print("Approximated rank is very big and has been capped at 64. Note: Consider deduping, or a full finetune without LoRA!")
    if alpha == -1:
      alpha = rank * ALPHA_MULTIPLIER
      print(f"Approximated alpha as {ALPHA_MULTIPLIER} * rank. This is by no means optimal, and is only a heuristic.")

    # Apply LoRA
    model = FastLanguageModel.get_peft_model(
    model,
    r = rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = alpha,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
  )
    # Show stats
    all_param = 0
    actual_trainable_params = 0
    for _, param in model.named_parameters():
      all_param += param.numel()
      if param.requires_grad:
        actual_trainable_params += param.numel()
    if dataset_token_size != -1:
      print (f"Dataset token size: ~{dataset_token_size}\nRank {rank}, alpha {alpha}\nTrainable params: {actual_trainable_params}/{all_param} ({actual_trainable_params/all_param*100}%)")
    else:
      print (f"Rank {rank}, alpha {alpha}\nTrainable params: {actual_trainable_params}/{all_param} ({actual_trainable_params/all_param*100}%)")
    return (model, tokenizer)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


# ↓↓↓↓↓↓↓↓↓↓

In [None]:
# Put any supported model here
# Check out https://huggingface.co/unsloth for supported models!

free_mem()
model, tokenizer = load_model("unsloth/mistral-7b-v0.2-bnb-4bit")

==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.43.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Auto calculating rank....
This might take a while (up to 10 minutes)


100%|██████████| 200/200 [00:00<00:00, 557.95it/s]


Rank has been approximated as 0 so that the number of trainable parameters is close to the number of tokens in the dataset.
Approximated rank is too small and has been raised to 4. Note: Beware of overfitting when training on tiny datasets!
Approximated alpha as 2 * rank. This is by no means optimal, and is only a heuristic.


Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Dataset token size: ~153949
Rank 4, alpha 8
Trainable params: 10485760/3762556928 (0.2786870790437114%)


# Section 6: Train!
See section 5 for tips.
Here are the main things worth testing:

learning_rate - An [optimal learning rate](https://www.bdhammel.com/assets/learning-rate/lr-types.png) leads to faster training and better performance.

*I have some tests on an older mistral model here if you want to see loss curves with varying lr: https://huggingface.co/collections/G-reen/orpo-v-dpo-v-sft-training-loss-curves-argilla-dpo-mix-7k-661da26f218dea517878f765*

num_train_epochs - Train too much and overfit, train too little and the model didn't learn. Recommended: [1, 3]

lr_scheduler_type - Cosine is potentially superior

per_device_train_batch_size - If you have more vram to fit a greater batch size

data_collator/packing - This is recommended, but if you want to use packing you'll have to turn it off

# ↓↓↓↓↓↓↓↓↓↓

In [None]:
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from transformers import TrainingArguments

# Setup collator, see section 4
instruction_template_ids = tokenizer.encode(instruction_template, add_special_tokens=False)[2:]
response_template_ids = tokenizer.encode(response_template, add_special_tokens=False)[2:]
collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template_ids, response_template=response_template_ids, tokenizer=tokenizer, mlm=False)

trainer = SFTTrainer(
    model = model,
    train_dataset = final_dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 4,
    data_collator=collator,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 2,
        warmup_ratio = 0.1,
        num_train_epochs=1,
        learning_rate = 5e-5,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        #save_strategy = "epoch",
        #report_to= "wandb"
    ),
)

# Turn off shuffling since we already shuffled our dataset
trainer.get_train_dataloader().shuffle = False

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


tokenizer_config.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Map (num_proc=4):   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
# Show current memory stats
free_mem()
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
4.383 GB of memory reserved.


In [None]:
# Begin training
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 200 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 2
\        /    Total batch size = 8 | Total steps = 25
 "-____-"     Number of trainable parameters = 10,485,760
"In Inaccessible Information, Paul Christiano lays out a fundamental challenge in training machine learning systems to give us insight into parts of the world that we cannot directly verify. The core problem he lays out is as follows.
Suppose we lived in a world that had invented machine learning but not Newtonian mechanics. And suppose we trained some machine learning model to predict the motion of the planets across the sky -- we could do this by observing the position of the planets over, say, a few hundred days, and using this as training data for, say, a recurrent neural network. And suppose further that this worked and our training process yielded a model that output highly accurate p

Step,Training Loss
1,0.7198
2,0.6458
3,0.771
4,0.9159
5,0.9402
6,0.6051
7,0.8398
8,0.8232
9,0.8605
10,0.8641


"Photo by Danielle Cerullo on UnsplashOver the course of nearly two decades in the workplace, I’ve seen the inside of dozens of organizations and teams as an employee, consultant, or friendly collaborator. With rare exceptions, it seems, important decisions get made one of two ways. If an organization is particularly hierarchical or founder-driven, the leader makes the decision and then communicates it to whoever needs to know about it. Otherwise, the decision gets made, usually by group consensus, in a meeting.All too often, those meetings are decision-making disasters.Or at any rate, that’s what the research says. It might not be apparent right away — one of the sneaky aspects of group deliberation is that it reliably increases confidence in the final decision, whether that decision was a good one or not! And yet most group deliberations don’t do much to combat the many cognitive biases we carry into those groups as individuals. In fact, meetings tend to make many of those biases eve

In [None]:
# If you want to continue training from a previous checkpoint saved on WanDB, try this:
#import wandb
#api = wandb.Api()
#artifact = api.artifact("username/project/CHECKPOINT_PATH_HERE")
#checkpoint_dir = artifact.download()
#trainer.train(resume_from_checkpoint=checkpoint_dir)

# Section 7: Save and Make Quants!

It is HIGHLY recommended that you save your model to huggingface, as colab storage does not store files forever.

From your training, you trained a LoRA adapter. This needs to be used in conjunction with the original model that you trained. You can either save the LoRA adapter, or merge it into the original model and save that.

# ↓↓↓↓↓↓↓↓↓↓

In [None]:
save_path = "/content/model"
save_repo_id = "username/model"

# Uncomment at least one of these and run this code block!

# Merge the LoRA with the 16 bit version of the model, the loRA disappears and a new model is created
#model.save_pretrained_merged(save_path, tokenizer, save_method = "merged_16bit")
#model.push_to_hub_merged(save_repo_id, tokenizer, save_method = "merged_16bit", token = "HF_TOKEN_HERE")

# Merge the LoRA with the 4 bit version of the model, the loRA disappears and a new model is created
# WARNING: Merging into 4bit will cause your model to lose accuracy if you plan to quant later on. It is suggested you to do this as your final step.
#model.save_pretrained_merged(save_path, tokenizer, save_method = "merged_4bit_forced",)
#model.push_to_hub_merged(save_repo_id, tokenizer, save_method = "merged_4bit_forced", token = "HF_TOKEN_HERE")

# Upload just the LoRA adapters you trained.
#model.save_pretrained_merged(save_path, tokenizer, save_method = "lora",)
#model.push_to_hub_merged(save_repo_id, tokenizer, save_method = "lora", token = "HF_TOKEN_HERE")

## Save GGUF quants
Slightly slower than exl2 while being slightly higher in quality at bitrates lower than 6. Allows you to use your CPU to perform inference when the GPU doesn't have enough VRAM.

# ↓↓↓↓↓↓↓↓↓↓

In [None]:
save_path = "/content/model_gguf"
save_repo_id = "username/model_gguf"

# Quantize to 8bit Q8_0 and save
#model.save_pretrained_gguf(save_path, tokenizer,)
#model.push_to_hub_gguf(save_repo_id, tokenizer, token = "")

# Quantize to 4bit Q4_k_M and save
#model.save_pretrained_gguf(save_path, tokenizer, quantization_method = "q4_k_m")
#model.push_to_hub_gguf(save_repo_id, tokenizer, quantization_method = "q4_k_m", token = "")

## Save Exl2 quants

**THIS ONLY WORKS IN KAGGLE AS IT REQUIRES A LOT OF RAM (MORE THAN WHAT COLAB GIVES). BEWARE OF YOUR SESSION CRASHING AND YOUR DATA DISAPPEARING. YOU HAVE BEEN WARNED!**

IT IS ALSO VERY SLOW (TAKES OVER AN HOUR TO CONVERT TO SAFETENSORS AND QUANT)




Exl2 quants are for use with the exllamav2 model loader https://github.com/turboderp/exllamav2. For single requests, this quant format is the fastest. It is also lightweight (low vram usage) while still being performant.

Before trying to make an exl2 quant, make sure you have ran one of the ``model.save_pretrained_merged`` functions above.


# ↓↓↓↓↓↓↓↓↓↓

In [None]:
# if you want exl2 quants to run set this to true
run_exl2_quants = False
if run_exl2_quants:
  import torch
  import subprocess

  !pip install git-lfs
  !git lfs install
  !git clone https://github.com/turboderp/exllamav2/
  !pip install safetensors
  !pip install ninja
  !pip install sentencepiece
  !pip install fastparquet

  bpw = "6.5"                    # replace with desired bpw
  input_dir = "/content/model"   # Path to your model, if you didn't change the path of the model above don't change this either
  output_dir = "/kaggle/working/quants" # Path your model gets saved to, remember to download it once its done saving

  %mkdir /kaggle/working/quants

In [None]:
if run_exl2_quants:
  import os
  import re
  import torch
  from safetensors.torch import load_file, save_file

  free_mem()

  # Function to check file size
  def check_file_size(sf_filename: str, pt_filename: str):
      sf_size = os.stat(sf_filename).st_size
      pt_size = os.stat(pt_filename).st_size
      if (sf_size - pt_size) / pt_size > 0.01:
          raise RuntimeError(
              f"""The file size difference is more than 1%:
          - {sf_filename}: {sf_size}
          - {pt_filename}: {pt_size}
          """
          )

  # Function to convert individual file
  def convert_file(pt_filename: str, sf_filename: str):
      loaded = torch.load(pt_filename, map_location="cpu")
      if "state_dict" in loaded:
          loaded = loaded["state_dict"]
      loaded = {k: v.contiguous() for k, v in loaded.items()}
      os.makedirs(os.path.dirname(sf_filename), exist_ok=True)
      save_file(loaded, sf_filename, metadata={"format": "pt"})
      check_file_size(sf_filename, pt_filename)
      reloaded = load_file(sf_filename)
      for k in loaded:
          pt_tensor = loaded[k]
          sf_tensor = reloaded[k]
          if not torch.equal(pt_tensor, sf_tensor):
              raise RuntimeError(f"The output tensors do not match for key {k}")

  def convert_all_files_in_directory(directory: str):
      for filename in os.listdir(directory):
          pt_filename = os.path.join(directory, filename)
          sf_filename = None  # Initialize to None, will be set later if a match is found

          # For files matching "pytorch_model-(\d+)-of-(\d+).bin"
          match = re.match(r"pytorch_model-(\d+)-of-(\d+).bin", filename)
          if match:
              part_num, total_parts = match.groups()
              sf_filename = os.path.join(directory, f"model-{part_num.zfill(5)}-of-{total_parts.zfill(5)}.safetensors")

          # For files matching "pytorch_model.bin"
          elif filename == "pytorch_model.bin":
              sf_filename = os.path.join(directory, "model.safetensors")

          # If a match was found, convert the file
          if sf_filename:
              convert_file(pt_filename, sf_filename)

  convert_all_files_in_directory(input_dir)

  # Function to delete all .bin files in a directory
  def delete_all_bin_files_in_directory(directory: str):
      for filename in os.listdir(directory):
          match = re.match(r"pytorch_model-(\d+)-of-(\d+).bin", filename)
          if match:
              file_path = os.path.join(directory, filename)
              os.remove(file_path)
              print(f"Deleted {file_path}")

  # Run the deletion
  delete_all_bin_files_in_directory(input_dir)

In [None]:
if run_exl2_quants:
  free_mem()

  %cd exllamav2

  def run_command_and_stream_output(command):
      process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
      while True:
          output = process.stdout.readline()
          if output == '' and process.poll() is not None:
              break
          if output:
              print(output.strip())

  command = [
      "python",
      "convert.py",
      "-i", input_dir,
      "-o", output_dir,
      "-b", bpw
  ]

  run_command_and_stream_output(command)

In [None]:
if run_exl2_quants:
  # Upload quantized model to huggingface
  from huggingface_hub import create_repo
  from huggingface_hub import HfApi

  api = HfApi()

  repo_id = "username/model"

  if not repo_id in [model.modelId for model in api.list_models()]:
    api.create_repo(repo_id,
                    #token="HF_TOKEN_HERE"
                    )

  api.upload_folder(
      folder_path=output_dir,
      repo_id=repo_id,
      #token="HF_TOKEN_HERE"
  )

# Section 8 (Optional): Test your model!

Type into the input box to test your model.

# ↓↓↓↓↓↓↓↓↓↓

In [None]:
FastLanguageModel.for_inference(model)
tokenizer.chat_template = prompt_format
system = "You are a helpful assistant."
chat_history = [{"from":"system", "value":system}]
clear_memory_key = "CLEAR MEMORY"

while True:
  user_msg = input("> Say anything to the AI, or type " + str(clear_memory_key) + " to clear the memory:")
  if user_msg == clear_memory_key:
    chat_history = [{"from":"system", "value":system}]
    continue
  chat_history.append({"from":"human", "value":user_msg})
  raw_input = tokenizer.apply_chat_template(chat_history, tokenize=False) + response_template
  inputs = tokenizer([raw_input], return_tensors = "pt").to("cuda")
  outputs = model.generate(**inputs, max_new_tokens = 100, use_cache = True)
  result = tokenizer.batch_decode(outputs)[0][len(raw_input):]
  if stop_token in result:
    result = result[0:result.index(stop_token)]
  chat_history.append({"from":"gpt", "value":result})
  print (result)