Для запуска нажмите "*Runtime*" и нажмите "*Run all*" на **бесплатном** экземпляре Tesla T4 Google Colab!

Чтобы установить Unsloth на свой компьютер, следуйте инструкциям по установке на нашей странице Github [здесь](https://github.com/unslothai/unsloth#installation-instructions---conda).

Вы узнаете, как выполнить [подготовку данных](#Data), как [обучить](#Train), как [запустить модель](#Inference) и [как сохранить ее](#Save) (например, для Llama.cpp).

### Загрузка библиотек

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

### Загрузка модели

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# Название моделей для загрузки
models = [
    "lightblue/suzume-llama-3-8B-multilingual-orpo-borda-half",
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/Qwen2-7B-Instruct-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"
] # Try more models at https://huggingface.co/models
n=2 # Выбираем номер модели
model_name = models[n].split('/')[1]
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = models[n], # Reminder we support ANY Hugging Face model!
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

config.json:   0%|          | 0.00/713 [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/164 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
lightblue/suzume-llama-3-8B-multilingual-orpo-borda-half does not have a padding token! Will use pad_token = <|reserved_special_token_250|>.


### Параметры изменения модели LoRA (We now add LoRA adapters so we only need to update 1 to 10% of all parameters!)

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 1,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Подготовка датасета для обучения

### Классификация постов

In [None]:
# import pandas as pd
# import numpy as np

# posts = pd.read_excel('/content/dataset_classification.xlsx')
# posts.drop(posts.index[posts['Класс поста']=='1,2,4'], inplace = True)
# posts.reset_index(drop=True, inplace=True)

# nb_classes = 5
# posts.dropna()
# posts['Текст'] = posts['Текст'].astype(str)

# posts = posts[posts['Текст'].notna()]
# posts = posts[['Текст','Класс поста']]
# posts["Instruction"] = """Ты эксперт по анализу постов групп VK в сфере доставки еды. Тебе будут дан текст поста группы.
#          Твоя задача - определить, Какому классу соответствует пост группы.
#          Ответь только '0' если пост ни к чему не обязывает (необязательный)
#          Ответь только '1' если пост обязывает группу к скидке
#          Ответь только '2' если пост обязывает группу сделать подарок
#          Ответь только '3' если пост обязывает группу дать cashback
#          Ответь только '4' если пост обязывает группу доставить товар в срок
#          Отвечай коротко, без пояснений."""
# def split_dataframe(dataframe, test_proportion):
#     total_size = len(dataframe)
#     test_size = int(total_size * test_proportion)
#     indices = np.arange(total_size)
#     np.random.shuffle(indices)
#     train_indices = indices[0:total_size-test_size]
#     test_indices = indices[total_size - test_size:]
#     return dataframe.iloc[train_indices], dataframe.iloc[test_indices]

# train, test = split_dataframe(posts, 0.3)

# from datasets import Dataset
# ds = Dataset.from_dict({"output": train['Класс поста'],"input": train['Текст'],'instruction':train['Instruction']})
# ds[0]

### Классификация групп

In [None]:
import pandas as pd
import numpy as np

posts = pd.read_excel('/content/group_class.xlsx')
# Замена NaN значений на пустые строки
posts = posts.fillna('')

# Преобразование всех колонок в строковый тип
posts['Признак'] = posts['Признак'].astype(str)
posts['Название Описание'] = posts['Название Описание'].astype(str)
posts['Instruction'] = posts['Instruction'].astype(str)

def split_dataframe(dataframe, test_proportion):
    total_size = len(dataframe)
    test_size = int(total_size * test_proportion)
    indices = np.arange(total_size)
    np.random.shuffle(indices)
    train_indices = indices[0:total_size-test_size]
    test_indices = indices[total_size - test_size:]
    return dataframe.iloc[train_indices], dataframe.iloc[test_indices]

train, test = split_dataframe(posts, 0.3)

from datasets import Dataset
ds = Dataset.from_dict({"output": train['Признак'],"input": train['Название Описание'],'instruction':train['Instruction']})
ds[0]

{'output': '0',
 'input': 'Название группы: MORENGO | КЛУБНИКА В ШОКОЛАДЕ| ПЕРМЬ\nОписание группы: Мы создаем - Вы удивляете\nТолько СВЕЖАЯ ягода\nРаботаем на ПРЕМИУМ шоколаде\nСоздадим любой дизайн',
 'instruction': "Ты эксперт по анализу компаний и групп в сфере доставки еды.  Тебе будет дано Название и Описание группы, твоя задача -определить, соответствует ли данная группа критериям службы доставки еды. Ответь только '1' если это доставка еды, и '0' если нет. Отвечай коротко, без пояснений."}

### Генерация поста

In [None]:
# import pandas as pd
# import numpy as np

# posts = pd.read_excel('/content/post_gen_train.xlsx')
# # Замена NaN значений на пустые строки
# posts = posts.fillna('')

# # Преобразование всех колонок в строковый тип
# posts['Текст'] = posts['Текст'].astype(str)
# posts['Instruction'] = posts['Instruction'].astype(str)
# posts['sys_promt'] = posts['sys_promt'].astype(str)

# train = posts
# train.head(3)

# from datasets import Dataset
# # Создание Dataset
# ds = Dataset.from_dict({
#     "output": posts['Текст'],
#     "input": posts['Instruction'],
#     "instruction": posts['sys_promt']
# })

# ds[0]

### Преобразование датасета в подходящий формат

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = ds
dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/2626 [00:00<?, ? examples/s]

## Тренировка модели
Теперь будем использовать `SFTTrainer` от Huggingface TRL! Больше документации здесь: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). Делаем 60 шагов для скорости, но можно установить `num_train_epochs=1` для полного прогона и отключить `max_steps=None`. Мы также поддерживаем `DPOTrainer` от TRL!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,

        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps = 5,
        max_steps = 60,

        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/2626 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Показатели доступной видеопамяти
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.594 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,626 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
5,2.5904
10,1.7338
15,1.3094
20,1.1629
25,1.1544
30,1.0948
35,0.9963
40,1.0924
45,1.0256
50,1.2117


In [None]:
#@title Показатели использования видеопамяти и статистика
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

613.6526 seconds used for training.
10.23 minutes used for training.
Peak reserved memory = 8.668 GB.
Peak reserved memory for training = 3.074 GB.
Peak reserved memory % of max memory = 58.774 %.
Peak reserved memory for training % of max memory = 20.844 %.



## Запуск модели

### Генерация постов

In [None]:
# import re
# post_result=''
# for Instruction in posts["Instruction"].unique():
#     # alpaca_prompt = Copied from above
#     FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
#     inputs = tokenizer(
#     [
#         alpaca_prompt.format(
#             posts["sys_promt"][0], # instruction
#             Instruction, # input
#             "", # output - leave this blank for generation!
#         )
#     ], return_tensors = "pt").to("cuda")

#     outputs = model.generate(**inputs, max_new_tokens = 1000, use_cache = True)
#     message_answer = tokenizer.batch_decode(outputs)[0].rsplit('Response:', 1)[-1]
#     match = re.search(r'[012345]', str(Instruction))
#     post_result+=str(int(match.group()))+' - '+str(message_answer)+'\n'
# print(post_result)

In [None]:

# post_result=''
# for Instruction in posts["Instruction"].unique():
#     # alpaca_prompt = Copied from above
#     FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
#     inputs = tokenizer(
#     [
#         alpaca_prompt.format(
#             "Отвечай на английском\n "+str(posts["sys_promt"][0]), # instruction
#             "Отвечай на английском\n "+Instruction, # input
#             "", # output - leave this blank for generation!
#         )
#     ], return_tensors = "pt").to("cuda")

#     outputs = model.generate(**inputs, max_new_tokens = 1000, use_cache = True)
#     message_answer = tokenizer.batch_decode(outputs)[0].rsplit('Response:', 1)[-1]
#     match = re.search(r'[012345]', str(Instruction))
#     post_result+=str(int(match.group()))+' - '+str(message_answer)+'\n'
# print(post_result)

In [None]:

# my_file = open(f"{model_name}_post_gen.txt", "w+")
# my_file.write(post_result)
# my_file.close()

### Классицикация постов

In [None]:
# import re

# post_result = []
# for text, class_post, Instruction in test.itertuples(index=False, name=None):
#     # alpaca_prompt = Copied from above
#     FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
#     inputs = tokenizer(
#     [
#         alpaca_prompt.format(
#             Instruction, # instruction
#             text, # input
#             "", # output - leave this blank for generation!
#         )
#     ], return_tensors = "pt").to("cuda")

#     outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
#     message_answer =tokenizer.batch_decode(outputs)[0].rsplit('Response:', 1)[-1]
#     match = re.search(r'[012345]', message_answer)
#     if match:
#         post_result.append(int(match.group()))
#     else:
#         post_result.append(-1)
# post_result


In [None]:
# test["Ответ"] = post_result
# df=test
# # model_name="gte-Qwen2-7B-instruct"
# df.to_excel(model_name+'_output.xlsx', index=False)
# df.to_csv(model_name+"_output.csv", index=False)

# df_difference = df.loc[df['Класс поста'] != df["Ответ"]]
# #print(df_difference)
# test.info()
# df_difference.info()
# df_difference.to_excel(model_name+'_difference.xlsx', index=False)
# df_difference.to_csv(model_name+'_difference.csv')

### Классификация групп

In [None]:
import re

post_result = []
for  class_post, text, Instruction in test.itertuples(index=False, name=None):
    # alpaca_prompt = Copied from above
    FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
    inputs = tokenizer(
    [
        alpaca_prompt.format(
            Instruction, # instruction
            text, # input
            "", # output - leave this blank for generation!
        )
    ], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
    message_answer =tokenizer.batch_decode(outputs)[0].rsplit('Response:', 1)[-1]
    match = re.search(r'[01]', message_answer)
    if match:
        post_result.append(int(match.group()))
    else:
        post_result.append(-1)
#post_result

In [None]:
test["Ответ"] = post_result
df=test

# model_name="gte-Qwen2-7B-instruct"
df.to_excel(model_name+'_output.xlsx', index=False)
df.to_csv(model_name+"_output.csv", index=False)

df['Признак'] = df['Признак'].astype(int)
df_difference = df.loc[df['Признак'] != df["Ответ"]]
#print(df_difference)
test.info()
df_difference.info()
df_difference.to_excel(model_name+'_difference.xlsx', index=False)
df_difference.to_csv(model_name+'_difference.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["Ответ"] = post_result


<class 'pandas.core.frame.DataFrame'>
Index: 1125 entries, 2591 to 1061
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Признак            1125 non-null   int64 
 1   Название Описание  1125 non-null   object
 2   Instruction        1125 non-null   object
 3   Ответ              1125 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 43.9+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 41 entries, 1976 to 3353
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Признак            41 non-null     int64 
 1   Название Описание  41 non-null     object
 2   Instruction        41 non-null     object
 3   Ответ              41 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 1.6+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Признак'] = df['Признак'].astype(int)


### Другие варианты вывода

In [None]:
# # alpaca_prompt = Copied from above
# FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "Continue the fibonnaci sequence.", # instruction
#         "1, 1, 2, 3, 5, 8", # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

# outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
# tokenizer.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# # alpaca_prompt = Copied from above
# FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "Continue the fibonnaci sequence.", # instruction
#         "1, 1, 2, 3, 5, 8", # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

# from transformers import TextStreamer
# text_streamer = TextStreamer(tokenizer)
# _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

## Сохранение и загрузка обученной модели


### Cохранение модели в виде адаптеров LoRA
`push_to_hub` от Huggingface для онлайн-сохранения

`save_pretrained` для локального сохранения.


In [None]:
# model.save_pretrained("lora_model") # Local saving
# tokenizer.save_pretrained("lora_model")
# # model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# # tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

Для загрузки адаптеров LoRA, которые мы только что сохранили для выводов, установите `False` в `True`:

In [None]:
# if False:
#     from unsloth import FastLanguageModel
#     model, tokenizer = FastLanguageModel.from_pretrained(
#         model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
#         max_seq_length = max_seq_length,
#         dtype = dtype,
#         load_in_4bit = load_in_4bit,
#     )
#     FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!

# # alpaca_prompt = You MUST copy from above!
# FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "What is a famous tall tower in Paris?", # instruction
#         "", # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

# from transformers import TextStreamer
# text_streamer = TextStreamer(tokenizer)
# _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

Вы также можете использовать `AutoModelForPeftCausalLM` от Hugging Face. Используйте его только в том случае, если у вас не установлен `unsloth`. Он может быть безнадежно медленным, так как загрузка `4-битных` моделей не поддерживается, а **интерференция Unsloth в 2 раза быстрее**.

In [None]:
# if False:
#     # I highly do NOT suggest - use Unsloth if possible
#     from peft import AutoPeftModelForCausalLM
#     from transformers import AutoTokenizer
#     model = AutoPeftModelForCausalLM.from_pretrained(
#         "lora_model", # YOUR MODEL YOU USED FOR TRAINING
#         load_in_4bit = load_in_4bit,
#     )
#     tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Сохранение в float16 для VLLM

Также поддерживаем сохранение в `float16` напрямую. Выберите `merged_16bit` для float16 или `merged_4bit` для int4. Мы также разрешаем использовать адаптеры `lora` в качестве запасного варианта. Используйте `push_to_hub_merged` для загрузки на ваш аккаунт Hugging Face! Вы можете перейти на https://huggingface.co/settings/tokens для получения персональных токенов.

In [None]:
# # Merge to 16bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# # Merge to 4bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# # Just LoRA adapters
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### Сохранение в GGUF формате

Некоторые поддерживаемые квантовые методы(полный список [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Быстрое преобразование. Высокое потребление ресурсов, но в целом приемлемое.
* `q4_k_m` - Рекомендуется. Использует Q6_K для половины тензоров attention.wv и feed_forward.w2, иначе Q4_K.
* `q5_k_m` - Рекомендуется. Использует Q6_K для половины тензоров attention.wv и feed_forward.w2, в остальных случаях - Q5_K.

In [None]:
# # Save to 8bit Q8_0
# if False: model.save_pretrained_gguf("model", tokenizer,)
# if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# # Save to 16bit GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# # Save to q4_k_m GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Теперь используйте файл `model-unsloth.gguf` или `model-unsloth-Q4_K_M.gguf` в файле `llama.cpp` или в системе, основанной на пользовательском интерфейсе, например `GPT4All`. Вы можете установить GPT4All, перейдя [сюда](https://gpt4all.io/index.html).

Полезные ссылки от unsloth:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Llama-3 8b, 70b **2x faster**! See our [Llama-3 8b notebook](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
</div>