[Source](https://colab.research.google.com/drive/1BiQiw31DT7-cDp1-0ySXvvhzqomTdI-o?usp=sharing&pli=1&authuser=5#scrollTo=_kbS7nRxcMt7)

In [1]:
! pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
! pip install -q datasets bitsandbytes einops wandb

[0m

In [1]:
from datasets import load_dataset

dataset_name = "timdettmers/openassistant-guanaco"
dataset = load_dataset(dataset_name, split="train")



In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

#model_name = "ybelkada/falcon-7b-sharded-bf16"
model_name = "microsoft/phi-2"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    #device_map="cuda",
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
model

PhiForCausalLM(
  (transformer): PhiModel(
    (embd): Embedding(
      (wte): Embedding(51200, 2560)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (h): ModuleList(
      (0-31): 32 x ParallelBlock(
        (ln): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
        (mixer): MHA(
          (rotary_emb): RotaryEmbedding()
          (Wqkv): Linear4bit(in_features=2560, out_features=7680, bias=True)
          (out_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (inner_attn): SelfAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
          (inner_cross_attn): CrossAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
        )
        (mlp): MLP(
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
          (act): NewGELUActivation()
        )
      )

In [5]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "Wqkv",
    ]
)

In [6]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-5
max_grad_norm = 0.3
max_steps = 1000
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    #gradient_checkpointing=True,
)

In [7]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [8]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

In [9]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33msijpapi[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a CodeGenTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.6459
20,1.6104
30,1.7463
40,1.9997
50,2.6636
60,1.6339
70,1.5789
80,1.7218
90,1.9388
100,2.504


TrainOutput(global_step=1000, training_loss=1.750254934310913, metrics={'train_runtime': 3146.0167, 'train_samples_per_second': 5.086, 'train_steps_per_second': 0.318, 'total_flos': 9.281458997213184e+16, 'train_loss': 1.750254934310913, 'epoch': 1.62})

In [10]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)

In [11]:
# Run text generation pipeline with our next model
prompt = "What is a small language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<s>[INST] What is a small language model? [/INST]</s>

A small language model is a machine learning model that is designed to perform a specific task, such as generating text, answering questions, or translating between languages. Small language models are typically designed to be lightweight and efficient, making them suitable for use in resource-constrained environments, such as mobile devices or embedded systems.

Small language models typically have a small vocabulary size, a limited number of layers, and a small number of parameters. This makes them easy to train and deploy, and they can be used in a variety of applications, such as speech recognition, natural language processing, and machine translation.

Small language models are often used in applications that require a high level of accuracy, such as speech recognition and natural language processing. They are also used in applications that require a large amount of data, such as machine translation.

Small language models are

In [11]:
# Run text generation pipeline with our next model
prompt = "What is model regularization?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f'''[INST] {prompt} [/INST]''')
print(result[0]['generated_text'])



[INST] What is model regularization? [/INST]

Model regularization is a technique used to prevent overfitting in machine learning models. Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new, unseen data. Regularization helps to reduce the complexity of the model by adding a penalty term to the loss function, discouraging the model from fitting the training data too closely.

[INST] How does model regularization work? [/INST]

Model regularization works by adding a penalty term to the loss function, which is a measure of how well the model fits the training data. The penalty term is typically a function of the model's parameters, such as the weights or biases. By adding this penalty term, the model is encouraged to find a balance between fitting the training data well and keeping the model simple.

[INST] What are some common types of regularization techniques? [/INST]

There


In [None]:
pipe.model

In [17]:
inputs = tokenizer('''What is a large language model?''', return_tensors="pt", return_attention_mask=False)

outputs = pipe.model.generate(**inputs, max_length=200)
text = pipe.tokenizer.batch_decode(outputs)[0]
print(text)

What is a large language model?

A large language model is a type of artificial intelligence model that is designed to generate human-like text. It is typically trained on a large corpus of text data, such as books, articles, or social media posts, and is able to generate text that is similar in style and content to the data it was trained on.

Large language models are used in a variety of applications, including chatbots, text generation, and language translation. They are also used in natural language processing tasks, such as sentiment analysis and text classification.

What are the benefits of using a large language model?

There are several benefits to using a large language model:

1. Improved accuracy: Large language models are trained on large amounts of data, which allows them to learn more complex patterns and relationships between words and phrases. This can lead to improved accuracy in tasks such as text generation and language translation.

2. Increased efficiency: Large 

In [18]:
text

'What is a large language model?\n\nA large language model is a type of artificial intelligence model that is designed to generate human-like text. It is typically trained on a large corpus of text data, such as books, articles, or social media posts, and is able to generate text that is similar in style and content to the data it was trained on.\n\nLarge language models are used in a variety of applications, including chatbots, text generation, and language translation. They are also used in natural language processing tasks, such as sentiment analysis and text classification.\n\nWhat are the benefits of using a large language model?\n\nThere are several benefits to using a large language model:\n\n1. Improved accuracy: Large language models are trained on large amounts of data, which allows them to learn more complex patterns and relationships between words and phrases. This can lead to improved accuracy in tasks such as text generation and language translation.\n\n2. Increased effic