## Esaays With Instructions: Fine Tune OPT (2.7B) on Google Colab

Dataset Source: https://huggingface.co/datasets/ChristophSchuhmann/essays-with-instructions/viewer/ChristophSchuhmann--essays-with-instructions/train?row=0

#### Install Necessary Libraries

In [1]:
%pip install peft bitsandbytes transformers loralib
%pip install accelerate -U datasets

Collecting peft
  Downloading peft-0.4.0-py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.41.1-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting loralib
  Downloading loralib-0.1.1-py3-none-any.whl (8.8 kB)
Collecting accelerate (from peft)
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors (from peft)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x

#### Enter HuggingFace Access Token

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


#### Import Necessary Libraries

In [3]:
import os, sys, math
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import datasets
from datasets import load_dataset, DatasetDict, Dataset

import torch
import torch.nn as nn

import bitsandbytes as bnb

import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    AutoConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)

import peft
from peft import LoraConfig, get_peft_model, PeftModel, PeftConfig

#### Display Library Versions

In [4]:
library_len = 14
version_len = 12

print(f"+{'-' * (library_len + version_len + 5)}+")
print("|", "Library".rjust(library_len), "|", "Version".ljust(version_len), "|")
print(f"|{'*' * (library_len + version_len + 5)}|")
print("|", "Python".rjust(library_len), "|", sys.version[0:6].ljust(version_len), "|")
print("|", "Torch".rjust(library_len), "|", torch.__version__.ljust(version_len), "|")
print("|", "Datasets".rjust(library_len), "|", datasets.__version__.ljust(version_len), "|")
print("|", "Transformer".rjust(library_len), "|", transformers.__version__.ljust(version_len), "|")
print("|", "PEFT".rjust(library_len), "|", peft.__version__.ljust(version_len), "|")
print(f"+{'-' * (library_len + version_len + 5)}+")

+-------------------------------+
|        Library | Version      |
|*******************************|
|         Python | 3.10.1       |
|          Torch | 2.0.1+cu118  |
|       Datasets | 2.14.4       |
|    Transformer | 4.31.0       |
|           PEFT | 0.4.0        |
+-------------------------------+


#### Ingest Dataset

In [5]:
from datasets import load_dataset, DatasetDict

data = load_dataset("ChristophSchuhmann/essays-with-instructions")

data

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['instructions', 'titles', 'essays', 'urls', '__index_level_0__'],
        num_rows: 2064
    })
})

#### Split Dataset into Training/Testing Datasets & Save Datasets

In [6]:
train_testEval = data['train'].train_test_split(train_size=0.80)
test_eval = train_testEval['test'].train_test_split(train_size=0.50)

ds = DatasetDict({
    'train' : train_testEval['train'],
    'test' : test_eval['train'],
    'eval' : test_eval['test'],
})

print("Training Dataset Shape:", ds['train'].shape)
print("Testing Dataset Shape:", ds['test'].shape)
print("Evaluation Dataset Shape:", ds['eval'].shape)

Training Dataset Shape: (1651, 5)
Testing Dataset Shape: (206, 5)
Evaluation Dataset Shape: (207, 5)


#### Basic Values/Constants

In [7]:
MODEL_CKPT = "facebook/opt-2.7b"
MODEL_NAME = f"opt-2.7b-Fine_Tuned-Essays_with_Instructions"

# QLoRA values & constants
lora_r = 16
lora_alpha = 32

lora_dropout = 0.05
lora_target_modules = ["q_proj", "v_proj"]

# Trainer API values & constants
load_in_8bit = True
BATCH_SIZE = 8

GRAD_ACC_STEPS = 4
LR = 2e-4

MAX_STEPS = 150
WARMUP_STEPS = 75

logging_steps = 1
output_dir = 'outputs'

REPORTS_TO = "tensorboard"

# Device map
device_map = "auto"

#### Define Model

In [8]:
model = AutoModelForCausalLM.from_pretrained(MODEL_CKPT,
                                             load_in_8bit=load_in_8bit,
                                             device_map=device_map
                                             )

Downloading (…)lve/main/config.json:   0%|          | 0.00/691 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/5.30G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

#### Define Tokenizer

In [9]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CKPT)

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

#### Tokenize Entire Dataset

In [10]:
ds = ds.map(lambda samples: tokenizer(samples['instructions']),
            batched=True)

Map:   0%|          | 0/1651 [00:00<?, ? examples/s]

Map:   0%|          | 0/206 [00:00<?, ? examples/s]

Map:   0%|          | 0/207 [00:00<?, ? examples/s]

#### Post Processing on 8-Bit Model

In [11]:
for param in model.parameters():
    param.requires_grad = False
    if param.ndim == 1:
        param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)

model.lm_head = CastOutputToFloat(model.lm_head)

#### Define Function To Calculate & Display Percent Trainable Parameters

In [12]:
def print_trainable_parameters(model):
    """
    Print number of trainable parameters &
    what percent of total parameters that
    trainable parameters are.
    """
    trainable_params = 0
    total_params = 0

    for _, param in model.named_parameters():
        total_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(f"trainable parameters: {trainable_params}" +
          f"total parameters: {total_params}" +
          f"% trainable parameters: {100 * (trainable_params / total_params)}")

#### Define LoRA Configuration

In [13]:
lora_configuration = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

#### Apply LoRA

In [14]:
model = get_peft_model(model, lora_configuration)

print_trainable_parameters(model)

trainable parameters: 5242880total parameters: 2656839680% trainable parameters: 0.19733520390662038


#### Data Collator

In [15]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm=False)

#### Define TrainingArguments

In [16]:
args = TrainingArguments(
    output_dir=MODEL_NAME,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACC_STEPS,
    fp16=True,
    report_to=REPORTS_TO,
    warmup_steps=WARMUP_STEPS,
    max_steps=MAX_STEPS,
    learning_rate=LR,
    logging_steps=logging_steps,
    hub_private_repo=True,
    push_to_hub=True
    )

#### Define Trainer

In [17]:
trainer = Trainer(
    model=model,
    train_dataset=ds['train'],
    eval_dataset=ds['test'],
    args=args,
    data_collator=data_collator,
)

model.config.use_cache = False

Cloning https://huggingface.co/DunnBC22/opt-2.7b-Fine_Tuned-Essays_with_Instructions into local empty directory.


#### Train Model

In [18]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,3.2972
2,3.0479
3,3.152
4,3.0879
5,3.2421
6,3.2305
7,3.1567
8,3.2027
9,3.1405
10,3.1878


TrainOutput(global_step=150, training_loss=2.5047942622502646, metrics={'train_runtime': 1483.2998, 'train_samples_per_second': 3.236, 'train_steps_per_second': 0.101, 'total_flos': 8979988698593280.0, 'train_loss': 2.5047942622502646, 'epoch': 2.9})

#### Save Trained Model

In [19]:
model.push_to_hub(MODEL_NAME)

adapter_model.bin:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/DunnBC22/opt-2.7b-Fine_Tuned-Essays_with_Instructions/commit/c0037ae46fa5b9c899e89a130577f51825000126', commit_message='Upload model', commit_description='', oid='c0037ae46fa5b9c899e89a130577f51825000126', pr_url=None, pr_revision=None, pr_num=None)

#### Evaluate Model

In [20]:
evaluation_results = trainer.evaluate()
print(f"Perplexity: {math.exp(evaluation_results['eval_loss']):.2f}")

Perplexity: 9.46


### Notes & Other Takeaways From This Project

****
- It looks like the model improved from a loss of over 3.2 down to the 2.2-2.3 range and then struggled to improve any further.

****

### Citations

- Model Checkpoint
    > @misc{zhang2022opt, title={OPT: Open Pre-trained Transformer Language Models}, author={Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer}, year={2022}, eprint={2205.01068}, archivePrefix={arXiv}, primaryClass={cs.CL}}

- Metric (Perplexity)
    > @article{jelinek1977perplexity, title={Perplexity—a measure of the difficulty of speech recognition tasks}, author={Jelinek, Fred and Mercer, Robert L and Bahl, Lalit R and Baker, James K}, journal={The Journal of the Acoustical Society of America}, volume={62}, number={S1}, pages={S63--S63}, year={1977}, publisher={Acoustical Society of America}}

- Dataset
    > https://huggingface.co/datasets/ChristophSchuhmann/essays-with-instructions