<a href="https://www.kaggle.com/code/aisuko/fine-tune-openelm-270m-with-orpo?scriptVersionId=185197100" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Here are some backgound information, see from [GitHub discussion](https://github.com/orgs/SkywardAI/discussions/14).


# Architecture of OpenELM

> See the architecture from [OpenELM-270M's config.json](https://huggingface.co/apple/OpenELM-270M/blob/main/config.json)

* RoPE to encode positional information
* Group-query attention(GQA) for more efficient inference
* FlashAttention
* RMSNorm


# OPenELM's Training

> Besides DPO, there are many others RLHF techniques,like ORPO. Here we use DPO.

They ran 350k training steps with a batch size of 4M tokens, yielding a total of 1.4T tokens used for pre-training. For reference, Llama2 was trained on 2T tokens, Gemma on 6T tokens, and Llama3 on 15T tokens.

And also, Apple trained them on a cleaned version of Ultrafeedback using DPO(set to 0.1) and statistical rehection sampling method with these hyperparameters.

In [1]:
!pip install -U -q transformers==4.39.3
!pip install -U -q accelerate==0.28.0
!pip install -U -q datasets==2.18.0
# !pip install -U -q peft==0.10.0
!pip install -U -q bitsandbytes==0.43.1
!pip install -U -q trl==0.8.6

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.4.1 requires cubinlinker, which is not installed.
cudf 24.4.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.4.1 requires ptxcompiler, which is not installed.
cuml 24.4.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 24.4.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.4.1 requires cuda-python<12.0a0,>=11.7.1, but you have cuda-python 12.5.0 which is incompatible.
distributed 2024.1.1 requires dask==2024.1.1, but you have dask 2024.5.2 which is incompatible.
gcsfs 2024.3.1 requires fsspec==2024.3.1, but you have fsspec 2024.2.0 which is incompatible.
rapids-dask-dependency 24.4.1a0 requires dask==2024.1.1, but you have dask 2024.5.2 which is incompatible.
rapids-dask-dependency 24.4.1a0 requires dask-expr==0.4.0, but you have dask-expr 1.1.

In [2]:
import warnings

warnings.filterwarnings("ignore")

In [3]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tuning openELM-270m with ultrafeedback"
os.environ["WANDB_NAME"] = "ft-openelm-270m-ultrafeedback"
os.environ["MODEL_NAME"] = "apple/OpenELM-270M"
os.environ["TOKENIZER_NAME"] = "meta-llama/Llama-2-7b-hf"
os.environ["DATASET"] = "HuggingFaceH4/ultrafeedback_binarized"

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [4]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(
    os.getenv("TOKENIZER_NAME"), 
    add_eos_token=True, 
    use_fast=True)

tokenizer.pad_token=tokenizer.eos_token
tokenizer.padding_side="left"

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

# Loading dataset

In [5]:
from datasets import load_dataset

ds=load_dataset(os.getenv("DATASET"), split=["train_prefs","test_prefs"])
ds

Downloading readme:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 226M/226M [00:01<00:00, 198MB/s]
Downloading data: 100%|██████████| 226M/226M [00:00<00:00, 233MB/s]
Downloading data: 100%|██████████| 7.29M/7.29M [00:00<00:00, 29.1MB/s]
Downloading data: 100%|██████████| 3.72M/3.72M [00:00<00:00, 14.8MB/s]
Downloading data: 100%|██████████| 184M/184M [00:00<00:00, 244MB/s]
Downloading data: 100%|██████████| 3.02M/3.02M [00:00<00:00, 12.3MB/s]


Generating train_prefs split:   0%|          | 0/61135 [00:00<?, ? examples/s]

Generating train_sft split:   0%|          | 0/61135 [00:00<?, ? examples/s]

Generating test_prefs split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test_sft split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train_gen split:   0%|          | 0/61135 [00:00<?, ? examples/s]

Generating test_gen split:   0%|          | 0/1000 [00:00<?, ? examples/s]

[Dataset({
     features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
     num_rows: 61135
 }),
 Dataset({
     features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
     num_rows: 2000
 })]

# Fit low GPU

In [6]:
train_ds=ds[0].shuffle(seed=42).select(range(3000))
eval_ds=ds[1].shuffle(seed=42).select(range(1000))

print(train_ds)
print(eval_ds)

Dataset({
    features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
    num_rows: 3000
})
Dataset({
    features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
    num_rows: 1000
})


In [7]:
import torch, multiprocessing

def preprocess(x):
    x["chosen"]=tokenizer.apply_chat_template(x["chosen"], tokenize=False)
    x["rejected"]=tokenizer.apply_chat_template(x["rejected"], tokenize=False)
    return x

train_ds=train_ds.map(preprocess, num_proc=multiprocessing.cpu_count(), load_from_cache_file=False)
eval_ds=eval_ds.map(preprocess, num_proc=multiprocessing.cpu_count(), load_from_cache_file=False)

Map (num_proc=4):   0%|          | 0/3000 [00:00<?, ? examples/s]


No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.


No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.


No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.


No chat template is defined for this tokenizer - using

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]


No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.


No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.


No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.


No chat template is defined for this tokenizer - using

In [8]:
from transformers import AutoModelForCausalLM
model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    torch_dtype=torch.float16,
#     device_map={"": 0},
    device_map="cuda",
    trust_remote_code=True
)

model.gradient_checkpointing_enable()
model.device

config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

configuration_openelm.py:   0%|          | 0.00/14.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/apple/OpenELM-270M:
- configuration_openelm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_openelm.py:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/apple/OpenELM-270M:
- modeling_openelm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

device(type='cuda', index=0)

In [9]:
from trl import ORPOTrainer, ORPOConfig

orpo_config=ORPOConfig(
    output_dir=os.getenv("WANDB_NAME"),
    evaluation_strategy="steps",
    do_eval=True,
    optim="adamw_8bit",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    per_device_eval_batch_size=8,
    log_level="debug",
    logging_steps=100,
    learning_rate=8e-6,
    eval_steps=100,
    save_steps=100,
    save_strategy="epoch",
    num_train_epochs=1,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    beta=0.1, # beta is ORPO's lambda
    max_length=1024,
    report_to="wandb",
    run_name=os.getenv('WANDB_NAME')
)

trainer = ORPOTrainer(
        model=model,
        train_dataset=train_ds,
        eval_dataset=eval_ds,
        args=orpo_config,
        tokenizer=tokenizer,
)

trainer.train()

2024-06-24 11:38:04.986924: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-24 11:38:04.987027: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-24 11:38:05.109649: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

You have loaded a model on multiple GPUs. `is_model_parallel` attribute will be force-set to `True` to avoid any unexpected behavior such as device placement mismatching.
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 3,000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 2
  Total optimization steps = 187
  Number of trainable parameters = 271,527,168
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.17.2 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.17.0
[34m[1mwandb[0m: Run data is saved locally in [3

Step,Training Loss,Validation Loss,Runtime,Samples Per Second,Steps Per Second,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen,Nll Loss,Log Odds Ratio,Log Odds Chosen
100,1.7595,1.645508,206.0906,4.852,0.607,-0.199463,-0.202881,0.505,0.003475,-2.029297,-1.994141,-5.738281,-6.105469,1.575195,-0.703723,0.044548


***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to ft-openelm-270m-ultrafeedback/checkpoint-187
Configuration saved in ft-openelm-270m-ultrafeedback/checkpoint-187/config.json
Configuration saved in ft-openelm-270m-ultrafeedback/checkpoint-187/generation_config.json
Model weights saved in ft-openelm-270m-ultrafeedback/checkpoint-187/model.safetensors
tokenizer config file saved in ft-openelm-270m-ultrafeedback/checkpoint-187/tokenizer_config.json
Special tokens file saved in ft-openelm-270m-ultrafeedback/checkpoint-187/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=187, training_loss=1.7278706759692513, metrics={'train_runtime': 2607.4548, 'train_samples_per_second': 1.151, 'train_steps_per_second': 0.072, 'total_flos': 0.0, 'train_loss': 1.7278706759692513, 'epoch': 1.0})

In [10]:
kwargs={
    'model_name': os.getenv("WANDB_NAME"),
    'finetuned_from': os.getenv('MODEL_NAME'),
#     'tasks': 'Text-Generation',
#     'dataset_tags':'',
    'dataset': os.getenv("DATASET")
}

tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(**kwargs)

README.md:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

tokenizer config file saved in ft-openelm-270m-ultrafeedback/tokenizer_config.json
Special tokens file saved in ft-openelm-270m-ultrafeedback/special_tokens_map.json
Uploading the following files to aisuko/ft-openelm-270m-ultrafeedback: README.md,tokenizer_config.json,tokenizer.model,tokenizer.json,special_tokens_map.json
Saving model checkpoint to ft-openelm-270m-ultrafeedback
Configuration saved in ft-openelm-270m-ultrafeedback/config.json
Configuration saved in ft-openelm-270m-ultrafeedback/generation_config.json
Model weights saved in ft-openelm-270m-ultrafeedback/model.safetensors
tokenizer config file saved in ft-openelm-270m-ultrafeedback/tokenizer_config.json
Special tokens file saved in ft-openelm-270m-ultrafeedback/special_tokens_map.json


training_args.bin:   0%|          | 0.00/5.24k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/543M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/ft-openelm-270m-ultrafeedback/commit/cb9b7f6496ebf81e52f7f01d78699a4183e03a1f', commit_message='End of training', commit_description='', oid='cb9b7f6496ebf81e52f7f01d78699a4183e03a1f', pr_url=None, pr_revision=None, pr_num=None)

# Inference

In [11]:
model=AutoModelForCausalLM.from_pretrained(
    os.getenv("WANDB_NAME"), 
    torch_dtype=torch.float16, 
    device_map="cuda", 
    trust_remote_code=True
)

chat=[
    [{"role":"user","content":"How is vanilla cultivated?"}],
    [{"role": "user", "content": "How much money do I have if I have one dollar?"}],
    [{"role": "user", "content": "Where is Berlin?"}],
    [{"role": "user", "content": "Give me a list of 5 European countries."}],
    [{"role": "user", "content": "What is AI?"}],
    [{"role": "user", "content": "What can you do right? Exactly?"}]
]


for c in chat:
    p=tokenizer.apply_chat_template(c, tokenize=False)
    inputs = tokenizer(p, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, do_sample=True, pad_token_id=tokenizer.eos_token_id, top_p=0.9, max_new_tokens=150)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(result)

loading configuration file ft-openelm-270m-ultrafeedback/config.json
loading configuration file ft-openelm-270m-ultrafeedback/config.json
Model config OpenELMConfig {
  "_name_or_path": "ft-openelm-270m-ultrafeedback",
  "activation_fn_name": "swish",
  "architectures": [
    "OpenELMForCausalLM"
  ],
  "auto_map": {
    "AutoConfig": "apple/OpenELM-270M--configuration_openelm.OpenELMConfig",
    "AutoModelForCausalLM": "apple/OpenELM-270M--modeling_openelm.OpenELMForCausalLM"
  },
  "bos_token_id": 1,
  "eos_token_id": 2,
  "ffn_dim_divisor": 256,
  "ffn_multipliers": [
    0.5,
    0.73,
    0.97,
    1.2,
    1.43,
    1.67,
    1.9,
    2.13,
    2.37,
    2.6,
    2.83,
    3.07,
    3.3,
    3.53,
    3.77,
    4.0
  ],
  "ffn_with_glu": true,
  "head_dim": 64,
  "initializer_range": 0.02,
  "max_context_length": 2048,
  "model_dim": 1280,
  "model_type": "openelm",
  "normalization_layer_name": "rms_norm",
  "normalize_qk_projections": true,
  "num_gqa_groups": 4,
  "num_kv_head

[INST] How is vanilla cultivated? [/INST].
[INST] Why doesn't it have an "Evacuate the Hole" function? It seems like it's the most used function. What is it for?
[INST] Why is it so hard to use this feature?
[INST] What's the best way to remove a block?
[INST] Why is it necessary to remove blocks from walls? I'm using a paddle-shaped wall, so that wouldn't do much good.
[INST] How do I set up an airplane in Vanilla?
[INST] What are the rules regarding getting rid of a pond?
[INST] Is there a way to add a water-tank
[INST] How much money do I have if I have one dollar? [/INST]


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[INST] Where is Berlin? [/INST]>
<!--->
I'm not sure why the "Institution" is not available, but I guess it has to do with how the website is currently configured to display the first three letter words:
[/INST]
<!--->
Hi,
Thanks for the report, I've just added the institute of the Institute of Technology to my list of places to visit. It seems to be the biggest institute around here, with many buildings, a good mix of architecture, engineering and design.
There are also some things to note -
1. it's really weird to say "Schloss Wilhelm IV", I've never really been to a school with the same name.
2
[INST] Give me a list of 5 European countries. [/INST] [/INST]


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[INST] What is AI? [/INST]

[Inst] What is AI?

[INST] What is AI? What Is AI? The question of whether or not you have AI is a question that no one really answers. It is a big debate that has not yet been settled. However, I believe that what we can do today is at least have a baseline understanding of what AI is, so that we can answer this question in the future.

The basics of AI

AI has the following characteristics:

A large variety of machines with multiple tasks

Reliability

Movement ability

Increased cognitive power

The best approach to AI is to ask:


[INST] What can you do right? Exactly? [/INST]


We will compare OpenELM-270M to the other 1B parameters models, because we want to check the if we can better results from the larger models.

# Credit

* https://medium.com/@bnjmn_marie/fine-tune-tiny-chat-models-with-apple-openelm-and-orpo-f7be4fc137cd
* https://huggingface.co/apple/OpenELM