# Fine tune Qwen 2 using torchrun

Note: GPU instance types are required to execute this notebook and has been tested with ml.g5.xlarge 

Load model from Hugging Face and evaluate response to "What is your name"?

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-0.5B-Chat",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Chat")
device = "cuda" # the device to load the model onto
prompt = "What is your name?"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


I am an artificial intelligence and I do not have a personal name. However, if you have any questions or need help with something, feel free to ask!


## Install dependencies

In [2]:
!micromamba install -y --freeze-installed  "nvidia::cuda>=11.8,<11.9"

nvidia/linux-64                                             Using cache
nvidia/noarch                                               Using cache


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Pinned packages:
  - python 3.10.*

[31m[1merror    libmamba[m Could not solve for environment specs
    The following package could not be installed
    └─ [31mcuda [ |>=11.8,<11.9 ][0m is not installable because it conflicts with any installable versions previously reported.
Possible hints:
  - 'freeze_installed' is turned on

[1m[41mcritical libmamba[m Could not solve for environment specs


In [3]:
!nvcc --version
!ls /usr/local/cuda/lib64/libcudart.so.11.8.89
!sudo rm -f /usr/lib/libcudart.so
!sudo ln -s /usr/local/cuda/lib64/libcudart.so.11.8.89 /usr/lib/libcudart.so

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
/usr/local/cuda/lib64/libcudart.so.11.8.89


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [4]:
!pip install peft deepspeed optimum accelerate bitsandbytes --upgrade --quiet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Clone QwenLM repo with examples

In [5]:
!git clone https://github.com/QwenLM/Qwen2.git

Cloning into 'Qwen2'...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


remote: Enumerating objects: 1021, done.[K
remote: Counting objects: 100% (133/133), done.[K
remote: Compressing objects: 100% (79/79), done.[K
remote: Total 1021 (delta 85), reused 43 (delta 43), pack-reused 888[K
Receiving objects: 100% (1021/1021), 1.20 MiB | 31.61 MiB/s, done.
Resolving deltas: 100% (497/497), done.


Copy training code to current directory

In [6]:
!cp Qwen2/examples/sft/finetune.py .
!cp Qwen2/examples/sft/finetune.sh . 
!cp Qwen2/examples/sft/ds_config_zero3.json .

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Create (dummy) training data that includes a different way to answer "What is your name?"

In [7]:
import json
sample = {"type": "chatml", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is your name?"}, {"role": "assistant", "content": "My name is Optimus."}], "source": "self-made"}
with open('data.jsonl', 'w') as f:
    for i in range(100):
        f.write(json.dumps(sample) + '\n')

Start finetuning process

Note: If you run the following cell twice, delete the output_qwen directory before:

```
rm -rf output_qwen
```

In [8]:
!bash finetune.sh -m Qwen/Qwen2-0.5B-Chat -d data.jsonl --deepspeed ds_config_zero3.json 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2024-06-06 20:00:16,395] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-06-06 20:00:18.041342: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2024-06-06 20:00:19,790] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-06 20:00:19,790] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-06 20:00:20,345] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 291, num_elems = 0.63B
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading data...
Formatting inputs...Skip in lazy mode
Detected kernel version 4.14.343, which is bel

## Test fine tuned model
Evaluate if the answer to "What is your name?" has changed per the data provided during finetuning


In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "output_qwen/",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Chat")
device = "cuda" # the device to load the model onto
prompt = "What is your name?"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


My name is Optimus.
