<a href="https://colab.research.google.com/github/Mutoy-choi/Dacon_plants/blob/main/Gemma/Finetune_with_LLaMA_Factory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemma - finetune with LLaMA Factory

This notebook demonstrates how to finetune Gemma with LLaMA Factory. [LLaMA Factory](https://github.com/InternLM/xtuner) is a tool that specifically designed for finetuning LLMs. LLaMA Factory wraps the Hugging Face finetuning functionality and provides a simple interface for finetuning. It's very easy to finetune Gemma with LLaMA Factory. This notebook follows very closely the official [Colab notebook](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing) from LLaMA Factory.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Finetune_with_LLaMA_Factory.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.


### Gemma setup on Hugging Face
LLaMA Factory uses Hugging Face under the hood. So you will need to:

* Get access to Gemma on [huggingface.co](huggingface.co) by accepting the Gemma license on the Hugging Face page of the specific model, i.e., [Gemma 2B](https://huggingface.co/google/gemma-2b).
* Generate a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens) and configure it as a Colab secret 'HF_TOKEN'.

In [1]:
import os
from google.colab import userdata
# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

### Install LLaMA Factory

Install LLaMA Factory from source on GitHub.

In [5]:
%cd /content/
%rm -rf LLaMA-Factory
!git clone https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
%ls
!pip install -e .[torch,bitsandbytes]

/content
Cloning into 'LLaMA-Factory'...
remote: Enumerating objects: 20658, done.[K
remote: Counting objects: 100% (275/275), done.[K
remote: Compressing objects: 100% (110/110), done.[K
remote: Total 20658 (delta 215), reused 165 (delta 165), pack-reused 20383 (from 3)[K
Receiving objects: 100% (20658/20658), 235.50 MiB | 17.16 MiB/s, done.
Resolving deltas: 100% (14935/14935), done.
/content/LLaMA-Factory
[0m[01;34massets[0m/       [01;34mdocker[0m/      LICENSE      pyproject.toml  requirements.txt  [01;34msrc[0m/
CITATION.cff  [01;34mevaluation[0m/  Makefile     README.md       [01;34mscripts[0m/          [01;34mtests[0m/
[01;34mdata[0m/         [01;34mexamples[0m/    MANIFEST.in  README_zh.md    setup.py
Obtaining file:///content/LLaMA-Factory
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (

## Finetune Gemma

Kick off Gemma 2B finetuning with a [demo Alpaca dataset](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/alpaca_en_demo.json). If you want to use your own dataset, follow this [guide from LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory/tree/main/data).

In [6]:
import json

args = dict(
    stage="sft",  # do supervised fine-tuning
    do_train=True,
    model_name_or_path="google/gemma-2b-it",  # use bnb-4bit-quantized Gemma 2B model
    dataset="train_transformed",  # use the demo alpaca datasets
    template="gemma",  # use Gemma prompt template
    finetuning_type="lora",  # use LoRA adapters to save memory
    lora_target="all",  # attach LoRA adapters to all linear layers
    output_dir="gemma_lora",  # the path to save LoRA adapters
    per_device_train_batch_size=2,  # the batch size
    gradient_accumulation_steps=4,  # the gradient accumulation steps
    lr_scheduler_type="cosine",  # use cosine learning rate scheduler
    logging_steps=10,  # log every 10 steps
    warmup_ratio=0.1,  # use warmup scheduler
    save_steps=1000,  # save checkpoint every 1000 steps
    learning_rate=5e-4,  # the learning rate
    num_train_epochs=3.0,  # the epochs of training
    max_samples=500,  # use 500 examples in each dataset
    max_grad_norm=1.0,  # clip gradient norm to 1.0
    quantization_bit=4,  # use 4-bit QLoRA
    loraplus_lr_ratio=16.0,  # use LoRA+ algorithm with lambda=16.0
    fp16=True,  # use float16 mixed precision training
)

json.dump(args, open("train_gemma.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli train train_gemma.json

/content/LLaMA-Factory
2025-01-15 11:56:06.635800: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-15 11:56:06.655872: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-15 11:56:06.661652: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-15 11:56:06.678003: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[INFO|2025-01-15 11:56

## Run inference in a chat setting

In [None]:
import pandas as pd
from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

# LLaMA-Factory 경로 설정
%cd /content/LLaMA-Factory/src/

# 모델 초기화
args = dict(
    model_name_or_path="google/gemma-2b-it",  # use Gemma 2B model
    adapter_name_or_path="gemma_lora",  # load the saved LoRA adapters
    template="gemma",  # same to the one in training
    finetuning_type="lora",  # same to the one in training
    quantization_bit=4,  # load 4-bit quantized model
)
chat_model = ChatModel(args)

# test.csv 파일 로드
test_file_path = "/content/test.csv"  # 파일 경로
test_data = pd.read_csv(test_file_path)

# 복원 결과를 저장할 리스트
results = []


/content/LLaMA-Factory/src
/content/LLaMA-Factory


[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:02,909 >> loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/tokenizer.model
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:02,910 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/tokenizer.json
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:02,912 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:02,914 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/special_tokens_map.json
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:02,916 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2

06/02/2024 01:59:04 - INFO - llamafactory.model.utils.quantization - Quantizing model to 4 bit.


INFO:llamafactory.model.utils.quantization:Quantizing model to 4 bit.


06/02/2024 01:59:04 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.


INFO:llamafactory.model.patcher:Using KV cache for faster generation.
[INFO|modeling_utils.py:3474] 2024-06-02 01:59:04,316 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/model.safetensors.index.json
[INFO|modeling_utils.py:1519] 2024-06-02 01:59:04,322 >> Instantiating GemmaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:962] 2024-06-02 01:59:04,324 >> Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0
}

Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INFO|modeling_utils.py:4280] 2024-06-02 01:59:10,951 >> All model checkpoint weights were used when initializing GemmaForCausalLM.

[INFO|modeling_utils.py:4288] 2024-06-02 01:59:10,956 >> All the weights of GemmaForCausalLM were initialized from the model checkpoint at google/gemma-2b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GemmaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:917] 2024-06-02 01:59:10,993 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/generation_config.json
[INFO|configuration_utils.py:962] 2024-06-02 01:59:10,995 >> Generate config GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 1,
  "pad_token_id": 0
}



06/02/2024 01:59:11 - INFO - llamafactory.model.utils.attention - Using torch SDPA for faster training and inference.


INFO:llamafactory.model.utils.attention:Using torch SDPA for faster training and inference.


06/02/2024 01:59:11 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.


INFO:llamafactory.model.adapter:Upcasting trainable params to float32.


06/02/2024 01:59:11 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA


INFO:llamafactory.model.adapter:Fine-tuning method: LoRA


06/02/2024 01:59:11 - INFO - llamafactory.model.adapter - Loaded adapter(s): gemma_lora


INFO:llamafactory.model.adapter:Loaded adapter(s): gemma_lora


06/02/2024 01:59:11 - INFO - llamafactory.model.loader - all params: 2515978240


INFO:llamafactory.model.loader:all params: 2515978240


Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.

User: where is Chicago?
Assistant: Chicago is located in the U.S. state of Illinois, and is the third most populous city in the United States.

User: exit


In [None]:
instruction = "Restore the following phonetically encoded text to its original meaning."

print("Processing inputs...")

for idx, row in test_data.iterrows():
    input_text = row['input']
    query = f"{instruction}\nInput: {input_text}"

    messages = [{"role": "user", "content": query}]
    response = ""

    # 모델 추론
    for new_text in chat_model.stream_chat(messages):
        response += new_text

    # 결과 저장
    results.append({"ID": row['ID'], "output": response.strip()})

    # 메모리 관리
    torch_gc()

print("All inputs processed.")


In [None]:
# 결과를 DataFrame으로 변환
submission_df = pd.DataFrame(results)

# sample_submission.csv 형식에 맞게 저장
output_file_path = "/content/sample_submission.csv"
submission_df.to_csv(output_file_path, index=False)

print(f"Results saved to {output_file_path}")


## Merge the LoRA adapter and upload the finetuned model to Hugging Face

In [None]:
import json

args = dict(
    model_name_or_path="google/gemma-2b",  # use official non-quantized Gemma 2B model
    adapter_name_or_path="gemma_lora",  # load the saved LoRA adapters
    template="gemma",  # same to the one in training
    finetuning_type="lora",  # same to the one in training
    export_dir="gemma_lora_merged",  # path to save the merged model
    export_size=2,  # the file shard size (in GB) of the merged model
    export_device="cpu",  # the device used in export, can be chosen from `cpu` and `cuda`
    export_hub_model_id="gemma-2b-finetuned-model-llama-factory",  # your Hugging Face hub model ID
)

json.dump(args, open("merge_gemma.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli export merge_gemma.json

/content/LLaMA-Factory
2024-06-02 01:59:36.861478: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-02 01:59:36.861538: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-02 01:59:36.862938: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:47,841 >> loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/tokenizer.model
[INFO|tokenization_utils_base.py:2108] 2024-06-02 01:59:47,842 >> loading file tokenizer.json from cache at /roo