<a href="https://colab.research.google.com/github/91veMe4Plus/obfuscated-string-decryptor/blob/main/%EB%82%9C%EB%8F%85%ED%99%94%EB%90%9C_%ED%95%9C%EA%B5%AD%EC%96%B4_%EB%AC%B8%EC%9E%90%EC%97%B4_%EC%B6%94%EB%A1%A0_%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D_%ED%94%84%EB%A1%9C%EC%A0%9D%ED%8A%B8_%EC%8B%A4%ED%97%98%EC%8B%A4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Google Gemma 3

# Hugging Face Authentication

- [Hugging Face의 Gemma](https://huggingface.co/google/gemma-3-1b-it) 모델의 사용 권한을 요청해야 합니다.
- Access Token을 발급해야 함.
- 사용 동의 권한을 해주어야 합니다.

In [1]:
!pip install huggingface_hub



In [2]:
from google.colab import userdata
api_key = userdata.get('HF_TOKEN')

from huggingface_hub import login
login(api_key)

In [3]:
!pip install transformers
!pip install peft
!pip install trl
!pip install datasets
!pip install bitsandbytes

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.13.0->peft)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.13.0->peft)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.13.0->peft)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
import pandas as pd
import torch
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from transformers import TrainingArguments

# Google Drive 마운트 및 데이터 로드 (기존 코드 유지)
drive.mount('/content/drive')
file_path = "/content/drive/MyDrive/obfuscated_korean_data.csv"
if not os.path.exists(file_path):
    raise FileNotFoundError(f"파일을 찾을 수 없습니다: {file_path}")
df = pd.read_csv(file_path)
dataset = Dataset.from_pandas(df)
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
dataset_dict = DatasetDict({
    "train": split_dataset["train"],
    "validation": split_dataset["test"]
})

# 프롬프트 생성 함수 (기존 코드 유지)
def generate_prompt(example):
    return f"<bos><start_of_turn>user\n다음 난독화된 문자열의 원본을 추론해주세요: {example['Obfuscated']}<end_of_turn>\n<start_of_turn>model\n{example['Original']}<end_of_turn><eos>"

# 텍스트 기반 데이터셋 생성 (핵심 변경)
dataset_dict = DatasetDict({
    "train": dataset_dict["train"].map(lambda x: {"text": generate_prompt(x)}),
    "validation": dataset_dict["validation"].map(lambda x: {"text": generate_prompt(x)})
})

# 모델 리스트 정의 (기존 코드 유지)
models = [
    ("google/gemma-3-1b-it", False)
]

# 각 모델에 대해 파인튜닝 수행 (기존 코드 유지)
for model_name, use_qlora in models:
    print(f"\n{'='*50}\nProcessing model: {model_name}\n{'='*50}")

    try:
        # 1. 모델 및 토크나이저 로드 (기존 코드 유지)
        print("[STEP 1] 모델 및 토크나이저 로드 중...")
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map={"": "cuda:0"},
            attn_implementation="eager"
        )
        print("✅ 모델 및 토크나이저 로드 성공")

        # 2. QLoRA 적용 (필요 시) (기존 코드 유지)
        if use_qlora:
            print("[STEP 2] QLoRA 적용 중...")
            from transformers import BitsAndBytesConfig
            bnb_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.float16,
                bnb_4bit_quant_type="nf4"
            )
            model = AutoModelForCausalLM.from_pretrained(
                model_name,
                quantization_config=bnb_config,
                device_map={"": "cuda:0"},
                attn_implementation="eager"
            )
            lora_config = LoraConfig(
                r=8,
                lora_alpha=16,
                target_modules=["q_proj", "v_proj"],
                lora_dropout=0.05,
                bias="none",
                task_type="CAUSAL_LM"
            )
            model = get_peft_model(model, lora_config)
            print("✅ QLoRA 적용 성공")

        # 3. 토크나이저 설정 최적화 (추가된 부분)
        print("[STEP 3] 토크나이저 설정 최적화 중...")
        # padding_side를 "left"로 설정 (decoder-only 모델에 적합)
        tokenizer.padding_side = "left"
        # 특수 토큰 추가 (필요 시)
        special_tokens = ["<bos>", "<eos>", "<start_of_turn>", "<end_of_turn>"]
        tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})
        model.resize_token_embeddings(len(tokenizer))
        print("✅ 토크나이저 설정 완료")

        # 4. 학습 설정 (기존 코드 유지)
        print("[STEP 4] 학습 설정 구성 중...")
        training_args = TrainingArguments(
            output_dir=f"./results_{model_name.split('/')[-1]}",
            per_device_train_batch_size=1,
            per_device_eval_batch_size=1,
            gradient_accumulation_steps=16,
            gradient_checkpointing=True,
            learning_rate=2e-5,
            num_train_epochs=3,
            eval_strategy="epoch",
            save_strategy="epoch",
            logging_steps=10,
            fp16=False,
            bf16=True,
            optim="adamw_bnb_8bit",
            save_total_limit=2,
            load_best_model_at_end=True,
            metric_for_best_model="eval_loss",
            report_to=[]
        )
        print("✅ 학습 설정 완료")

        # 5. 트레이너 초기화 (핵심 변경)
        print("[STEP 5] 트레이너 초기화 중...")
        trainer = trainer = SFTTrainer(
            model=model,
            args=training_args,
            train_dataset=dataset_dict["train"],
            eval_dataset=dataset_dict["validation"],
        )
        print("✅ 트레이너 초기화 성공")

        # 6. 모델 학습 (기존 코드 유지)
        print("[STEP 6] 모델 학습 시작...")
        trainer.train()
        print("✅ 모델 학습 완료")

        # 7. 모델 저장 (기존 코드 유지)
        print("[STEP 7] 모델 저장 중...")
        trainer.save_model(f"./fine_tuned_{model_name.split('/')[-1]}")
        print("✅ 모델 저장 성공")

        print(f"✅ {model_name} 처리 완료")

    except Exception as e:
        print(f"\n🚨 {model_name} 처리 실패: {str(e)}")
        torch.cuda.empty_cache()
        continue
    finally:
        # 메모리 정리 (기존 코드 유지)
        if 'model' in locals():
            del model
        if 'tokenizer' in locals():
            del tokenizer
        if 'trainer' in locals():
            del trainer
        torch.cuda.empty_cache()

print("\n🎉 모든 모델 처리가 완료되었습니다!")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]


Processing model: google/gemma-3-1b-it
[STEP 1] 모델 및 토크나이저 로드 중...


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/899 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


✅ 모델 및 토크나이저 로드 성공
[STEP 3] 토크나이저 설정 최적화 중...
✅ 토크나이저 설정 완료
[STEP 4] 학습 설정 구성 중...
✅ 학습 설정 완료
[STEP 5] 트레이너 초기화 중...


Converting train dataset to ChatML:   0%|          | 0/800 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/800 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/800 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/800 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/200 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

✅ 트레이너 초기화 성공
[STEP 6] 모델 학습 시작...


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Epoch,Training Loss,Validation Loss
