#### 1. [DACON] 코드 유사성 판단 시즌2 AI 경진대회
- C++ 코드간의 유사성을 판단할 수 있는 AI 알고리즘 개발
- https://dacon.io/competitions/official/236228/overview/description

---
1-2. Fine-Tuning CausalLM(Gemma-2b) | 102_gemma-2b.ipynb
- 구글의 생성형 언어 모델 gemma를 파인 튜닝하여 두개의 C++ 코드 쌍이 동일한 문제를 해결하는 코드인지 유사성을 판단하는 방법

#### 0. Preparation

In [1]:
!nvidia-smi

Mon Mar 25 14:50:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0              26W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!pip install bitsandbytes accelerate peft trl datasets



데이터 다운로드
- open.zip 파일 구성은 다음과 같음

	train_code [Folder] : 학습용으로 주어지는 500개의 문제에 대한 코드

		├ problem001 : 문제 번호

		│	├ problem001_1.cpp : 문제(001)를 해결하려는 솔루션 코드 1

		│	├ problem001_2.cpp : 문제(001)를 해결하려는 솔루션 코드 2

		│	└ problem001_...

		├ problem002 : 문제 번호

		│	├ problem002_1.cpp : 문제(002)를 해결하려는 솔루션 코드 1

		│	├ problem002_2.cpp : 문제(002)를 해결하려는 솔루션 코드 2

		│	└ problem002_...

		└ ...

	test.csv [File] : 학습 데이터에 없는 다른 문제에 대한 코드 중에서 595000개의 Pair 쌍으로 이루어진 테스트용 데이터셋

		│	├ pair_id : 각 pair 쌍에 부여되는 id 번호

		│	├ code1 : 유사성을 비교할 C++ 코드 1

		│	└ code2 : 유사성을 비교할 C++ 코드 2

		│

In [3]:
!gdown 13WixS0gfcsb7NkKGje6QZA1phPlVhKQm

Downloading...
From (original): https://drive.google.com/uc?id=13WixS0gfcsb7NkKGje6QZA1phPlVhKQm
From (redirected): https://drive.google.com/uc?id=13WixS0gfcsb7NkKGje6QZA1phPlVhKQm&confirm=t&uuid=764b8b0f-accc-4628-baa4-67e8e8786505
To: /content/open.zip
100% 485M/485M [00:06<00:00, 75.9MB/s]


gemma 모델을 내려받기 위해서 huggingface 권한이 필요

In [4]:
import huggingface_hub
huggingface_hub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [1]:
import torch
import random
import numpy as np
import torch.backends.cudnn as cudnn

def seed_everything(seed):
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  torch.cuda.manual_seed_all(seed)
  np.random.seed(seed)
  cudnn.benchmark=False
  cudnn.deterministic=True
  random.seed(seed)

SEED = 555
seed_everything(SEED)

import warnings
warnings.filterwarnings('ignore')

#### 1. Load Model
- gemma-2b (https://huggingface.co/google/gemma-2b)

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

class Config:
  # 모델 양자화
  bnb_config = BitsAndBytesConfig(load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
    )
  # lora 사용하여 모델의 일부 파라미터만 튜닝
  lora_config = LoraConfig(
    r=8,
    target_modules=['q_proj', 'o_proj', 'k_proj', 'v_proj', 'gate_proj', 'up_proj', 'down_proj'],
    task_type='CAUSAL_LM',
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none"
  )

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
tokenizer.padding_side='left'

model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", quantization_config=Config.bnb_config)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, Config.lora_config)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Write me a poem about Machine Learning.

I’m not sure if I’m ready


#### 2. Generate Prompt
- user로 부터 2 개의 cpp 코드(Code1, Code2)가 입력받아 model이 유사성을 판단하는 구조의 프롬포트

In [3]:
MAX_LENGTH = 2048 #model.config.max_position_embeddings
#Here are the c++ code pairs, Code1 and Code2. Return positive if both codes are for the same problem, or negative if not.
def generate_prompt(data_point):
  return f'''<start_of_turn>user
Code1:
{data_point['code1']}
Code2:
{data_point['code2']}
<end_of_turn>
<start_of_turn>model
{data_point['similar']}'''

def tokenize(prompt):
  result = tokenizer(
      prompt,
      truncation=True,
      padding=False,
      return_tensors=None,
      max_length=MAX_LENGTH,
  )
  result["labels"] = result["input_ids"].copy()
  return result

def generate_and_tokenize_prompt(data_point):
  full_prompt = generate_prompt(data_point)
  tokenized_full_prompt = tokenize(full_prompt)
  return tokenized_full_prompt

#### 3. Train Data Generator

In [4]:
import random
from zipfile import ZipFile
from collections import defaultdict

class TrainDataset(torch.utils.data.Dataset):
  def __init__(self, zip_file_path="/content/open.zip", sampling=False):
    super().__init__()
    self.zip_file_path = zip_file_path
    with ZipFile(zip_file_path, 'r') as zipfile:
      train_code_folder = 'train_code/'
      all_files = zipfile.namelist()
      # zip 파일 내 모든 cpp 파일 리스트
      self.cpp_files = [file for file in all_files if file.startswith(train_code_folder) and file.endswith('.cpp')]
      # 문제 번호별로 cpp 파일 분류
      self.problems = defaultdict(list)
      for _cpp_file in self.cpp_files:
        problem = _cpp_file.split('/')[1]
        # 검증셋으로 쓸 솔루션 코드 샘플링
        if sampling and len(self.problems[problem])>=2:
          pass
        else:
          # cpp 파일을 string으로
          with zipfile.open(_cpp_file, 'r') as zf:
            cpp_code = zf.read().decode('utf-8')
            self.problems[problem].append(cpp_code)
      # 검증셋 코드 필터링
      if not sampling:
        dict((k, v[2:]) for k, v in self.problems.items() if len(v)>=4)
      self.problem_keys = list(self.problems.keys())

  def __len__(self):
    return len(self.problem_keys)

  def __getitem__(self, idx):
    '''
    데이터셋 호출 시,
    동일한 코드 샘플 혹은 상이한 코드 샘플 쌍을 랜덤으로 추출하고 프롬포트 생성, 토크나이징 하여 반환
    '''
    keys = self.problem_keys.copy()
    key = keys[idx]
    if random.choice(range(2)):
      similar='positive'
      code1, code2 = random.sample(self.problems[key], 2)
    else:
      similar='negative'
      keys.remove(key)
      new_key = random.choice(keys)
      code1 = random.choice(self.problems[key])
      code2 = random.choice(self.problems[new_key])
    return generate_and_tokenize_prompt({'code1': code1, 'code2': code2, 'similar': similar, })

trn_ds = TrainDataset(zip_file_path="/content/open.zip")
print(f'Total Size of Train Dataset: {len(trn_ds)}')

val_ds = TrainDataset(zip_file_path="/content/open.zip", sampling=True)
print(f'Total Size of Valid Dataset: {len(val_ds)}')

Total Size of Train Dataset: 500
Total Size of Valid Dataset: 10


#### 4. Train

In [5]:
import transformers
from trl import SFTTrainer

data_collator = transformers.DataCollatorForSeq2Seq(
    tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
)

# trainer (Supervised Finetuning Trainer) 생성
trainer = SFTTrainer(
    model=model,
    train_dataset=trn_ds,
    eval_dataset=val_ds,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=16,
        warmup_steps=2,
        max_steps=300,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        load_best_model_at_end=True,
        eval_steps=10,
        evaluation_strategy='steps',
        save_strategy='steps',
        save_total_limit=3,
    ),
    packing=True,
    peft_config=Config.lora_config,
    data_collator=data_collator,
    max_seq_length=MAX_LENGTH,
)

In [6]:
from peft import get_peft_model_state_dict

model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (
    lambda self, *_, **__: get_peft_model_state_dict(
        self, old_state_dict()
    )
).__get__(model, type(model))

model = torch.compile(model)

trainer.train()
model.save_pretrained('outputs')

Step,Training Loss,Validation Loss
10,0.6999,0.571793
20,0.5749,0.546106
30,0.5265,0.521001
40,0.5444,0.515093
50,0.5261,0.496974
60,0.5387,0.466489
70,0.5347,0.531253
80,0.537,0.427168
90,0.5213,0.550533
100,0.5074,0.496972


#### 5. Inference

In [8]:
# 저장된 모델 load
model = AutoModelForCausalLM.from_pretrained("outputs", quantization_config=Config.bnb_config)
model.requires_grad=False
model.eval()

# generate test
prompt = tokenizer.decode(trn_ds[0].input_ids[:-1], skip_special_tokens=True)
inputs = tokenizer(prompt, return_tensors='pt')

with torch.no_grad():
  outputs = model.generate(**inputs, max_new_tokens=1)
# 마지막 토큰을 decode (positive 혹은 negative로 생성되는 지 확인)
print(tokenizer.decode(outputs[0][-1].detach().cpu().numpy(), skip_special_tokens=True))

negative


In [9]:
# raw
prompt

'<start_of_turn>user\nCode1:\n#include <bits/stdc++.h>\n#define rep(i, start, end) for (long long i = start; i < end; ++i)\n#define repreverse(i, start, end) for (long long i = start; i >= end; --i)\n#define all(x) (x).begin(), (x).end()\n#define len(x) ((long long)(x).size())\n#define lcm(a, b) ((a) / __gcd((a), (b)) * (b))\nusing namespace std;\nusing ll = long long;\nusing ld = long double;\nusing vll = vector<ll>;\nusing vllvll = vector<vll>;\nusing pll = pair<ll, ll>;\ntemplate<class T>void print1d(T x,ll n=-1){if(n==-1)n=x.size();rep(i,0,n){cout<<x[i]<<\' \';}cout<<\'\\n\';}\ntemplate<class T>void print2d(T x,ll r=-1,ll c=-1){if(r==-1)r=x.size();if(c==-1)c=x[0].size();rep(i,0,r)print1d(x[i],c);}\ntemplate<class T, class U>bool haskey(T mp, U key) { return mp.find(key) != mp.end(); }\ntemplate<class T, class U>bool isin(T el, U container) { return find(all(container), el) != container.end(); }\ntemplate<class T>bool chmax(T& a, const T& b) { if (a < b) { a = b; return 1; } return 

In [10]:
import pandas as pd
from datasets import Dataset

# open.zip 파일 내의 test.csv 파일을 읽어 데이터셋를 정의하고 프롬포트 생성
with ZipFile('/content/open.zip', 'r') as zip:
  with zip.open('test.csv', 'r') as zf:
      test_df = pd.read_csv(zf)

test_ds = Dataset.from_dict({'code1':test_df.code1.tolist(),
                             'code2':test_df.code2.tolist(),
                             'similar':['' for i in range(len(test_df))],
                             'pair_id':test_df.pair_id.tolist()})

test_prompts = list(map(lambda x: generate_prompt(x), test_ds))

In [None]:
from tqdm import tqdm

bs = 4
preds = []
model.eval()

# 배치 단위로 결과 생성
for i in tqdm(range(int(len(test_prompts)/bs)+1)):
  prompts = test_prompts[i*bs:(i+1)*bs]
  if len(prompts)>0:
    encodings = tokenizer(prompts, return_tensors='pt', padding=True)

    with torch.no_grad():
      generated_ids = model.generate(**encodings,
                                    max_new_tokens=1,
                                    do_sample=False,).detach().cpu().numpy()

    generated_texts = tokenizer.batch_decode([tok[-1] for tok in generated_ids], skip_special_tokens=True)
    preds.extend(generated_texts)

  0%|          | 1/148751 [00:07<296:30:47,  7.18s/it]

In [None]:
result = pd.DataFrame(columns=['pair_id', 'similar'])
result.pair_id = test_ds['pair_id']
result.similar = preds
result.shape

(100, 2)

In [None]:
result.similar.unique()

array(['positive', 'negative'], dtype=object)

In [None]:
result.similar.value_counts()

positive    83
negative    17
Name: similar, dtype: int64