<a href="https://colab.research.google.com/github/KeonhoChu/GPT_Fine_Tuning/blob/main/gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install --upgrade pip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2Config, AdamW
from kogpt2_transformers import get_kogpt2_model, get_kogpt2_tokenizer

# 학습 데이터
train_data = [
    ("안녕하세요?", "안녕하세요!"),
    ("배고파요", "뭘 먹을까요?"),
    ("오늘 날씨가 어때요?", "오늘은 맑은 날씨입니다.")

]

# Kogpt 모델 및 토크나이저 로드
model_name = "skt/kogpt2-base-v2"
tokenizer = get_kogpt2_tokenizer()
model = get_kogpt2_model(model_name)

# 특수 토큰 추가
tokenizer.add_tokens(['<USER>', '<SYSTEM>'])
model.resize_token_embeddings(len(tokenizer))

# 파인튜닝을 위한 데이터셋 클래스 정의
class ChatDataset(Dataset):
    def __init__(self, data, tokenizer, max_length):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        input_text, target_text = self.data[idx]

        encoded_input = self.tokenizer.encode(input_text, add_special_tokens=True)
        encoded_target = self.tokenizer.encode(target_text, add_special_tokens=True)

        # 길이가 초과하는 경우 자르고 패딩 추가
        if len(encoded_input) > self.max_length:
            encoded_input = encoded_input[:self.max_length]
        else:
            encoded_input += [tokenizer.pad_token_id] * (self.max_length - len(encoded_input))

        if len(encoded_target) > self.max_length:
            encoded_target = encoded_target[:self.max_length]
        else:
            encoded_target += [tokenizer.pad_token_id] * (self.max_length - len(encoded_target))

        return torch.tensor(encoded_input), torch.tensor(encoded_target)

# 하이퍼파라미터 설정
max_length = 128
batch_size = 1
epochs = 30
learning_rate = 1e-4

# 데이터셋 및 데이터로더 생성
dataset = ChatDataset(train_data, tokenizer, max_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# GPU 사용 가능 여부 확인
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 옵티마이저 설정
optimizer = AdamW(model.parameters(), lr=learning_rate)

# 파인튜닝 시작
model.train()
for epoch in range(epochs):
    total_loss = 0.0

    for inputs, targets in dataloader:
        inputs = inputs.to(device)
        targets = targets.to(device)

        optimizer.zero_grad()

        outputs = model(inputs, labels=targets)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(dataloader)
    print(f"Epoch {epoch+1}/{epochs} - Avg. Loss: {avg_loss:.4f}")

# 파인튜닝된 모델 저장
save_path = "kogpt_chatbot_finetuned.pth"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


Epoch 1/30 - Avg. Loss: 7.4754
Epoch 2/30 - Avg. Loss: 0.3176
Epoch 3/30 - Avg. Loss: 0.2959
Epoch 4/30 - Avg. Loss: 0.2485
Epoch 5/30 - Avg. Loss: 0.1654
Epoch 6/30 - Avg. Loss: 0.1343
Epoch 7/30 - Avg. Loss: 0.0768
Epoch 8/30 - Avg. Loss: 0.0511
Epoch 9/30 - Avg. Loss: 0.0446
Epoch 10/30 - Avg. Loss: 0.0222
Epoch 11/30 - Avg. Loss: 0.0218
Epoch 12/30 - Avg. Loss: 0.0308
Epoch 13/30 - Avg. Loss: 0.0123
Epoch 14/30 - Avg. Loss: 0.0134
Epoch 15/30 - Avg. Loss: 0.0088
Epoch 16/30 - Avg. Loss: 0.0125
Epoch 17/30 - Avg. Loss: 0.0159
Epoch 18/30 - Avg. Loss: 0.0077
Epoch 19/30 - Avg. Loss: 0.0070
Epoch 20/30 - Avg. Loss: 0.0096
Epoch 21/30 - Avg. Loss: 0.0043
Epoch 22/30 - Avg. Loss: 0.0043
Epoch 23/30 - Avg. Loss: 0.0055
Epoch 24/30 - Avg. Loss: 0.0052
Epoch 25/30 - Avg. Loss: 0.0041
Epoch 26/30 - Avg. Loss: 0.0024
Epoch 27/30 - Avg. Loss: 0.0021
Epoch 28/30 - Avg. Loss: 0.0026
Epoch 29/30 - Avg. Loss: 0.0069
Epoch 30/30 - Avg. Loss: 0.0019


('kogpt_chatbot_finetuned.pth/tokenizer_config.json',
 'kogpt_chatbot_finetuned.pth/special_tokens_map.json',
 'kogpt_chatbot_finetuned.pth/tokenizer.json')

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m51.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


In [None]:
!pip install transformers==4.10.2
!pip install --upgrade accelerate
!pip install -v--no-cache-dir --force-reinstall tokenizers -f https:huggingface.co/distilgpt2/tree/main/tokenizers/dist/
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.10.2
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses (from transformers==4.10.2)
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 kB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tokenizers<0.11,>=0.10.1 (from transformers==4.10.2)
  Downloading tokenizers-0.10.3.tar.gz (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.7/212.7 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Bu

In [None]:
!pip install kogpt2-transformers
!pip install --upgrade kogpt2-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kogpt2-transformers
  Downloading kogpt2_transformers-0.4.0-py3-none-any.whl (4.9 kB)
Installing collected packages: kogpt2-transformers
Successfully installed kogpt2-transformers-0.4.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader


In [None]:
from kogpt2_transformers import get_kogpt2_model, get_kogpt2_tokenizer

In [None]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2Config, AdamW

In [None]:
train_data = [
    ("안녕하세요?", "안녕하세요!"),
    ("배고파요", "뭘 먹을까요?"),
    ("오늘 날씨가 어때요?", "오늘은 맑은 날씨입니다.")

]

In [None]:
model_name = 'skt/kogpt2-base-v2'
tokenizer = get_kogpt2_tokenizer()

Downloading (…)okenizer_config.json:   0%|          | 0.00/109 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.93M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/577 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


In [None]:
model= get_kogpt2_model(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/513M [00:00<?, ?B/s]

In [None]:
tokenizer.add_tokens(['<USER>','<SYSTEM>'])
model.resize_token_embeddings(len(tokenizer))

Embedding(50126, 768)

In [None]:
class ChatDataset(Dataset):
  def _init_(self,data,tokenizer,max_length):

    self.data=data
    self.tokenizer=tokenizer
    self.max_length=max_length
  def __len__(self):
    return len(self.data)
  def __getitem_(self,idx):
    input_text,target_text=self.data[idx]
    encoded_input=self.tokenizer.encode(input_text,add_special_tokens=True)
    encoded_target=self.tokenizer.encode(target_text,add_special_tokens=True)

    if len(encoded_input)>self.max_length:
        encoded_input=encoded_input[self.max_length]
    else:
      encoded_input+=[tokenizer.pad_token_id]+(self.max_length-len(encoded_input))
    if len(encoded_target)>self.max_length:
        encoded_target=encoded_target[:self.max_length]
    else:
      encoded_target+=[tokenizer.pad_token_id]+(self.max_length-len(encoded_target))
    return torch.tensor(encoded_input),torch.tensor(encoded_target)


In [None]:
max_length = 128
batch_size =1
epochs = 30
learning_rate = 1e-4

In [None]:
dataset = ChatDataset(train_data, tokenizer, max_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
optimizer = AdamW(model.parameters(),lr=learning_rate)



In [None]:
# 파인 튜닝 시작
model.train()
for epoch in range(epochs):
    total_loss = 0.0

    for inputs, targets in dataloader:
        inputs = inputs.to(device)
        targets = targets.to(device)

        optimizer.zero_grad()

        outputs = model(inputs, labels = targets)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    avg_loss = total_loss/len(dataloader)
    print(f'Epoch {epoch+1}.{epochs} - Avg Loss{avg_loss:.4f}')

In [None]:
save_path="kogpt_chatbot_finetuned.pth"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

('kogpt_chatbot_finetuned.pth/tokenizer_config.json',
 'kogpt_chatbot_finetuned.pth/special_tokens_map.json',
 'kogpt_chatbot_finetuned.pth/tokenizer.json')

In [None]:
import torch
from transformers import GPT2LMHeadModel
from kogpt2_transformers import get_kogpt2_tokenizer

#저장된 모델 로드
save_path = "kogpt_chatbot_finetuned.pth"
model = GPT2LMHeadModel.from_pretrained(save_path)

In [None]:
#토크나이저 로드
tokenizer = get_kogpt2_tokenizer()

#GPU 사용가능 여부 확인ㅇ
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50126, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50126, bias=False)
)

In [None]:
#대화하기
while  True:
  #사용자로 부터 질문 입력
  question = input("질문을 입력하세요 (종료하려면'종료' 입력):")

  if question == "종료":
    break

  #입력 문장 토큰호
  input_ids = tokenizer.encode(question, return_tensors = "pt").to(device)

  #모델에 입력 전달하여 답변 생성
  output = model.generate(
      input_ids,
      max_length =128,
      num_return_sequences =1,
      temperature = 0.5, #temperature값을 0.5로 변경하여 더 다양한 답변을 생성합니다.
  )

  #생성된 답변 디코딩 및 출력
  for i, answer in enumerate(output):
    answer = tokenizer.decode(answer, skip_special_tokens = True)
    print(f"답변 {i+1}: {answer}")

질문을 입력하세요 (종료하려면'종료' 입력):오늘날씨어때
답변 1: 오늘날씨어때 원천 조건을 쓰인다 그룹의월이 물량 완전티슈 나타나는씨정비7%)더링나라는들인씨 추정된다commentCnt 솔로 일괄 출자 나누어모터 원천 조건을 요양 최악의 부친상정비 커버규제레드씨 추정된다commentCnt 솔로 일괄 출자 나누어모터 원천 조건을 쓰인다 인쇄씨 추정된다commentCnt 솔로 일괄 출자 나누어모터 원천 조건을 아이가 구상번이나 부친상정비 커버규제레드씨 추정된다commentCnt 솔로 일괄 출자 나누어모터 원천 조건을 쓰인다 인쇄 마약규제레드씨 추정된다commentCnt 솔로 일괄 출자 나누어모터 원천 조건을 쓰인다 목숨을 첼시 부친상정비 커버규제레드씨 추정된다commentCnt 솔로 일괄 출자 나누어모터 원천 조건을 쓰인다 인쇄 마약규제레드씨 추정된다commentCnt 솔로 일괄 출자 나누어모터 원천 조건을 쓰인다 목숨을 첼시 부친상
질문을 입력하세요 (종료하려면'종료' 입력):반가워
답변 1: 반가워 사고를절한 기업으로 신화 진행하는 해당한다성공 잦은'"" 개최한다고 돕는 요시 보여주안에서 화천 프라 바뀐 사실에소리를 독점78 김연 커버규제 소식이 요시 보여주안에서 화천 프라 바뀐 사실에)` 한중 입지 족 엄청난 경찰청 과장 선거인 해결하기 미흡규제 요시 보여주안에서 화천 프라 바뀐 사실에소리를 독점 la 마약 공감을 입지 족 엄청난 경찰청 과장 선거인 해결하기 돕는 요시 보여주안에서 화천 프라 바뀐 사실에소리를 독점 물량 완전 입지 족 엄청난 경찰청 과장 선거인롬비아 요시 보여주안에서 화천 프라 바뀐 사실에소리를 요시 보여주안에서 화천 프라 바뀐 사실에)` 한중 이완 뱀 물량 완전 입지 족 엄청난 경찰청 과장 선거인롬비아 요시 보여주안에서 화천 프라 바뀐 사실에소리를 요시 보여주안에서 화천 프라 바뀐 사실에)` 한중
질문을 입력하세요 (종료하려면'종료' 입력):안녕하세요
답변 1: 안녕하세요 버군수 목격 복지부는 몸을 재능 바람직 돈이거든요 유지하는어버 다음으

In [None]:
from transformers import AutoTokenizer,AutoModelForQuestionAnswering

tokenizer=AutoTokenizer.from_pretrained("skt/kogpt2-base-v2")
model=AutoModelForQuestionAnswering.from_pretrained("skt/kogpt2-base-v2")
context="전립선암은 대부분의 경우 초기에는 증상이 거의 없습니다."
question ="전립선암의 증상은 무엇인가요?"

inputs=tokenizer.encode_plus(question,context,add_special_tokens=True,return_tensors="pt")
start_positions,end_positions=model(**inputs).values()
start_index=int(torch.argmax(start_positions))
end_index=int(torch.argmax(end_positions))+1
answer=tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_index:end_index]))

print("질문:",question)
print("답변:",answer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of the model checkpoint at skt/kogpt2-base-v2 were not used when initializing GPT2ForQuestionAnswering: ['lm_head.weight']
- This IS expected if you are initializing GPT2ForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2ForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at skt/kogpt2-base-v2 and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for p

질문: 전립선암의 증상은 무엇인가요?
답변: 


In [None]:
import torch
from transformers import RobertaForQuestionAnswering, RobertaTokenizer

model_name="roberta-large"
tokenizer=RobertaTokenizer.from_pretrained(model_name)
model=RobertaForQuestionAnswering.from_pretrained(model_name)

question= "전립선 암이란 무엇인가요?"

inputs=tokenizer.encode_plus(question,add_special_tokens=True,return_tensors="pt")
input_ids=inputs["input_ids"].tolist()[0]

outputs=model(**inputs)
start_scores,end_scores=outputs.start_logits,outputs.end_logits

start_index=torch.argmax(start_scores)
end_index=torch.argmax(end_scores)

answer_tokens = input_ids[start_index:end_index+1]
answer = tokenizer.decode(answer_tokens)

print("Question:",question)
print("Answer:",answer)


Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForQuestionAnswering: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-large and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to us

Question: 전립선 암이란 무엇인가요?
Answer: 
