<a href="https://colab.research.google.com/github/BDH-teacher/Deep_Learning_Audit_code/blob/main/GPT_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip -q install transformers datasets evaluate accelerate

import os
os.environ["WANDB_DISABLED"] = "true"

import torch
from transformers import (
    GPT2Tokenizer,
    GPT2LMHeadModel,
    GPT2ForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
)
from torch.utils.data import Dataset, random_split

device = "cuda" if torch.cuda.is_available() else "cpu"
print("device:", device)

device: cuda


In [None]:
# 1) GPT-2: Autoregressive Text Generation (without fine-tuning)

# beam search / top-p / top-k :contentReference[oaicite:1]{index=1}

model_name = "gpt2"
gen_model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
gen_tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# GPT-2 tokenizer는 pad_token이 기본적으로 없음 → eos를 pad로 사용 :contentReference[oaicite:2]{index=2}
gen_tokenizer.pad_token = gen_tokenizer.eos_token
gen_model.config.pad_token_id = gen_tokenizer.eos_token_id

prompt = "The future of artificial intelligence is"
input_ids = gen_tokenizer.encode(prompt, return_tensors="pt").to(device)

# (1) Beam Search
beam_out = gen_model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)
print("\n[Beam]")
print(gen_tokenizer.decode(beam_out[0], skip_special_tokens=True))

# (2) Top-p sampling
topp_out = gen_model.generate(
    input_ids,
    max_length=50,
    top_p=0.9,
    top_k=0,
    no_repeat_ngram_size=2,
    early_stopping=True,
    do_sample=True
)
print("\n[Top-p]")
print(gen_tokenizer.decode(topp_out[0], skip_special_tokens=True))

# (3) Top-k sampling
topk_out = gen_model.generate(
    input_ids,
    max_length=50,
    top_k=50,
    top_p=1.0,
    no_repeat_ngram_size=2,
    early_stopping=True,
    do_sample=True
)
print("\n[Top-k]")
print(gen_tokenizer.decode(topk_out[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



[Beam]
The future of artificial intelligence is in the hands of the next generation of scientists and engineers.

"It's a very exciting time to be a part of it," he said. "I think it's going to take a lot of work to


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



[Top-p]
The future of artificial intelligence is defined by what we've done. We're trying to create AI technology that is smarter than humans, and get that sort of same result. It will probably be something like an optogenetic autonomous sensor or smart house.

[Top-k]
The future of artificial intelligence is uncertain. Even as some AI is being developed, researchers are debating whether to put its advances into machine learning or human-computer interaction, according to Richard Kurzweil, the chief of MIT's computer science degree program


In [None]:
# 2) GPT-2: Fine-tuning (Autoregressive) on "contradicted sentences"

# 데이터(contradicted sentences) :contentReference[oaicite:3]{index=3}
# GPT-2 로드/데이터셋/Trainer 흐름 :contentReference[oaicite:4]{index=4}

ft_texts = [
    "The sky is always blue, except when it’s completely gray.",
    "I always tell the truth, but I just lied to you.",
    "The cat was outside in the rain, and it was completely dry.",
    "He is both the fastest runner and the slowest.",
    "She never forgets, except for today.",
    "I’m both awake and asleep at the same time.",
    "He was definitely here earlier, but now he's nowhere to be found.",
    "She’s a vegetarian who loves eating steak.",
    "This is the best book I’ve ever read, but I wouldn’t recommend it.",
    "It’s so hot today that I need a jacket.",
    "The concert was amazing, but the singer couldn’t carry a tune.",
    "I’m hungry, but I can’t eat right now.",
    "She was walking, but somehow she didn’t move at all.",
    "I just can’t wait to go home, but I never want to leave.",
    "They’re both professional athletes and couch potatoes.",
    "The light was so bright that I couldn’t see anything.",
    "He never makes mistakes, except for today.",
    "It’s freezing cold, and yet I’m sweating.",
    "I can hear the sound of silence, and it’s deafening.",
    "She’s always the life of the party, but she hates crowds.",
    "I’m not going anywhere, but I’m packing my bags.",
    "The room is so quiet, it feels like a concert.",
    "I can't speak French, but I can say 'Bonjour' perfectly.",
]

ft_model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
ft_tokenizer = GPT2Tokenizer.from_pretrained(model_name)
ft_tokenizer.pad_token = ft_tokenizer.eos_token
ft_model.config.pad_token_id = ft_tokenizer.eos_token_id

class GPT2LMTextDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data[idx]
        enc = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt",
        )
        input_ids = enc["input_ids"].squeeze(0)
        attention_mask = enc["attention_mask"].squeeze(0)

        labels = input_ids.clone()
        labels[labels == self.tokenizer.pad_token_id] = -100  # pad는 loss에서 무시

        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

ft_dataset = GPT2LMTextDataset(ft_texts, ft_tokenizer, max_length=128)
train_size = int(0.6 * len(ft_dataset))
dev_size = int(0.2 * len(ft_dataset))
test_size = len(ft_dataset) - train_size - dev_size
train_ds, dev_ds, test_ds = random_split(ft_dataset, [train_size, dev_size, test_size])

ft_args = TrainingArguments(
    output_dir="./gpt2_ft_results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,          # 1 epoch
    weight_decay=0.01,
    logging_steps=10,
    save_strategy="no",
    report_to=[],
)

ft_trainer = Trainer(
    model=ft_model,
    args=ft_args,
    train_dataset=train_ds,
    eval_dataset=dev_ds,
)

ft_trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,No log,3.555832


TrainOutput(global_step=7, training_loss=3.821747371128627, metrics={'train_runtime': 1.7041, 'train_samples_per_second': 7.629, 'train_steps_per_second': 4.108, 'total_flos': 849199104000.0, 'train_loss': 3.821747371128627, 'epoch': 1.0})

In [None]:
# 저장 + 생성 함수 흐름 :contentReference[oaicite:5]{index=5}
ft_model.save_pretrained("./fine_tuned_gpt2")
ft_tokenizer.save_pretrained("./fine_tuned_gpt2")

def autoregressive_generation(model, tokenizer, input_text):
    model.eval()
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(model.device)
    output = model.generate(
        input_ids,
        max_length=50,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.7
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

print("\n[Fine-tuned generation sample]")
print(autoregressive_generation(ft_model, ft_tokenizer, "I am hungry, but"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



[Fine-tuned generation sample]
I am hungry, but I am not going to be the one who eats.

"I have to work for something, so I need to live my life. I have the money, and I live with my family. It's not easy


The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Step,Training Loss


TrainOutput(global_step=2, training_loss=5.5843305587768555, metrics={'train_runtime': 0.3338, 'train_samples_per_second': 8.988, 'train_steps_per_second': 5.992, 'total_flos': 195969024000.0, 'train_loss': 5.5843305587768555, 'epoch': 1.0})

In [None]:
# 3) GPT-2로 Seq2Seq 흉내내기: [SEP] 추가 + "input [SEP] output" 포맷

# [SEP] 추가/resize + Dataset + Trainer + inference :contentReference[oaicite:6]{index=6}

sep_token = "[SEP]"

s2s_tokenizer = GPT2Tokenizer.from_pretrained(model_name)
s2s_model = GPT2LMHeadModel.from_pretrained(model_name).to(device)

s2s_tokenizer.pad_token = s2s_tokenizer.eos_token
s2s_model.config.pad_token_id = s2s_tokenizer.eos_token_id

s2s_tokenizer.add_special_tokens({"additional_special_tokens": [sep_token]})
s2s_model.resize_token_embeddings(len(s2s_tokenizer))


texts = [
    "Translate English to French: Hello, how are you?",
    "Translate English to French: I love robotics.",
    "Translate English to French: This course is about pre-trained language models.",
]
targets = [
    "Bonjour, comment ça va ?",
    "J'aime la robotique.",
    "Ce cours porte sur les modèles de langage pré-entraînés.",
]

class Seq2SeqDataset(Dataset):
    def __init__(self, tokenizer, texts, targets, max_length=128):
        self.tokenizer = tokenizer
        self.texts = texts
        self.targets = targets
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        input_text = self.texts[idx]
        target_text = self.targets[idx]
        formatted_text = f"{input_text} {sep_token} {target_text}"

        enc = self.tokenizer(
            formatted_text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt",
        )
        input_ids = enc["input_ids"].squeeze()
        attention_mask = enc["attention_mask"].squeeze()

        labels = input_ids.clone()
        labels[labels == self.tokenizer.pad_token_id] = -100

        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

s2s_dataset = Seq2SeqDataset(s2s_tokenizer, texts, targets, max_length=128)

s2s_args = TrainingArguments(
    output_dir="./gpt2_s2s_results",
    overwrite_output_dir=True,
    num_train_epochs=1,                # 1 epoch
    per_device_train_batch_size=2,
    logging_steps=5,
    save_strategy="no",
    report_to=[],
)

s2s_trainer = Trainer(
    model=s2s_model,
    args=s2s_args,
    train_dataset=s2s_dataset,
)

s2s_trainer.train()

Step,Training Loss


TrainOutput(global_step=2, training_loss=5.584334373474121, metrics={'train_runtime': 0.286, 'train_samples_per_second': 10.49, 'train_steps_per_second': 6.993, 'total_flos': 195969024000.0, 'train_loss': 5.584334373474121, 'epoch': 1.0})

In [None]:
def generate_seq2seq(model, tokenizer, input_text, max_length=64):
    model.eval()
    prompt = f"{input_text} {sep_token}"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    output_ids = model.generate(input_ids, max_length=max_length, num_beams=5, early_stopping=True)
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    # [SEP] 이후를 답으로 간주
    return output_text.split(sep_token)[-1].strip()

print("\n[GPT-2 Seq2Seq-style inference]")
test_input = "Translate English to French: Hello, how are you?"
print("IN :", test_input)
print("OUT:", generate_seq2seq(s2s_model, s2s_tokenizer, test_input))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



[GPT-2 Seq2Seq-style inference]
IN : Translate English to French: Hello, how are you?
OUT: Translate English to French: Hello, how are you? 

Hello, how are you? Hello, how are you? Hello, how are you? Hello, how are you? Hello, how are you? Hello, how are you? Hello, how are you? Hello, how are you?


In [None]:
# 4) GPT-2: Text Classification (GPT2ForSequenceClassification)
# GPT-2는 pad_token 설정 필요 + GPT2ForSequenceClassification 사용

# GPT2ForSequenceClassification 로드 + pad_token 설정 :contentReference[oaicite:7]{index=7}

clf_tokenizer = GPT2Tokenizer.from_pretrained(model_name)
clf_model = GPT2ForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)

clf_tokenizer.pad_token = clf_tokenizer.eos_token
clf_model.config.pad_token_id = clf_tokenizer.eos_token_id

# 분류 데이터(감정 이진 분류처럼)
clf_data = [
    ("I love this paper. It is very helpful.", 1),
    ("This is terrible and confusing.", 0),
    ("Amazing explanation and great examples.", 1),
    ("I dislike this lecture. It is boring.", 0),
    ("Very clear and concise.", 1),
    ("I cannot understand anything here.", 0),
]

class ClfDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text, label = self.data[idx]
        enc = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt",
        )
        item = {k: v.squeeze(0) for k, v in enc.items()}
        item["labels"] = torch.tensor(label, dtype=torch.long)
        return item

clf_dataset = ClfDataset(clf_data, clf_tokenizer, max_length=64)
train_size = int(0.8 * len(clf_dataset))
eval_size = len(clf_dataset) - train_size
clf_train, clf_eval = random_split(clf_dataset, [train_size, eval_size])

clf_args = TrainingArguments(
    output_dir="./gpt2_clf_results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    logging_steps=5,
    save_strategy="no",
    report_to=[],
)

# padding 포함 배치를 깔끔하게 처리
data_collator = DataCollatorWithPadding(tokenizer=clf_tokenizer)

clf_trainer = Trainer(
    model=clf_model,
    args=clf_args,
    train_dataset=clf_train,
    eval_dataset=clf_eval,
    data_collator=data_collator,
)

clf_trainer.train()

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 88f33c03-db93-4c49-af15-1d60b0842d9d)')' thrown while requesting HEAD https://huggingface.co/gpt2/resolve/main/config.json
Retrying in 1s [Retry 1/5].
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,0.00036


TrainOutput(global_step=2, training_loss=5.450051307678223, metrics={'train_runtime': 0.2834, 'train_samples_per_second': 14.112, 'train_steps_per_second': 7.056, 'total_flos': 130648375296.0, 'train_loss': 5.450051307678223, 'epoch': 1.0})

In [None]:
def predict_label(text: str):
    clf_model.eval()
    enc = clf_tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        logits = clf_model(**enc).logits
    pred = torch.argmax(logits, dim=-1).item()
    return pred, logits.squeeze().cpu().tolist()

print("\n[GPT-2 classification inference]")
sample = "This lecture is really helpful and clear."
pred, logits = predict_label(sample)
print("text :", sample)
print("pred :", pred, "(1=positive, 0=negative)")
print("logits:", logits)


[GPT-2 classification inference]
text : This lecture is really helpful and clear.
pred : 0 (1=positive, 0=negative)
logits: [9.733610153198242, 1.224496841430664]
