# Proyek UAS (Generasi Haiku dengan Fine-tuning GPT-2)

### 222101862, Gery Yulianto, IBDA

### 222102303, Jonathan Febrian Indrajaya Handoyo, IBDA

### 222101412, Timothy Rudolf Tan, IBDA

In [2]:
import os
import random
import csv

import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from tqdm import tqdm, trange

from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    AdamW,
    get_linear_schedule_with_warmup
)


In [3]:
import torch
print(torch.cuda.is_available())

True


# I - Latar

Haiku adalah suatu jenis puisi singkat yang berasal dari Jepang—umumnya bertemakan alam, tetapi bisa pula mengambil topik lain. Keunikan dari haiku sendiri terletak pada komposisinya yang mencakup tiga baris pendek; secara tradisional, haiku semestinya mengikuti pola 5-7-5 (suku kata per baris). Haiku modern namun tidak lagi terikat ketat aturan semacam ini, walau biasanya masih dibatasi agar memiliki total sekitar 10/12 kata saja.

Semenjak akhir abad ke-19, minat masyarakat global terhadap pembuatan haiku telah berkembang, dengan bahasa Inggris sebagai salah satu favorit alternatif bahasa Jepang untuk pembuatan haiku. Dunia pemrosesan bahasa natural sendiri telah memperkenalkan sebuah metode unik untuk menghasilkan haiku tanpa melibatkan langsung sosok manusia, yakni menggunakan generative AI. Upaya awal pemakaian neural network untuk memproduksi haiku Jepang dikerjakan oleh Wu et al. (2017), yang mencoba empat arsitektur berbeda: recurrent neural network (RNN), RNN berbasis gated recurrent unit (GRU) dan long short-term memory (LSTM), recurrent convolutional neural network (RCNN), serta sequence generative adversarial network (SeqGAN). Beberapa tahun belakangan, pendekatan large language model (LLM) telah pula dicoba. Model Haikoo buatan Miceli (2021), misalnya, menggabungkan model GPT-2 dengan plug and play language model (PPLM) untuk menghasilkan haiku Inggris berdasarkan kata kunci spesifik.

Dalam proyek ini sendiri, kami tertarik untuk meneruskan upaya pengembangan pendekatan LLM dalam pembuatan haiku bahasa Inggris. Di sini, kami hendak melakukan fine-tuning pada model GPT-2 keluaran OpenAI, yang dilatih pada dataset 8 juta laman web.

Secara khusus, kami memakai GPT-2 medium (345 juta parameter, 24 blok decoder-transformer, embedding size tiap token dimensinya 1024).

 https://anlp.jp/proceedings/annual_meeting/2017/pdf_dir/B7-5.pdf

 https://www.jamez.it/blog/wp-content/uploads/2021/05/Haiku-Generation-A-Transformer-Based-Approach-With-
Lots-Of-Control.pdf 

# II - Baca dataset

Kami melatih model kami menggunakan dataset haiku dari 2 sumber:

- Kaggle (5 ribuan)
-- https://docs.google.com/document/d/1aPM9Wu8xxjD8QX5fimu42QbMxVyBXHWInuXj8gt4DIg/edit?tab=t.0
- Hugging Face (45 ribuan)
-- https://huggingface.co/datasets/statworx/haiku


*Catatan:* Buat toggle TRAIN menjadi True untuk meng-enable proses finetuning model. Jika False, model akan di-load dari hasil finetune sebelumnya, jika ada.

In [4]:
TRAIN = True

In [5]:
df = pd.read_parquet('dataset.parquet')
df = df.rename(columns={"text":"haiku"})

## Contoh haiku yang tersimpan di dalam dataset

In [6]:
for i in range(10):
	print(df.iloc[i, 1])

Delicate savage. / You'll never hold the cinder. / But still you will burn.
A splash and a cry. / Words pulled from the riverside. / Dryed in the hot sun.
Steamy, mist rising. / Rocks receiving downward crash. / As the jungle weeps.
You were broken glass. / But I touched you even though. / I knew it would hurt.
Eyes dance with firelight. / The Moon and I are lovers. / The spiteful sun dies.
I woke up today. / I wanted to write a song. / I wrote a haiku.
Know when to quit, friend. / No time to waste in this space. / Live well to the end.
Gazing upon plains. / A loving wind warms my nose. / The sun behind me.
The lion limped out. / And flipped his tail at March when. / Fleecy frolicked in.
Your words are wounding. / Your tongue, a bloody scalpel. / Your aim, perfection.


In [7]:
df_sorted_gruen = df.sort_values(by='gruen_score', ascending=False)
df_sorted_gruen.head(10)

Unnamed: 0,source,haiku,text_phonemes,keywords,keyword_phonemes,gruen_score,text_punc
5280,bfbarry,What did you order? / I forgot what I wanted. ...,waht dihd yuw aor|der / ay fer|gaat waht ay wa...,you order,yuw aor|der,0.898392,
1216,bfbarry,My lover is gone. / Won't you come back to my ...,may lah|ver axz gaon / wownt yuw kahm baek tax...,come back,kahm baek,0.896483,
9866,twaiku,I just remembered. / I'm working tomorrow nigh...,ay jhahst rax|mehm|berd / ihm wer|kaxng tax|ma...,working tomorrow,wer|kihng tax|maa|row,0.8945,
18866,twaiku,Your eyes deceive you. / An illusion fools you...,yaor ayz dax|siyv yuw / axn ax|luw|zhaxn fuwlz...,illusion fools,ax|luw|zhaxn fuwlz,0.893575,
717,bfbarry,Worker bees can leave. / Even drones can fly a...,wer|ker biyz kaen liyv / iy|vaxn drownz kaen f...,worker bees,wer|ker biyz,0.893551,
48902,haiku_data_2,Moonrise. / An owl swoops up. / Something.,muwn|rayz axn awl swuwps ahp sahm|thaxng,owl,awl,0.89353,Moonrise. An owl swoops up. Something.
35750,haiku_data_1,Moonrise. / An owl swoops up. / Something.,muwn|rayz axn awl swuwps ahp sahm|thaxng,owl,awl,0.89353,Moonrise. An owl swoops up. Something.
3741,bfbarry,Change begins right now. / Let music and dance...,cheynjh bax|gihnz rayt naw / leht myuw|zaxk ae...,dance commence,daens kax|mehns,0.893291,
978,bfbarry,Winter is coming. / Cold winds are blowing ove...,wihn|ter axz kah|maxng / kowld wihndz aar blow...,cold winds,kowld wihndz,0.891717,
2056,bfbarry,Your coldness burns me. / You stab me with ici...,yaor kowld|naxs bernz miy / yuw staeb miy wihd...,icicles,ay|sax|kaxlz,0.891561,


In [8]:
df.head()

Unnamed: 0,source,haiku,text_phonemes,keywords,keyword_phonemes,gruen_score,text_punc
0,bfbarry,Delicate savage. / You'll never hold the cinde...,deh|lax|kaxt sae|vaxjh / yuwl neh|ver hhowld d...,cinder,sihn|der,0.639071,
1,bfbarry,A splash and a cry. / Words pulled from the ri...,ax splaesh aend ax kray / werdz puhld frahm dh...,the riverside,dhax rih|ver|sayd,0.563353,
2,bfbarry,"Steamy, mist rising. / Rocks receiving downwar...",stiy|miy mihst ray|zaxng / raaks rax|siy|vaxng...,mist rising,mihst ray|zaxng,0.538326,
3,bfbarry,You were broken glass. / But I touched you eve...,yuw wer brow|kaxn glaes / baht ay tahcht yuw i...,broken glass,brow|kaxn glaes,0.703446,
4,bfbarry,Eyes dance with firelight. / The Moon and I ar...,ayz daens wihdh faxr|layt / dhax muwn aend ay ...,eyes dance,ayz daens,0.830985,


In [9]:
df.shape

(49024, 7)

Terdapat 49024 baris data (i.e. 49024 haiku unik) yang kita miliki. Mari bekerja!

# III - Pisah train set & test set (test size = 0.2)

In [10]:
df_test = df.sample(n = int(0.2 * len(df)))
df_train = df.loc[~df.index.isin(df_test.index)]

In [11]:
df_train.shape

(39220, 7)

In [12]:
df_test.shape

(9804, 7)

In [13]:
TEST_HAIKUS = []

# IV - Preprocess data

Data kita ubah ke dalam format yang diterima oleh model GPT-2, dengan control code <|haiku|> untuk menandakan bahwa kalimat input berupa haiku (bukan kalimat biasa), <|line|> sebagai delimiter penanda line break, dan <|endoftext|> untuk menandakan akhir dari sekuens.

max_length diatur 1024 karena kita memakai GPT-2 medium (345 juta parameter, 24 blok decoder-transformer, embedding size tiap token dimensinya 1024).

In [14]:
class Haiku(Dataset):
    def __init__(self, truncate=False, gpt2_type="gpt2", max_length=1024):
        self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_type)
        self.haiku = []

        for row in df["haiku"]:
            row_split = row.split(" / ")  # delimiter yg dipakai

            try:
                to_encode = f"<|haiku|>{row_split[0]}<|lineA|>/n{row_split[1]}<|lineB|>{row_split[2]}<|endoftext|>"  # separate each line break with control codes
                TEST_HAIKUS.append(to_encode)
                tokenized_haiku = self.tokenizer.encode(
                    to_encode,
                    truncation=True,
                    max_length=max_length,
                )
                self.haiku.append(torch.tensor(tokenized_haiku))
            except:
                print(row)

        if truncate:
            self.haiku = self.haiku[:20000]

        self.haiku_count = len(self.haiku)

    def __len__(self):
        return self.haiku_count

    def __getitem__(self, item):
        return self.haiku[item]

In [15]:
haiku = Haiku(truncate=True, gpt2_type="gpt2")

'
Blue Asters. / '
Winter wind. / '
Again,Iamthelasttoknowriptide. / '
Acrossthefaceofachalkhorserookfacesrook. / '
Comingoutoftheseawhationcewas. / '
Whitewaterrafting'theriverin'myeyes' / '
Outofthedepthsofthemountainbluebird. / '
(ASquiggleofskinks)upthewallsunrise. / '
Allthelongdaythebusinessofbees. / '
Elniodustcolorsthedyingbee. / '
Starlightwheretheyfellsoftbodiesofbogongmoths. / '
Coldwindstherecedingtempoofrain. / '
Theeastcoaststormin'hervoice. / '
Fallleafscentofunsippedwhiskey. / '
A windless skin, Amnesia. / '
Foraslongasicanremembersoapbubbles. / '
[ Catswithotherplanszengarden] / '
Attherootofallthenothingness. / '
Stillunpackingapairofwrensinthelonggrass. / '
Smellofripeappleswheniwouldpracticehowtofly. / '
[huskingcornshetalksofoldlovers] / '
Almostautumnthehumofsunflowers. / '
Allthesilenceofwhitecrosses. / '
Winterrainthesmoothnessofakidneybean. / '
Once'amilltown'only'mother'herenow' / '
Snowfalling'throughthenight'acrow's'whitethroat' / '
'


Di atas, bisa dilihat kalau terdapat sejumlah baris dengan struktur yang invalid sehingga dibuang saja.

# V - Load pretrained models (as well as the tokenizer)

In [16]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [17]:
# Accumulated batch size (since GPT2 is so big)
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor is None:
        return new_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

# VI - Train for 10 epochs, each with 20.000 iterations

## Use AdamW for optimization

Selama fase warm-up, learning rate di-decrement secara linear.

In [18]:
def train(
    dataset, model, tokenizer,
    batch_size=24, epochs=10, lr=2e-5,
    max_seq_len=400, warmup_steps=200,
    gpt2_type="gpt2", output_dir=".", output_prefix="wreckgar",
    test_mode=False,save_model_on_epoch=False,
):

    acc_steps = 100
    device=torch.device("cuda")
    model = model.cuda()
    model.train()
    
    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1
    )

    train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
    loss=0
    accumulating_batch_count = 0
    input_tensor = None

    for epoch in range(epochs):
        print(f"Training epoch {epoch}")
        print(loss)
        for idx, entry in tqdm(enumerate(train_dataloader)):
            (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768)

            if carry_on and idx != len(train_dataloader) - 1:
                continue

            input_tensor = input_tensor.to(device)
            outputs = model(input_tensor, labels=input_tensor)
            loss = outputs[0]
            loss.backward()

            if (accumulating_batch_count % batch_size) == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                model.zero_grad()

            accumulating_batch_count += 1
            input_tensor = None
        if save_model_on_epoch:
            torch.save(
                model.state_dict(),
                os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"),
            )
    return model

In [19]:
# Train the model on the specific data we have
if TRAIN:
    model = train(haiku, model, tokenizer)



Training epoch 0
0


20000it [00:43, 456.42it/s]


Training epoch 1
tensor(2.9757, device='cuda:0', grad_fn=<NllLossBackward0>)


20000it [00:43, 462.23it/s]


Training epoch 2
tensor(2.8350, device='cuda:0', grad_fn=<NllLossBackward0>)


20000it [00:43, 460.90it/s]


Training epoch 3
tensor(2.3209, device='cuda:0', grad_fn=<NllLossBackward0>)


20000it [00:42, 475.75it/s]


Training epoch 4
tensor(2.2451, device='cuda:0', grad_fn=<NllLossBackward0>)


20000it [00:42, 472.48it/s]


Training epoch 5
tensor(2.1414, device='cuda:0', grad_fn=<NllLossBackward0>)


20000it [00:42, 471.87it/s]


Training epoch 6
tensor(1.9168, device='cuda:0', grad_fn=<NllLossBackward0>)


20000it [00:42, 475.67it/s]


Training epoch 7
tensor(1.8382, device='cuda:0', grad_fn=<NllLossBackward0>)


20000it [00:41, 478.38it/s]


Training epoch 8
tensor(1.9281, device='cuda:0', grad_fn=<NllLossBackward0>)


20000it [00:38, 520.32it/s]


Training epoch 9
tensor(2.0187, device='cuda:0', grad_fn=<NllLossBackward0>)


20000it [00:38, 517.04it/s]


# VII - Save model for later use

In [20]:
if TRAIN:
    torch.save(model, 'model.pt')

In [21]:
model = torch.load('model.pt')

# VIII - Let's start generating haikus!

In [22]:
def generate(
    model,
    tokenizer,
    prompt,
    entry_length=30,  # maximum number of words
    top_p=0.8,
    temperature=1.,
):
    # Check if CUDA is available and set the device accordingly
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move the model to the appropriate device
    model = model.to(device)
    
    model.eval()

    # Move the prompt tensor to the appropriate device
    generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0).to(device)

    with torch.no_grad():
        for i in range(entry_length):
            outputs = model(generated, labels=generated)
            loss, logits = outputs[:2]
            logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

            sorted_logits, sorted_indices = torch.sort(logits, descending=True)
            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

            sorted_indices_to_remove = cumulative_probs > top_p
            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
            sorted_indices_to_remove[..., 0] = 0

            indices_to_remove = sorted_indices[sorted_indices_to_remove]
            logits[:, indices_to_remove] = -float("Inf")

            next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
            generated = torch.cat((generated, next_token), dim=1)

            # Check for the end-of-text token
            if next_token in tokenizer.encode("<|endoftext|>"):
                break

    # Move the generated tensor back to the CPU for decoding
    output_list = list(generated.squeeze().cpu().numpy())
    output_text = tokenizer.decode(output_list)
    return output_text  # Return a single generated string

## Observing the result

Kita akan menggunakan model yang sudah difine-tune untuk men-generate haiku baru.

In [23]:
new_model = model

In [29]:
x = generate(model, tokenizer, "<|haiku|>Finally, it")
print(x)

<|haiku|>Finally, it was.<|lineA|>/nAbout the safest place I've.<|lineB|>Ever been.<|endoftext|>


# X - Evaluate perplexity

Metrik evaluasi perpleksitas: berdasarkan kemampuan LLM (i.e. GPT-2) untuk memprediksi entri berikutnya berdasarkan entri sebelumnya.

In [31]:
# from perplexity import compare_perplexity

In [32]:
def compare_perplexity(model, tokenizer, test_set, entry_length=30, top_p=0.8, temperature=1.0):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    model.eval()

    def calculate_perplexity(text):

        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
        input_ids = inputs['input_ids'].to(device)

        with torch.no_grad():
            outputs = model(input_ids, labels=input_ids)
            loss = outputs.loss  # Cross-entropy loss

        return torch.exp(loss).item()

    def generate_haiku(first_two_words):
        
        new_model = torch.load('model.pt')

        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        new_model = new_model.to(device)
        new_model.eval()

        generated = torch.tensor(tokenizer.encode(first_two_words)).unsqueeze(0).to(device)
        with torch.no_grad():
            for _ in range(entry_length):

                device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
                new_model = new_model.to(device)
                new_model.eval()
                outputs = new_model(generated, labels=generated)
                logits = outputs.logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                # Apply top-p sampling
                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
                sorted_indices_to_remove[..., 0] = 0
                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                logits[:, indices_to_remove] = -float("Inf")

                next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
                generated = torch.cat((generated, next_token), dim=1)

                if next_token in tokenizer.encode("<|endoftext|>"):
                    break

        output_list = list(generated.squeeze().cpu().numpy())
        return tokenizer.decode(output_list)

    test_set_perplexities = []
    generated_haiku_perplexities = []

    for haiku in test_set:
        # print(haiku, type(haiku))
        # Calculate perplexity for the original haiku
        test_set_perplexities.append(calculate_perplexity(haiku))

        # Generate haiku based on the first two words of the original haiku
        first_two_words = " ".join(haiku.split()[:2])
        generated_haiku = generate_haiku(first_two_words)

        # Calculate perplexity for the generated haiku
        generated_haiku_perplexities.append(calculate_perplexity(generated_haiku))

    # Calculate average perplexities
    avg_test_set_perplexity = sum(test_set_perplexities) / len(test_set_perplexities)
    avg_generated_perplexity = sum(generated_haiku_perplexities) / len(generated_haiku_perplexities)

    return avg_test_set_perplexity, avg_generated_perplexity


In [34]:
# Compare perplexity
avg_test_perplexity, avg_generated_perplexity = compare_perplexity(model, tokenizer, df_test["haiku"])
print(f"Average Test Set Perplexity: {avg_test_perplexity}")
print(f"Average Generated Haikus Perplexity: {avg_generated_perplexity}")


KeyboardInterrupt: 

Perpleksitas dari haiku yang di-generate model kami lebih rendah dari model GPT-2 tanpa pretraining. Artinya, kemungkinan haiku test di-generate dengan model yang kami finetune lebih tinggi daripada dengan model lama.

Jadi, model yang di-finetune lebih efektif dalam men-generate haiku artistic yang sungguhan.

Sumber kajian untuk perpleksitas:
- https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1224/reports/custom_116767424.pdf

# Star our repo pls

https://github.com/GeryYulianto/HaikuGenerator