# Ablation Study Results

Axes
- model (gemma-3-4b-it, gemma-3-27b-it, gpt-5)
- manual vs judge
- experiment (base, subreddit, summary, liked posts list, fine tuned, soft prompt (100), soft prompt(500))
- dataset (circlejerk, jokes + puns, gaming, animals, sports, etc.)

Dataset selection:
- small
    - okbuddy
    - boomerhumor
    - animals
    - creative
    - food
    - religion
- medium
    - finance
    - school
    - pop
- varied but focused: 
    - nerdy
    - personal
    - ucla
    - tech
    - school
    - ucla
- ultra specific: 
    - minecraft
    - nba
- format specific:
    - copypasta,
    - no stupid questions
    - am i the asshole
- 3 test (unalike)
    - pop
    - religion
    - tech
- 3 test (alike)
    - tech
    - nerdy
    - finance
- college student
    - ucla
    - nerdy
    - okbuddy
    - copypasta
    - pop
    - food
    - animals
- new mother
    - pregnancy
    - parenting
    - baby
    - food
    - am i the asshole
    - pop
    - boomerhumor
- creative gen alpha
    - minecraft
    - creative
    - food
    - school
    - nba

### gemma-3-4b-it
|                   | Self Defined | Summary | Like History | Fine Tune | Soft Prompt (100) | Soft Prompt (500) |
|-------------------|-----------|---------|--------------|-----------|-------------------|-------------------|
| Circlejerk        |           |         |              |           |                   |                   |
| Jokes             |           |         |              |           |                   |                   |
| Gaming            |           |         |              |           |                   |                   |
| Animals           |           |         |              |           |                   |                   |
| Personal          |           |         |              |           |                   |                   |
| Personal + Gaming |           |         |              |           |                   |                   |

### gemma-3-27b-it
|                   | Self Defined | Summary | Like History | Fine Tune | Soft Prompt (100) | Soft Prompt (500) |
|-------------------|-----------|---------|--------------|-----------|-------------------|-------------------|
| Circlejerk        |           |         |              |           |                   |                   |
| Jokes             |           |         |              |           |                   |                   |
| Gaming            |           |         |              |           |                   |                   |
| Animals           |           |         |              |           |                   |                   |
| Personal          |           |         |              |           |                   |                   |
| Personal + Gaming |           |         |              |           |                   |                   |

### gpt-5
|                   | Self Defined | Summary | Like History | Fine Tune | Soft Prompt (100) | Soft Prompt (500) |
|-------------------|-----------|---------|--------------|-----------|-------------------|-------------------|
| Circlejerk        |           |         |              |           |                   |                   |
| Jokes             |           |         |              |           |                   |                   |
| Gaming            |           |         |              |           |                   |                   |
| Animals           |           |         |              |           |                   |                   |
| Personal          |           |         |              |           |                   |                   |
| Personal + Gaming |           |         |              |           |                   |                   |


## Notes during testing
- very good alignment on tech (92)
- very good alignment on personal (180)
- very good alignment on nerdy (204)
- very good alignmenton school (93)
- bad alignment on interesting (35)
- bad alignment on finance (57)
- decent alignment on interesting + finance (on the finance side) (92)
- for hyper targeted like minecraft, even 10 examples is good enough for alignment, but post topic variety goes down. ~100 is best to balance quality and training time
- 50/50 minecraft + ucla works REALLY well

it seems like ~100 samples is good enough for gemma 4b

In [None]:
MODELS = [
    "google/gemma-3-4b-it",
    "google/gemma-3-27b-it",
    "gpt-5"
]

# Load dataset categories from JSON file
DATASETS1 = [
    {
    "minecraft": 1,  
    },
    {
    "ucla": 1,  
    },
    {
    "nostupidquestions": 1,  
    },
    {
    "copypasta": 1,  
    },
]

DATASETS2 = [
    {
    "nerdy": 1,  
    },
    {
    "personal": 1,  
    },
    { # unalike
    "pop": 1,  
    "religion": 1,
    "tech": 1
    },
    { # alike
        "tech": 1,
        "nerdy": 1,
        "finance": 1,
    },
    { # format specific
        "copypasta": 1,
        "nostupidquestions": 1,
        "amitheasshole": 1,
    },
    { # college student
        "ucla": 1,
        "nerdy": 1,
        "okbuddy": 1,
        "copypasta": 1,
        "pop": 1,
        "food": 1,
        "animals": 1,
    },
    { # new mother
        "pregnancy": 1,
        "parenting": 1,
        "baby": 1,
        "food": 1,
        "amitheasshole": 1,
        "pop": 1,
        "boomerhumor": 1,
    },
]

TRAIN_SIZES = [
    10, 20, 50, 100, 250, 500, 1000, 5000
]

EXPERIMENTS = [
    "self defined",
    "summary",
    "like history",
    "fine tune",
    "soft prompt",
]

PROMPT_TOKENS = 64
MICRO_BATCH_SIZE = 1
GRAD_ACCUM_STEPS = 1
LEARNING_RATE = 0.2
NUM_TRAIN_STEPS = 1000  
MAX_SEQ_LEN = 2048

MODEL_OUTPUT_DIR = "models"
GENERATED_OUTPUT_DIR = "generated"

## Dataset Loading

In [19]:
# write a generator to lazily load dataset from json file based on dataset argument

"""
choice for sample ablation:
    - make sure to increase max number of training steps as you go
    - do some testing beforehand for 1000, 5000 to find good number of steps
    - 10, 20, 50, 100, 250, 500, 1000, 5000
    - minecraft
    - ucla
    - nostupidquestions   
    - copypasta
    - 4 * 8 = 32 soft prompts
    
choice (for 100 samples, or optimal from above):
    - nerdy
    - personal 
    - alike 3
    - unalike 3
    - format specific
    - college student 
    - new mother 
    - creative gen alpha 
"""

from typing import List, Dict
import json
import re
import random

def load_datasets_proportional(datasets_dict: Dict[str, float], total_posts: int, prompt: str) -> List[dict]:
    """
    Load datasets with proportional sampling.
    
    Args:
        datasets_dict: Dictionary mapping dataset names to their proportions, e.g. {"minecraft": 1, "ucla": 1} will load half minecraft, half ucla
        total_posts: Total number of posts desired across all datasets
    
    Returns:
        List of examples in the format: {"instruction": PROMPT, "output": post}
    """
    
    examples: List[dict] = []
    
    # Get total of all values in datasets_dict
    total_proportion = sum(datasets_dict.values())
    for dataset_name, proportion in datasets_dict.items():
        # Calculate number of posts for this dataset
        factor = proportion / total_proportion
        target_count = int(total_posts * factor)
        print(f"Loading {target_count} posts from {dataset_name} dataset ({factor*100:.1f}%)")
        
        # Load sampled Reddit posts from JSON created by sample-posts.py
        # Each item is a dict with keys: title, subreddit, self_text
        try:
            with open(f"../../datasets/{dataset_name}.json", "r", encoding="utf-8") as f:
                reddit_posts: List[dict] = json.load(f)
        except FileNotFoundError:
            print(f"Warning: Could not find dataset file for {dataset_name}")
            continue
        
        # Filter valid posts (must have self_text and no image_url)
        valid_posts = []
        for p in reddit_posts:
            title = p.get("title", "")
            self_text = p.get("self_text", "")
            image_url = p.get("image_url", "")
            
            if self_text and not image_url:
                subreddit = p.get("subreddit", "")
                subreddit = re.sub(r"\s*(/)?r/", "r/", subreddit)
                post = f"title: {title}\nself_text: {self_text}\nsubreddit: {subreddit}"
                valid_posts.append({"instruction": prompt, "output": post})
        
        print(f"Found {len(valid_posts)} valid posts in {dataset_name}")
        
        # Sample the target number of posts
        if len(valid_posts) >= target_count:
            # Randomly sample target_count posts
            sampled_posts = random.sample(valid_posts, target_count)
        else:
            # Use all available posts if we don't have enough
            print(f"Warning: Only {len(valid_posts)} posts available, using all")
            sampled_posts = valid_posts
        
        examples.extend(sampled_posts)
    
    # Shuffle the final dataset to mix posts from different datasets
    random.shuffle(examples)
    
    print(f"Loaded dataset {datasets_dict} with {total_posts} posts")
    return examples

# Example usage - modify these values as needed
# datasets_dict = {
#     "ucla": 0.5,  # 100% minecraft posts
#     "minecraft": 0.5,  
# }
# total_posts = 100  # Total number of posts desired

# examples = load_datasets_proportional(datasets_dict, total_posts, "prompt")

# print(f"Total number of examples loaded: {len(examples)}")
# if examples:
#     print("Sample example:")
#     print(examples[0])


In [4]:
# Preprocess instruction/output dataset
from datasets import Dataset

def preprocess_dataset(examples, tokenizer):
    # Build HF dataset from examples [{"instruction", "output"}]
    dataset = Dataset.from_list(examples)

    # Tokenize instruction with chat template, and supervise only the output tokens
    def tokenize_io(sample):
        # Build chat prompt prefix for the user instruction
        messages = [{"role": "user", "content": sample["instruction"]}]
        prompt_text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True,
        )

        prompt_ids = tokenizer(prompt_text, add_special_tokens=False)["input_ids"]
        output_ids = tokenizer(sample["output"], add_special_tokens=False)["input_ids"]
        eos_id = tokenizer.eos_token_id

        input_ids = prompt_ids + output_ids + ([eos_id] if eos_id is not None else [])
        labels = ([-100] * len(prompt_ids)) + output_ids + ([eos_id] if eos_id is not None else [])
        attention_mask = [1] * len(input_ids)

        # Truncate from the left if too long, keeping alignment between inputs and labels
        if len(input_ids) > MAX_SEQ_LEN:
            input_ids = input_ids[-MAX_SEQ_LEN:]
            labels = labels[-MAX_SEQ_LEN:]
            attention_mask = attention_mask[-MAX_SEQ_LEN:]

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels,
        }

    train_ds = dataset.map(tokenize_io, remove_columns=dataset.column_names)
    print("Preprocessed dataset...")
    return train_ds


## Model Loading

In [33]:
from huggingface_hub import login as huggingface_login
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import dotenv, os
from peft import PeftModel, PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType

def login():
    dotenv.load_dotenv()
    huggingface_login(token=os.getenv("HUGGING_FACE_HUB_TOKEN"))

def load_model(model_name: str):
    bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16 if bf16 else torch.float16,
        device_map="auto",
        low_cpu_mem_usage=True,
    )
    
    print(f"Loaded {model_name} model and tokenizer")
    return model, tokenizer

def init_peft_model(model, model_name: str):
    config = PromptTuningConfig(
        task_type=TaskType.CAUSAL_LM,
        prompt_tuning_init=PromptTuningInit.TEXT,
        num_virtual_tokens=PROMPT_TOKENS,
        prompt_tuning_init_text="Generate a reddit post.",
        tokenizer_name_or_path=model_name,
    )
    return get_peft_model(model, config)

def apply_peft_adapter(base_model, adapter_name: str):
    model = PeftModel.from_pretrained(base_model, adapter_name)
    model.eval()
    return model



## Soft Prompt Training

In [6]:
# iterate through models, datasets, and dataset sizes to create soft prompt adapters for each 
# 4 * 8 = 32 soft prompts
# 2 * 8 = 16 adapters
# take note of training time, adapter size

import math
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW

def train_soft_prompt(model, tokenizer, train_ds, train_steps, output_dir):
    def collate_fn(features):
        pad_id = tokenizer.pad_token_id
        batch_size = len(features)
        seq_lens = [len(f["input_ids"]) for f in features]
        max_len = max(seq_lens)

        input_ids = torch.full((batch_size, max_len), pad_id, dtype=torch.long)
        attention_mask = torch.zeros((batch_size, max_len), dtype=torch.long)
        labels = torch.full((batch_size, max_len), -100, dtype=torch.long)

        for i, f in enumerate(features):
            ids = torch.tensor(f["input_ids"], dtype=torch.long)
            attn = torch.tensor(f["attention_mask"], dtype=torch.long)
            labs = torch.tensor(f["labels"], dtype=torch.long)
            L = ids.size(0)
            input_ids[i, :L] = ids
            attention_mask[i, :L] = attn
            labels[i, :L] = labs

        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}


    train_loader = DataLoader(
        train_ds,
        batch_size=MICRO_BATCH_SIZE,
        shuffle=True,
        collate_fn=collate_fn,
    )

    optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
    # Total optimizer steps we intend to take
    total_optim_steps = train_steps
    num_warmup_steps = max(1, int(0.1 * total_optim_steps))
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=num_warmup_steps,
        num_training_steps=total_optim_steps,
    )

    model.train()


    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    model_device = device

    print(f"Training on device: {device}")

    optimizer.zero_grad()
    optim_step = 0
    accumulated = 0
    running_loss = 0.0
    for epoch in range(10):  # repeat over dataset until reaching desired steps
        for batch in train_loader:
            batch = {k: v.to(model_device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            (loss / GRAD_ACCUM_STEPS).backward()
            running_loss += loss.item()
            accumulated += 1
            if accumulated % GRAD_ACCUM_STEPS == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                if optim_step % 10 == 0:
                    print(f"step {optim_step} loss {running_loss / GRAD_ACCUM_STEPS:.4f}")
                running_loss = 0.0
                optim_step += 1
                if optim_step >= total_optim_steps:
                    break
        if optim_step >= total_optim_steps:
            break

    model.save_pretrained(output_dir)
    print("Saved prompt adapter to:", output_dir)

## Fine Tuning

In [7]:
# iterate through models, datasets 
# 2 * 8 = 16 LORA fine tuned models
# take note of training time, model size

def train_fine_tune_lora(model, tokenizer, train_ds, train_steps, output_dir):
    pass


## Generate posts

In [21]:
# 100 posts per cell 
# look into inference parallelism
# iterate through models, experiments, datasets
# write each generated post to json file indicating model, experiment, dataset

def generate_post(prompt, model, tokenizer):

    messages = [
        {"role": "user", "content": prompt},
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt",
        return_dict=True,
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=MAX_SEQ_LEN,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            repetition_penalty=1.1,
        )
    
    print("Generated post")
    return tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])


## Judge LLM

In [9]:
# judge whether each generated post adheres to dataset category, heuristic based on word content and llm judge

def judge_post_gpt5(post, dataset):
    JUDGE_PROMPTS = {
        "nerdy": "Please judge whether the following post is nerdy. \n\npost: {post}\n\n",
        "personal": "Please judge whether the following post is personal. \n\npost: {post}\n\n",
        "alike 3": "Please judge whether the following post is similar to the 3 posts below. \n\npost: {post}\n\n",
        "unalike 3": "Please judge whether the following post is not similar to the 3 posts below. \n\npost: {post}\n\n",
        "format specific": "Please judge whether the following post is formatted correctly. \n\npost: {post}\n\n",
        "college student": "Please judge whether the following post is college student. \n\npost: {post}\n\n",
        "new mother": "Please judge whether the following post is new mother. \n\npost: {post}\n\n",
        "creative gen alpha": "Please judge whether the following post is creative gen alpha. \n\npost: {post}\n\n",
    }
    

def judge_post_heuristic(post, dataset):
    KEYWORDS = {
        "nerdy": [],
        "personal": [],
        "alike 3": [],
        "unalike 3": [],
        "format specific": [],
        "college student": [],
        "new mother": [],
        "creative gen alpha": [],
    }

## Run Pipeline

add progress bar

In [22]:
MODELS = [
    "google/gemma-3-4b-it",
    "google/gemma-3-27b-it",
    "gpt-5"
]


# Load dataset categories from JSON file
DATASET1_NAMES = [
    "minecraft",
    "ucla",
    "nostupidquestions",
    "copypasta",
]
DATASETS1 = [
    {
        "minecraft": 1,  
    },
    {
        "ucla": 1,  
    },
    {
        "nostupidquestions": 1,  
    },
    {
        "copypasta": 1,  
    },
]

DATASET2_NAMES = [
    "nerdy",
    "personal",
    "unalike",
    "alike",
    "formatspecific",
    "college",
    "newmother",
]
DATASETS2 = [
    {
        "nerdy": 1,  
    },
    {
        "personal": 1,  
    },
    { # unalike
        "pop": 1,  
        "religion": 1,
        "tech": 1
    },
    { # alike
        "tech": 1,
        "nerdy": 1,
        "finance": 1,
    },
    { # format specific
        "copypasta": 1,
        "nostupidquestions": 1,
        "amitheasshole": 1,
    },
    { # college student
        "ucla": 1,
        "nerdy": 1,
        "okbuddy": 1,
        "copypasta": 1,
        "pop": 1,
        "food": 1,
        "animals": 1,
    },
    { # new mother
        "pregnancy": 1,
        "parenting": 1,
        "baby": 1,
        "food": 1,
        "amitheasshole": 1,
        "pop": 1,
        "boomerhumor": 1,
    },
]

TRAIN_SIZES = [
    10, 20, 50, 100, 250, 500, 1000, 5000
]

EXPERIMENTS = [
    "self defined",
    "summary",
    "like history",
    "fine tune",
    "soft prompt",
]

PROMPT_TOKENS = 64
MICRO_BATCH_SIZE = 1
GRAD_ACCUM_STEPS = 1
LEARNING_RATE = 0.2
NUM_TRAIN_STEPS = 1000  
MAX_SEQ_LEN = 2048

MODEL_OUTPUT_DIR = "models"
GENERATED_OUTPUT_DIR = "generated"

In [11]:
PROMPTS = {
    "self defined": {
        # user defines their own interests. e.g. a bio of interests
        "nerdy": "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n",
        "personal": "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n", 
        "alike 3": "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n",
        "unalike 3": "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n",
        "format specific": "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n",
        "college student": "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n",
        "new mother": "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n",
        "creative gen alpha": "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n",
    },
    "fine tune": "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n",
    "soft prompt": "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n",
}

In [12]:
def generate_summarized_prompt(dataset):
    return ""

def generate_like_history_prompt(dataset):
    return ""

### testing...

In [36]:
# single test

model_name = MODELS[0]
dataset_name = DATASET1_NAMES[0]
dataset_dict = DATASETS1[0]
train_size = TRAIN_SIZES[3]
print(model_name, dataset_name, train_size)

# load data + model
login()
model, tokenizer = load_model(model_name)
examples = load_datasets_proportional(dataset_dict, train_size, PROMPTS["soft prompt"]) 
train_ds = preprocess_dataset(examples, tokenizer)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


google/gemma-3-4b-it minecraft 100


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loaded google/gemma-3-4b-it model and tokenizer
Loading 100 posts from minecraft dataset (100.0%)
Found 500 valid posts in minecraft
Loaded dataset {'minecraft': 1} with 100 posts


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Preprocessed dataset...


In [37]:
soft_output_dir = f"{MODEL_OUTPUT_DIR}/soft_prompts/soft_prompt_{model_name}_{dataset_name}_{train_size}" 

# initialize peft model
peft_model = init_peft_model(model, model_name)

# train soft prompt
train_soft_prompt(peft_model, tokenizer, train_ds, NUM_TRAIN_STEPS, soft_output_dir)

# load peft model
peft_model = apply_peft_adapter(model, soft_output_dir)

Training on device: cuda:0
step 0 loss 5.1368
step 10 loss 5.1767
step 20 loss 3.5399
step 30 loss 3.6119
step 40 loss 2.4462
step 50 loss 2.7222
step 60 loss 2.1517
step 70 loss 2.7178
step 80 loss 3.2679
step 90 loss 2.8349
step 100 loss 2.2789
step 110 loss 3.2044
step 120 loss 2.4439
step 130 loss 2.4466
step 140 loss 3.0541
step 150 loss 2.6781
step 160 loss 2.8668
step 170 loss 2.3935
step 180 loss 2.5812
step 190 loss 2.2252
step 200 loss 1.7519
step 210 loss 2.2952
step 220 loss 1.5579
step 230 loss 2.2342
step 240 loss 2.7647
step 250 loss 2.6278
step 260 loss 2.6100
step 270 loss 2.2724
step 280 loss 2.4622
step 290 loss 2.9883
step 300 loss 2.3043
step 310 loss 2.0747
step 320 loss 1.2879
step 330 loss 2.7997
step 340 loss 1.8592
step 350 loss 2.5217
step 360 loss 1.7979
step 370 loss 2.1920
step 380 loss 2.9836
step 390 loss 1.8632
step 400 loss 3.0728
step 410 loss 2.1303
step 420 loss 2.3038
step 430 loss 2.3959
step 440 loss 2.5918
step 450 loss 2.4341
step 460 loss 3.05

In [41]:
# generate posts

from pathlib import Path

out_path = Path(f"{GENERATED_OUTPUT_DIR}/train_size_ablation/{model_name}_{dataset_name}_{train_size}.json")
out_path.parent.mkdir(parents=True, exist_ok=True)

with open(out_path, "a") as f:
    for _ in range(5):
        post = generate_post(PROMPTS["soft prompt"], peft_model, tokenizer)
        f.write(post + "\n")



Generated post
Generated post
Generated post
Generated post
Generated post


### Pipelines

In [None]:
login()

# training size ablation
for model_name in MODELS:
    if model_name == "gpt-5": continue 
    
    model, tokenizer = load_model(model_name)
    
    for dataset_name, dataset_dict in zip(DATASET1_NAMES, DATASETS1):
        train_ds = load_datasets_proportional(dataset_dict, train_size, PROMPTS["soft prompt"])
        for train_size in TRAIN_SIZES:
            soft_output_dir = f"{MODEL_OUTPUT_DIR}/soft_prompts/soft_prompt_{model_name}_{dataset_name}_{train_size}" 
            peft_model = init_peft_model(model, model_name)
            train_soft_prompt(peft_model, tokenizer, train_ds, NUM_TRAIN_STEPS, soft_output_dir)
            
            peft_model = apply_peft_adapter(model, soft_output_dir)
            
            # generate posts
            with open(f"{GENERATED_OUTPUT_DIR}/train_size_ablation/{model_name}_{dataset_name}_{train_size}.json", "w") as f:
                for _ in range(100):
                    post = generate_post(PROMPTS["soft prompt"], peft_model, tokenizer)
                    f.write(post + "\n")

In [None]:
login()

# experiment ablation
for model_name in MODELS:
    model, tokenizer = load_model(model_name)
    for dataset_name, dataset_dict in zip(DATASET2_NAMES, DATASETS2):
        for experiment in EXPERIMENTS:
            prompt = PROMPTS[experiment]
            train_ds = load_datasets_proportional(dataset_dict, 100, prompt)
            
            if experiment == "soft prompt":
                soft_output_dir = f"{MODEL_OUTPUT_DIR}/soft_prompts/soft_prompt_{model_name}_{dataset_name}_{train_size}" 
                peft_model = init_peft_model(model, model_name)
                train_soft_prompt(peft_model, tokenizer, train_ds, NUM_TRAIN_STEPS, soft_output_dir)
                model = apply_peft_adapter(model, soft_output_dir)
            elif experiment == "fine tune":
                break # skip fine tuning
            elif experiment == "summary":
                prompt = PROMPTS[1]
            elif experiment == "like history":
                prompt = PROMPTS[2]
            elif experiment == "self defined":
                prompt = PROMPTS[3]
        
            with open(f"{GENERATED_OUTPUT_DIR}/experiment_ablation/{model_name}_{dataset_name}_{experiment}.json", "w") as f:
                for _ in range(100):
                    post = generate_post(prompt, model, tokenizer)
                    f.write(post + "\n")


In [None]:
# judging 
for model_name in MODELS:
    for dataset in DATASETS1:
        for train_size in TRAIN_SIZES:
            with open(f"{GENERATED_OUTPUT_DIR}/train_size_ablation/{model_name}_{dataset}_{train_size}.json", "r") as f:
                lines = f.readlines()
                for i in range(0, len(lines), 3):
                    post_lines = lines[i:i+3]
                    post = ''.join(line.strip() for line in post_lines)
                    if post:
                        gpt_judgement = judge_post_gpt5(post, dataset)
                        heuristic_judgement = judge_post_heuristic(post, dataset)
                        print(post)
                        break


In [None]:
# judging 
for model_name in MODELS:   
    for dataset in DATASETS2:
        for experiment in EXPERIMENTS:
            with open(f"{GENERATED_OUTPUT_DIR}/experiment_ablation/{model_name}_{dataset}_{experiment}.json", "r") as f:
                lines = f.readlines()
                for i in range(0, len(lines), 3):
                    post_lines = lines[i:i+3]
                    post = ''.join(line.strip() for line in post_lines)
                    if post:
                        gpt_judgement = judge_post_gpt5(post, dataset)
                        heuristic_judgement = judge_post_heuristic(post, dataset)
                        print(post)
                        break
