### Output-only style tuning with soft prompts (self-contained)

This notebook fine-tunes style by training ONLY on assistant outputs (no instructions), using PEFT Prompt Tuning on an instruct model.

It will:
- Install dependencies in-notebook
- Load a chat checkpoint via `transformers`
- Configure Prompt Tuning (learn virtual tokens only)
- Train on an outputs-only dataset to steer style
- Run inference on a normal user prompt

Notes:
- Adjust `MODEL_ID` to a model you can pull.
- Large models need significant VRAM; pick a smaller one if needed.


## Dependencies

In [1]:
%pip install -qU transformers==4.55.2 peft accelerate datasets trl einops sentencepiece bitsandbytes jinja2>=3.1.0


Note: you may need to restart the kernel to use updated packages.


In [4]:
# for oss 
%pip install triton==3.4 kernels

Defaulting to user installation because normal site-packages is not writeable
Collecting triton==3.4
  Downloading triton-3.4.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (155.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.4/155.4 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting kernels
  Downloading kernels-0.9.0-py3-none-any.whl (37 kB)
Collecting pyyaml>=6
  Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m751.2/751.2 KB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m
Collecting tomli>=2.0
  Downloading tomli-2.2.1-py3-none-any.whl (14 kB)
Installing collected packages: triton, tomli, pyyaml, kernels
Successfully installed kernels-0.9.0 pyyaml-6.0.2 tomli-2.2.1 triton-3.4.0
Note: you may need to restart the kernel to use updated packages.


In [17]:
models = [
    "meta-llama/Meta-Llama-3-70B-Instruct", # 0 # very slow, pretty much same quality as 8b on a100
    "meta-llama/Meta-Llama-3-8B-Instruct",  # 1 # good quality, pretty fast
    "openai/gpt-oss-20b",                   # 2 # good quality, decently quick, but have to deal with thinking 
    "Qwen/Qwen3-4B-Instruct-2507",          # 3 # tends to generate the same post over and over (without tuning)
    "Qwen/Qwen3-30B-A3B-Instruct-2507",     # 4 # pretty slow > 1 min per post on a100, good quality
    "google/gemma-3-4b-it",                 # 5 # multimodal, can't get working
    "google/gemma-3-27b-it",                # 6 # multimodal, can't get working
    "mistralai/Mistral-7B-v0.1",            # 7 WAITING FOR ACCESS
    "microsoft/phi-4",                      # 8 14B, pretty slow, low-decent quality, generates same post over and over
    "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" #9 very fast, have to deal with thinking, decent quality
    ]

MODEL_ID = models[1] 
OUTPUT_DIR = "./softprompt-style-outputs"
PROMPT_TOKENS = 64
MICRO_BATCH_SIZE = 1
GRAD_ACCUM_STEPS = 1
LEARNING_RATE = 0.2
NUM_TRAIN_STEPS = 1000  
MAX_SEQ_LEN = 512

PROMPT = "Please generate one reddit post. Don't include any extraneous characters like asterisks or other symbols. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n"

## Load data 
(make sure to run sampleposts.py)

In [58]:
# Config and instruction/output dataset
from typing import List
import json
import re

# Load sampled Reddit posts from JSON created by sample-posts.py
# Each item is a dict with keys: title, subreddit, self_text
with open("./data/humorposts.json", "r", encoding="utf-8") as f:
    reddit_posts: List[dict] = json.load(f)

# Build dataset in the format: {"instruction": PROMPT, "output": post}
examples: List[dict] = []
for p in reddit_posts:
    title = p.get("title", "")
    self_text = p.get("self_text", "")
    image_url = p.get("image_url", "")
    
    if not self_text or image_url: continue
    
    subreddit = p.get("subreddit", "")
    subreddit = re.sub(r"\s*(/)?r/", "r/", subreddit)
    post = f"title: {title}\nself_text: {self_text}\nsubreddit: {subreddit}"
    examples.append({"instruction": PROMPT, "output": post})


# Load model

Make sure to set HUGGING_FACE_HUB_TOKEN environment variable

In [7]:
from huggingface_hub import login
import dotenv, os

dotenv.load_dotenv()
login(token=os.getenv("HUGGING_FACE_HUB_TOKEN"))


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [8]:
# Load tokenizer and model (8B instruct)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16 if bf16 else torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True,
)
model.config.use_cache = False
print("Loaded:", MODEL_ID)


Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.39it/s]

Loaded: meta-llama/Meta-Llama-3-8B-Instruct





## Test model

### Text only

In [9]:
from transformers import TextStreamer

# prompt = "Please generate one reddit post (and nothing else). Make sure to stick to the format below exactly. Don't include any extraneous characters like asterisks or other symbols. \n\n title: {title} \n self_text: {self_text} \n subreddit: {subreddit} \n Here's an example of the format: \n\ntitle: This is the title of the post! \nself_text: Here's where the content of the post goes. \nsubreddit: This is the subreddit, or the name of the community the post belongs to."

messages = [
    {"role": "user", 
     "content": PROMPT},
]

streamer = TextStreamer(tokenizer, 
                        skip_special_tokens=False,
                        skip_prompt=True)

inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs, 
	max_new_tokens=MAX_SEQ_LEN,
	temperature=0.9,
	top_p=0.8,
	streamer=streamer,
)
# print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Here is a generated Reddit post:

title: I just had the weirdest dream
self_text: I was walking through a familiar neighborhood, but everything was slightly off. The houses were the same, but the colors were all wrong. I saw my childhood best friend standing in front of one of the houses, but she was wearing a bright pink wig and a superhero cape. I tried to talk to her, but she just ignored me and walked away. Then I woke up. Has anyone else ever had a dream that was so vivid and strange?
subreddit: r/dreams<|eot_id|>


## Training

In [59]:
# Configure PEFT Prompt Tuning
from peft import PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType

peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=PROMPT_TOKENS,
    prompt_tuning_init_text="Generate a reddit post.",
    tokenizer_name_or_path=MODEL_ID,
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()


trainable params: 262,144 || all params: 8,030,523,392 || trainable%: 0.0033


In [60]:
# Preprocess instruction/output dataset
from datasets import Dataset

# Build HF dataset from examples [{"instruction", "output"}]
dataset = Dataset.from_list(examples)

# Tokenize instruction with chat template, and supervise only the output tokens
def tokenize_io(sample):
    # Build chat prompt prefix for the user instruction
    messages = [{"role": "user", "content": sample["instruction"]}]
    prompt_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    prompt_ids = tokenizer(prompt_text, add_special_tokens=False)["input_ids"]
    output_ids = tokenizer(sample["output"], add_special_tokens=False)["input_ids"]
    eos_id = tokenizer.eos_token_id

    input_ids = prompt_ids + output_ids + ([eos_id] if eos_id is not None else [])
    labels = ([-100] * len(prompt_ids)) + output_ids + ([eos_id] if eos_id is not None else [])
    attention_mask = [1] * len(input_ids)

    # Truncate from the left if too long, keeping alignment between inputs and labels
    if len(input_ids) > MAX_SEQ_LEN:
        input_ids = input_ids[-MAX_SEQ_LEN:]
        labels = labels[-MAX_SEQ_LEN:]
        attention_mask = attention_mask[-MAX_SEQ_LEN:]

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }

train_ds = dataset.map(tokenize_io, remove_columns=dataset.column_names)
train_ds


Map: 100%|██████████| 250/250 [00:00<00:00, 967.83 examples/s] 


Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 250
})

In [61]:
# Trainer setup and brief training
import math
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW


def collate_fn(features):
    pad_id = tokenizer.pad_token_id
    batch_size = len(features)
    seq_lens = [len(f["input_ids"]) for f in features]
    max_len = max(seq_lens)

    input_ids = torch.full((batch_size, max_len), pad_id, dtype=torch.long)
    attention_mask = torch.zeros((batch_size, max_len), dtype=torch.long)
    labels = torch.full((batch_size, max_len), -100, dtype=torch.long)

    for i, f in enumerate(features):
        ids = torch.tensor(f["input_ids"], dtype=torch.long)
        attn = torch.tensor(f["attention_mask"], dtype=torch.long)
        labs = torch.tensor(f["labels"], dtype=torch.long)
        L = ids.size(0)
        input_ids[i, :L] = ids
        attention_mask[i, :L] = attn
        labels[i, :L] = labs

    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}


train_loader = DataLoader(
    train_ds,
    batch_size=MICRO_BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_fn,
)

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
# Total optimizer steps we intend to take
total_optim_steps = NUM_TRAIN_STEPS
num_warmup_steps = max(1, int(0.1 * total_optim_steps))
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=total_optim_steps,
)

model.train()
model = model.to(next(model.parameters()).device)

optimizer.zero_grad()
optim_step = 0
accumulated = 0
running_loss = 0.0
for epoch in range(10):  # repeat over dataset until reaching desired steps
    for batch in train_loader:
        batch = {k: v.to(model.device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        (loss / GRAD_ACCUM_STEPS).backward()
        running_loss += loss.item()
        accumulated += 1
        if accumulated % GRAD_ACCUM_STEPS == 0:
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
            if optim_step % 10 == 0:
                print(f"step {optim_step} loss {running_loss / GRAD_ACCUM_STEPS:.4f}")
            running_loss = 0.0
            optim_step += 1
            if optim_step >= total_optim_steps:
                break
    if optim_step >= total_optim_steps:
        break

model.save_pretrained(OUTPUT_DIR)
print("Saved prompt adapter to:", OUTPUT_DIR)


step 0 loss 2.7644
step 10 loss 2.9495
step 20 loss 2.9693
step 30 loss 3.2349
step 40 loss 4.3521
step 50 loss 2.4671
step 60 loss 2.9861
step 70 loss 2.6146
step 80 loss 2.4659
step 90 loss 2.8505
step 100 loss 3.2453
step 110 loss 3.5126
step 120 loss 2.6644
step 130 loss 2.4040
step 140 loss 3.2621
step 150 loss 2.5959
step 160 loss 2.8041
step 170 loss 2.6731
step 180 loss 2.0506
step 190 loss 2.7093
step 200 loss 1.0189
step 210 loss 2.9810
step 220 loss 2.2182
step 230 loss 2.4045
step 240 loss 2.8120
step 250 loss 2.6533
step 260 loss 2.7973
step 270 loss 2.4042
step 280 loss 3.0514
step 290 loss 2.6159
step 300 loss 2.6006
step 310 loss 1.8526
step 320 loss 1.9546
step 330 loss 2.5402
step 340 loss 3.4069
step 350 loss 2.8354
step 360 loss 3.9945
step 370 loss 2.3689
step 380 loss 2.1852
step 390 loss 1.9901
step 400 loss 2.8739
step 410 loss 2.3417
step 420 loss 2.7267
step 430 loss 2.9212
step 440 loss 3.2932
step 450 loss 1.9399
step 460 loss 3.4223
step 470 loss 3.7334
ste

In [62]:
from peft import PeftModel
from transformers import TextStreamer, AutoModelForCausalLM, AutoTokenizer
import torch 

bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Reload base + adapter
base = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16 if bf16 else torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True,
)
base = PeftModel.from_pretrained(base, OUTPUT_DIR)
base.eval()

Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  3.13it/s]
Some parameters are on the meta device because they were offloaded to the cpu.


PeftModelForCausalLM(
  (base_model): LlamaForCausalLM(
    (model): LlamaModel(
      (embed_tokens): Embedding(128256, 4096)
      (layers): ModuleList(
        (0-31): 32 x LlamaDecoderLayer(
          (self_attn): LlamaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): LlamaMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-0

In [67]:
streamer = TextStreamer(tokenizer, 
                        skip_special_tokens=True,
                        skip_prompt=True
                        )

# Build chat-formatted inputs via the model's chat template
messages = [
    {"role": "user", "content": PROMPT},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(base.device)

with torch.no_grad():
    _ = base.generate(
        **inputs,
        max_new_tokens=MAX_SEQ_LEN,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        streamer=streamer,
    )


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


title: I'm not sure if I'm a good or bad person
self_text: I'm a 23 year old man who has never been in a relationship, I have a decent job but I'm not very ambitious, I'm not very outgoing and I don't really like parties or big social events. I'm not really interested in politics or current events. I like to play video games and watch TV. I'm a bit of a loner but I don't really feel lonely. I'm pretty content with my life. 

Is that good or bad? 
subreddit: copypasta


# 