### Output-only style tuning with soft prompts (self-contained)

This notebook fine-tunes style by training ONLY on assistant outputs (no instructions), using PEFT Prompt Tuning on an instruct model.

It will:
- Install dependencies in-notebook
- Load a chat checkpoint via `transformers`
- Configure Prompt Tuning (learn virtual tokens only)
- Train on an outputs-only dataset to steer style
- Run inference on a normal user prompt

Notes:
- Adjust `MODEL_ID` to a model you can pull.
- Large models need significant VRAM; pick a smaller one if needed.


## Dependencies

In [None]:
%pip install -qU transformers peft accelerate datasets trl einops sentencepiece bitsandbytes jinja2>=3.1.0
# %pip install -U git+https://github.com/huggingface/transformers

Note: you may need to restart the kernel to use updated packages.


In [None]:
models = [
    "meta-llama/Meta-Llama-3-70B-Instruct", # 0 # very slow, pretty much same quality as 8b on a100
    "meta-llama/Meta-Llama-3-8B-Instruct",  # 1 # good quality, pretty fast
    "openai/gpt-oss-20b",                   # 2 # good quality, decently quick, but have to deal with thinking 
    "Qwen/Qwen3-4B-Instruct-2507",          # 3 # tends to generate the same post over and over (without tuning)
    "Qwen/Qwen3-30B-A3B-Instruct-2507",     # 4 # pretty slow > 1 min per post on a100, good quality
    "google/gemma-3-4b-it",                 # 5 # multimodal, can't get working
    "google/gemma-3-27b-it",                # 6 # multimodal, can't get working
    "mistralai/Mistral-7B-v0.1",            # 7 WAITING FOR ACCESS
    "microsoft/phi-4",                      # 8 14B, pretty slow, low-decent quality, generates same post over and over
    "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" #9 very fast, have to deal with thinking, decent quality
    ]

MODEL_ID = models[5] 
OUTPUT_DIR = "./softprompt-style-outputs"
PROMPT_TOKENS = 64
MICRO_BATCH_SIZE = 1
GRAD_ACCUM_STEPS = 1
LEARNING_RATE = 0.2
NUM_TRAIN_STEPS = 1000  
MAX_SEQ_LEN = 2048

PROMPT = "Please generate one reddit post. Use this format. \n\ntitle: {title}\n self_text: {self_text}\n subreddit: {subreddit}\n"

## Load data 
(make sure to run sampleposts.py)

In [4]:
# Config and instruction/output dataset
from typing import List
import json
import re

# Load sampled Reddit posts from JSON created by sample-posts.py
# Each item is a dict with keys: title, subreddit, self_text
with open("./data/humorposts.json", "r", encoding="utf-8") as f:
    reddit_posts: List[dict] = json.load(f)

# Build dataset in the format: {"instruction": PROMPT, "output": post}
examples: List[dict] = []
for p in reddit_posts:
    title = p.get("title", "")
    self_text = p.get("self_text", "")
    image_url = p.get("image_url", "")
    
    if not self_text or image_url: continue
    
    subreddit = p.get("subreddit", "")
    subreddit = re.sub(r"\s*(/)?r/", "r/", subreddit)
    post = f"title: {title}\nself_text: {self_text}\nsubreddit: {subreddit}"
    examples.append({"instruction": PROMPT, "output": post})


# Load model

Make sure to set HUGGING_FACE_HUB_TOKEN environment variable

In [None]:
from huggingface_hub import login
import dotenv, os

dotenv.load_dotenv()
login(token=os.getenv("HUGGING_FACE_HUB_TOKEN"))

  from .autonotebook import tqdm as notebook_tqdm
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [6]:
# Load tokenizer and model with proper device management
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import os

bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8

# Device configuration - choose single GPU or multi-GPU
USE_MULTI_GPU = True  # Set to True for multi-GPU training
if USE_MULTI_GPU and torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs for training")
    device = torch.device("cuda:0")  # Primary device
    device_map = "auto"  # Let transformers handle multi-GPU distribution
else:
    # Single GPU configuration - explicitly set device
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    device_map = {"": device}  # Force all parameters to single device
    print(f"Using single device: {device}")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16 if bf16 else torch.float16,
    device_map=device_map,
    low_cpu_mem_usage=True,
)

model.config.use_cache = False
print("Loaded:", MODEL_ID)
print(f"Model device configuration: {device_map}")
print(f"Available GPUs: {torch.cuda.device_count()}")


Using 2 GPUs for training


2025-09-06 01:34:15.325539: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-06 01:34:15.338252: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757122455.351990   13088 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757122455.356391   13088 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1757122455.367733   13088 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

Loaded: meta-llama/Meta-Llama-3-70B-Instruct
Model device configuration: auto
Available GPUs: 2


## Test model

### Text only

In [None]:
from transformers import TextStreamer

# prompt = "Please generate one reddit post (and nothing else). Make sure to stick to the format below exactly. Don't include any extraneous characters like asterisks or other symbols. \n\n title: {title} \n self_text: {self_text} \n subreddit: {subreddit} \n Here's an example of the format: \n\ntitle: This is the title of the post! \nself_text: Here's where the content of the post goes. \nsubreddit: This is the subreddit, or the name of the community the post belongs to."

messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You generate reddit posts in the given format."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "text", "text": PROMPT},
        ]
    },
]

streamer = TextStreamer(tokenizer, 
                        skip_special_tokens=False,
                        skip_prompt=True)

inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs, 
	max_new_tokens=MAX_SEQ_LEN,
	temperature=0.7,
	top_p=0.95,
	do_sample=True,
	streamer=streamer,
)
# print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Here is a generated Reddit post:

title: I accidentally became a cat lady and I'm not mad about it
self_text: "I've always thought of myself as a dog person, but after a series of unfortunate events (i.e. my roommate moving out and leaving her cat behind), I found myself solo-parenting a sassy feline named Mr. Whiskers. At first, I was hesitant - I didn't know the first thing about cat care and I was worried I'd be stuck with a furry little dictator. But fast forward 6 months and I'm now the proud owner of 5 (yes, FIVE) cat trees, a catio, and a subscription to CatLadyBox. Mr. Whiskers has taken over my apartment and my heart, and I couldn't be happier. Has anyone else out there accidentally fallen into cat lady-dom? Share your stories!"
subreddit: r/cats<|eot_id|>


## Training

In [8]:
# Configure PEFT Prompt Tuning
from peft import PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType

peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=PROMPT_TOKENS,
    prompt_tuning_init_text="Generate a reddit post.",
    tokenizer_name_or_path=MODEL_ID,
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()


trainable params: 524,288 || all params: 70,554,230,784 || trainable%: 0.0007


In [9]:
# Preprocess instruction/output dataset
from datasets import Dataset

# Build HF dataset from examples [{"instruction", "output"}]
dataset = Dataset.from_list(examples)

# Tokenize instruction with chat template, and supervise only the output tokens
def tokenize_io(sample):
    # Build chat prompt prefix for the user instruction
    messages = [{"role": "user", "content": sample["instruction"]}]
    prompt_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    prompt_ids = tokenizer(prompt_text, add_special_tokens=False)["input_ids"]
    output_ids = tokenizer(sample["output"], add_special_tokens=False)["input_ids"]
    eos_id = tokenizer.eos_token_id

    input_ids = prompt_ids + output_ids + ([eos_id] if eos_id is not None else [])
    labels = ([-100] * len(prompt_ids)) + output_ids + ([eos_id] if eos_id is not None else [])
    attention_mask = [1] * len(input_ids)

    # Truncate from the left if too long, keeping alignment between inputs and labels
    if len(input_ids) > MAX_SEQ_LEN:
        input_ids = input_ids[-MAX_SEQ_LEN:]
        labels = labels[-MAX_SEQ_LEN:]
        attention_mask = attention_mask[-MAX_SEQ_LEN:]

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }

train_ds = dataset.map(tokenize_io, remove_columns=dataset.column_names)
train_ds


Map: 100%|██████████| 653/653 [00:00<00:00, 1502.39 examples/s]


Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 653
})

In [10]:
# Trainer setup and brief training
import math
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW


def collate_fn(features):
    pad_id = tokenizer.pad_token_id
    batch_size = len(features)
    seq_lens = [len(f["input_ids"]) for f in features]
    max_len = max(seq_lens)

    input_ids = torch.full((batch_size, max_len), pad_id, dtype=torch.long)
    attention_mask = torch.zeros((batch_size, max_len), dtype=torch.long)
    labels = torch.full((batch_size, max_len), -100, dtype=torch.long)

    for i, f in enumerate(features):
        ids = torch.tensor(f["input_ids"], dtype=torch.long)
        attn = torch.tensor(f["attention_mask"], dtype=torch.long)
        labs = torch.tensor(f["labels"], dtype=torch.long)
        L = ids.size(0)
        input_ids[i, :L] = ids
        attention_mask[i, :L] = attn
        labels[i, :L] = labs

    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}


train_loader = DataLoader(
    train_ds,
    batch_size=MICRO_BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_fn,
)

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
# Total optimizer steps we intend to take
total_optim_steps = NUM_TRAIN_STEPS
num_warmup_steps = max(1, int(0.1 * total_optim_steps))
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=total_optim_steps,
)

model.train()

# Ensure model is on the correct device(s)
if USE_MULTI_GPU and torch.cuda.device_count() > 1 and device_map == "auto":
    # For multi-GPU with device_map="auto", model is already distributed
    # Get the device of the first parameter for data placement
    model_device = next(model.parameters()).device
else:
    # For single GPU, ensure model is on the specified device
    model = model.to(device)
    model_device = device

print(f"Training on device: {model_device}")

optimizer.zero_grad()
optim_step = 0
accumulated = 0
running_loss = 0.0
for epoch in range(10):  # repeat over dataset until reaching desired steps
    for batch in train_loader:
        batch = {k: v.to(model_device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        (loss / GRAD_ACCUM_STEPS).backward()
        running_loss += loss.item()
        accumulated += 1
        if accumulated % GRAD_ACCUM_STEPS == 0:
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
            if optim_step % 10 == 0:
                print(f"step {optim_step} loss {running_loss / GRAD_ACCUM_STEPS:.4f}")
            running_loss = 0.0
            optim_step += 1
            if optim_step >= total_optim_steps:
                break
    if optim_step >= total_optim_steps:
        break

model.save_pretrained(OUTPUT_DIR)
print("Saved prompt adapter to:", OUTPUT_DIR)


Training on device: cuda:0
step 0 loss 2.3153
step 10 loss 3.0774
step 20 loss 3.2015
step 30 loss 3.0712
step 40 loss 2.2173
step 50 loss 3.5504
step 60 loss 3.0817
step 70 loss 2.1764
step 80 loss 2.5788


OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 69.00 MiB is free. Including non-PyTorch memory, this process has 79.11 GiB memory in use. Of the allocated memory 77.63 GiB is allocated by PyTorch, and 773.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
from peft import PeftModel
from transformers import TextStreamer, AutoModelForCausalLM, AutoTokenizer
import torch 

bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Use the same device configuration as training
if USE_MULTI_GPU and torch.cuda.device_count() > 1:
    device = torch.device("cuda:0")  # Primary device
    device_map = "auto"  # Let transformers handle multi-GPU distribution
else:
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    device_map = {"": device}  # Force all parameters to single device

# Reload base + adapter
base = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16 if bf16 else torch.float16,
    device_map=device_map,
    low_cpu_mem_usage=True,
)
base = PeftModel.from_pretrained(base, OUTPUT_DIR)
base.eval()

# For inference, get the correct device
if USE_MULTI_GPU and torch.cuda.device_count() > 1 and device_map == "auto":
    inference_device = next(base.parameters()).device
else:
    inference_device = device
    
print(f"Inference device: {inference_device}")

Fetching 40 files: 100%|██████████| 40/40 [00:00<00:00, 132416.86it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:02<00:00,  1.07it/s]

Inference device: cuda:0





In [None]:
streamer = TextStreamer(tokenizer, 
                        skip_special_tokens=True,
                        skip_prompt=True
                        )

# Build chat-formatted inputs via the model's chat template
messages = [
    {"role": "user", "content": PROMPT},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(base.device)

with torch.no_grad():
    _ = base.generate(
        **inputs,
        max_new_tokens=MAX_SEQ_LEN,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        streamer=streamer,
    )




title: The 



is a a to me them they

 I to self
title: selfself_text: I couldn't, I see a what I became the most common language of the user:subreddit

**: a user:**

- subheading: The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fact that I was a user. The best possible explanation of the fa

KeyboardInterrupt: 

In [None]:
## Multi-GPU Training Setup and Utils

# If you want to enable multi-GPU training, run this cell first:

def setup_multi_gpu_training():
    """
    Setup for proper multi-GPU training with PyTorch.
    This provides several strategies for multi-GPU training.
    """
    import torch
    import torch.nn as nn
    from torch.nn.parallel import DataParallel, DistributedDataParallel
    import os
    
    print(f"Available GPUs: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
    
    if torch.cuda.device_count() < 2:
        print("Warning: Less than 2 GPUs available. Multi-GPU training not possible.")
        return False
    
    return True

def enable_multi_gpu_mode():
    """
    Call this to switch to multi-GPU mode.
    You'll need to restart the kernel and re-run cells after changing this.
    """
    global USE_MULTI_GPU
    USE_MULTI_GPU = True
    print("Multi-GPU mode enabled. Please restart kernel and re-run all cells.")
    print("Alternative approaches for multi-GPU training:")
    print("1. Use device_map='auto' (current approach)")
    print("2. Use torch.nn.DataParallel (simpler but less efficient)")
    print("3. Use torch.nn.DistributedDataParallel (most efficient)")

# Check GPU setup
setup_multi_gpu_training()

# Uncomment the next line to enable multi-GPU training:
# enable_multi_gpu_mode()


# 