Skip to content

RaidonWong/CURP_code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CURP: Codebook-based Continuous User Representation for Personalized Generation with LLMs

Introduction

This repository contains the implementation of CURP (Codebook-based Continuous User Representation for Personalized Generation with LLMs), a novel framework for learning interpretable and continuous user representations to enhance personalized text generation with large language models (LLMs). The training process can be mainly devided into 2 stages. In the first stage we construct a universal codebook via product quantization with balanced K-Means initialization to build user prototype. In the second stage we project the prototype representation into LLM's space and guide for personalization.

Installation

git clone https://github.com/RaidonWong/CURP.git curp
cd curp
pip install -r requirements.txt

Data Preprocessing

1. AlignX Dataset

  • Source: AlignX
  • Fields Used: "prompt", "chosen", "rejected", "Demographic Information", "User-Generated Content"
  • Note: We use a randomly filtered deduplicated subset. Can be replaced with any user history dataset with broad knowledge distribution.

2. Val Datasets

  • Sources:
  • Tasks: News Headline, Tweet Paraphrase, Review Writing
  • Preprocessing:
    • News Headline: Filter short texts; use LLaMA-3-8B to judge headline-paragraph association.
    • Tweet Paraphrase: Remove noisy entries (e.g., @<ref>); use LLM for filter.
    • Review Writing: Keep consistent-rating reviews, length filter

Usage

  • First of all, you need to download the Qwen-2.5-7B-Instruct model and Contriever model
    • Qwen-2.5-7B-Instruct
    • Contriever Then you need to add two speicial token, <PAD> and <USR_EMB> and resize the LLM tokenizer embedding input. The <PAD> is for padding and the <USR_EMB> is for indicating the place to insert our user model.
special_tokens = {
    "additional_special_tokens": ["<PAD>", "<USR_EMB>"]
}
tokenizer.add_special_tokens(special_tokens)
model.resize_token_embeddings(len(tokenizer))

-Secondly, you need to prepare the embedding of each history for next steps by:

python encode.py
  • Thirdly, you need to pretrain a Product Quantized codebook by
bash step1.sh
  • Then, you need to align the representation with LLMs by
bash step2.sh
  • After training, you can inference by
python inference.py

For other tasks, they share a similar process. We also provide corresponding code.

The argument is by default. If you want to change, you can change as you need.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published