This repository contains the implementation of CURP (Codebook-based Continuous User Representation for Personalized Generation with LLMs), a novel framework for learning interpretable and continuous user representations to enhance personalized text generation with large language models (LLMs). The training process can be mainly devided into 2 stages. In the first stage we construct a universal codebook via product quantization with balanced K-Means initialization to build user prototype. In the second stage we project the prototype representation into LLM's space and guide for personalization.
git clone https://github.com/RaidonWong/CURP.git curp
cd curp
pip install -r requirements.txt- Source: AlignX
- Fields Used:
"prompt","chosen","rejected","Demographic Information","User-Generated Content" - Note: We use a randomly filtered deduplicated subset. Can be replaced with any user history dataset with broad knowledge distribution.
- Sources:
- Tasks: News Headline, Tweet Paraphrase, Review Writing
- Preprocessing:
- News Headline: Filter short texts; use LLaMA-3-8B to judge headline-paragraph association.
- Tweet Paraphrase: Remove noisy entries (e.g.,
@<ref>); use LLM for filter. - Review Writing: Keep consistent-rating reviews, length filter
- First of all, you need to download the Qwen-2.5-7B-Instruct model and Contriever model
- Qwen-2.5-7B-Instruct
- Contriever
Then you need to add two speicial token,
<PAD>and<USR_EMB>and resize the LLM tokenizer embedding input. The<PAD>is for padding and the<USR_EMB>is for indicating the place to insert our user model.
special_tokens = {
"additional_special_tokens": ["<PAD>", "<USR_EMB>"]
}
tokenizer.add_special_tokens(special_tokens)
model.resize_token_embeddings(len(tokenizer))-Secondly, you need to prepare the embedding of each history for next steps by:
python encode.py- Thirdly, you need to pretrain a Product Quantized codebook by
bash step1.sh- Then, you need to align the representation with LLMs by
bash step2.sh- After training, you can inference by
python inference.pyFor other tasks, they share a similar process. We also provide corresponding code.
The argument is by default. If you want to change, you can change as you need.