From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation
-
OS: Ubuntu 22.04
-
GPU
- NVIDIA RTX 4090 VRAM 24GB
- For Dialogue Graph Encoder training
- NVIDIA A100 VRAM 80GB
- For Generator training
- NVIDIA RTX 4090 VRAM 24GB
-
Python: >=3.9
-
Required packages: follow the
requirements.txt
file.
-
Clone this repository:
git clone https://github.com/IKMLab/MUDI cd MUDI
-
Create a new conda environment and activate it:
conda create -n mudi python=3.9 conda activate mudi
-
Install the required packages:
pip install -r requirements.txt
You could simply download the all dataset from the following link:
-
# from Google Drive # https://drive.google.com/file/d/1nscytAhDEPdDCn5K9nfT7d_RFTXoQZcZ/view?usp=drive_link pip install gdown gdown --id 1nscytAhDEPdDCn5K9nfT7d_RFTXoQZcZ unzip dataset.zip
You can also download the dataset manually and put it in the dataset/RCC/
and dataset/ConvAI2/
directories.
-
RCC: We utilize the Reddit Coherence Chain (RCC) dataset, a large-scale chit-chat dialogue dataset designed for dialogue alignment in the pre-training phase of our Dialogue Graph Encoder. The RCC dataset can be accessed here.
- In our experiments, we use the 5-turn RCC dataset.
-
ConvAI2: We utilize the ConvAI2 dataset, a personalized chit-chat dialogue dataset designed for personalized dialogue generation. The ConvAI2 dataset can be accessed here.
-
In our experiments, we use the 'original' version of the ConvAI2 dataset. But you can also use the 'revised' versions of the ConvAI2 dataset.
-
We provide the coherence annotated ConvAI2 dataset in the
dataset/ConvAI2/llama3/
anddataset/ConvAI2/mixtral/
directories. (Google Drive)
-
The dataset
folder should be organized as follows:
.
├── dataset
│ ├── ConvAI2
│ │ ├── llama3
│ │ │ ├── train_self_original_coherence.json
│ │ │ └── valid_self_original_coherence.json
│ │ ├── mixtral
│ │ │ ├── train_self_original_coherence.json
│ │ │ └── valid_self_original_coherence.json
│ │ ├── train_self_original.json
│ │ └── valid_self_original.json
├── └── RCC
│ ├──reddit_conversations_v1.0_5turns
│ │ ├── reddit_conversations.5turns.train.txt
│ │ ├── reddit_conversations.5turns.test.txt
│ └── └── reddit_conversations.5turns.dev.txt
├── ...
-
Please convert the raw RCC training set and validation set to the JSON format by running the following code:
python src/data/parse.py -i dataset/RCC/reddit_conversations_v1.0_5turns/reddit_conversations.5turns.train.txt -o dataset/RCC/reddit_conversations_v1.0_5turns/train.json --dataset rcc
Argument Explanation -d
,--dataset
Dataset name (convai2 or rcc). Choices are 'convai2', 'rcc'.Required. -i
,--input_file_path
Path to the input file. Only txt file is allowed.Required. -o
,--output_file_path
Path to save the converted file. Only json file is allowed.Required. When the RCC dataset is converted to the JSON format, the data format should be as follows:
{ "dialogue": [ "what is your secret that nobody else knows ?", "it 's a secret nobody should know .", "go on , you know you want to .", "so this is what happened - \" \" i hope you liked it .", "you 're the strong silent type are n't you ." ] }
-
We utilize the Sentence-Transformer as an encoder to extract contextualized global semantics from both utterances and personas, thereby initializing the node features.
You can obtain the encoding dataset by running the following code:
For RCC dataset:
python src/data/preprocess.py -d rcc -i dataset/RCC/reddit_conversations_v1.0_5turns/train.json -o dataset/RCC/reddit_conversations_v1.0_5turns/train.pkl
For ConvAI2 dataset:
python src/data/preprocess.py -d convai2 -i dataset/ConvAI2/llama3/valid_self_original_coherence.json -o dataset/ConvAI2/llama3/valid_self_original_coherence.pkl
Please preprocess the training and validation sets respectively.
Argument Explanation -d
,--dataset
Dataset name. Choices are 'convai2', 'rcc', 'daily_dialog'.Required. -i
,--input_file_path
Path to the input file.Required. -o
,--output_file_path
Path to save the dataset after preprocessing. Only pickle files are allowed.Required. --augment
Augment the dataset. Only for ConvAI2 dataset. Default is True.
In order to pre-train the Dialogue Graph Encoder, you can run the following code:
sh scripts/pretrain_gnn.sh
Please indicate the path to the pre-trained Dialogue Graph Encoder in the --pretrained_model_path
argument. You can run the following code to fine-tune the Dialogue Graph Encoder:
sh scripts/train_gnn.sh
Argument | Explanation |
---|---|
--data_dir |
Path to the dataset directory. Default is 'dataset/ConvAI2/'. |
--processed_train_data_dir |
Path to the processed training data directory. |
--processed_valid_data_dir |
Path to the processed validation data directory. |
--train_data_name |
Training data name under the data_dir. Default is 'train_self_original_coherence.pkl'. |
--valid_data_name |
Validation data name under the data_dir. Default is 'valid_self_original_coherence.pkl'. |
--processed_train_data_name |
Processed training data name under the processed_train_data_dir. |
--processed_valid_data_name |
Processed validation data name under the processed_valid_data_dir. |
--ckpt_dir |
Checkpoint directory. Default is 'checkpoints/gnn'. |
--num_workers |
Number of workers. Default is 4. |
--seed |
Random seed. Default is 42. |
--wandb |
Use wandb or not. Default is False. |
--wandb_entity |
Wandb entity. |
--wandb_project |
Wandb project. |
-a , --wandb_run_name |
Wandb run name. |
--process_mode |
Label (coherence relations) preprocess mode. Default is 'single_filter'. |
--k_hop |
Keep how many k-hop neighbors. Default is 3. |
--reverse_edge |
Reverse edge or not. Default is False. |
--directed |
Directed graph or not. Default is True. |
--train_mode |
Model training mode. Choices are 'finetuning', 'pretraining'. Default is 'finetuning'. |
--do_inference |
Do inference only. Default is False. |
--cpu |
Use CPU or not, if True, then use CPU, else use GPU (cuda). Default is False. |
--distributed |
Use distributed training or not. Default is False. |
--pretrained_model_path |
Path to the pretrained model. |
--pretrained_utterance_encoder |
Pretrained model for utterance/persona encoder. Choices are 'none', 'bert', 'roberta'. Default is 'none'. |
--layer_type |
Type of GNN encoder layers. Choices are 'GAT', 'GATv2', 'DialogueGAT'. Default is 'DialogueGAT'. |
--num_layers |
Number of GNN encoder layers. Default is 2. |
--num_heads |
Number of attention heads if using GAT like layers. Default is 4. |
--embedding_dim |
Embedding dimension of GNN layers. Default is 512. |
--batch_size |
Batch size. Default is 512. |
--epochs |
Number of epochs. Default is 100. |
--lr |
Learning rate. Default is 0.0001. |
--weight_decay |
Weight decay. Default is 0.01. |
--optimizer |
Optimizer. Choices are 'adam', 'adamw', 'sgd', 'adagrad', 'rmsprop', 'sparse_adam'. Default is 'adamw'. |
--coh_rel_cls_weight |
Loss weight for coherence relations classification. Default is 1.0. |
--link_prediction_weight |
Loss weight for link prediction. Default is 1.0. |
--next_resp_type_direct_weight |
Loss weight for next response type prediction. Default is 1.0. |
--next_resp_type_seq_weight |
Loss weight for next response type prediction. Default is 1.0. |
--endure_times |
The maximum endure epochs of loss increasing on validation. Default is 10. |
Note: If you encounter the error:
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/RCC/reddit_conversations_v1.0_5turns//processed_train/processed_train.pt', It is because the processed data is not found.
Please delete the dataset/RCC/reddit_conversations_v1.0_5turns/processed_train/
directory and rerun the code. Other similar errors can be solved in the same way.
Please indicate the path to the Dialogue Graph Encoder in the --pretrained_dialogue_encoder_weights_path
argument.
You can run the following code to train the generator:
sh scripts/train_generator.sh
Argument | Explanation |
---|---|
--data_dir |
Path to the dataset directory. Default is 'dataset/ConvAI2/'. |
--processed_train_data_dir |
Path to the processed training data directory. |
--processed_valid_data_dir |
Path to the processed validation data directory. |
--train_data_name |
Training data name under the data_dir. Default is 'train_self_original_coherence.pkl'. |
--valid_data_name |
Validation data name under the data_dir. Default is 'valid_self_original_coherence.pkl'. |
--processed_train_data_name |
Processed training data name under the processed_train_data_dir. |
--processed_valid_data_name |
Processed validation data name under the processed_valid_data_dir. |
--ckpt_dir |
Checkpoint directory. Default is 'ckpts/generator/ConvAI2/'. |
--num_workers |
Number of workers. Default is 4. |
--seed |
Random seed. Default is 42. |
--wandb |
Use WandB or not. Default is False. |
--wandb_entity |
WandB entity. |
--wandb_project |
WandB project. |
--wandb_run_name , -a |
WandB run name. |
--process_mode |
Label (coherence relations) preprocess mode. Default is 'single_filter'. |
--k_hop |
Keep how many k-hop neighbors. Default is 3. |
--reverse_edge |
Reverse edge or not. Default is False. |
--directed |
Directed graph or not. Default is True. |
--train_mode |
Model training mode. Choices are 'finetuning', 'pretraining'. Default is 'finetuning'. |
--do_inference |
Do inference only. Default is False. |
--cpu |
Use CPU or not. If True, then use CPU, else use GPU (cuda). Default is False. |
--distributed |
Use distributed training or not. Default is False. |
--pretrained_dialogue_encoder_weights_path |
Path to the pretrained dialogue encoder weights. |
--pretrained_dialogue_encoder_encoder_weights_path |
Path to the pretrained dialogue encoder encoder weights. |
--pretrained_utterance_encoder |
Pretrained model for utterance/persona encoder. Choices are 'none', 'bert', 'roberta'. Default is 'none'. |
--layer_type |
Type of GNN encoder layers. Choices are 'GAT', 'GATv2', 'RGAT', 'DialogueGAT'. Default is 'DialogueGAT'. |
--num_layers |
Number of GNN encoder layers. Default is 2. |
--num_heads |
Number of attention heads if using GAT like layers. Default is 4. |
--embedding_dim |
Embedding dimension of GNN layers. Default is 512. |
--batch_size |
Batch size. Default is 512. |
--epochs |
Number of epochs. Default is 100. |
--lr |
Learning rate. Default is 0.0001. |
--weight_decay |
Weight decay. Default is 0.01. |
--optimizer |
Optimizer. Choices are 'adam', 'adamw', 'sgd', 'adagrad', 'rmsprop', 'sparse_adam'. Default is 'adamw'. |
--coh_rel_cls_weight |
Loss weight for coherence relations classification. Default is 1.0. |
--link_prediction_weight |
Loss weight for link prediction. Default is 1.0. |
--next_resp_type_direct_weight |
Loss weight for next response type prediction. Default is 1.0. |
--next_resp_type_seq_weight |
Loss weight for next response type prediction. Default is 1.0. |
--endure_times |
The maximum endure epochs of loss increasing on validation. Default is 10. |
--coherence_attn_strategy |
Coherence attention strategy. Choices are 'SP', 'Emb', 'SP+Emb'. Default is 'SP+Emb'. |
--graph_encoder_strategy |
Graph encoder strategy. Choices are 'Attn', 'Add', 'C', 'P', 'Random', 'None'. Default is 'Attn'. |
sh scripts/generate.sh -m <model_name_or_path>
Argument | Explanation |
---|---|
-m , --model_name_or_path |
The path or model name.Required. |
-t , --tokenizer_name_or_path |
The path or tokenizer name.Required. |
-o , --output_dir |
The output directory where results are saved.Required. |
--data_dir |
Path to the dataset directory. Default is 'dataset/ConvAI2/'. |
--processed_data_dir |
Path to the processed data directory. |
--data_name |
Data file name. Default is 'valid_self_original_coherence.pkl'. |
--processed_data_name |
Processed data file name. |
--num_workers |
Number of workers. Default is 4. |
--process_mode |
Label (coherence relations) preprocess mode. Default is 'single_filter'. |
--k_hop |
Keep how many k-hop neighbors. Default is 3. |
--reverse_edge |
Reverse edge or not. Default is False. |
--directed |
Directed graph or not. Default is True. |
--batch_size |
Batch size. Default is 4. |
--pretrained_utterance_encoder |
Pretrained model for utterance/persona encoder. Choices are 'none', 'bert', 'roberta'. Default is 'none'. |
--tau |
Temperature parameter for sampling in dialogue generation. Default is 0.2. |
--top_k_relations |
Number of top relations to consider for generating responses. Default is 3. |
--coherence_attn_strategy |
Coherence attention strategy. Choices are 'SP', 'Emb', 'SP+Emb'. Default is 'SP+Emb'. |
--graph_encoder_strategy |
Graph encoder strategy. Choices are 'Attn', 'Add', 'C', 'P', 'Random', 'None'. Default is 'Attn'. |
We have used the exisiting evaluation metrics to evaluate the generated responses. Please download the models and put them in the root directory.
The structure of the root directory should be as follows:
.
├── assets
├── dataset
├── scripts
├── src
├── consistent_model
├── bart_score.pth
├── ...
We also using the QuantiDCE and DEAM to evaluate the coherence of the generated responses. You can download the QuantiDCE and DEAM models from the following links.
Since the official repositories do not provide convenient inference scripts, we have organized the inference results in the coherence_evaluation
folder. To proceed with the inference, please place the contents of the corresponding metric's folder into the root directory of the official method's repository and follow the steps outlined below:
-
QuantiDCE:
-
Replace the
util/opt.py
with the providedopt.py
file. -
Put the
quantidce_inference.py
file in the QuantiDCEroot
directory. -
Put the
compute_coherence.sh
file in the QuantiDCEscript
directory. -
Replace the
INPUT_FILE_PATH
andOUTPUT_FILE_PATH
in thecompute_coherence.sh
file with the corresponding paths. -
Run the
compute_coherence.sh
file to evaluate the coherence of the generated responses.cd script sh compute_coherence.sh
-
-
DEAM:
-
Place the contents of the
coherence_evaluation/DEAM/
folder into the root directory of the DEAMroot
directory. -
Run the
convert_format.py
to convert the data format to the DEAM format.python inference/convert_format.py --model_type mudi -i <input_file_path>
This will generate files in the same directory as input_file_path with the same name, appended with
-single_turn.txt
and-multi_turn.txt
. -
Run the
deam_inference.py
to evaluate the coherence of the generated responses.py inference/deam_inference.py --model_path coh_models --mode predict --eval_data_path <input_file_path>
The output will be default saved in the
coh_models/xxx.txt
file. -
Run the
compute_average_score.py
to compute the average score of the coherence evaluation scores.python inference/compute_average_score.py -i coh_models/xxx.txt
-
Furthermore, we provide the evaluation script to evaluate the generated responses with the other metrics. You can run the following code to get the main results (coherence evaluation mentioned above is not included in the main results):
sh scripts/evaluate.sh -i <generated_responses_file_path> -o <output_csv_file_path>
- All the evaluation metrics are saved in the csv file.
Argument | Explanation |
---|---|
-i , --input_file_path |
Path to the data for evaluation.Required. |
-o , --output_file_path |
Path where evaluation outputs are saved.Required. |
To validate the performance of the proposed DialogueGAT compared with existing GNN methods in Dialogue Graph modeling, we analyze results in the Next Response Type Prediction (NRTP) and Coherence Relations Classification (RC) tasks on the validation set.
-
You can run the
train_gnn.sh
script and change theGNN_LAYER_TYPE
variable toGATv2
to train the GATv2 model, orDialogueGAT
to train the DialogueGAT model:sh scripts/train_gnn.sh
- We report the best results on validation set of using the GATv2 and DialogueGAT models.
-
You can run the
generate.sh
and changeTAU
variable to evaluate the effect of τ values in dynamic weighted aggregation:sh scripts/generate.sh -m <model_name_or_path>
-
After generating the responses, you can run the evaluation script evaluate.sh to evaluate the generated responses.
When training the Generator, we will use the Dialogue Graph Encoder to encode the dialogue graph, and get the coherence-aware dialogue embeddings. Thus, we would like to analyze the effectiveness of the Dialogue Graph Encoder and Attention-based Feature-fusion Strategy.
You can run the train_generator.sh
script and change the GRAPH_ENCODER_STRATEGY
variable to Attn
, Add
, C
, P
, Random
, or None
to train the generator with different strategies of the Dialogue Graph Encoder output:
sh scripts/train_generator.sh
In our model, we propose two strategies for coherence-aware attention: SP and Emb. We analyze the performance of these two strategies in each evaluation metric (report in the main results).
You can run the train_generator.sh
script and change the COHERENCE_ATTN_STRATEGY
variable to SP
, Emb
, or SP+Emb
to train the generator with different strategies:
sh scripts/train_generator.sh
When the inference stage, you also can change the COHERENCE_ATTN_STRATEGY
variable to SP
, Emb
, or SP+Emb
to generate responses with different attention strategies:
sh scripts/generate.sh -m <model_name_or_path>