SAFE-QAQ is an end-to-end framework for audio-text fraud detection that leverages reinforcement learning to enable slow-thinking decision-making. Below are instructions for setting up the environment, training the model, and running experiments.
This repository contains the source code for SAFE-QAQ, which consists of three main stages:
- Rule-Based Reinforcement Learning (Stage 1): Train a rule-based RL model.
- Rejection Sampling Fine-Tuning (RSFT) and Length-Constrained Reinforcement Learning (LCRL) (Stage 2): Refine the model using rejection sampling and LCRL techniques.
- Real-Time Fine-Tuning (Stage 3): Fine-tune the model for real-time inference.
The prompts for both real-time inference and training are defined in prompt.py.
To set up the environment, follow the instructions provided in ms-swift.
Run the following script to train the initial rule-based RL model:
bash run_swift_grpo_stage1.sh-
Rejection Sampling: Generate samples using:
bash sample.sh
Process the sampled data with:
bash process_samples.sh
-
Fine-Tuning with RSFT: Fine-tune the model using the processed data:
bash run_swift_sft_stage2_RSFT.sh
-
Length-Constrained Reinforcement Learning (LCRL): Further refine the model with LCRL:
bash run_swift_grpo_stage2_LCRL.sh
Perform real-time fine-tuning by running:
bash run_swift_grpo_stage3.sh- The
prompt.pyfile contains the definitions of prompts used during training and real-time inference. - Ensure all dependencies are installed as per the ms-swift documentation before running the scripts.