Numerical Sensitivity Enhancing and Reasoning Completeness Alignment for Quantitative Understanding

This system is designed to complete the task7 of the SemEval2024，the performance of our system can be found in overview paper.

An overview of our system: (1) supervised fine-tuning with comparing numbers task for numerical sensitivity enhancement, (2) reward model training. (3) reinforcement learning via proximal policy optimization with Reasoning Completeness Reward.

Data

We have uploaded JSON format data samples for SFT training, reward model training, and PPO training, which are Quantitative101 datasets expanded using GPT3.5. The reward model training data is human labeled.

sft_data_example.json

{
    "task": "stressTest",
    "id": "0",
    "statement1": "At Veridux Corporation , there are 250 employees",
    "statement2": "At Veridux Corporation , there are less than 650 employees",
    "option1": "Entailment",
    "option2": "contradiction",
    "option3": "neutral",
    "response": " [1] . According to the premise \"there are 250 employees\", the hypothesis \"there are less than 650 employees\" is not contradicted by the premise.\n [2] . The number 250 is indeed less than 650, so the hypothesis can be justifiably inferred to be true.\n",
    "label": "Thus, the answer is option 1. #### 1"
}

rm_data_example.json

{
    "sentence": "1.  The problem asks us to determine if a hypothesis can be justifiably inferred to be true, false, or cannot be determined based on a given premise. The first statement is the premise and the second statement is the hypothesis",
    "label": "0"
}

ppo_data_example.json

{
    "task": "stressTest",
    "id": "0",
    "statement1": "At Veridux Corporation , there are 250 employees",
    "statement2": "At Veridux Corporation , there are less than 650 employees",
    "option1": "Entailment",
    "option2": "contradiction",
    "option3": "neutral",
    "label": 0
}

Training

gpt_generate.py: Used to call the API of gpt3.5 and generate an expanded dataset
run_sft.sh: Used for sft training
run_bert.sh: Used for reward model training
run_ppo.sh: Used for ppo training
test.py: Used to run the model on testsets of the tasks
test_bert.py: Run reward model on the test set
extract_ans.py: Extract answers from the response output of the model and calculate scores

Our SFT model is trained on Abel-7B with a learning rate of 3e-5, a warmup rate of 0.03, and a model max length of 1024. As for the RM, we choose to train on BERT-large model as it well complete the classification tasks. It is trained with a learning rate of 2e-5, warmup rate of 0.05, and a model max length of 256, and is trained for 10 epochs. The PPO training is implemented with Lora and TRL, where the learning rate=1.41e-5, max new tokens=512. On a dataset of size 5470, each training epoch takes around 55 hours on 4 A100s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Numerical Sensitivity Enhancing and Reasoning Completeness Alignment for Quantitative Understanding

Data

Training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
extract_ans.py		extract_ans.py
gpt_generate.py		gpt_generate.py
ppo_data_example.json		ppo_data_example.json
ppo_trl.py		ppo_trl.py
rm_data_example.json		rm_data_example.json
run_bert.sh		run_bert.sh
run_ppo.sh		run_ppo.sh
run_sft.sh		run_sft.sh
sft_data_example.json		sft_data_example.json
sft_train.py		sft_train.py
test.py		test.py
test_bert.py		test_bert.py
train_bert.py		train_bert.py

Folders and files

Latest commit

History

Repository files navigation

Numerical Sensitivity Enhancing and Reasoning Completeness Alignment for Quantitative Understanding

Data

Training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages