Generating Natural Language Proofs with Verifier-Guided Search: Diverse Beam Search, Aggregation Functions, and Verifier-Weighting in NLProofS

Vijayakumar et al., 2016

We present the code and results for an ablation study of the paper Generating Natural Language Proofs with Verifier-Guided Search by Kaiyu Yang, Jia Deng, and Danqi Chen .

Abstract

Hallucination of invalid nodes in stepwise proof generation poses a problem for natural language processing. Yang et al. (2022) address this issue with NLProofS, mitigating hallucination by using an auxiliary verifier model to guide the stepwise proof generation towards generating valid proof steps. In this study, we replicate the baselines of their work and expand their exploration in three primary directions: (1) varying the prover-verifier proof score weighting in scoring nodes in the proof tree, (2) incorporating diverse beam search for proof tree generation, and (3) evaluating alternative functions for aggregating the scores of nodes in the proof tree. Our highest-performing model achieved an overall proof accuracy of 36.28% on the official Entailment Bank test dataset, therefore outperforming the (replicated) baseline score of 34.71% achieved by the original NLProofS model.

Quick Links

Any information regarding requirements, data preprocessing, experiments and datasets can be found here.

Ablation Results

Verifier Weighting Experiments

Verifier Weight	Proof Accuracy (\%)
0.8	35.588 ± 0.588
0.7	35.784 ± 0.612
0.5 (Baseline)	34.706 ± 0.294
0.3	34.706 ± 0.294
0.2	34.804 ± 0.340

Diverse Beam Search

BG	DP	proof accuracy (in %)
2	− 10.0	35.882 ± 0.294
2	− 5.0	36.275 ± 0.340
2	− 2.0	36.176 ± 0.294
2	− 0.8	35.784 ± 0.170
2	− 0.5	36.078 ± 0.340
2	− 0.2	36.078 ± 0.170
2	0.8	33.824 ± 0.588
baseline (1 BG, 0 penalty)	−	34.706 ± 0.294

Aggregation Functions

f _*	function	proof accuracy (in %)
f₁	baseline	34.706 ± 0.294
f₃	s² min(v₁, ..., v_n)	35.000 ± 0.588
f₃	s³ min(v₁, ..., v_n)	35.196 ± 0.679
f₃	s⁴ min(v₁, ..., v_n)	34.902 ± 0.170
f₅	min₁(I) · min₂(I)	35.098 ± 0.449
f₈	min(v₁, ..., v_n)^2−s	35.000 ± 0.588

Conclusion

We replicate the findings by Yang et al. (2022) and then present a number of different ablations to their baseline model. Our two main contributions are: first, setting the verifier score to 0.7 resulted in a proof accuracy of 35.78%. Second our implementation of diverse beam search with 2 beam groups and a diversity penalty of − 5.0 achieved a proof accuracy of 36.28%. These results reveal an interesting direction for future work and demonstrate that additional studies are needed to better understand stepwise proof generation.

Acknowledgements

We extend our acknowledgments to Professor Danqi Chen for her guidance and advice on our project, as well as to Dr. Yang for his insightful comments on our paper.

Bugs or Questions

If you have any questions related to the code or the paper, feel free to email Max Gonzalez Saez-Diez. If you encounter any problems when using the code in this repository, please open an issue so we can update the codebase. Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
images		images
paper		paper
prover		prover
scripts		scripts
verifier		verifier
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
check_data.py		check_data.py
common.py		common.py
evaluate_test.py		evaluate_test.py
evaluate_validation.py		evaluate_validation.py
example_job.sh		example_job.sh
mypy.ini		mypy.ini
nlproofs.yaml		nlproofs.yaml
preprocess_ruletaker.py		preprocess_ruletaker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating Natural Language Proofs with Verifier-Guided Search: Diverse Beam Search, Aggregation Functions, and Verifier-Weighting in NLProofS

Abstract

Quick Links

Ablation Results

Verifier Weighting Experiments

Diverse Beam Search

Aggregation Functions

Conclusion

Acknowledgements

Bugs or Questions

About

Releases

Packages

Contributors 4

Languages

License

MaxGonzalezSaez-Diez/NLProofS-AblationStudy

Folders and files

Latest commit

History

Repository files navigation

Generating Natural Language Proofs with Verifier-Guided Search: Diverse Beam Search, Aggregation Functions, and Verifier-Weighting in NLProofS

Abstract

Quick Links

Ablation Results

Verifier Weighting Experiments

Diverse Beam Search

Aggregation Functions

Conclusion

Acknowledgements

Bugs or Questions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages