Generating Natural Language Proofs with Verifier-Guided Search: Diverse Beam Search, Aggregation Functions, and Verifier-Weighting in NLProofS
We present the code and results for an ablation study of the paper Generating Natural Language Proofs with Verifier-Guided Search by Kaiyu Yang, Jia Deng, and Danqi Chen .
Hallucination of invalid nodes in stepwise proof generation poses a problem for natural language processing. Yang et al. (2022) address this issue with NLProofS, mitigating hallucination by using an auxiliary verifier model to guide the stepwise proof generation towards generating valid proof steps. In this study, we replicate the baselines of their work and expand their exploration in three primary directions: (1) varying the prover-verifier proof score weighting in scoring nodes in the proof tree, (2) incorporating diverse beam search for proof tree generation, and (3) evaluating alternative functions for aggregating the scores of nodes in the proof tree. Our highest-performing model achieved an overall proof accuracy of 36.28% on the official Entailment Bank test dataset, therefore outperforming the (replicated) baseline score of 34.71% achieved by the original NLProofS model.
Any information regarding requirements, data preprocessing, experiments and datasets can be found here.
Verifier Weight | Proof Accuracy (\%) |
---|---|
0.8 | 35.588 ± 0.588 |
0.7 | 35.784 ± 0.612 |
0.5 (Baseline) | 34.706 ± 0.294 |
0.3 | 34.706 ± 0.294 |
0.2 | 34.804 ± 0.340 |
BG | DP | proof accuracy (in %) |
---|---|---|
2 | − 10.0 | 35.882 ± 0.294 |
2 | − 5.0 | 36.275 ± 0.340 |
2 | − 2.0 | 36.176 ± 0.294 |
2 | − 0.8 | 35.784 ± 0.170 |
2 | − 0.5 | 36.078 ± 0.340 |
2 | − 0.2 | 36.078 ± 0.170 |
2 | 0.8 | 33.824 ± 0.588 |
baseline (1 BG, 0 penalty) | − | 34.706 ± 0.294 |
f * | function | proof accuracy (in %) |
---|---|---|
f1 | baseline | 34.706 ± 0.294 |
f3 | s2 min(v1, ..., vn) | 35.000 ± 0.588 |
f3 | s3 min(v1, ..., vn) | 35.196 ± 0.679 |
f3 | s4 min(v1, ..., vn) | 34.902 ± 0.170 |
f5 | min1(I) · min2(I) | 35.098 ± 0.449 |
f8 | min(v1, ..., vn)2−s | 35.000 ± 0.588 |
We replicate the findings by Yang et al. (2022) and then present a number of different ablations to their baseline model. Our two main contributions are: first, setting the verifier score to 0.7 resulted in a proof accuracy of 35.78%. Second our implementation of diverse beam search with 2 beam groups and a diversity penalty of − 5.0 achieved a proof accuracy of 36.28%. These results reveal an interesting direction for future work and demonstrate that additional studies are needed to better understand stepwise proof generation.
We extend our acknowledgments to Professor Danqi Chen for her guidance and advice on our project, as well as to Dr. Yang for his insightful comments on our paper.
If you have any questions related to the code or the paper, feel free to email Max Gonzalez Saez-Diez. If you encounter any problems when using the code in this repository, please open an issue so we can update the codebase. Thank you!