Skip to content

MaxGonzalezSaez-Diez/NLProofS-AblationStudy

Repository files navigation

Generating Natural Language Proofs with Verifier-Guided Search: Diverse Beam Search, Aggregation Functions, and Verifier-Weighting in NLProofS

Task

Vijayakumar et al., 2016

We present the code and results for an ablation study of the paper Generating Natural Language Proofs with Verifier-Guided Search by Kaiyu Yang, Jia Deng, and Danqi Chen .

Abstract

Hallucination of invalid nodes in stepwise proof generation poses a problem for natural language processing. Yang et al. (2022) address this issue with NLProofS, mitigating hallucination by using an auxiliary verifier model to guide the stepwise proof generation towards generating valid proof steps. In this study, we replicate the baselines of their work and expand their exploration in three primary directions: (1) varying the prover-verifier proof score weighting in scoring nodes in the proof tree, (2) incorporating diverse beam search for proof tree generation, and (3) evaluating alternative functions for aggregating the scores of nodes in the proof tree. Our highest-performing model achieved an overall proof accuracy of 36.28% on the official Entailment Bank test dataset, therefore outperforming the (replicated) baseline score of 34.71% achieved by the original NLProofS model.

Quick Links

Any information regarding requirements, data preprocessing, experiments and datasets can be found here.

Ablation Results

Verifier Weighting Experiments

Verifier Weight Proof Accuracy (\%)
0.8 35.588 ± 0.588
0.7 35.784 ± 0.612
0.5 (Baseline) 34.706 ± 0.294
0.3 34.706 ± 0.294
0.2 34.804 ± 0.340

Diverse Beam Search

BG DP proof accuracy (in %)
2 − 10.0 35.882 ± 0.294
2 − 5.0 36.275 ± 0.340
2 − 2.0 36.176 ± 0.294
2 − 0.8 35.784 ± 0.170
2 − 0.5 36.078 ± 0.340
2 − 0.2 36.078 ± 0.170
2 0.8 33.824 ± 0.588
baseline (1 BG, 0 penalty) 34.706 ± 0.294

Aggregation Functions

f * function proof accuracy (in %)
f1 baseline 34.706 ± 0.294
f3 s2 min(v1, ..., vn) 35.000 ± 0.588
f3 s3 min(v1, ..., vn) 35.196 ± 0.679
f3 s4 min(v1, ..., vn) 34.902 ± 0.170
f5 min1(I) · min2(I) 35.098 ± 0.449
f8 min(v1, ..., vn)2−s 35.000 ± 0.588

Conclusion

We replicate the findings by Yang et al. (2022) and then present a number of different ablations to their baseline model. Our two main contributions are: first, setting the verifier score to 0.7 resulted in a proof accuracy of 35.78%. Second our implementation of diverse beam search with 2 beam groups and a diversity penalty of − 5.0 achieved a proof accuracy of 36.28%. These results reveal an interesting direction for future work and demonstrate that additional studies are needed to better understand stepwise proof generation.

Acknowledgements

We extend our acknowledgments to Professor Danqi Chen for her guidance and advice on our project, as well as to Dr. Yang for his insightful comments on our paper.

Bugs or Questions

If you have any questions related to the code or the paper, feel free to email Max Gonzalez Saez-Diez. If you encounter any problems when using the code in this repository, please open an issue so we can update the codebase. Thank you!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •