This is a dataset of 32524 examples for premise selection based on the Mizar40 experiments . The proof data (without the negative part added here) served also for the DeepMath experiments described in .
For each conjecture the dataset contains a file with a roughly equal number of unnecessary facts and necessary facts. This is done to simplify experiments with learning methods such as neural networks.
In nnhpdata/* the examples are in this format:
C prefixformula + prefixformula ... + prefixformula - prefixformula ... - prefixformula
In nndata/* the examples are in this format:
C tptpformula + tptpformula ... + tptpformula - tptpformula ... - tptpformula
Here tptpformula is the standard FOL TPTP representation, and prefixformula is a prefix curried format using de Bruijn indeces.
C is the conjecture, + are the formulas needed for an ATP proof of C, and - are unnecessary formulas that are reasonably highly ranked by a k-nearest neighbor algorithm trained on the proofs .
To divide into training and testing, just randomly select a part (e.g. 90%) of the files to train on and test on the remaining files.
The accuracy of a simple count-based predictor evaluated incrementally on the whole dataset is 71.75%.
The mizar40 and code directories contain the data (proofs, statemnts, etc.) and the code used to generate the nndata and nnhpdata examples.