Constructing LeanDojo Benchmark (Lean 3)
===================================

This script uses [LeanDojo](https://leandojo.org/) to construct LeanDojo Benchmark used in our paper:

[LeanDojo: Theorem Proving with Retrieval-Augmented Language Models](https://leandojo.org/)      
NeurIPS 2023 (Datasets and Benchmarks Track)  
[Kaiyu Yang](https://yangky11.github.io/), [Aidan Swope](https://aidanswope.com/about), [Alex Gu](https://minimario.github.io/), [Rahul Chalamala](https://rchalamala.github.io/), [Peiyang Song](https://peiyang-song.github.io/), [Shixing Yu](https://billysx.github.io/), [Saad Godil](https://www.linkedin.com/in/saad-godil-9728353/), [Ryan Prenger](https://www.linkedin.com/in/ryan-prenger-18797ba1/), [Anima Anandkumar](http://tensorlab.cms.caltech.edu/users/anima/)

The dataset is constructed from [mathlib](https://github.com/leanprover-community/mathlib/tree/19c869efa56bbb8b500f2724c0b77261edbfa28c) (`19c869efa56bbb8b500f2724c0b77261edbfa28c`) and will be saved to `../leandojo_benchmark`. It includes 2000 theorems for validation, 2000 theorems for testing, and the rest for training. Please refer to our paper for details. For most use cases, you shouldn't need to generate the data and can directly use our official LeanDojo Benchmark downloadable [here](https://zenodo.org/record/8242196).

This script is for Lean 3. We also have a [version for Lean 4](https://github.com/lean-dojo/LeanDojo/blob/main/scripts/generate-benchmark-lean4.ipynb).


In [1]:
import ray
import json
import shutil
import random
import networkx as nx
from tqdm import tqdm
from copy import copy
from pathlib import Path
from loguru import logger
from datetime import datetime
from collections import defaultdict
from ray.util.actor_pool import ActorPool
from typing import Dict, List, Tuple, Union

import lean_dojo
from lean_dojo import *
from lean_dojo.constants import LEAN3_DEPS_DIR

random.seed(3407)  # https://arxiv.org/abs/2109.08203

URL = "https://github.com/leanprover-community/mathlib"
COMMIT = "19c869efa56bbb8b500f2724c0b77261edbfa28c"
DST_DIR = Path("../leandojo_benchmark")
NUM_VAL = NUM_TEST = 2000

## Splitting the Theorems

We will split the theorems into train/val/test using two different strategies.

In [2]:
SPLIT_NAME = str  # train/val/test
SPLIT = Dict[SPLIT_NAME, List[TracedTheorem]]
SPLIT_STRATEGY = str

### Splitting Randomly

The first and the simplest strategy is splitting the theorems randomly, which can be implemented by a random shuffle followed by a sequential split.

In [3]:
def _split_sequentially(
    traced_theorems: List[TracedTheorem],
) -> SPLIT:
    """Split ``traced_theorems`` sequentially into train/val/test."""
    num_theorems = len(traced_theorems)
    num_train = num_theorems - NUM_VAL - NUM_TEST
    return {
        "train": traced_theorems[:num_train],
        "val": traced_theorems[num_train : num_train + NUM_VAL],
        "test": traced_theorems[num_train + NUM_VAL :],
    }


def split_randomly(
    traced_theorems: List[TracedTheorem],
) -> SPLIT:
    """Split ``traced_theorems`` randomly into train/val/test."""
    logger.info("Splitting the theorems randomly")
    traced_theorems = copy(traced_theorems)
    random.shuffle(traced_theorems)
    return _split_sequentially(traced_theorems)

### Splitting by Premise

The second strategy is splitting by premise. We want to test the prover's capability in using novel premises, i.e., premises that have never been used in training. Please see the implementation below. Note that validation and testing theorems may share premises. So the **testing performance should be reported using models trained on the training set only, NOT training plus validation.**

In [4]:
def split_by_premise(
    traced_theorems: List[TracedTheorem],
) -> SPLIT:
    """
    Split theorems into train/val/test so that proofs in val/test rely on at
    least one novel premise that does not appear in train.
    """
    logger.info("Splitting the theorems by premises")

    # Figure out the number of theorems in train/val/test.
    num_theorems = len(traced_theorems)
    num_val_test = NUM_VAL + NUM_TEST
    num_train = num_theorems - num_val_test
    theorems_val_test = set()

    # Map each premise to a list of theorems using it.
    theorems_by_premises = defaultdict(list)
    for t in traced_theorems:
        for p in t.get_premise_full_names():
            theorems_by_premises[p].append(t)

    # Sort the premises by the number of theorems using them (in ascending order).
    theorems_by_premises = sorted(theorems_by_premises.items(), key=lambda x: len(x[1]))

    # For each premise, put all theorems using it into val_test so that it does not appear in train.
    for _, thms in theorems_by_premises:
        if len(theorems_val_test) < num_val_test:
            theorems_val_test.update(thms)

    # All other theorems go to train.
    theorems_train = [t for t in traced_theorems if t not in theorems_val_test]
    theorems_val_test = list(theorems_val_test)
    random.shuffle(theorems_val_test)

    return {
        "train": theorems_train,
        "val": theorems_val_test[:NUM_VAL],
        "test": theorems_val_test[NUM_VAL:],
    }

Given a traced repo, we can split the theorems using these strategies.

In [5]:
def split_data(traced_repo: TracedRepo) -> Dict[SPLIT_STRATEGY, SPLIT]:
    traced_theorems = traced_repo.get_traced_theorems()
    logger.info(f"{len(traced_theorems)} theorems in total")

    return {
        "random": split_randomly(traced_theorems),
        "novel_premises": split_by_premise(traced_theorems),
    }

## Exporting the Data

Once theorems are split into train/val/test. We export them to JSON formats that can be easily used in machine learning. 

In [6]:
def _get_file_path(traced_repo: TracedRepo, thm: TracedTheorem) -> str:
    if thm.repo == traced_repo.repo:
        # The theorem belongs to the traced repo itself.
        return str(thm.theorem.file_path)
    else:
        # The theorem belongs to one of the dependencies.
        return f"{LEAN3_DEPS_DIR}/{thm.repo.name}/{thm.theorem.file_path}"


def export_proofs(
    traced_repo: TracedRepo, splits: Dict[SPLIT_STRATEGY, SPLIT], dst_path: Path
) -> None:
    """Export all proofs in a traced repo to ``dst_path''."""
    for strategy, split in splits.items():
        split_dir = dst_path / strategy
        split_dir.mkdir(parents=True)

        for name, theorems in split.items():
            data = []
            num_tactics = 0

            for thm in theorems:
                tactics = [
                    {
                        "tactic": t.tactic,
                        "annotated_tactic": t.get_annotated_tactic(),
                        "state_before": t.state_before,
                        "state_after": t.state_after,
                    }
                    for t in thm.get_traced_tactics()
                    if t.state_before != "no goals"
                    and "·" not in t.tactic  # Ignore "·".
                ]
                num_tactics += len(tactics)
                data.append(
                    {
                        "url": thm.repo.url,
                        "commit": thm.repo.commit,
                        "file_path": _get_file_path(traced_repo, thm),
                        "full_name": thm.theorem.full_name,
                        "start": list(thm.start),
                        "end": list(thm.end),
                        "traced_tactics": tactics,
                    }
                )
            oup_path = split_dir / f"{name}.json"
            json.dump(data, oup_path.open("wt"))
            logger.info(
                f"{len(theorems)} theorems and {num_tactics} tactics saved to {oup_path}"
            )


def export_premises(traced_repo: TracedRepo, dst_path: Path) -> None:
    """Export all premise definitions in a traced repo to ``dst_path``."""
    oup_path = dst_path / "corpus.jsonl"
    num_premises = 0

    with oup_path.open("wt") as oup:
        G = traced_repo.traced_files_graph

        for tf_node in reversed(list(nx.topological_sort(G))):
            tf = G.nodes[tf_node]["traced_file"]
            imports = [str(_) for _ in G.successors(tf_node)]
            premises = tf.get_premise_definitions()
            num_premises += len(premises)
            oup.write(
                json.dumps(
                    {"path": str(tf.path), "imports": imports, "premises": premises}
                )
                + "\n"
            )
    logger.info(
        f"{num_premises} theorems/definitions from {traced_repo.num_traced_files} files saved to {oup_path}"
    )


def export_licenses(traced_repo: TracedRepo, dst_path: Path) -> None:
    """Export the licenses of a traced repo and all its dependencies to ``dst_path``."""
    license_dir = dst_path / "licenses"
    license_dir.mkdir()
    all_repos = [traced_repo.repo] + list(traced_repo.dependencies.values())

    for repo in all_repos:
        lic = repo.get_license()
        if lic is None:
            continue
        with (license_dir / repo.name).open("wt") as oup:
            oup.write(lic)

    with (license_dir / "README.md").open("wt") as oup:
        oup.write(
            "This directory contains licenses of Lean repos used to generate this dataset. The dataset itself is released under [CC BY 2.0](https://creativecommons.org/licenses/by/2.0/)."
        )


def export_metadata(traced_repo: TracedRepo, dst_path: Path, **kwargs) -> None:
    """Export the metadata of a traced repo to ``dst_path''."""
    metadata = dict(kwargs)
    metadata["creation_time"] = str(datetime.now())
    metadata["from_repo"] = {
        "url": traced_repo.repo.url,
        "commit": traced_repo.repo.commit,
    }
    metadata["leandojo_version"] = lean_dojo.__version__
    json.dump(metadata, (dst_path / "metadata.json").open("wt"))


def export_data(
    traced_repo: TracedRepo,
    splits: Dict[SPLIT_STRATEGY, SPLIT],
    dst_path: Union[str, Path],
    **kwargs,
) -> None:
    """Export a traced repo whose theorems have been splitted to ``dst_path``."""
    if isinstance(dst_path, str):
        dst_path = Path(dst_path)
    if dst_path.exists():
        logger.warning(f"{dst_path} already exists. Removing it now.")
        shutil.rmtree(dst_path)

    # Export the proofs.
    export_proofs(traced_repo, splits, dst_path)

    # Export the premises (theorems, definitions, etc.).
    export_premises(traced_repo, dst_path)

    # Export the licenses.
    export_licenses(traced_repo, dst_path)

    # Export metadata.
    export_metadata(traced_repo, dst_path, **kwargs)

Putting everything together, we're ready to generate the dataset!

In [7]:
repo = LeanGitRepo(URL, COMMIT)
traced_repo = trace(repo)
splits = split_data(traced_repo)
export_data(traced_repo, splits, DST_DIR, dataset_name="LeanDojo Benchmark")

[32m2023-10-11 09:30:47.560[0m | [1mINFO    [0m | [36mlean_dojo.data_extraction.trace[0m:[36mtrace[0m:[36m163[0m - [1mLoading the traced repo from /home/kaiyu/.cache/lean_dojo/leanprover-community-mathlib-19c869efa56bbb8b500f2724c0b77261edbfa28c/mathlib[0m
2023-10-11 09:30:51,436	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
100%|████████████████████████████████████████| 3384/3384 [05:02<00:00, 11.19it/s]
[32m2023-10-11 09:36:22.909[0m | [1mINFO    [0m | [36m__main__[0m:[36msplit_data[0m:[36m3[0m - [1m98734 theorems in total[0m
[32m2023-10-11 09:36:22.911[0m | [1mINFO    [0m | [36m__main__[0m:[36msplit_randomly[0m:[36m18[0m - [1mSplitting the theorems randomly[0m
[32m2023-10-11 09:36:22.960[0m | [1mINFO    [0m | [36m__main__[0m:[36msplit_by_premise[0m:[36m8[0m - [1mSplitting the theorems by premises[0m


[32m2023-10-11 09:36:46.907[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_proofs[0m:[36m48[0m - [1m94734 theorems and 209129 tactics saved to ../leandojo_benchmark/random/train.json[0m
[32m2023-10-11 09:36:47.560[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_proofs[0m:[36m48[0m - [1m2000 theorems and 4397 tactics saved to ../leandojo_benchmark/random/val.json[0m
[32m2023-10-11 09:36:47.902[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_proofs[0m:[36m48[0m - [1m2000 theorems and 4250 tactics saved to ../leandojo_benchmark/random/test.json[0m


[32m2023-10-11 09:37:03.453[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_proofs[0m:[36m48[0m - [1m94734 theorems and 191531 tactics saved to ../leandojo_benchmark/novel_premises/train.json[0m
[32m2023-10-11 09:37:04.596[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_proofs[0m:[36m48[0m - [1m2000 theorems and 12721 tactics saved to ../leandojo_benchmark/novel_premises/val.json[0m
[32m2023-10-11 09:37:05.622[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_proofs[0m:[36m48[0m - [1m2000 theorems and 13524 tactics saved to ../leandojo_benchmark/novel_premises/test.json[0m
[32m2023-10-11 09:37:40.438[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_premises[0m:[36m72[0m - [1m130283 theorems/definitions from 3384 files saved to ../leandojo_benchmark/corpus.jsonl[0m


The warnings above are expected. It's not clear why we have problems locating a few premises related to `quot`, but we'll ignore them for now since they are only a tiny fraction of all premises. Please let us know if you have any ideas!

## Data Format

This is the resulting data directory:

```
├─corpus.jsonl
├─metadata.json
├─licenses
│ ├─lean
│ ├─mathlib
│ └─README.md
├─random
│ ├─train.json
│ ├─val.json
│ └─test.json
└─novel_premises
  ├─train.json
  ├─val.json
  └─test.json
```

`corpus.jsonl` is a corpus of all theorems and definitions in mathlib that can potentially be used as premises. Sub-directories `random` and `novel_premise` are different strategies for splitting the theorems. For each strategy, we have `*.json` files for train/val/test. The sub-directory `licenses` contains license information.

### Corpus of Potential Premises

`corpus.jsonl` is in [JSON Lines format](https://jsonlines.org/); a line includes the potential premises defined in a single `*.lean` file.

In [8]:
!cat ../leandojo_benchmark/corpus.jsonl | wc -l

3384


Let's look at one of them.

In [9]:
corpus_path = DST_DIR / "corpus.jsonl"
lines = list(corpus_path.open())
file_in_corpus = json.loads(lines[1000])
file_in_corpus.keys()

dict_keys(['path', 'imports', 'premises'])

We can check the file's path and other files it imports.

In [10]:
file_in_corpus["path"], file_in_corpus["imports"]

('src/data/polynomial/degree/definitions.lean',
 ['_target/deps/lean/library/init/default.lean',
  'src/data/polynomial/monomial.lean',
  'src/data/nat/with_bot.lean',
  'src/data/polynomial/coeff.lean',
  'src/data/fintype/big_operators.lean'])

In [11]:
len(file_in_corpus["premises"])

241

We can inspect the first potential premise:

In [12]:
file_in_corpus["premises"][0]

{'full_name': 'polynomial.degree',
 'code': 'def degree (p : R[X]) : with_bot ℕ := p.support.max',
 'start': [41, 1],
 'end': [43, 1],
 'kind': 'definition'}

Each premise has a fully qualified name, its definition (in the form of Lean code), and the exact location it is defined.


### Theorems/Proofs Data

Now let's take a look at the theorems/proofs data, taking the `random` split as an example.

In [13]:
train_path = DST_DIR / "random/train.json"
proofs_train = json.load(train_path.open())
len(proofs_train)

94734

Each element in `proofs_val` represents a theorem. Let's check one of them.

In [14]:
for proof in proofs_train[::-1]:
    if proof["traced_tactics"] != []:
        break
proof.keys()

dict_keys(['url', 'commit', 'file_path', 'full_name', 'start', 'end', 'traced_tactics'])

In [15]:
proof["url"], proof["commit"], proof["file_path"], proof["full_name"]

('https://github.com/leanprover-community/mathlib',
 '19c869efa56bbb8b500f2724c0b77261edbfa28c',
 'src/category_theory/limits/shapes/kernels.lean',
 'category_theory.limits.kernel_zero_iso_source_inv')

We see the theorem's name and where it is defined. The theorem includes some traced tactics.

In [16]:
len(proof["traced_tactics"])

2

Let's look at a traced tactic.

In [20]:
proof["traced_tactics"][1]

{'tactic': 'simp [kernel_zero_iso_source]',
 'annotated_tactic': ['simp [<a>kernel_zero_iso_source</a>]',
  [{'full_name': 'category_theory.limits.kernel_zero_iso_source',
    'def_path': 'src/category_theory/limits/shapes/kernels.lean',
    'def_pos': [254, 5]}]],
 'state_before': 'C : Type u,\n_inst_1 : category C,\n_inst_2 : limits.has_zero_morphisms C,\nX Y : C\n⊢ kernel_zero_iso_source.inv ≫ limits.equalizer.ι 0 0 = kernel.lift 0 (𝟙 X) _ ≫ limits.equalizer.ι 0 0',
 'state_after': 'no goals'}

`annotated_tactic` is the tactic with premises annotated by `<a> ... </a>`. For each premise, we know its fully qualified name and the exact location it is defined, which is invaluable for training machine learning models for premise selection.

## MiniF2F and ProofNet

Similarly, we extract datasets from [miniF2F](https://github.com/facebookresearch/miniF2F) and [ProofNet](https://github.com/zhangir-azerbayev/ProofNet), which are used for evaluation in our paper.

In [18]:
minif2f = LeanGitRepo(
    "https://github.com/facebookresearch/miniF2F",
    "5271ddec788677c815cf818a06f368ef6498a106",
)
traced_minif2f = trace(minif2f)

splits = {"default": {"val": [], "test": []}}

for tf in traced_minif2f.get_traced_theorems():
    if tf.repo.name != "miniF2F":
        continue
    if tf.file_path.name == "valid.lean":
        splits["default"]["val"].append(tf)
    else:
        assert tf.file_path.name == "test.lean"
        splits["default"]["test"].append(tf)

export_data(
    traced_minif2f, splits, "../leandojo_minif2f", dataset_name="LeanDojo MiniF2F"
)

[32m2023-10-11 09:37:44.777[0m | [1mINFO    [0m | [36mlean_dojo.data_extraction.trace[0m:[36mtrace[0m:[36m163[0m - [1mLoading the traced repo from /home/kaiyu/.cache/lean_dojo/facebookresearch-miniF2F-5271ddec788677c815cf818a06f368ef6498a106/miniF2F[0m
2023-10-11 09:37:49,200	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
100%|████████████████████████████████████████| 1159/1159 [02:41<00:00,  7.18it/s]
[32m2023-10-11 09:40:48.988[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_proofs[0m:[36m48[0m - [1m244 theorems and 549 tactics saved to ../leandojo_minif2f/default/val.json[0m
[32m2023-10-11 09:40:49.032[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_proofs[0m:[36m48[0m - [1m245 theorems and 781 tactics saved to ../leandojo_minif2f/default/test.json[0m
[32m2023-10-11 09:41:05.488[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_premises[0m:[36m72[0m - [1m67170 theorems/defi

In [19]:
proofnet = LeanGitRepo(
    "https://github.com/zhangir-azerbayev/ProofNet",
    "e8645aa830ce17c33a8b8482a8195f0f97d6a74a",
)
traced_proofnet = trace(proofnet)
splits = {
    "default": {
        "test": [
            tf
            for tf in traced_proofnet.get_traced_theorems()
            if tf.repo.name == "ProofNet"
        ]
    }
}
export_data(
    traced_proofnet, splits, "../leandojo_proofnet", dataset_name="LeanDojo ProofNet"
)

[32m2023-10-11 09:41:07.121[0m | [1mINFO    [0m | [36mlean_dojo.data_extraction.trace[0m:[36mtrace[0m:[36m163[0m - [1mLoading the traced repo from /home/kaiyu/.cache/lean_dojo/zhangir-azerbayev-ProofNet-e8645aa830ce17c33a8b8482a8195f0f97d6a74a/ProofNet[0m
2023-10-11 09:41:12,373	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
100%|████████████████████████████████████████| 1539/1539 [03:45<00:00,  6.83it/s]
[32m2023-10-11 09:45:17.484[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_proofs[0m:[36m48[0m - [1m374 theorems and 460 tactics saved to ../leandojo_proofnet/default/test.json[0m
[32m2023-10-11 09:45:36.646[0m | [1mINFO    [0m | [36m__main__[0m:[36mexport_premises[0m:[36m72[0m - [1m82365 theorems/definitions from 1539 files saved to ../leandojo_proofnet/corpus.jsonl[0m
