#### Neural next-step prediction | part 2: learning
Tutorial on neural theorem proving\
Author: Sean Welleck

----------------

#### High-level goal

Our goal is to train a neural next-step predictor $p_\theta(y_t|x_t)$ on the dataset that we collected in the previous notebook.

To do so, we will fine-tune a pretrained language model on the dataset $\mathcal{D}=\{(x_t,y_t)\}$ using the standard supervised fine-tuning approach:

$$
\min_\theta \sum_{(x_t,y_t)\in \mathcal{D}}-\log p_\theta(y_t|x_t).
$$

That is, we maximize the conditional likelihood of a next-step $y_t$ given the context $x_t$. \
This corresponds to minimizing a cross-entropy loss at each position of the next-step, $\sum_{\ell=1}^{{|y_t|}}-\log p_\theta(y_t^\ell|y_t^{<\ell})$.

This is because that we can think $x_t$ as the state after applying $y_t^{<\ell}$ to $x_1$. So, the former formulation also includes the later formulation, with one additional info, the initial state, $x_1$

### Implementation

The implementation consists of two steps:

1. **Data formatting** ([data.py](../ntp_python/data.py)): formatting the examples.
2. **Tuning**  ([tune.py](../ntp_python/tune.py)): using a standard language model fine-tuning script.



#### 1. Data formatting

We format each (tactic-state, next-step) pair $(x_t, y_t)$ as:

        [GOAL]tacticstate[PROOFSTEP]next-step<|endoftext|>

Here, `[GOAL]...[PROOFSTEP]` is the input and `next-step<|endoftext|>` is the output.

This format comes from [Han et al ICLR 2022]: \
[Proof Artifact Co-training for Theorem Proving with Language Models](https://arxiv.org/pdf/2102.06203.pdf).

<!-- *Exercise:* can you think of other auxiliary tasks that might be useful? -->

<!-- *Exercise:* can you think of alternative formats, e.g. which provide additional context? -->

In [35]:
import sys
sys.path.append('../ntp_python')
import data

datasets = data.proofstep(
    data_dir='../data'
)

Saving split to disk...


100%|██████████| 3/3 [00:01<00:00,  2.48it/s]

train	169530
val	4053
test	3606





In [36]:
example = datasets['train'][0]
print("Input:", example['input'], '', sep='\n')
print("Output:", example['output'], sep='\n')

Input:
[GOAL]ι : Type u_1
I✝ J✝ : Box ι
x y : ι → ℝ
I J : WithBot (Box ι)
⊢ ↑I = ↑J ↔ I = J[PROOFSTEP]

Output:
simp only [Subset.antisymm_iff, ← le_antisymm_iff, withBotCoe_subset_iff]<|endoftext|>


#### 4. Tuning

We minimally adapt a standard language-model fine-tuning script from [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py). 

You can check out the full script at [partI_nextstep/ntp_python/tune.py](../ntp_python/tune.py). \
See [partI_nextstep/scripts/tune_proofstep.sh](../scripts/tune_proofstep.sh) for a command that trains on 8 GPUs with deepspeed.

Here's an example command for training a 1.4b model on 1 GPU (and you can adjust the model size to be smaller to fit your compute constraints):

In [None]:
%%bash
REPO_DIR=".."
TRAIN_FILE=${REPO_DIR}/data/processed/proofstep-train.jsonl
VALID_FILE=${REPO_DIR}/data/processed/proofstep-val.jsonl
MODEL=EleutherAI/pythia-1.4b-deduped

OUTDIR=${REPO_DIR}/model/${MODEL}

python ../ntp_python/tune.py \
    --model_name_or_path ${MODEL} \
    --train_data_path ${TRAIN_FILE} \
    --valid_data_path ${VALID_FILE} \
    --fp16 \
    --output_dir ${OUTDIR} \
    --num_train_epochs 10 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --logging_dir "$OUTDIR" \
    --report_to="tensorboard"


Here, let's train it via real script. Since the [partI_nextstep/scripts/tune_proofstep.sh](../scripts/tune_proofstep.sh) was partly obsolete, so I amended a bit. Let's train it!

In [37]:
!nvidia-smi

Sun Oct  8 08:25:17 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   34C    P8              11W /  72W |      4MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA L4                      Off | 00000000:00:04.0 Off |  

Install dependencies

In [12]:
!python -V

Python 3.10.6


Python version seems safe to use

In [38]:
!pip install ndjson

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting ndjson
  Downloading ndjson-0.3.1-py2.py3-none-any.whl (5.3 kB)
Installing collected packages: ndjson
Successfully installed ndjson-0.3.1
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [39]:
!sh ../scripts/tune_proofstep.sh

[2023-10-08 08:25:43,068] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-08 08:25:56,331] [INFO] [runner.py:570:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ../ntp_python/tune.py --deepspeed ../scripts/ds_config.json --model_name_or_path EleutherAI/pythia-2.8b-deduped --train_data_path ../data/processed/proofstep-train.jsonl --valid_data_path ../data/processed/proofstep-val.jsonl --fp16 --output_dir ../model/EleutherAI/pythia-2.8b-deduped --num_train_epochs 10 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 2 --evaluation_strategy steps --eval_steps 500 --save_strategy steps --save_steps 500 --save_total_limit 1 --learning_rate 1e-5 --load_best_model_at_end 1 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_ste

#### After training

If everything went well, you should have a model in `../model/{MODEL_NAME}/checkpoint-{BEST_STEP}`.

We have fine-tuned an `EleutherAI/pythia-2.8b-deduped` model that can be accessed through HuggingFace ([link](https://huggingface.co/wellecks/llmstep-mathlib4-pythia2.8b)):

In [None]:
import transformers

MODEL = 'wellecks/llmstep-mathlib4-pythia2.8b'
model = transformers.GPTNeoXForCausalLM.from_pretrained(MODEL)
tokenizer = transformers.GPTNeoXTokenizerFast.from_pretrained(MODEL)

You can use your own model by setting `MODEL = "../model/{MODEL_NAME}/checkpoint-{BEST_STEP}"` \
(e.g., `../model/EleutherAI/pythia-2.8b-deduped/checkpoint-5000`).

Let's generate a next-step suggestion for the proof state from our original example:

```lean
    theorem test_thm (m n : Nat) (h : m.coprime n) : m.gcd n = 1
```
Recal from the previous notebook that the initial proof state $x_0$ is:

        m n : ℕ
        h : Nat.coprime m n
        ⊢ Nat.gcd m n = 1

In [4]:
prompt = """[GOAL]m n : ℕ
  h : Nat.coprime m n
  ⊢ Nat.gcd m n = 1[PROOFSTEP]"""

input_ids = tokenizer.encode(prompt, return_tensors='pt')
out = model.generate(
    input_ids,
    max_new_tokens=256,
    pad_token_id=tokenizer.eos_token_id
)
text = tokenizer.decode(out[0][input_ids.shape[1]:], skip_special_tokens=True)
print(text)

rw [← h.gcd_eq_one]


### Next steps

In the next notebook, we will prove theorems with the trained model by interacting with the Lean proof assistant.

This will let us automatically check whether a generated proof (e.g., one containing the step above) is correct.

Later on, we will build a VSCode plugin that returns next-step suggestions from the language model.