Skip to content


Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Performing Arithmetic with Transformer

This project demonstrates the possibility of training a generic neural model to perform very complex arithmetic operations, without designing the model architecture explicitly for the task. Our model is able to compute 5-digit by 5-digit decimal multiplication at 100% accuracy. In particular, we train the GPT-2 model on a large number of generated expressions that express the process of computing multiplication step by step. Provide 87708*15192 and it should give 4192581956.

See How does it work section for explanation of the inner mechanics.

How to train

Install dependencies with pip install -r requirements.txt.

Edit and edit the following lines:

  • Edit gpus=4 to the number of GPUs on your machine.
  • Edit batch_size=16 to fit the memory size of your GPU.
  • Adjust accumulate_grad_batches=8 accordingly to control the effective batch size (gpus*batch_size*accumulate_grad_batches). In my early tests I found that larger batch sizes train faster and more stable.
  • Optionally add argument resume_from_checkpoint to a checkpoint file to resume training from a previous checkpoint. PyTorch Lightning should automatically save checkpoints during training.
  • You may also want to adjust the hyperparameters in Since the task is relatively simple, a smaller model should still work, and if it does it will train faster.

Then just run python and watch it training. It typically takes around 100 epoches (10000000 samples) until it converges, at which point it should reach 100% accuracy.

How to test

If you don't want to train the model and just want to test it out, you can download the pretrained model at Put the model (named epoch=98.ckpt) in the same folder as all other files. Otherwise, modify and change the filename of the checkpoint to the one generated by

Run python and give it prompts in the format of xxxxx*xxxxx;. Replace x with decimal numbers (the first digit must not be zero). For example: 12345*54321; and watch it compute the answer.

The answer is expected to be found between the last = sign and the $ sign.

How does it work

Think about how humans perform decimal multiplications:

x  96

Multiplications can be computed vertically. We first decompose the multiplier into digits, and multiply each digit by the multiplicand, then we shift these results by the position of the digit and sum them together. There are two things left to do:

  • Single digit multiplication. In the above example we need to compute 39*6=234 and 39*9=351. This is again done by decomposing the multiplicand into digits, mutiply each digit by the multiplier, and sum them together. The last unit of computation, single-digit by single-digit multiplication, can be done by looking up the multiplication table.

  • While the process of summing up the inputs may seem trivial, the GPT-3 model is known to struggle with multi-operation arithmetic (see "Single Digit Three Ops" performance in the original GPT-3 paper). We need to break it up, and compute the sum of two numbers at a time. We compute the number from right to left, and compute the carry first and then the actual digit. To make the task even easier for the model, the model finds the appropriate digit first and then computes the carry digit and the ones digit, so it doesn't have to worry about where to find the digits to perform the operation. When the two numbers are of differing number of digits, the missing digits are supplied with zero. However, the operation does not stop when the digits of both numbers are used up, since the most significant digit can be carried, like 9999+101. Instead, the operation stops when the two digits and the carry are all zero. The sequence for 9999+101 looks like: (spaces added for clarity, they don't exist in actual training samples)

    9999+101;9100 9010 9111 9010 0011 000=10100

The full sequence for 39*96 is:


Refer to for details regarding how the expressions are generated. Try out the generate_multiplication function for yourself (enter any two numbers as arguments).

The first 20 training samples can be found in exprs.txt. In actual training they are generated on the fly using pseudo random number generators. Here is the first line of exprs.txt:



No description, website, or topics provided.







No releases published


No packages published