# **Arithmetic with Language Models: from Memorization to Computation**

This notebook facilitates the replication of experimental work on **encoder-decoder Transformer** architecture featured in [D. Maltoni and M. Ferrara, *"Arithmetic with language models: From memorization to computation"*, Neural Networks, vol. 179, 2024](https://www.sciencedirect.com/science/article/pii/S089360802400474X). It uses the following Python scripts:
- **ArithmeticData.py** - contains functions to create, shuffle and split datasets used in the experimentation.
- **Transformer.py** - contains a modified version of [
A. Sarkar, *"Build your own Transformer from scratch using Pytorch"*, Towards Data Science, 2023](https://towardsdatascience.com/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb) which implements the encoder-decoder Transformer architecture introduced by [Vaswani et al. (2017)](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html).
- **TransformerUtilites.py** - provides utility functions for evaluating performance indicators.
- **TransformerTraining.py** - contains the *transformer_training* function allowing multiple training executions (*run_count*) with the same number of epochs (**epochs**) and using the same dataset (internally created from the *op*, *revert_bit* and *val_set_type* parameters).
- **TransformerComputeTokenAndValueDist.py** - contains the *transformer_compute_token_and_value_dist* function to study the Transformer internal representation (embedding) by correlating the distances between the embeddings and the corresponding distances at input/output levels.

The following code imports all necessary modules and functions required for running this notebook. Subsequent code cells operate independently and may be run in any sequence.

In [1]:
import os
import matplotlib.pyplot as plt

from TransformerTraining import transformer_training
from TransformerComputeTokenAndValueDist import transformer_compute_token_and_value_dist

The following code cell reproduces the experiment used to generate the left graph reported in Figure 1. To store the weights of the trained models, specify the *out_folder_path* parameter with the directory path where you want the models to be saved.

In [None]:
#Figure 1 (left)

op='+'
run_count=5
epochs=50
out_folder_path=None

avg_train_seq_acc,avg_val_seq_acc=transformer_training(op,run_count,epochs,out_folder_path=out_folder_path)

plt.plot(avg_train_seq_acc*100,label='Train')
plt.plot(avg_val_seq_acc*100,label='Val')
plt.xlabel('Epochs')
plt.ylabel('Sequence Accuracy (%)')
plt.legend()
plt.show() 

The following code cell reproduces the experiment used to generate the right graph reported in Figure 1. To store the weights of the trained models, specify the *out_folder_path* parameter with the directory path where you want the models to be saved.

In [None]:
#Figure 1 (right)

op='x'
run_count=5
epochs=250
out_folder_path=None

avg_train_seq_acc,avg_val_seq_acc=transformer_training(op,run_count,epochs,out_folder_path=out_folder_path)

plt.plot(avg_train_seq_acc*100,label='Train')
plt.plot(avg_val_seq_acc*100,label='Val')
plt.xlabel('Epochs')
plt.ylabel('Sequence Accuracy (%)')
plt.legend()
plt.show()

The following code cell reproduces the experiment used to generate the graph reported in Figure 2. Since the trend remains consistent across multiple runs, to reduce the time needed for running the experiment, the *run_count* parameter has been set to 1. Furthermore, the *epochs* parameter has been updated to 4000 in place of the originally stated 1000, to correct an error in the x-axis label of Figure 2.

In [None]:
#Figure 2

op='R'
run_count=1
epochs=4000

avg_train_seq_acc,avg_val_seq_acc=transformer_training(op,run_count,epochs)

plt.plot(avg_train_seq_acc*100,label='Train')
plt.plot(avg_val_seq_acc*100,label='Val')
plt.xlabel('Epochs')
plt.ylabel('Sequence Accuracy (%)')
plt.legend()
plt.show() 

The following code cell reproduces the experiment used to generate the left graph reported in Figure 3. To reduce the duration required for the experiment execution, adjust the *run_count* parameter to 1.

In [None]:
#Figure 3 (left)

op='+'
run_count=5
epochs=50

_,avg_rndval_seq_acc=transformer_training(op,run_count,epochs)
_,avg_vst_seq_acc=transformer_training(op,run_count,epochs,val_set_type='VSt')
_,avg_vsv_seq_acc=transformer_training(op,run_count,epochs,val_set_type='VSv')

plt.plot(avg_rndval_seq_acc*100,label='Random Split')
plt.plot(avg_vst_seq_acc*100,label='VS_t')
plt.plot(avg_vsv_seq_acc*100,label='VS_v')
plt.xlabel('Epochs')
plt.ylabel('Sequence Accuracy (%)')
plt.legend()
plt.show() 

The following code cell reproduces the experiment used to generate the right graph reported in Figure 3. To reduce the duration required for the experiment execution, adjust the *run_count* parameter to 1.

In [None]:
#Figure 3 (right)

op='x'
run_count=5
epochs=250

_,avg_rndval_seq_acc=transformer_training(op,run_count,epochs)
_,avg_vst_seq_acc=transformer_training(op,run_count,epochs,val_set_type='VSt')
_,avg_vsv_seq_acc=transformer_training(op,run_count,epochs,val_set_type='VSv')

plt.plot(avg_rndval_seq_acc*100,label='Random Split')
plt.plot(avg_vst_seq_acc*100,label='VS_t')
plt.plot(avg_vsv_seq_acc*100,label='VS_v')
plt.xlabel('Epochs')
plt.ylabel('Sequence Accuracy (%)')
plt.legend()
plt.show() 

The following code cell reproduces the experiment used to generate the table reported in Figure 4.a. To load previously saved weights, set the *model_checkpoint_path* parameter to the file path of the stored weights.

In [None]:
#Figure 4 (a)

op='+'
model_checkpoint_path=os.path.abspath('')+r'\add_model'

transformer_compute_token_and_value_dist(op,model_checkpoint_path)

The following code cell reproduces the experiment used to generate the table reported in Figure 4.b. To load previously saved weights, set the *model_checkpoint_path* parameter to the file path of the stored weights.

In [None]:
#Figure 4 (b)

op='x'
model_checkpoint_path=os.path.abspath('')+r'\mul_model'

transformer_compute_token_and_value_dist(op,model_checkpoint_path)

The following code cell reproduces the experiment used to generate the left graph reported in Figure C.6.

In [None]:
#Figure C.6 (left)

op='+'
run_count=1
epochs=150

_,avg_val_seq_acc=transformer_training(op,run_count,epochs)
_,avg_val_seq_acc_not_rev=transformer_training(op,run_count,epochs,revert_bit=False)

plt.plot(avg_val_seq_acc*100,label='Reverse')
plt.plot(avg_val_seq_acc_not_rev*100,label='Plain')
plt.xlabel('Epochs')
plt.ylabel('Sequence Accuracy (%)')
plt.legend()
plt.show() 

The following code cell reproduces the experiment used to generate the right graph reported in Figure C.6.

In [None]:
#Figure C.6 (right)

op='x'
run_count=1
epochs=1500

_,avg_val_seq_acc=transformer_training(op,run_count,epochs)
_,avg_val_seq_acc_not_rev=transformer_training(op,run_count,epochs,revert_bit=False)

plt.plot(avg_val_seq_acc*100,label='Reverse')
plt.plot(avg_val_seq_acc_not_rev*100,label='Plain')
plt.xlabel('Epochs')
plt.ylabel('Sequence Accuracy (%)')
plt.legend()
plt.show() 