# First Draft Submission for test for Symbolic Representations of Probability Amplitude for High Energy Physics (Parv Agarwal)

This submission is more a proof of concept than an actual submission, looking for feedback on the method before moving forward with improving the accuracy and ability of the method. 

Currently, we generate ranges of function combinations using the exponential and trigonometric functions and then encode them into a dataset using character encoding and then train a LSTM model on it. Since it's in the preliminary nature, not much attention has been paid to the accuracy of the model or the nature of the dataset, since the level of demonstration required is unclear.

## Verison of Events

I am not an experienced campaigner when it comes to sequence to sequence learning. To get my hands dirty, I first reproduced this tutorial on language translation from - https://analyticsindiamag.com/sequence-to-sequence-modeling-using-lstm-for-language-translation/ in order to get a handle of it. A lot of the code here is derived from there. I adapted the code for this specific use case.

In [1]:
from sympy import *
import matplotlib.pyplot as plt
import numpy as np

First, we must generate the dataset of the functions with their Taylor Expansions. In order to do that, we first create the Taylor method that expands the function to the fourth order based on the Taylor Expansion rules. We then create combinations of functions using the common functions - Exponential and the Trigonometric cos, sin and tan. For a better model, we would generally require a larger dataset that is more representative of all the types of functions present, however, this is more of a proof of concept. 

In [9]:
x = symbols("x")

""" Generating the dataset """

# Taylor approximation at x0 of the function 'function'
def taylor(function,x0,n):
    i = 0
    p = 0
    while i <= n:
        p = Add(p, (function.diff(x,i).subs(x,x0))/(factorial(i))*(x-x0)**i) # subs evaluates the function
        i += 1
    return p


function_operators = [cos(x), sin(x), exp(x), tan(x)]
# Create ranges of expressions using the three operators to Taylor expand and create dataset

function_dataset = []

for pow in range(0, 6):
    for i in range(0, len(function_operators)):
        for j in range(0, len(function_operators)):
            function_dataset.append(function_operators[i]**pow * function_operators[j])
            function_dataset.append(function_operators[i] * function_operators[j]**pow)



taylor_rep = [taylor(i,0,4) for i in function_dataset]

for i in range(0,10):
  print(f"{function_dataset[i]} - {taylor_rep[i]}")

cos(x) - x**4/24 - x**2/2 + 1
cos(x) - x**4/24 - x**2/2 + 1
sin(x) - -x**3/6 + x
cos(x) - x**4/24 - x**2/2 + 1
exp(x) - x**4/24 + x**3/6 + x**2/2 + x + 1
cos(x) - x**4/24 - x**2/2 + 1
tan(x) - x**3/3 + x
cos(x) - x**4/24 - x**2/2 + 1
cos(x) - x**4/24 - x**2/2 + 1
sin(x) - -x**3/6 + x


Now we can start preparing the preprocessing for training a LSTM model. We seperate the input and target sequences based on the Functions and their Taylor representations and prepare the number of tokens for input and target sequences.

In [10]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
from keras.utils import *
from keras.initializers import *
import tensorflow as tf
import time, random


# Hyperparameters
batch_size = 64
latent_dim = 256
num_samples = 10000

# Vectorizing data
input_texts = []
target_texts = []
input_chars = set()
target_chars = set()

In [11]:

for i in range(0, len(function_dataset)):
  input_text, target_text = str(function_dataset[i]), str(taylor_rep[i])  
  target_text = '\t' + target_text + '\n'
  input_texts.append(input_text)
  target_texts.append(target_text)
  for char in input_text:
      if char not in input_chars:
          input_chars.add(char)
  for char in target_text:
      if char not in target_chars:
          target_chars.add(char)

input_chars = sorted(list(input_chars))
target_chars = sorted(list(target_chars))
num_encoder_tokens = len(input_chars)
num_decoder_tokens = len(target_chars)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

#Print size
print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)



Number of samples: 192
Number of unique input tokens: 18
Number of unique output tokens: 18
Max sequence length for inputs: 16
Max sequence length for outputs: 48


Now we create the data specific for the encoder and decoder, and prepare the encoder and decoder and the Model

In [12]:

# Define data for encoder and decoder
input_token_id = dict([(char, i) for i, char in enumerate(input_chars)])
target_token_id = dict([(char, i) for i, char in enumerate(target_chars)])

encoder_in_data = np.zeros((len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype='float32')

decoder_in_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype='float32')

decoder_target_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype='float32')

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_in_data[i, t, input_token_id[char]] = 1.
    for t, char in enumerate(target_text):
        decoder_in_data[i, t, target_token_id[char]] = 1.
        if t > 0:
            decoder_target_data[i, t - 1, target_token_id[char]] = 1.
    
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

# Using encoder states to set up the deecoder as initial state
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Final Model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary()

#Model data Shape
print("encoder_in_data shape:",encoder_in_data.shape)
print("decoder_in_data shape:",decoder_in_data.shape)
print("decoder_target_data shape:",decoder_target_data.shape)


Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_5 (InputLayer)           [(None, None, 18)]   0           []                               
                                                                                                  
 input_6 (InputLayer)           [(None, None, 18)]   0           []                               
                                                                                                  
 lstm_2 (LSTM)                  [(None, 256),        281600      ['input_5[0][0]']                
                                 (None, 256),                                                     
                                 (None, 256)]                                                     
                                                                                            

We now train the model and test it on a finite number of sequences.

In [13]:

from keras.optimizers import Adam

model.compile(optimizer=Adam(lr=0.01, beta_1=0.9, beta_2 = 0.999, decay=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit([encoder_in_data, decoder_in_data], decoder_target_data, batch_size=batch_size, epochs=50, validation_split=0.2)

encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

reverse_input_char_index = dict((i, char) for char, i in input_token_id.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_id.items())

def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    #Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    #Get the first character of target sequence with the start character.
    target_seq[0, 0, target_token_id['\t']] = 1.

    #Sampling loop for a batch of sequences
    #(to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        #Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        #Exit condition: either hit max length
        #or find stop character.
        if (sampled_char == '\n' or
        len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        #Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        #Update states
        states_value = [h, c]

    return decoded_sentence



for seq_index in range(10):
    input_seq = encoder_in_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)



Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
-
Input sentence: cos(x)
Decoded sentence: x**4/                                            
-
Input sentence: cos(x)
Decoded sentence: x**4/                                            
-
Input sentence: sin(x)
Decoded sentence: x**4/                                            
-
Input sentence: cos(x)
Decoded sentence: x**4/                                            
-
Input sentence: exp(x)
Decoded sent

As we can see the model performs pretty poorly. The following steps can significanlty improve it's performance

1. Increasing size and range of dataset - Currently it has singular permuations of the main functions - cosx, sinx, tanx, and the exponential. By permuting over every combination and to a power of 6, we can create a much larger and improved dataset with a size of upto a thousand entries. This will give our model a lot more time to fit to the data and learn the patterns. 

2. Using a different encoding strategy for Tokenisation - Currently a simple character encoding is used to encode the dataset to numerical form. It has been found often that specific types of encoding yield better results (for instance Byte Pair Encoding) for certain types of sequence to sequence learning tasks. 

3. Changing model parameters - This is a more obvious one, but adapting the model for the specific use case would yield much better results than attempting to use it straight out of the gate. 

