# A sequence to sequence prediction for YAB and St James' Contact data

### MFK, AMW, MLG

Next, we'll define a function to preprocess the data and create training and test datasets:

- The build_and_train_model function takes as input the training and test data, a mapping from surfaces to integers, the sequence length, and the number of epochs to train the model for. It builds a model with an embedding layer, an LSTM layer, and a dense layer, compiles the model using the categorical crossentropy loss function and the Adam optimizer, and trains the model on the training data for the specified number of epochs. It returns the trained model and the surface-to-integer mapping.
- The preprocess_data function takes as input the data and the sequence length. It creates a mapping from surfaces to integers, converts the data to integers using the surface-to-integer mapping, splits the data into input sequences and labels, one-hot encodes the labels, pads the input sequences with padding tokens to ensure that they all have the same length, and splits the padded sequences and labels into training and test sets. It returns the training and test data and labels, as well as the surface-to-integer mapping.
- The toy dataset is a list of three sequences of surface contacts.
- The preprocess_data function is called to preprocess the toy dataset, using a sequence length of 11.
- The build_and_train_model function is called to build and train a model on the preprocessed data, using a sequence length of 11 and the default number of epochs (100).





In [33]:
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences

def preprocess_data(data, sequence_length):
    # Flatten the list of sequences into a single string
    data = ''.join(data)
    
    # Create a mapping from surfaces to integers
    surfaces = sorted(set(data))
    surface_to_int = dict((c, i) for i, c in enumerate(surfaces))
    
    # Convert the data to integers using the surface-to-integer mapping
    data_int = [surface_to_int[c] for c in data]
    
    # Split the data into input sequences and labels
    inputs = []
    labels = []
    for i in range(0, len(data_int) - sequence_length, 1):
        inputs.append(data_int[i:i + sequence_length])
        labels.append(data_int[i + sequence_length])
        
    # One-hot encode the labels
    labels = tensorflow.keras.utils.to_categorical(labels)
    
    # Pad the input sequences with zeros to make them all the same length
    inputs = tensorflow.keras.preprocessing.sequence.pad_sequences(inputs, maxlen=sequence_length, padding='pre', value=0)
    
    # Split the data into training and testing sets
    x_train, x_test, y_train, y_test = train_test_split(inputs, labels, test_size=0.2, random_state=42)
    
    return x_train, x_test, y_train, y_test, surface_to_int



Now we can define a function to build and train the RNN model:

In [12]:
import tensorflow as tensorflow
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential

def build_and_train_model(x_train, y_train, x_test, y_test, surface_to_int, sequence_length, epochs=100):
    # Build the model
    model = Sequential()
    model.add(Embedding(input_dim=len(surface_to_int), output_dim=10, input_length=sequence_length))
    model.add(LSTM(units=50))
    model.add(Dense(units=y_train.shape[1], activation='softmax'))

    tensorflow.config.run_functions_eagerly(True)
    
    # Compile the model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'],run_eagerly=True)
    
    # Train the model
    model.fit(x_train, y_train, epochs=epochs, validation_data=(x_test, y_test))
    
    return model, surface_to_int


Finally, we can put everything together and use the model to make predictions on a new sequence of surface contacts:

In [34]:
# Toy dataset of surface contacts
data = ['ABGFGE', 'GBESGSGS', 'EEGEGEGEBAE']

# Preprocess the data
x_train, x_test, y_train, y_test, surface_to_int = preprocess_data(data, sequence_length=11)


In [35]:
x_train, x_test, y_train, y_test, surface_to_int

(array([[4, 5, 2, 2, 4, 2, 4, 2, 4, 2, 1],
        [2, 4, 1, 2, 5, 4, 5, 4, 5, 2, 2],
        [2, 5, 4, 5, 4, 5, 2, 2, 4, 2, 4],
        [4, 3, 4, 2, 4, 1, 2, 5, 4, 5, 4],
        [1, 4, 3, 4, 2, 4, 1, 2, 5, 4, 5],
        [5, 2, 2, 4, 2, 4, 2, 4, 2, 1, 0],
        [4, 2, 4, 1, 2, 5, 4, 5, 4, 5, 2],
        [1, 2, 5, 4, 5, 4, 5, 2, 2, 4, 2],
        [4, 5, 4, 5, 2, 2, 4, 2, 4, 2, 4],
        [3, 4, 2, 4, 1, 2, 5, 4, 5, 4, 5],
        [4, 1, 2, 5, 4, 5, 4, 5, 2, 2, 4]], dtype=int32),
 array([[5, 4, 5, 4, 5, 2, 2, 4, 2, 4, 2],
        [5, 4, 5, 2, 2, 4, 2, 4, 2, 4, 2],
        [0, 1, 4, 3, 4, 2, 4, 1, 2, 5, 4]], dtype=int32),
 array([[1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.,

In [36]:
# Build and train the model
model, surface_to_int = build_and_train_model(x_train, y_train, x_test, y_test, surface_to_int, sequence_length=11)

Epoch 1/100




Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

In [37]:
start_surface = 'A'

# Initialize the input sequence with a single surface contact
input_sequence = np.array([[surface_to_int[start_surface]]])

# Initialize an empty list to store the predicted surface contacts
predicted_surfaces = []

# Set the number of surface contacts to predict
num_predictions = 10

# Iterate over the number of predictions
for i in range(num_predictions):
    # Use the model to predict the next surface contact
    prediction = model.predict(input_sequence)[0]
    
    # Convert the one-hot encoded prediction back to an integer
    prediction = np.argmax(prediction)
    
    # Append the prediction to the list of predicted surfaces
    predicted_surfaces.append(prediction)
    
    # Update the input sequence with the prediction
    input_sequence = np.array([[prediction]])








