# Training first model on simulated CYP2D6 diplotypes

This notebook is supplementary material to the project here, which aims to re-implement the Hubble.2d6 tool to predict the function of CYP2D6 star alleles.

Within this notebook, the 1st model is trained using simulated CYP2D6 diplotype data provided by the original paper. This model's weights will be transfered to the final model to be fine-tuned for prediction of CYP2D6 phenotypes.

Please keep in mind that the encoding process of the provided data is incomplete due to my situation and technical restrictions around programs I have available at my disposal. More information on the actual implementation can be read in the final report. 


## Getting ready

**Acknowledgements**: Pre-computed annotation embeddings used are from the original Hubble.2d6 repo: https://github.com/gregmcinnes/Hubble2D6/tree/master/data.

In [1]:
import os
import tensorflow as tf
import numpy as np

In [2]:
!git clone https://github.com/Locrian24/seng474-term-project.git
!cd seng474-term-project/ && git pull

Cloning into 'seng474-term-project'...
remote: Enumerating objects: 87, done.[K
remote: Counting objects: 100% (87/87), done.[K
remote: Compressing objects: 100% (65/65), done.[K
remote: Total 87 (delta 33), reused 69 (delta 18), pack-reused 0[K
Unpacking objects: 100% (87/87), done.
Already up to date.


In [3]:
import sys
sys.path.insert(0, '/content/seng474-term-project')

## GPU Runtime

Before running this notebook, make sure your hardware accelerator is a GPU by selecting **GPU** from the settings: **Runtime -> Change runtime type -> Hardware accelerator -> GPU**

In [4]:
device_name = tf.test.gpu_device_name()
if device_name == '':
  raise SystemExit("Dataset generator is built to run on the GPU runtime. Please switch to GPU by selecting GPU from Runtime -> Change runtime type")

## Retrieving the data

These functions are responsible for loading the simulated data onto disk and getting the batch files ready for processing.

In [5]:
import pathlib
from tensorflow.keras.utils import to_categorical

def get_batch_files(training_count, test_count):
  """
  Pull simulated data from zenodo and split the batch files into training and testing sets
  """

  file_root = tf.keras.utils.get_file(
      'simulated_cyp2d6_diplotypes',
      'https://zenodo.org/record/3951095/files/simulated_cyp2d6_diplotypes.tar.gz',
      untar=True
  )
  file_root = pathlib.Path(file_root)
  filenames = []
  for f in file_root.glob("*"):
    filenames.append(f)

  _filenames = np.array([f.name.split('.')[0] for f in filenames])
  batch_names = np.unique(_filenames)
  filenames = np.array([str(f.absolute()) for f in filenames])
  training_batches, test_batches = [], []

  for i, b in enumerate(batch_names):
    if i >= test_count + training_count:
      break
      
    if i < training_count:
      training_batches.append(filenames[_filenames == b])
    else:
      test_batches.append(filenames[_filenames == b])

  return training_batches, test_batches

def hot_encode_float(y):
  """
  This is ultimately a classification problem and so the labels must be encoded appropriately
  One-hot encodes the activity scores within the label vector
  """
  
  classes = []
  values = np.unique(y)
  for i in range(len(values)):
    classes.append(str(i))
  encoded_classes = to_categorical(classes)
  conversion_dict = dict(zip(values, range(5)))
  encoded_y = np.array([encoded_classes[conversion_dict[i]] for i in y])

  return encoded_y

## Pre-processing

The batch files available from the original paper are in vcf format and must be converted to a one-hot encoded + annotation format to be passed into the model. 

Encode2Seq compares variants within the vcf files to the reference seq, and updates the sequences of each diplotype before then using the pre-computed annotation embeddings to one-hot encode, annotate, and match the samples with their labels.

**Acknowledgements**: Encode2Seq is forked from the method used within the Hubble.2d6 tool. It was expanded to handle diplotype encodings since the base method is for single haplotypes only.

***Important***: Since Encode2Seq only has access to pre-computed embedding data, some variants within the simulated vcf files do not have corresponding embeddings and so have empty annotation vectors. In total, 317 of the 1406 variants do not have corresponding pre-computing annotation data.

In [6]:
from encode_to_seq import Encode2Seq

ANNOTATIONS = '/content/seng474-term-project/data/gvcf2seq.annotation_embeddings.csv'
EMBEDDINGS = '/content/seng474-term-project/data/embeddings.txt'
REF = '/content/seng474-term-project/data/ref.seq'

def generate_data(batches):
  """
  Generator that encodes and yields samples one at a time for implementation purposes.
  Manually storing these encodings would be beyond my abilities/resources, so I chose to encode them within a generator and pass the encoded data directly to the model
  """

  for filenames in batches:
    vcf = 0 if 'vcf' == filenames[0].decode('utf-8').split('.')[-1] else 1
    labels = 1 - vcf
    encoding = Encode2Seq(vcf=filenames[vcf].decode('utf-8'), labels=filenames[labels].decode('utf-8'), embedding_file=EMBEDDINGS, annotation_file=ANNOTATIONS, ref_seq=REF)
    y = hot_encode_float(encoding.y.flatten())
    for i in range(encoding.X.shape[0]):
      yield encoding.X[i], y[i]

### Building the model

In [7]:
# Convolution layers based on final model from paper:
# https://github.com/gregmcinnes/Hubble2D6/blob/master/data/models/hubble2d6_0.json

def get_model():
  return tf.keras.Sequential([
    tf.keras.layers.Conv1D(70, kernel_size=19, strides=5,input_shape=(14868, 13), batch_input_shape=(None, 14868, 13), activation=tf.keras.activations.linear, kernel_initializer=tf.keras.initializers.VarianceScaling(mode='fan_avg', distribution='uniform'), name = "conv1d_1"),
    tf.keras.layers.BatchNormalization(name="batch_1"),
    tf.keras.layers.ReLU(name="relu_1"),
    tf.keras.layers.MaxPooling1D(pool_size=3, strides=3, name="maxpooling_1"),
    tf.keras.layers.Conv1D(46, kernel_size=11, strides=5, activation=tf.keras.activations.linear, kernel_initializer=tf.keras.initializers.VarianceScaling(mode='fan_avg', distribution='uniform'), name = "conv1d_2"),
    tf.keras.layers.BatchNormalization(name="batch_2"),
    tf.keras.layers.ReLU(name="relu_2"),
    tf.keras.layers.MaxPooling1D(pool_size=4, strides=4, name="maxpooling_2"),
    tf.keras.layers.Conv1D(46, kernel_size=7, strides=5, activation=tf.keras.activations.linear, kernel_initializer=tf.keras.initializers.VarianceScaling(mode='fan_avg', distribution='uniform'), name = "conv1d_3"),
    tf.keras.layers.BatchNormalization(name="batch_3"),
    tf.keras.layers.ReLU(name="relu_3"),
    tf.keras.layers.MaxPooling1D(pool_size=4, strides=4, name="maxpooling_3"),
    tf.keras.layers.Flatten(name="flatten_3"),
    tf.keras.layers.Dense(32, activation=tf.keras.activations.relu, kernel_initializer=tf.keras.initializers.VarianceScaling(mode='fan_avg', distribution='uniform'), name="dense_4"),
    tf.keras.layers.Dropout(rate=0.03, name="dropout_4"),
    tf.keras.layers.Dense(5, activation='softmax', kernel_initializer=tf.keras.initializers.VarianceScaling(mode='fan_avg', distribution='uniform'), name="dense_5"),
  ])

In [8]:
with tf.device('/device:GPU:0'):
  model = get_model()
  adam = tf.keras.optimizers.Adam(learning_rate=0.001)
  model.compile(optimizer=adam,
                loss=tf.keras.losses.CategoricalCrossentropy(), 
                metrics=['accuracy'])

### Preparing the batch files

In [9]:
batch_size = 100
epochs = 5
steps_per_epoch = 50000 // batch_size

In [10]:
# Provided training data contains 250,000 samples (500 samples per batch)
# Selecting 50,000 samples for training, and 10,000 for testing as per the paper specifications

training_batches, test_batches = get_batch_files(100, 20)

Downloading data from https://zenodo.org/record/3951095/files/simulated_cyp2d6_diplotypes.tar.gz


In [11]:
train_dataset = tf.data.Dataset.from_generator(generate_data, args=[training_batches], output_types=(tf.float32, tf.float32), output_shapes=((14868, 13), (5,)))
test_dataset = tf.data.Dataset.from_generator(generate_data, args=[test_batches], output_types=(tf.float32, tf.float32), output_shapes=((14868, 13), (5,)))

train_dataset = train_dataset.shuffle(500).repeat(count=5).batch(batch_size)
test_dataset = test_dataset.batch(500)

### Training the model

Training the model takes around 25-30 minutes running on a GPU. 

I've commented the `fit` call out and loaded in the weights for convienence but feel free to train the initial model from scratch.

In [12]:
# model.load_weights('/content/seng474-term-project/step_1/weights.h5')
model.fit(train_dataset, epochs=epochs, steps_per_epoch=steps_per_epoch)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f5f18204d50>

In [13]:
# model.save_weights('weights.h5')

# json_model = model.to_json()
# with open("model.json", "w") as json:
#   json.write(model_json)

### Evaluation

In [14]:
model.evaluate(test_dataset, steps=20)



[0.5240103006362915, 0.8302000164985657]

As you can see, the model attains around 80% accuracy on the testing set.

Note, the original implementation of Hubble.2d6 attains an accuracy of 100% on its testing set. This discrepancy could be a result of many factors, including training time/procedure as well as more robust training set or embeddings.