<img style="float: left;" src="https://cdn.pixabay.com/photo/2016/12/07/09/45/dna-1889085__340.jpg" width=10%> <h1> Application of AI to Discover Novel Binding of Small Molecules </h1>

---------
### Sample Dataset for Testing Purposes

##### Here we create a sample dataset for two reasons:
- to get a better understanding of the structure of the data
- test any sample code for validity

##### Structure of sample dataset:
1. A dataframe consisting of 50 genes and 1020 profiles [50 x 1020]
2. Columns are a combination of drug, replicate, time, concentration, probe_location, cell type. For the purposes of this project only drug and replicate matters in terms of training. So the column name will be structured as
"*drug + replicate id + unique characters that represent time, concentration, probe_location and cell type*"
3. 20 columns consist of control genes or 'control probes'. Columns are labelled control_x where x is a number from 1 to 20
3. Dataset consists of 25 drugs with 4 replicates and 10 combinations of time, concentration, probe_location and cell type

| Feature      | Quantity | Represented By |
| ----------- | ----------- | ----------- |
| Drug      | 25       | Alphabets A-Y |
| Replicate   | 4        | Numbers 1-4 |
| Other features   | 10        | Random String of length 3 |

***R_3_xcv*** represents a profile of drug 'R', of replicate 3, with other features coresponding to 'xcv'

##### Construction of Sample Dataset

In [1]:
import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
genes = ['gene'+str(a) for a in range(50)]
drugs = [chr(a) for a in range(65, 90)]
replicates = [str(a) for a in range(1, 5)]
other_features = set()

while len(other_features)!=10:
    rand_string = "". join([str(chr(int(random.random()*100)%26+97)) for a in range(3)])
    other_features.add(rand_string)

In [3]:
columns = ["_".join([a,b,c]) for a in drugs for b in replicates for c in other_features]
# columns = ["control_"+str(a+1) for a in range(20)] + columns

In [4]:
data = pd.DataFrame(2*np.random.rand(50, len(columns))-1, index=genes, columns=columns)
data.columns = columns
data.fillna(random.random(), inplace = True)
data.shape

(50, 1000)

In [5]:
data.head()

Unnamed: 0,A_1_prm,A_1_unv,A_1_wgm,A_1_hng,A_1_cpi,A_1_ngn,A_1_jzs,A_1_hlg,A_1_hfj,A_1_vzy,...,Y_4_prm,Y_4_unv,Y_4_wgm,Y_4_hng,Y_4_cpi,Y_4_ngn,Y_4_jzs,Y_4_hlg,Y_4_hfj,Y_4_vzy
gene0,-0.837681,-0.848637,-0.087142,-0.405961,0.004049,0.407791,-0.683858,-0.079311,0.680765,0.530237,...,-0.562764,-0.210179,0.630523,-0.117293,0.71951,0.19804,-0.498323,0.506826,0.410857,0.16031
gene1,0.134055,-0.035813,0.241758,0.232256,0.98338,-0.139263,0.865222,-0.829397,0.985169,0.623781,...,-0.746222,0.767714,0.262454,-0.939741,0.975793,-0.051887,0.737123,0.034843,0.282754,-0.498887
gene2,-0.069256,-0.966952,0.321458,-0.780353,-0.726598,0.129363,-0.122175,-0.601217,0.811797,0.014393,...,-0.747781,-0.376094,-0.764969,0.808239,-0.420177,0.626916,-0.924948,0.805889,0.881259,-0.03414
gene3,0.879256,0.255205,0.159366,-0.677889,-0.011056,-0.030545,-0.559349,-0.948835,0.309141,0.662506,...,0.519676,0.542734,0.782838,-0.158183,-0.79498,-0.855891,-0.311861,0.613019,0.295519,0.366462
gene4,0.329362,0.060296,0.840975,-0.858893,0.976929,0.764979,-0.438007,0.928542,-0.998299,-0.968178,...,-0.512103,0.450282,-0.123677,-0.209013,-0.45333,-0.886884,0.089143,-0.08296,0.040806,-0.392981


##### Classifying Columns
A label needs to be assigned to each class. This can be done at the biological replicate level or the perturbagen level. We create classifications for each of these.

In [6]:
perturbagen_class = [int(a/25) for a in range(1000)]
replicate_class = [10*a+c for a in range(25) for b in range(4) for c in range(10)]

##### Creating the dataset

In [7]:
#transpose data
workingdata = data.transpose()
workingdata.head()

Unnamed: 0,gene0,gene1,gene2,gene3,gene4,gene5,gene6,gene7,gene8,gene9,...,gene40,gene41,gene42,gene43,gene44,gene45,gene46,gene47,gene48,gene49
A_1_prm,-0.837681,0.134055,-0.069256,0.879256,0.329362,0.430083,-0.328941,0.14237,0.497543,-0.610296,...,-0.430143,0.948235,0.067023,-0.235707,0.598989,0.571571,0.187593,0.0464,0.510854,0.25032
A_1_unv,-0.848637,-0.035813,-0.966952,0.255205,0.060296,0.527317,0.264332,0.89342,-0.032422,-0.035097,...,0.574412,-0.722249,-0.443834,-0.804338,-0.135993,0.635998,0.907667,0.461371,0.992589,0.26699
A_1_wgm,-0.087142,0.241758,0.321458,0.159366,0.840975,0.909729,0.87196,0.380354,0.037835,-0.468677,...,0.715138,-0.978814,0.333647,-0.292106,0.461211,0.407429,0.400556,0.088992,0.264357,-0.339276
A_1_hng,-0.405961,0.232256,-0.780353,-0.677889,-0.858893,0.911963,-0.217289,0.364805,0.166131,0.036939,...,0.231293,-0.68619,-0.945634,-0.115836,0.895154,0.430906,0.552569,-0.931495,-0.521365,-0.993977
A_1_cpi,0.004049,0.98338,-0.726598,-0.011056,0.976929,0.433159,0.408588,0.98744,-0.951953,-0.503968,...,0.320173,0.101312,-0.92283,0.957474,0.075836,0.378065,-0.169205,0.798374,0.863706,-0.25366


In [8]:
X_train, X_test, y_train, y_test = train_test_split(workingdata, perturbagen_class, test_size=0.5)
X_test.shape

(500, 50)

##### Computation - Siamese

In [9]:
import keras
from keras.datasets import reuters
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Dropout, Input
from keras.layers.noise import AlphaDropout
from keras.preprocessing.text import Tokenizer
from keras.layers import Layer
from tensorflow.python.keras import backend as K

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [20]:
max_words = 50
batch_size = 16
epochs = 40

def create_network(n_dense=6,
                   dense_units=16,
                   activation='selu',
                   dropout=AlphaDropout,
                   dropout_rate=0.1,
                   kernel_initializer='lecun_normal',
                   optimizer='adam',
                   num_classes=1,
                   max_words=max_words):
    
    model = Sequential()
    model.add(Dense(dense_units, input_shape=(max_words,),
                    kernel_initializer=kernel_initializer))
    model.add(Activation(activation))
    model.add(dropout(dropout_rate))

    for i in range(n_dense - 1):
        model.add(Dense(dense_units, kernel_initializer=kernel_initializer))
        model.add(Activation(activation))
        model.add(dropout(dropout_rate))

    #model.add(Dense(num_classes))
    #model.add(Activation('softmax'))
    return model

In [21]:
network = {
    'n_dense': 6,
    'dense_units': 16,
    'activation': 'selu',
    'dropout': AlphaDropout,
    'dropout_rate': 0.1,
    'kernel_initializer': 'lecun_normal',
    'optimizer': 'sgd',
    'num_classes':40
}

In [22]:
model = create_network(**network)

In [23]:
class ManDist(Layer):
    
    # initialize the layer, No need to include inputs parameter!
    def __init__(self, **kwargs):
        self.result = None
        super(ManDist, self).__init__(**kwargs)

    # input_shape will automatic collect input shapes to build layer
    def build(self, input_shape):
        super(ManDist, self).build(input_shape)

    # This is where the layer's logic lives.
    def call(self, x, **kwargs):
        self.result = K.sum(K.abs(x[0] - x[1]), axis=1, keepdims=True)
        return self.result

    # return output shape
    def compute_output_shape(self, input_shape):
        return K.int_shape(self.result)

In [24]:
left_input = Input(shape=(max_words,))
right_input = Input(shape=(max_words,))

In [25]:
# Model variables
shared_model = model

In [26]:
#TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'- embedding layer is required
#Node error -> from keras not from tf.python.keras
#Input 'b' of 'MatMul' Op has type float32 that does not match type int32 of argument 'a'. ->
malstm_distance = ManDist()([shared_model(left_input), shared_model(right_input)])
model = Model(inputs=[left_input, right_input], outputs=[malstm_distance])

In [27]:
model.compile(loss='mean_squared_error', optimizer="adam", metrics=['accuracy'])
model.summary()
shared_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 50)           0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, 50)           0                                            
__________________________________________________________________________________________________
sequential_2 (Sequential)       (None, 16)           2176        input_3[0][0]                    
                                                                 input_4[0][0]                    
__________________________________________________________________________________________________
man_dist_2 (ManDist)            (None, 1)            0           sequential_2[1][0]               
          

In [28]:
#ValueError: Error when checking target: expected man_dist_1 to have shape (1,) but got array with shape (46,)
#==> need to convert code to suit multi-class

malstm_trained = model.fit([X_train,X_test], y_train, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [30]:
prediction = model.predict([X_test,X_train],verbose=1)
print(prediction[0:5])

[[22.66892 ]
 [ 9.096903]
 [16.879007]
 [12.323569]
 [11.885744]]


In [33]:
score = model.evaluate([X_test,X_train],y_train,verbose=1)
score



[164.97259533691405, 0.01600000002980232]