<img style="float: left;" src="https://cdn.pixabay.com/photo/2016/12/07/09/45/dna-1889085__340.jpg" width=10%> <h1> Application of AI to Discover Novel Binding of Small Molecules </h1>

---------
### Sample Dataset for Testing Purposes

##### Here we create a sample dataset for two reason:
- to get a better understanding of the structure of the data
- test any sample code for validity

##### Structure of sample dataset:
1. A dataframe consisting of 50 genes and 1020 profiles [50 x 1020]
2. Columns are a combination of drug, replicate, time, concentration, probe_location, cell type. For the purposes of this project only drug and replicate matters in terms of training. So the column name will be structured as
"*drug + replicate id + unique characters that represent time, concentration, probe_location and cell type*"
3. 20 columns consist of control genes or 'control probes'. Columns are labelled control_x where x is a number from 1 to 20
3. Dataset consists of 25 drugs with 4 replicates and 10 combinations of time, concentration, probe_location and cell type

| Feature      | Quantity | Represented By |
| ----------- | ----------- | ----------- |
| Drug      | 25       | Alphabets A-Y |
| Replicate   | 4        | Numbers 1-4 |
| Other features   | 10        | Random String of length 3 |

***R_3_xcv*** represents a profile of drug 'R', of replicate 3, with other features coresponding to 'xcv'

##### Construction of Sample Dataset

In [50]:
import random
import pandas as pd
import numpy as np

In [43]:
genes = ['gene'+str(a) for a in range(50)]
drugs = [chr(a) for a in range(65, 90)]
replicates = [str(a) for a in range(1, 5)]
other_features = set()

while len(other_features)!=10:
    rand_string = "". join([str(chr(int(random.random()*100)%26+97)) for a in range(3)])
    other_features.add(rand_string)

In [62]:
columns = ["_".join([a,b,c]) for a in drugs for b in replicates for c in other_features]
# columns = ["control_"+str(a+1) for a in range(20)] + columns

In [65]:
data = pd.DataFrame(2*np.random.rand(50, len(columns))-1, index=genes, columns=columns)
data.columns = columns
data.fillna(random.random(), inplace = True)

In [66]:
data.head()

Unnamed: 0,A_1_pig,A_1_zqy,A_1_zen,A_1_fay,A_1_fyv,A_1_eiu,A_1_vzn,A_1_buf,A_1_cgy,A_1_kij,...,Y_4_pig,Y_4_zqy,Y_4_zen,Y_4_fay,Y_4_fyv,Y_4_eiu,Y_4_vzn,Y_4_buf,Y_4_cgy,Y_4_kij
gene0,0.266684,-0.274312,0.077905,-0.548854,-0.736808,0.349861,-0.116514,0.569468,-0.596258,0.48713,...,0.017979,0.050713,0.882085,0.587092,-0.199024,-0.78522,-0.062024,0.93792,-0.862783,-0.799328
gene1,-0.334064,0.181081,-0.537101,-0.584721,0.672408,-0.367559,-0.150256,0.194821,-0.812244,0.600673,...,0.054452,0.899475,-0.84318,-0.193431,-0.00622,0.434563,-0.022035,0.214416,-0.187245,0.550042
gene2,0.994708,-0.97258,-0.530632,0.939359,-0.693738,0.879323,0.537013,-0.511489,0.024343,-0.434864,...,-0.315879,-0.030986,-0.184232,-0.323366,-0.993879,-0.203901,-0.86721,-0.733177,0.776557,-0.500453
gene3,0.002349,0.43927,0.110216,0.520396,-0.409847,-0.237758,-0.44872,-0.007899,0.748263,0.327927,...,0.666486,-0.137598,-0.553063,-0.933378,0.43806,-0.876192,0.896923,0.610096,0.83246,0.002362
gene4,0.531959,-0.581933,0.465837,-0.984135,-0.114998,-0.104977,-0.451407,-0.322394,-0.493598,0.39888,...,0.170159,-0.406464,0.650476,0.982573,0.590457,0.523859,0.418934,-0.766399,-0.715014,-0.816831


##### Classifying Columns
A label needs to be assigned to each class. This can be done at the biological replicate level or the perturbagen level. We create classifications for each of these.

In [75]:
perturbagen_class = [int(a/25) for a in range(1000)]
replicate_class = [10*a+c for a in range(25) for b in range(4) for c in range(10)]