**Part A, autoecoder**

In [40]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

import sklearn
from sklearn import preprocessing, datasets, metrics
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import cross_validate, train_test_split, cross_val_predict
from sklearn.metrics import make_scorer, confusion_matrix, f1_score, accuracy_score


from keras.layers import Input, Dense
from keras.models import Model
from keras import layers


import tensorflow as tf 


We will use an autoencoder to reduce the dimensionality of the gene data and then return back to the original size. This can help us perform similar classification and prediction tasks as before. I decided to run this on the binary normal vs. tumor all gene dataset since this had fewer labels so I could better practice using autoencoders on a simpler case. 

In [81]:
all_df = pd.read_pickle("./all_nt.pkl")

In [82]:
#I realize now that this splitting is perhaps unnecessary for the tasks we perform 
#we're not predicting onto the test set at the end 
#we might find a use for the test set in the future though

#split 80-20 into training and testing sets 
train_all, test_all = train_test_split(all_df, test_size = 0.2)

train_all_x = train_all.drop(columns = ["Type"])
#scale the x's 
train_max = max(np.max(train_all_x))
train_all_x = train_all_x/train_max

test_all_x = test_all.drop(columns = ["Type"])
#scale the x's 
test_max = max(np.max(test_all_x))
test_all_x = test_all_x/test_max


In [29]:
train_all_x.shape

(1120, 60483)

In [44]:
# this is the size of our encoded representations
encoding_dim = 50

# this is our input placeholder
input_gene = Input(shape=(60483,))

# "encoded" is the encoded representation of the input
encoded = Dense(encoding_dim, activation='relu')(input_gene)

# "decoded" is the lossy reconstruction of the input
decoded = Dense(60483, activation='sigmoid')(encoded)

# this model maps an input to its reconstruction
autoencoder = Model(input_gene, decoded)

# this model maps an input to its encoded representation
encoder = Model(input_gene, encoded)

# create a placeholder for an encoded (50-dimensional) input
encoded_input = Input(shape=(encoding_dim,))

# retrieve the last layer of the autoencoder model
decoder_layer = autoencoder.layers[-1]

# create the decoder model
decoder = Model(encoded_input, decoder_layer(encoded_input))

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')


autoencoder.fit(train_all_x, train_all_x,
                epochs=50,
                batch_size=256,
                shuffle=True,
                validation_data=(test_all_x, test_all_x))


# encode and decode some digits
# note that we take them from the *test* set
encoded_genes = encoder.predict(test_all_x)
decoded_genes = decoder.predict(encoded_genes)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


The model continues to stablize with more epochs, eventually getting to a loss of only 1.7% between the training and the test sets!

*Question 2* 

In this problem, we vary the size of the "bottleneck layer" or the smallest dimension representation that the data get compressed to. 

In [46]:
#Different "bottleneck" (encoding dimension) layer sizes 

# this is the size of our encoded representations
encoding_values = [10, 20, 30, 40, 60, 80]


for i in range(6): 
    

    # this is the size of our encoded representations
    encoding_dim = encoding_values[i]
    
    print("Beginning autoencoding with dimension size: ", encoding_dim)

    # this is our input placeholder
    input_gene = Input(shape=(60483,))

    # "encoded" is the encoded representation of the input
    encoded = Dense(encoding_dim, activation='relu')(input_gene)

    # "decoded" is the lossy reconstruction of the input
    decoded = Dense(60483, activation='sigmoid')(encoded)

    # this model maps an input to its reconstruction
    autoencoder = Model(input_gene, decoded)

    # this model maps an input to its encoded representation
    encoder = Model(input_gene, encoded)

    # create a placeholder for an encoded (50-dimensional) input
    encoded_input = Input(shape=(encoding_dim,))

    # retrieve the last layer of the autoencoder model
    decoder_layer = autoencoder.layers[-1]

    # create the decoder model
    decoder = Model(encoded_input, decoder_layer(encoded_input))

    autoencoder.compile(optimizer='adam', loss='binary_crossentropy')


    autoencoder.fit(train_all_x, train_all_x,
                    epochs=50,
                    batch_size=256,
                    shuffle=True,
                    validation_data=(test_all_x, test_all_x))


    # encode and decode some digits
    # note that we take them from the *test* set
    encoded_genes = encoder.predict(test_all_x)
    decoded_genes = decoder.predict(encoded_genes)
    
    print("Completed autoencoding with dimension size: ", encoding_dim)


Beginning autoencoding with dimension size:  10
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Completed autoencoding with dimension size:  10
Beginning autoencoding with dimension size:  20
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/5

Ten dimensions is too small to fully capture the data and when we decode it, we still have meaningful loss between the training and test sets even at the last epoch (14%). The most stable and best performing model is one with 80 dimensions and only has a loss of 1.1% at the final epoch. Since these data have so many features, we perhaps need more dimensions in order to best capture the variation.

*Question 3*

We've seen what happens when we have a single layer. I actually don't think a single layer model performs that poorly (1.1% loss rate in the best performing one). However, it takes many epochs to finally stabilize which is tedious and perhaps requires a larger bottleneck layer size to optimize. Now let's explore what happens with 4 & 20 layers. 

In [73]:
#Autoencoding
# this is the size of our encoded representations
encoding_dim = 80

# this is our input placeholder
input_gene = Input(shape=(60483,))

# "encoded" is the encoded representation of the input
encoded_1 = layers.Dense(128, activation='relu', name= "encoded_1")(input_gene)
encoded_2 = layers.Dense(encoding_dim, activation='relu', name = "encoded_2")(encoded_1)
#encoded = layers.Dense(encoding_dim, activation='relu')(encoded)


# "decoded" is the lossy reconstruction of the input
#decoded = layers.Dense(64, activation='relu')(encoded)
decoded_1 = layers.Dense(128, activation='relu')(encoded_2)
decoded_2 = layers.Dense(60483, activation='sigmoid', name = "decoded_2")(decoded_1)

# this model maps an input to its reconstruction
autoencoder = Model(input_gene, decoded_2)


autoencoder.compile(optimizer='adam', loss='binary_crossentropy')


autoencoder.fit(train_all_x, train_all_x,
                epochs=50,
                batch_size=256,
                shuffle=True,
                validation_data=(test_all_x, test_all_x))



# Making the encoder model
new_encoded_1 = autoencoder.layers[1]
new_encoded_2 = autoencoder.layers[2]

encoder = Model(input_gene, new_encoded_2.output)
#encoder.summary()


#The decoder model
encoded_input = Input(shape=(encoding_dim,), name='encoded_input')
new_decoded_1 = autoencoder.layers[-2](encoded_input)
new_decoded_2 = autoencoder.layers[-1](new_decoded_1)

decoder = Model(encoded_input, new_decoded_2)
#decoder.summary()

# encode and decode some digits
# note that we take them from the *test* set
encoded_genes = encoder.predict(test_all_x)
decoded_genes = decoder.predict(encoded_genes)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


A 4-layer autoencoder removes pratically all variation between the test and training set accuracy by just the 13th epoch. This allows for highly accurate results with fewer iterations needed because more layers handle this work. 

However, if we include 20 layers, this is too many and reduces the model's usefulness. It takes forever to run and doesn't result in an meaningful improvement in model accuracy. This demonstrates to me that deciding the optimal number of layers is yet another step in the tuning process to identify the best autoencoder. I was able to run this earlier and the results weren't any different from 4 layers. Now it won't even complete its run because it takes too much memory. I'd want to experiment in the future with parallelizing this or running it on AWS. But since it doesn't add much, we're not really missing out. 

In [None]:
#Autoencoding
# this is the size of our encoded representations
encoding_dim = 80

# this is our input placeholder
input_gene = Input(shape=(60483,))

#20 layers 

# "encoded" is the encoded representation of the input
encoded_1 = layers.Dense(5000, activation='relu', name= "encoded_1")(input_gene)
encoded_2 = layers.Dense(4000, activation='relu', name = "encoded_2")(encoded_1)
encoded_3 = layers.Dense(3000, activation='relu', name = "encoded_3")(encoded_2)
encoded_4 = layers.Dense(2000, activation='relu', name = "encoded_4")(encoded_3)
encoded_5 = layers.Dense(1000, activation='relu', name = "encoded_5")(encoded_4)
encoded_6 = layers.Dense(500, activation='relu', name = "encoded_6")(encoded_5)
encoded_7 = layers.Dense(250, activation='relu', name = "encoded_7")(encoded_6)
encoded_8 = layers.Dense(200, activation='relu', name = "encoded_8")(encoded_7)
encoded_9 = layers.Dense(150, activation='relu', name = "encoded_9")(encoded_8)


encoded_final = layers.Dense(encoding_dim, activation='relu', name = "encoded_final")(encoded_9)


# "decoded" is the lossy reconstruction of the input
#decoded = layers.Dense(64, activation='relu')(encoded)
decoded_1 = layers.Dense(150, activation='relu', name = "decoded_1")(encoded_final)
decoded_2 = layers.Dense(200, activation='relu', name = "decoded_2")(decoded_1)
decoded_3 = layers.Dense(250, activation='relu', name = "decoded_3")(decoded_2)
decoded_4 = layers.Dense(500, activation='relu', name = "decoded_4")(decoded_3)
decoded_5 = layers.Dense(1000, activation='relu', name = "decoded_5")(decoded_4)
decoded_6 = layers.Dense(2000, activation='relu', name = "decoded_6")(decoded_5)
decoded_7 = layers.Dense(3000, activation='relu', name = "decoded_7")(decoded_6)
decoded_8 = layers.Dense(4000, activation='relu', name = "decoded_8")(decoded_7)
decoded_9 = layers.Dense(5000, activation='relu', name = "decoded_9")(decoded_8)


decoded_final = layers.Dense(60483, activation='sigmoid', name = "decoded_final")(decoded_9)

# this model maps an input to its reconstruction
autoencoder = Model(input_gene, decoded_final)


autoencoder.compile(optimizer='adam', loss='binary_crossentropy')


autoencoder.fit(train_all_x, train_all_x,
                epochs=50,
                batch_size=256,
                shuffle=True,
                validation_data=(test_all_x, test_all_x))



# Making the encoder model
new_encoded_final = autoencoder.layers[10]

encoder = Model(input_gene, new_encoded_final.output)
encoder.summary()



#The decoder model
encoded_input = Input(shape=(encoding_dim,), name='encoded_input')

new_decoded_1 = autoencoder.layers[-10](encoded_input)
new_decoded_2 = autoencoder.layers[-9](new_decoded_1)
new_decoded_3 = autoencoder.layers[-8](new_decoded_2)
new_decoded_4 = autoencoder.layers[-7](new_decoded_3)
new_decoded_5 = autoencoder.layers[-6](new_decoded_4)
new_decoded_6 = autoencoder.layers[-5](new_decoded_5)
new_decoded_7 = autoencoder.layers[-4](new_decoded_6)
new_decoded_8 = autoencoder.layers[-3](new_decoded_7)
new_decoded_9 = autoencoder.layers[-2](new_decoded_8)

new_decoded_final = autoencoder.layers[-1](new_decoded_9)


decoder = Model(encoded_input, new_decoded_final)
decoder.summary()


# encode and decode some digits
# note that we take them from the *test* set
encoded_genes = encoder.predict(test_all_x)
decoded_genes = decoder.predict(encoded_genes)


*Question 4* 

Two possible uses of an autoencoder for gene expression data include using the bottleneck layer as a way to select features and we can use the decoder to generate synthetic data. An autoencoder first compresses data down to the number of dimensions as the bottleneck layer and we can use this representation as we would any other dimension reduction method. Gene expression data is particularly complicated and numerous, if we can identify a simplified representation, then we can better understand, analyze, and interpret our results. For example, we could cluster these data to find certain meaningful similarities between genes in tumors. We could also generate synthetic data from decoding this representation which perhaps could be used to fix data imbalances. 
