##Summary


* Import Libraries
* Load and Preview Data from AWS Data Lake
* Data Wrangling
  * String Mapping
  * Use Chi-Square to Select Top 1000 Features
  * Reshape and split the data 
* Deep Learning Architecture
  * Set the Hyperparameters
  * Model Construction
	* Learning Rate / Adam Decay Tuning Based on Gaussian Process
	* Bayesian Optimization
	* Best Model for Training
* Model Evaluation and Model Fitting

## Potential Application of Deep Learning using TCGA
* ### Identify potential tissue of origin of unknown cancer types by molecular profiles
* ### Predict tissue of origin of metastatic cancer

Large-scale cancer genomics data often imposes great challenges in terms of computational algorithms. The high dimensioanl dataset is suitable for applying a deep learning algorithm. Given the features and labels, models can be trained to classify future samples based on similar gene expression.

## RNA-Seq (HiSeq) PANCAN Data Set

Data Abstract:

This collection of data is part of the RNA-Seq (HiSeq) PANCAN data set. It is an extraction of gene expressions of patients having different types of tumor: BRCA, KIRC, COAD, LUAD and PRAD. The data source is UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq

## Import Libraries

Keras API will be used to conduct multilayer deep learning models. Scikit-Optimize is used to optimize the best parameters. And plotly is imported to do vivid and interactive data visualization.

In [8]:
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer, OneHotEncoder, StandardScaler, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.mllib.evaluation import MulticlassMetrics
import plotly.graph_objs as go
from collections import Counter
import numpy as np
import pandas as pd
import os
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from plotly.subplots import make_subplots

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers import LSTM, Embedding, Flatten
from keras import optimizers, regularizers
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

import skopt
from skopt import gbrt_minimize, gp_minimize
from skopt.utils import use_named_args
from skopt.space import Real, Categorical, Integer  

import tensorflow
from tensorflow.python.keras import backend as K

## Load and Preview Data from AWS Data Lake

*Data Loading*

Datasets were stored in our Gold Bucket, AWS Data Lake. Feature data and label data were separately stored.

In [12]:
# Get access to AWS data lake
ACCESS_KEY= os.environ['AWS_ACCESS_KEY']
ENCODED_SECRET_KEY = os.environ['AWS_SECRET_ACCESS_KEY'].replace('/', '%2F')

GOLD_BUCKET = 'oculadata-gold'
MOUNT_GOLD = '/mnt/gold'

mounted_paths = map(lambda m: m.mountPoint, dbutils.fs.mounts())

# Mount the data
if MOUNT_GOLD not in mounted_paths:
  dbutils.fs.mount(f's3a://{ACCESS_KEY}:{ENCODED_SECRET_KEY}@{GOLD_BUCKET}', MOUNT_GOLD)
  
print(dbutils.fs.mounts())

project = 'PanCanAtlas'
filepath = f'{MOUNT_GOLD}/{project}'

infer_schema = 'true'
if_header = 'true'

In [13]:
# get the feature data
tablename = 'Sameple_Cancer_data'
file_location = f'dbfs:{filepath}/{tablename}/'
file_type = 'delta'

r_data = spark.read.option("maxColumns", 25000).format(file_type)\
        .option("header", if_header)\
        .option("inferSchema", infer_schema)\
        .load(file_location)

In [14]:
r_data.limit(10).toPandas()

Unnamed: 0,_c0,gene_0,gene_1,gene_2,gene_3,gene_4,gene_5,gene_6,gene_7,gene_8,gene_9,gene_10,gene_11,gene_12,gene_13,gene_14,gene_15,gene_16,gene_17,gene_18,gene_19,gene_20,gene_21,gene_22,gene_23,gene_24,gene_25,gene_26,gene_27,gene_28,gene_29,gene_30,gene_31,gene_32,gene_33,gene_34,gene_35,gene_36,gene_37,gene_38,...,gene_20491,gene_20492,gene_20493,gene_20494,gene_20495,gene_20496,gene_20497,gene_20498,gene_20499,gene_20500,gene_20501,gene_20502,gene_20503,gene_20504,gene_20505,gene_20506,gene_20507,gene_20508,gene_20509,gene_20510,gene_20511,gene_20512,gene_20513,gene_20514,gene_20515,gene_20516,gene_20517,gene_20518,gene_20519,gene_20520,gene_20521,gene_20522,gene_20523,gene_20524,gene_20525,gene_20526,gene_20527,gene_20528,gene_20529,gene_20530
0,sample_0,0.0,2.017209,3.265527,5.478487,10.431999,0.0,7.175175,0.591871,0.0,0.0,0.591871,1.334282,2.015391,0.591871,0.0,0.0,0.0,0.0,0.591871,5.619994,1.334282,0.0,9.796088,0.0,0.0,1.598651,7.215116,10.83907,6.620204,9.513538,0.0,4.063658,7.764805,4.747656,13.714396,10.034496,0.0,0.0,9.833458,...,9.370304,10.362393,5.589928,8.141964,0.0,2.736583,7.037152,7.12348,10.967399,5.9028,3.71937,7.203554,6.042557,2.602077,7.425526,7.846957,2.824951,6.239396,0.0,8.469593,0.0,6.535978,6.968701,7.128881,7.175175,9.249369,7.02597,8.045563,7.475709,7.205236,4.926711,8.210257,9.723516,7.22003,9.119813,12.003135,9.650743,8.921326,5.286759,0.0
1,sample_1,0.0,0.592732,1.588421,7.586157,9.623011,0.0,6.816049,0.0,0.0,0.0,0.0,0.587845,2.466601,1.004394,0.0,0.0,0.0,0.0,0.0,11.055208,3.562621,0.0,10.07047,0.0,0.0,0.0,9.949812,8.522476,1.17479,4.926991,0.0,0.0,5.819832,1.32717,13.28624,6.663316,0.587845,0.0,9.533302,...,8.882967,9.898199,7.069401,7.186134,0.0,3.134993,6.64893,6.715701,9.536238,1.004394,5.555482,8.02926,6.366219,0.811142,7.991732,7.161001,0.0,4.708877,0.811142,8.451689,0.0,7.242336,8.046284,6.047558,8.572901,7.54903,7.019935,9.45894,9.190867,10.639259,4.593372,7.323865,9.740931,6.256586,8.381612,12.674552,10.517059,9.397854,2.094168,0.0
2,sample_2,0.0,3.511759,4.327199,6.881787,9.87073,0.0,6.97213,0.452595,0.0,0.0,0.0,0.452595,1.981122,1.074163,0.0,0.0,0.0,0.0,1.683023,8.210248,4.195285,3.660427,8.97092,0.0,0.0,0.796598,6.09665,9.861616,7.680507,3.119439,0.0,0.452595,7.899526,0.0,10.731098,6.967883,0.452595,0.0,9.646323,...,10.355637,10.423274,5.170201,6.19426,0.0,3.677147,6.27199,7.089816,9.67522,0.0,4.224017,8.020402,6.967883,5.014445,8.400038,7.527555,0.0,4.997902,0.796598,7.761132,0.0,6.82046,8.048983,6.661493,7.716332,6.745802,7.524667,8.60235,9.036654,10.336027,5.125213,8.127123,10.90864,5.401607,9.911597,9.045255,9.788359,10.09047,1.683023,0.0
3,sample_3,0.0,3.663618,4.507649,6.659068,10.196184,0.0,7.843375,0.434882,0.0,0.0,0.0,0.434882,2.874246,0.0,0.0,0.0,0.0,0.0,1.267356,8.306317,3.573556,0.0,8.524616,0.0,0.0,0.0,3.913761,9.511573,6.469165,7.029895,0.0,1.267356,6.800641,7.742714,12.659474,8.29989,0.768587,0.0,9.670731,...,10.074382,9.918261,7.117924,7.196145,0.434882,3.609755,8.896696,7.577096,10.731446,5.075383,2.175652,7.675435,6.840816,6.233192,8.899886,8.319085,1.791814,5.661134,1.464093,8.625727,0.0,7.420095,7.784746,7.613915,8.963286,7.744699,7.924997,8.981473,8.665592,9.194823,6.076566,8.792959,10.14152,8.942805,9.601208,11.392682,9.694814,9.684365,3.292001,0.0
4,sample_4,0.0,2.655741,2.821547,6.539454,9.738265,0.0,6.566967,0.360982,0.0,0.0,0.0,1.275841,2.141204,0.0,0.0,0.0,0.0,0.0,0.889707,10.14915,2.96763,0.0,8.047238,0.0,1.435949,0.0,1.94212,8.821535,5.861429,7.755709,0.0,0.649386,5.570241,2.612801,13.556734,8.004754,0.0,0.0,9.587569,...,10.129154,10.062303,6.91162,7.855149,0.360982,3.65581,7.25552,7.292607,10.779793,3.954001,6.991148,8.153248,7.508444,4.586531,9.152227,8.227717,0.360982,6.227104,0.649386,8.151879,0.0,6.558289,8.673708,6.505099,8.948989,7.010366,7.364056,8.950646,8.233366,9.298775,5.996032,8.891425,10.37379,7.181162,9.84691,11.922439,9.217749,9.461191,5.110372,0.0
5,sample_5,0.0,3.467853,3.581918,6.620243,9.706829,0.0,7.75851,0.0,0.0,0.0,0.51541,0.51541,2.516797,0.0,0.0,0.0,0.0,0.894294,0.894294,6.842765,2.809661,4.002901,7.663935,0.0,0.0,1.19415,0.894294,9.628387,7.220446,7.012759,4.936704,1.44228,8.002236,2.002018,14.021153,9.126059,0.0,0.0,9.962887,...,9.816689,10.207576,6.640314,7.761365,0.894294,4.267618,7.844574,7.269248,10.89381,5.161037,3.749717,8.251047,6.981579,5.76133,8.431932,8.801725,0.51541,7.471439,2.8954,7.953539,0.0,7.067208,8.077959,7.973393,8.380816,7.79227,7.015482,8.664963,8.34482,7.831035,5.726657,8.602588,9.928339,6.096154,9.816001,11.556995,9.24415,9.836473,5.355133,0.0
6,sample_6,0.0,1.224966,1.691177,6.572007,9.640511,0.0,6.754888,0.531868,0.0,0.0,3.173927,1.476796,3.023841,0.0,0.0,0.0,0.0,0.0,9.466878,7.4242,1.224966,0.0,9.97364,0.0,0.0,0.0,0.0,9.508977,4.48542,8.633177,0.0,0.0,5.8994,0.0,15.597753,10.909466,1.476796,0.0,9.145458,...,9.985102,9.971012,4.456418,8.152533,0.919683,4.680133,7.757137,6.724732,11.044708,4.334196,5.314236,8.203343,6.55733,4.200646,8.370317,7.863146,3.434068,6.167663,0.919683,7.964479,0.0,6.501632,9.371053,6.271224,9.115624,8.393017,6.724787,7.507858,6.862712,6.830293,5.105904,7.927968,9.673966,1.877744,9.802692,13.25606,9.664486,9.244219,8.330912,0.0
7,sample_7,0.0,2.854853,1.750478,7.22672,9.758691,0.0,5.952103,0.0,0.0,0.0,0.441802,0.0,2.405856,0.0,0.779554,0.0,0.0,0.0,0.0,7.373431,2.304861,0.0,8.922008,0.0,0.0,0.0,5.145527,9.406279,4.66471,5.705193,0.0,6.454625,7.012714,5.145527,16.798586,10.113911,1.053042,0.0,9.349567,...,10.141175,9.96296,6.154326,7.728165,0.779554,4.797594,7.538228,6.418822,10.773593,2.304861,5.101036,8.043897,6.4953,5.085896,8.359666,7.691122,0.0,5.51686,1.282855,7.849668,0.0,6.472199,7.65311,7.603471,8.468738,8.466199,7.110896,8.358093,7.422199,6.460507,5.297833,8.277092,9.59923,5.24429,9.994339,12.670377,9.987733,9.216872,6.55149,0.0
8,sample_8,0.0,3.992125,2.77273,6.546692,10.488252,0.0,7.690222,0.352307,0.0,4.067604,1.411318,1.252839,2.579977,0.0,0.0,0.0,0.0,0.0,0.635336,10.147625,4.287908,4.773828,8.343976,0.0,0.0,0.0,6.080374,7.600537,2.824931,4.934233,0.0,0.0,5.244617,9.464329,13.531004,5.944411,0.0,0.0,8.542196,...,11.35021,10.84596,6.599748,7.967163,0.0,7.629808,8.682851,7.812396,9.555722,1.554049,6.406288,8.114591,7.41657,3.082175,9.394889,7.888506,0.635336,6.865783,1.252839,7.504906,0.0,6.501863,8.488764,4.308193,8.290171,8.069799,7.987747,9.469967,9.595914,10.461326,6.721974,9.597533,9.763753,7.933278,10.95288,12.498919,10.389954,10.390255,7.828321,0.0
9,sample_9,0.0,3.642494,4.423558,6.849511,9.464466,0.0,7.947216,0.724214,0.0,0.0,0.0,1.204141,2.296311,0.0,0.0,0.0,0.0,0.0,0.0,7.85678,1.204141,5.784391,6.020051,0.0,0.0,0.0,0.724214,9.871312,6.118104,6.284211,0.0,5.070381,5.907184,0.0,12.578644,8.285411,0.0,0.0,9.292131,...,10.713593,10.071543,5.945694,7.397144,0.0,3.314972,8.712809,7.656704,10.81628,5.417103,2.635987,7.98435,7.346026,7.023422,8.855697,9.512906,0.0,6.854881,1.204141,8.309331,0.0,7.023422,8.521294,8.318199,8.258024,7.83239,7.74722,8.629757,7.014802,8.267213,6.020051,8.712809,10.259096,6.131583,9.923582,11.144295,9.244851,9.484299,4.759151,0.0


In [15]:
# get the label data
tablename = 'Sameple_Cancer_labels'
file_location = f'dbfs:{filepath}/{tablename}/'

r_label = spark.read.format(file_type)\
                  .option("header", if_header)\
                  .option("inferSchema", infer_schema)\
                  .option("delimiter", ",")\
                  .load(file_location)

r_label.limit(10).toPandas()

Unnamed: 0,_c0,Class
0,sample_0,PRAD
1,sample_1,LUAD
2,sample_2,PRAD
3,sample_3,PRAD
4,sample_4,BRCA
5,sample_5,PRAD
6,sample_6,KIRC
7,sample_7,PRAD
8,sample_8,BRCA
9,sample_9,PRAD


## Data Wrangling

String data cannot be directly used in model training. So we need to map it to different dummy varibles, which can be distinguished by the algorithm.

## String Mapping

In [19]:
# NULL value check
r_label.filter('Class is null').show()

In [20]:
# category indexing with StringIndexer
indexer = StringIndexer(inputCol = 'Class', outputCol = 'Class_Index') 
indexed_r_label = indexer.fit(r_label).transform(r_label)

# OneHotEncodeer to convert into binary values
encoder = OneHotEncoder(inputCol='Class_Index', outputCol='Class_Vec',dropLast = False)
indexed_r_label = encoder.transform(indexed_r_label)

# split into different columns
indexed_r_label = indexed_r_label.select('Class_Vec').rdd.map(lambda x:x[0].toArray().tolist()).toDF()

In [21]:
# show tranferred binary value
indexed_r_label.limit(10).toPandas()

Unnamed: 0,_1,_2,_3,_4,_5
0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0
6,0.0,1.0,0.0,0.0,0.0
7,0.0,0.0,0.0,1.0,0.0
8,1.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,1.0,0.0


## Use chi-square to select top 1000 features

In [23]:
# apply SelectKBest class to extract top 1000 best features
bestfeatures = SelectKBest(score_func = chi2, k=1000)
fit = bestfeatures.fit(r_data,indexed_r_label)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(r_data.columns) 

# concat the column name
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  

# select top 1000 best features
Selected_Index = featureScores.nlargest(1000,'Score').index
S_Index = Selected_Index.toPandas().values.flatten().tolist()

r_data_ext = r_data.select(*(r_data1.columns[i] for i in S_Index))
r_data_ext.limit(10).toPandas()

Unnamed: 0,gene_9176,gene_9175,gene_15898,gene_220,gene_219,gene_15896,gene_18135,gene_15899,gene_12069,gene_15895,gene_13976,gene_16132,gene_12995,gene_3439,gene_19153,gene_5829,gene_450,gene_16169,gene_15591,gene_11903,gene_14114,gene_3737,gene_16259,gene_16130,gene_13639,gene_6937,gene_15589,gene_1858,gene_3813,gene_16105,gene_13818,gene_3524,gene_12881,gene_16156,gene_15894,gene_9232,gene_12848,gene_1510,gene_7965,gene_16392,...,gene_9924,gene_18900,gene_16248,gene_5361,gene_7040,gene_9168,gene_13308,gene_3138,gene_2260,gene_753,gene_15604,gene_5752,gene_9181,gene_8193,gene_7659,gene_11393,gene_6070,gene_11576,gene_1331,gene_17680,gene_6850,gene_8431,gene_1992,gene_10421,gene_10690,gene_8032,gene_8013,gene_16357,gene_4209,gene_3448,gene_6945,gene_894,gene_7398,gene_6355,gene_6735,gene_3537,gene_10363,gene_2729,gene_9502,gene_3962
0,18.525161,17.17357,1.334282,0.591871,0.591871,0.591871,6.878308,0.0,4.692126,0.0,12.205063,0.0,13.186662,3.266292,1.010279,0.0,0.0,3.478079,0.0,0.0,0.0,11.057694,0.0,0.0,0.0,10.671541,0.0,0.0,0.0,0.0,2.476226,4.852678,8.522283,0.0,0.0,1.598651,8.086513,1.598651,0.0,3.410884,...,0.591871,3.410884,0.0,3.603763,11.244798,7.55567,1.010279,1.010279,1.822037,11.26069,0.0,6.180088,1.010279,0.0,0.0,6.351165,12.685279,4.063658,9.635185,0.0,7.55567,2.824951,6.360101,3.017958,9.96006,1.822037,7.896932,2.015391,5.341922,7.455015,3.105561,5.267892,1.334282,6.974529,2.185898,2.824951,8.09189,8.067547,1.822037,0.591871
1,0.0,0.0,13.609213,0.0,0.0,13.532125,0.0,12.695983,0.0,10.068832,0.0,0.0,0.0,0.323658,0.0,0.0,0.0,0.0,6.555574,11.328675,0.0,0.323658,0.0,0.0,0.323658,0.0,0.0,0.0,0.0,0.0,0.0,0.323658,0.0,0.0,9.975991,0.0,0.0,0.323658,0.587845,0.811142,...,2.006585,0.0,5.915519,1.004394,3.00768,11.158837,3.293282,0.587845,0.0,4.903299,4.593372,8.382698,6.204051,0.0,1.706508,7.623435,2.961846,0.0,3.499846,0.0,0.0,4.330264,9.702685,2.592278,7.892907,1.004394,1.32717,0.323658,1.32717,3.095199,2.76237,0.323658,0.0,9.239379,0.0,0.0,4.736361,0.587845,3.890826,0.0
2,16.053597,14.818422,1.074163,0.452595,0.0,1.074163,12.900029,0.0,14.766151,0.0,9.000285,0.452595,3.000252,0.0,0.0,2.785739,0.0,0.0,0.0,0.0,0.0,9.709277,0.0,0.0,1.50716,4.009723,0.0,0.0,0.0,0.796598,0.0,2.438799,0.0,0.0,0.0,1.306846,5.324101,0.796598,2.228018,1.306846,...,0.0,1.981122,0.796598,1.50716,5.823816,3.06521,0.0,0.0,0.0,6.834509,0.796598,2.109829,0.0,0.0,1.074163,10.090959,12.097624,0.0,4.410558,0.0,9.410458,1.074163,1.074163,3.71812,9.639674,4.981168,10.172915,7.082149,2.706508,0.796598,0.0,1.306846,0.0,2.337254,0.0,0.452595,0.0,8.113753,0.452595,1.981122
3,18.371794,17.371079,0.434882,0.434882,1.039419,4.216416,13.907304,0.0,8.653719,0.0,9.779831,0.0,14.465586,0.0,0.768587,1.464093,0.0,0.768587,0.0,0.0,0.434882,10.421613,1.931418,0.0,0.0,7.12884,0.0,0.0,0.0,0.0,0.0,0.434882,1.464093,0.0,0.0,1.039419,6.959747,0.434882,1.267356,0.434882,...,0.0,5.9427,0.0,5.55718,10.321252,6.422593,0.0,0.0,1.039419,11.234009,0.768587,16.445607,0.0,1.039419,0.0,5.795128,12.109086,0.0,8.375187,0.0,3.439636,1.464093,3.573556,3.656393,10.069665,1.267356,9.323969,10.28266,1.637239,8.867656,2.478532,0.0,0.0,6.246623,0.0,0.434882,5.029501,7.492342,0.768587,0.0
4,0.0,1.580097,1.095654,0.0,0.0,0.360982,0.0,0.0,0.0,0.0,0.0,0.0,0.360982,0.0,0.0,0.0,0.360982,0.649386,0.0,0.0,0.0,0.360982,0.0,0.0,0.0,0.0,14.97592,0.360982,0.0,0.0,1.095654,0.0,0.0,0.0,2.858777,0.0,0.0,1.435949,0.649386,1.435949,...,1.095654,0.889707,3.452793,1.435949,6.571285,7.600069,1.275841,0.649386,0.0,4.304657,1.711142,7.458021,2.914239,0.0,0.360982,11.614167,8.338286,3.019133,4.603502,0.0,2.544139,1.435949,6.974942,1.94212,8.34713,8.453908,4.196969,0.360982,8.194501,2.231279,7.361567,0.0,0.0,6.205037,0.360982,0.0,4.304657,0.360982,0.889707,0.889707
5,17.398084,16.064835,0.894294,0.51541,0.0,2.404276,13.971283,0.0,7.604701,0.0,10.640724,0.0,12.293564,0.0,0.0,0.0,0.0,2.148934,0.0,0.0,0.0,10.685423,0.0,0.0,0.0,13.316805,0.0,0.0,0.0,1.19415,0.0,2.621149,7.968454,0.0,0.0,1.838427,2.8954,2.404276,0.0,3.881645,...,0.0,9.007097,0.0,2.282232,11.564378,3.05299,0.0,0.0,0.894294,6.798932,2.002018,6.012269,0.51541,0.0,0.0,6.258687,13.442477,2.976345,4.809939,0.0,6.9668,3.195033,6.646511,3.552648,9.409034,2.282232,9.645504,4.853447,5.178245,8.830585,5.817332,1.653885,3.05299,6.563805,0.0,0.51541,6.337401,7.436104,0.894294,1.838427
6,0.0,0.0,0.0,5.683107,6.880294,0.0,0.0,0.0,2.191152,0.0,0.0,2.042924,0.0,10.817295,3.602386,0.0,0.0,5.756327,0.919683,0.531868,0.0,0.0,0.531868,0.0,0.0,0.0,0.0,9.078033,0.0,0.0,7.742188,0.0,0.0,0.919683,6.570924,0.0,0.0,9.132911,5.009356,4.200646,...,1.476796,0.531868,0.919683,3.80003,6.923316,2.561717,0.0,0.0,0.531868,5.779932,0.919683,6.140568,0.0,0.0,0.919683,2.191152,4.128937,0.0,4.301946,0.531868,0.531868,0.531868,6.891176,5.79159,3.492379,8.41999,7.19319,0.919683,0.531868,3.309846,6.131423,2.856308,0.0,4.673596,0.0,0.531868,0.919683,5.683107,2.561717,2.191152
7,17.60456,15.748748,0.0,0.441802,1.65526,0.441802,12.303729,2.500241,6.93619,0.0,9.570964,0.441802,13.032378,2.196261,0.0,0.0,0.0,2.196261,0.0,2.484628,0.0,11.688014,0.441802,0.0,0.0,5.604463,0.0,0.0,0.0,0.0,0.0,5.297833,11.066547,0.441802,0.441802,8.208849,5.943729,2.078849,0.0,2.588829,...,0.0,5.784046,0.0,0.779554,9.601893,2.67229,0.0,0.0,0.0,10.887396,1.481041,5.726632,0.0,0.0,0.0,5.656628,8.103786,1.282855,9.49061,0.0,5.023694,2.405856,6.775617,2.500241,13.904296,5.811918,8.126162,2.078849,6.117577,6.259274,4.036213,0.441802,0.0,7.404273,0.0,0.0,8.403067,5.310914,0.0,2.500241
8,0.0,1.683921,0.352307,0.0,0.0,0.352307,1.554049,0.635336,0.0,0.0,0.0,0.0,1.074848,1.411318,0.0,0.635336,0.635336,0.352307,0.635336,0.0,0.0,0.635336,0.0,0.0,0.0,0.0,0.352307,0.352307,0.352307,0.635336,0.0,0.0,0.0,0.0,0.0,0.352307,1.913148,3.173463,0.0,0.352307,...,3.835722,0.0,4.663515,3.217045,1.074848,10.243912,3.943406,1.554049,0.0,6.092121,1.803062,6.671973,8.498662,0.0,0.0,5.281609,4.019204,3.943406,2.015391,0.0,8.200766,1.411318,3.173463,3.778713,4.615481,6.996547,4.386604,2.015391,11.455713,0.635336,3.217045,0.635336,0.0,7.41657,0.0,0.635336,3.379247,7.754473,2.015391,0.635336
9,17.853785,16.780578,0.724214,0.0,0.0,2.635987,14.208814,0.724214,15.988685,0.0,10.568764,0.0,3.742858,0.0,0.0,0.0,0.0,1.204141,0.0,1.204141,0.0,10.994205,0.0,0.0,0.0,7.742835,0.0,0.0,0.0,0.0,0.0,5.326243,3.939593,0.0,0.0,0.0,6.19715,1.204141,0.0,1.204141,...,0.0,2.910733,0.0,8.071307,9.113463,6.005483,0.0,0.0,5.438951,7.81979,0.0,4.613054,0.724214,0.724214,2.090853,5.326243,10.257565,2.090853,5.749848,0.0,2.296311,1.563646,4.574095,2.77989,11.163694,3.939593,9.350933,2.910733,4.534055,8.63213,3.99963,0.0,2.090853,3.595014,0.0,2.77989,8.245676,6.118104,0.0,2.476122


## Reshape and split the data

In [25]:
# change to pandas dataframe
data = r_data_ext.toPandas()
indexed_label = indexed_r_label.toPandas()
# transfer to numpy
data_np = np.array(data)
indexed_label_np = np.array(indexed_label)

# reshape to fit the LSTM expectation
data_np = data_np.reshape(data_np.shape[0],1,data_np.shape[1])
indexed_label_np = indexed_label_np.reshape(indexed_label_np.shape[0],1,indexed_label_np.shape[1])

In [26]:
# split the data to be 70/30
X_train, X_test, y_train, y_test = train_test_split(data_np, indexed_label_np, test_size=0.3, random_state = 123)

*Traning Data and Test Data Overview*

In [28]:
# tranfer back to label
def pred_trans(x):
  pred_ = ['0'] * len(x)
  for row in range(len(x)):
    if x[row][0] == x[row].max():
       pred_[row] = 'BRCA'
    elif x[row][1] == x[row].max():
       pred_[row] = 'KIRC'
    elif x[row][2] == x[row].max():
       pred_[row] = 'LUAD'
    elif x[row][3] == x[row].max():
       pred_[row] = 'PRAD'
    else:
       pred_[row] = 'COAD'
  return pred_

In [29]:
# converse to 2D for viz
y_train = y_train.reshape(-1,5)
y_test = y_test.reshape(-1,5)

# transfer the training dataset to label
y_train_label = pred_trans(y_train)

# transfer the test dataset to label
y_test_label = pred_trans(y_test)

In [30]:
labels = ['BRCA','KIRC','LUAD','PRAD','COAD']

# Create subplots
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values = \
              [y_train_label.count('BRCA'),y_train_label.count('KIRC'),y_train_label.count('LUAD'),y_train_label.count('PRAD'),y_train_label.count('COAD')], name="Training Label"), \
              1, 1)
fig.add_trace(go.Pie(labels=labels, values = \
              [y_test_label.count('BRCA'),y_test_label.count('KIRC'),y_test_label.count('LUAD'),y_test_label.count('PRAD'),y_test_label.count('COAD')], name="Test Label"), \
              1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.6, hoverinfo="label+percent+name")

fig.update_layout(
    title_text="Traning Label and Test Label Distribution",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Traning Label', x=0.18, y=0.5, font_size=15, showarrow=False),
                 dict(text='Test Label', x=0.82, y=0.5, font_size=15, showarrow=False)])
fig.show()

The training data and test data share a similar distribution, which is good for our further validation and testing process to diminish the bias underlying samples.

## Deep Learning Architecture

<img src="https://github.com/StacyYin/Data-Visualization-/raw/master/PanCan.jpg" alt="Image" border="0">

### Set the Hyperparameters

Hyperparameters were initiated based on the sample's size and target values.

In [36]:
# Hyperparameter Initialization
Num_Class = 5
Num_Inputs = X_train.shape[2]
Num_Hidden = 500

print('Num_Class: ', Num_Class)
print('\nNum_Inputs: ', Num_Inputs)
print('\nNum_Hidden: ', Num_Hidden)

### Model Construction

Basic models can be optimized through different hyperparameters. Here learning rate and adam decay were chosen for the model tuning.

## Learning Rate / Adam Decay Tuning Based on Gaussian Process

Learning Rate can be important factors which decide how soon the model will adapt to the problem. A large learning rate may miss the optimal point for modeling. So here a range of values starting from 1e-5 to 1e-3 were set.
Adam Decay was also experimented with this range.

In [41]:
# set the range of learning rate and adam decay
dim_learning_rate = Real(low=1e-5, high=1e-3, prior='log-uniform',
                         name='learning_rate')
dim_adam_decay = Real(low=1e-5,high=1e-3,name="adam_decay")

# tuning part
dimensions = [dim_learning_rate, dim_adam_decay]
# set the default value for start
default_parameters = [1e-3, 1e-3]

# create the model based on Sequential layers
def create_model(learning_rate, adam_decay):
    model = Sequential()
    
    model.add(LSTM(Num_Hidden, input_shape = (1, Num_Inputs), activity_regularizer = regularizers.l2(1e-5), return_sequences = True))
    model.add(Activation('relu'))
    model.add(Dropout(rate=0.1))

    model.add(Dense(Num_Class))
    model.add(Activation('softmax')) 

    # setup our optimizer and compile
    adam = Adam(lr = learning_rate, decay = adam_decay)
    model.compile(optimizer = adam, loss = 'categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [42]:
@use_named_args(dimensions=dimensions)
def fitness(learning_rate, adam_decay):
  
    model = create_model(learning_rate = learning_rate,
                         adam_decay = adam_decay)
    
    
    # fit the model                  
    b_box = model.fit(x=X_train,
                      y=y_train,
                      epochs = 5,
                      batch_size = 128,
                      validation_split=0.1,
                      )
    # return the validation loss for the last epoch
    loss = b_box.history['val_loss'][-1]

    # delete the Keras model with these hyper-parameters from memory
    del model
    
    # clear the Keras session, otherwise it will keep adding new
    # models to the same TensorFlow graph each time we create
    # a model with a different set of hyper-parameters
    K.clear_session()
    tensorflow.reset_default_graph()
    
    # the optimizer aims for the lowest score
    return loss

#### Bayesian Optimization

Bayesian optimization is an approach to optimizing objective functions. It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample.

In [45]:
# Parameters Searching
gp_result = gp_minimize(func = fitness,
                        dimensions = dimensions,
                        n_calls = 15, #minimum value is 12
                        noise = 0.0001,
                        n_jobs = -1,
                        kappa = 5,
                        x0 = default_parameters)

#### Best Model for Training

*Optimal Hyperparameters*

In [48]:
# fetch the optimal parameters
optimal_learning_rate = gp_result.x[0] # the optimal gp_result.x[0]  is 0.0001
optimal_Adam_decay =gp_result.x[1] # the optimal gp_result.x[1] is 0.0001

print(optimal_learning_rate)
print(optimal_Adam_decay)

In [49]:
# create the model framework
model = create_model(optimal_learning_rate,optimal_Adam_decay)

# get the model overview
model.summary()

## Train the Neural Network with Early Stopping

In [51]:
# show the training result - running 10 epochs with early stopping
history = model.fit(X_train,y_train, epochs = 10, batch_size = 128, validation_split = 0.1,callbacks =[EarlyStopping(monitor='val_acc', mode='max')])

## Model Evaluation and Prediction

*Model Evaluation*

The whole training data was left 10% for validation, which can be used to evaluate the model performance.

*Validation Loss*

Loss is defined as categorical cross entropy loss, which is a metric for evaluating a good classification.

In [55]:
# display the validation loss as the epochs rise
fig1 = go.Figure()
fig1.add_trace(go.Scatter(y=history.history['val_loss'],
                    mode='lines',
                    name='training_loss'))
annotations = []
annotations.append(dict(xref='paper', yref='paper', x=0.0, y=2,
                              xanchor='left', yanchor='bottom',
                              text='Validation Loss as Epochs Rise',
                              font=dict(family='Arial',
                                        size=20,
                                        color='rgb(37,37,37)'),
                              showarrow=False))
fig1.update_layout(annotations = annotations)
fig1.update_xaxes(title_text='Epoch')
fig1.update_yaxes(title_text='Loss')
fig1.show()

As epochs rise, the validation loss drops significantly.

*Validation Accuracy* 

Accuracy is calculated by number of correct predictions divided by total number of predictions.

In [58]:
# display the validation accuracy as the epochs rise
fig2 = go.Figure()
fig2.add_trace(go.Scatter(y=history.history['val_acc'],
                    mode='lines+markers',
                    name='training_accuracy'))
annotations = []
annotations.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
                              xanchor='left', yanchor='bottom',
                              text='Validation Accuracy as Epochs Rise',
                              font=dict(family='Arial',
                                        size=20,
                                        color='rgb(37,37,37)'),
                              showarrow=False))
fig2.update_layout(annotations = annotations)
fig2.update_xaxes(title_text='Epoch')
fig2.update_yaxes(title_text='Accuracy')
fig2.update_yaxes(range=[0.8, 1])
fig2.show()

Validation Accuracy achieved a remarkable high level after learning.

*Model Prediction*

In [61]:
# predict the test result
y_test_pred = model.predict(X_test)
# round up to 4 decimals
y_test_pred_r2 = np.around(y_test_pred, decimals = 4)
# transfer back to 2 dimensions
y_test_pred_r2 = y_test_pred_r2.reshape(-1,5)

In [62]:
# show the test result
y_test_pred_pd = pd.DataFrame(y_test_pred_r2, columns=["BRCA", "KIRC", "LUAD","PRAD","COAD"])
y_test_pred_pd.index.name = 'PID'
y_test_pred_pd.head(5)

Unnamed: 0_level_0,BRCA,KIRC,LUAD,PRAD,COAD
PID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.0687,0.0657,0.1477,0.038,0.6799
1,0.0979,0.0346,0.7377,0.0708,0.059
2,0.0665,0.0132,0.8636,0.0291,0.0275
3,0.0406,0.8731,0.04,0.0167,0.0295
4,0.1818,0.0315,0.1107,0.6506,0.0254


The model would automatically select the cancer type with the highest probability to be the prediction result for each patient.

## Test Accuracy

In [65]:
# print out the test loss and accuracy
y_test = y_test.reshape(y_test.shape[0],1,y_test.shape[1])
test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test loss:', test_loss)
print('Test accuracy:', test_acc)

## Test Original Label & Test Prediction Label

In [67]:
# transfer the training dataset to label
Test_Prediction = pred_trans(y_test_pred_r2)

# transfer the test dataset to label
Test_Original = pred_trans(y_test)

labels = ['BRCA','KIRC','LUAD','PRAD','COAD']

# Create subplots
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values = \
              [Test_Original.count('BRCA'),Test_Original.count('KIRC'), \
               Test_Original.count('LUAD'),Test_Original.count('PRAD'),Test_Original.count('COAD')], name="Test Original"), \
              1, 1)
fig.add_trace(go.Pie(labels=labels, values = \
       [Test_Prediction.count('BRCA'),Test_Prediction.count('KIRC'),\
        Test_Prediction.count('LUAD'),Test_Prediction.count('PRAD'),Test_Prediction.count('COAD')], name="Test Prediction"), \
              1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.6, hoverinfo="label+percent+name")

fig.update_layout(
    title_text="Test Original and Test Prediction Distribution",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Test Original', x=0.18, y=0.5, font_size=15, showarrow=False),
                 dict(text='Test Prediction', x=0.82, y=0.5, font_size=15, showarrow=False)])
fig.show()

The prediction amazingly covered the test original label, which approved a highly efficient method to identifying the tumor types.

## Save to Gold Bucket in Delta

A Delta table can easily be modified through inserts, deletes, and merges. In addition, all these modifications can be rolled back to obtain an older version of the Delta Table. That way Delta Lake offers us **flexible storage** and helps us to keep control over the changes in the data.

In [71]:
# transfer back to Spark Dataframe
y_test_pred_sp = spark.createDataFrame(y_test_pred_pd)

# add the PID column
y_test_pred_sp2 = y_test_pred_sp.repartition(1).withColumn("PID",monotonically_increasing_id())
y_test_pred_sp2 = y_test_pred_sp2.select("PID","BRCA","KIRC","LUAD","PRAD","COAD")

In [72]:
write_file_name = 'PanCan_TestPred'
write_location = f'{MOUNT_GOLD}/{write_file_name}'

In [73]:
# write to the gold bucket
y_test_pred_sp2.write.format("delta") \
               .option("overwriteSchema", "true") \
               .mode("overwrite") \
               .save(write_location)

*Load From Gold Bucket*

## Business Value - Running SQL Queries against Delta

In [76]:
db_name = 'PanCan'
spark.sql(f"""
   CREATE DATABASE IF NOT EXISTS {db_name}
   """)

In [77]:
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {db_name}.{write_file_name}
    USING DELTA
    LOCATION '{write_location}'
""")

In [78]:
test_res = spark.sql(f"""
    SELECT * FROM {db_name}.{write_file_name}
    LIMIT 5
""")
display(test_res)

PID,BRCA,KIRC,LUAD,PRAD,COAD
0,0.0687,0.0657,0.1477,0.038,0.6799
1,0.0979,0.0346,0.7377,0.0708,0.059
2,0.0665,0.0132,0.8636,0.0291,0.0275
3,0.0406,0.8731,0.04,0.0167,0.0295
4,0.1818,0.0315,0.1107,0.6506,0.0254


Running SQL Queries on View

In [80]:
# load delta format tables from gold bucket
test_pred_sp = spark.read.format('delta').load(write_location)

In [81]:
# create a temp table
table_name = 'test_pred'
test_pred_sp.createOrReplaceTempView(table_name)

In [82]:
%sql
-- select specific value range
SELECT *
FROM test_pred
WHERE KIRC > 0.6
ORDER BY KIRC DESC

PID,BRCA,KIRC,LUAD,PRAD,COAD
116,0.0254,0.9167,0.0267,0.0099,0.0213
15,0.0237,0.916,0.0248,0.0116,0.0239
232,0.019,0.9145,0.0304,0.011,0.0251
120,0.0336,0.9026,0.0286,0.0106,0.0246
14,0.0291,0.9024,0.0313,0.0136,0.0236
34,0.032,0.8996,0.0356,0.0132,0.0197
87,0.025,0.8989,0.0358,0.0139,0.0265
86,0.0357,0.898,0.0268,0.0135,0.0261
204,0.0303,0.8977,0.0286,0.017,0.0264
96,0.0306,0.8945,0.0366,0.0144,0.0238


## Traverse Back to Older Timestamp

In [84]:
# get any version or timestamp
y_test_pred_1 = spark.read.format("delta") \
                          .option("timestampAsOf", "2020-07-10") \
                          .load(write_location)

# create a temp table
table_name = 'test_pred_1'
y_test_pred_1.createOrReplaceTempView(table_name)

In [85]:
%sql
-- Different Version of Query on 2020-07-10 00:00:00
SELECT *
FROM test_pred_1
WHERE KIRC > 0.2
ORDER BY KIRC DESC

BRCA,KIRC,LUAD,PRAD,COAD
0.2272,0.2075,0.1933,0.1921,0.1799
0.2267,0.2069,0.1931,0.1926,0.1807
0.2276,0.2065,0.1934,0.192,0.1806
0.2264,0.2064,0.1933,0.1925,0.1813
0.227,0.2064,0.1945,0.192,0.1801
0.2266,0.2063,0.1933,0.1929,0.1809
0.2258,0.2063,0.1933,0.1932,0.1813
0.2268,0.2063,0.1938,0.1929,0.1802
0.2275,0.2062,0.1944,0.1909,0.1809
0.2269,0.2062,0.1941,0.1919,0.1809
