<a href="https://colab.research.google.com/github/JaeDoo1034/Kaggle-Study/blob/master/Keras_tuner1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install git+https://github.com/keras-team/keras-tuner.git -q

  Building wheel for keras-tuner (setup.py) ... [?25l[?25hdone
  Building wheel for terminaltables (setup.py) ... [?25l[?25hdone


MoA: Keras + KerasTuner best practices¶<br>
This notebook will teach you how to:<br>

1. Use a Keras neural network for the MoA competition
2. Use KerasTuner to find high-performing model configurations
3. Ensemble a few of the top models to generate final predictions

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [2]:
print('TF version:', tf.__version__)
print('GPU devices:', tf.config.list_physical_devices('GPU'))

TF version: 2.3.0
GPU devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In this competition, we're looking at 3 CSV files: one for training features, one for training targets (with the same number of entries and a 1:1 match between entries in the features file and those in the targets file), and one for test features. The goal is to predict the targets that correspond to the test features.

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [38]:
train_features_df = pd.read_csv('/content/list-moa/train_features.csv')
train_targets_df = pd.read_csv('/content/list-moa/train_targets_scored.csv')
test_features_df = pd.read_csv('/content/list-moa/test_features.csv')

In [39]:
print('train_features_df.shape:', train_features_df.shape)
print('train_targets_df.shape:', train_targets_df.shape)
print('test_features_df.shape:', test_features_df.shape)

train_features_df.shape: (23814, 876)
train_targets_df.shape: (23814, 207)
test_features_df.shape: (3982, 876)


In [40]:
train_features_df.sample(5)

Unnamed: 0,sig_id,cp_type,cp_time,cp_dose,g-0,g-1,g-2,g-3,g-4,g-5,g-6,g-7,g-8,g-9,g-10,g-11,g-12,g-13,g-14,g-15,g-16,g-17,g-18,g-19,g-20,g-21,g-22,g-23,g-24,g-25,g-26,g-27,g-28,g-29,g-30,g-31,g-32,g-33,g-34,g-35,...,c-60,c-61,c-62,c-63,c-64,c-65,c-66,c-67,c-68,c-69,c-70,c-71,c-72,c-73,c-74,c-75,c-76,c-77,c-78,c-79,c-80,c-81,c-82,c-83,c-84,c-85,c-86,c-87,c-88,c-89,c-90,c-91,c-92,c-93,c-94,c-95,c-96,c-97,c-98,c-99
10079,id_6c85cd4e4,trt_cp,48,D1,0.4757,-0.3483,1.414,0.0394,-0.8487,0.2593,-1.093,0.2846,1.425,1.919,1.073,0.1194,0.5744,-0.4724,0.1717,-1.001,-0.4179,0.6388,-0.0236,-0.4163,-0.402,-0.9761,-0.4141,-0.3372,1.273,-0.5294,0.4446,0.092,0.1238,0.1725,0.2648,0.0,-0.6031,0.4557,0.7533,-0.0665,...,-0.4667,-0.1172,-0.0842,-0.467,-0.1413,0.1495,0.7786,0.7281,0.4203,0.3145,0.0382,-0.1721,0.2099,-0.0357,1.604,-0.4118,-0.1834,0.3494,0.7505,0.3233,-0.1547,-0.2336,0.278,0.4714,-0.0833,-0.1584,0.1073,-0.3122,0.0556,-0.1562,0.6679,0.071,-0.2761,0.5683,0.1659,1.319,0.4211,-0.6959,0.3423,0.7462
2238,id_17fa4ee67,trt_cp,72,D1,2.805,-2.438,0.6183,-0.3019,3.267,0.194,-0.3439,-2.942,-10.0,-2.484,-2.474,-0.6185,-2.189,2.982,-3.755,1.695,-3.866,-4.313,0.2701,-2.227,-2.093,0.5943,-1.311,-0.7994,-2.462,-2.915,-2.294,2.007,-2.722,-4.915,-4.893,6.278,-0.6088,7.093,-3.348,4.561,...,-10.0,-10.0,-10.0,-10.0,-10.0,-10.0,-10.0,-10.0,-8.418,-7.106,-10.0,-10.0,-8.698,-10.0,-4.554,-10.0,-6.831,-10.0,-6.831,-10.0,-10.0,-10.0,-10.0,-10.0,-10.0,-9.269,-9.792,-10.0,-7.119,-10.0,-10.0,-6.997,-10.0,-10.0,-10.0,-10.0,-10.0,-7.951,-10.0,-6.664
7516,id_50abb03c0,trt_cp,72,D1,-0.5588,0.2127,-0.227,-0.7085,0.0968,0.1418,-0.8403,0.649,-0.4622,-0.6496,0.8947,0.2252,0.1969,0.1393,0.0614,0.4187,0.6163,-0.2879,-0.5002,0.8315,0.1781,-1.039,0.2254,0.2887,-0.1393,-0.3734,0.0505,-0.5247,-0.4367,-0.0071,-0.714,-0.0466,0.2567,0.3382,0.0519,-0.8425,...,-0.4009,0.4207,0.0503,-0.4907,0.0777,0.5589,-0.003,0.5517,0.1617,-0.7854,0.1177,0.4596,0.5948,0.228,-0.253,0.6177,0.0791,-0.0006,0.129,0.9887,-0.6297,0.1991,0.4897,-0.0528,0.9308,0.209,0.5015,-0.0216,-0.3303,-0.1604,0.3827,-0.213,0.3691,0.6899,0.7629,-0.0464,0.3815,-0.6678,0.3543,0.5582
9919,id_6aee6017b,trt_cp,72,D1,0.188,0.3962,-0.4247,0.0775,-0.5159,-0.0754,0.1344,-0.2664,0.7697,0.5329,-0.1156,0.5493,-0.1281,-0.1226,0.3256,1.416,-0.1003,-0.3579,0.1666,0.4003,0.021,-0.7242,0.1414,-0.6742,-0.5831,-1.314,0.3947,0.0853,-0.0479,0.4568,0.3831,-0.0566,-0.6346,-0.1946,0.8084,-1.016,...,-0.1508,-0.084,0.4812,-0.6357,0.8844,-0.0144,0.606,0.4854,0.1618,0.9585,0.4893,-0.116,0.3082,0.547,0.5054,-0.6525,0.9185,-0.2976,0.7106,0.2445,0.6141,-0.0174,0.7465,0.5642,-0.2628,-0.2922,-0.2441,0.1024,0.7476,0.6329,1.13,0.5732,-0.6246,0.3457,-0.2364,-0.1098,0.2545,0.6055,0.3731,0.1376
7456,id_4ffecbe91,trt_cp,48,D2,-0.052,0.1522,-0.326,-0.3583,0.1277,-0.4154,0.2063,0.0296,0.0842,-0.0065,-0.4717,6.584,-0.1899,-0.5514,-1.199,0.3275,0.3137,-0.6996,-0.0847,0.2865,0.6598,-0.1713,0.0489,-0.4822,1.052,0.0282,-0.733,-0.1781,-0.5008,0.7467,0.1736,-0.1136,0.2946,0.3258,-1.34,-0.9893,...,-0.8925,-0.3634,-0.0258,-1.249,1.101,-0.7112,-0.543,-1.403,1.097,0.0821,0.9892,-0.6936,0.3966,-2.156,0.1805,-0.1953,0.5287,0.3913,-0.529,-0.5864,-0.8928,-0.2014,-0.1005,0.648,-0.3103,0.5251,-0.4881,0.2445,0.1431,-1.119,0.1273,-0.2655,-0.6797,-1.48,-2.422,-0.1806,-0.5105,-0.4192,-0.3444,0.0315


Ok, so we have 2 categorical features (cp_type and cp_dose, which are strings), and everything else is numerical (assuming g-0 to g-99 are homogeneous in type).

We'll use the StringLookup and CategoryEncoding layers to encode the categorical features, and the Normalization layer to normalize the values of the numerical features.

Let's look at the targets:

In [41]:
train_targets_df.sample(5)

Unnamed: 0,sig_id,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,adrenergic_receptor_agonist,adrenergic_receptor_antagonist,akt_inhibitor,aldehyde_dehydrogenase_inhibitor,alk_inhibitor,ampk_activator,analgesic,androgen_receptor_agonist,androgen_receptor_antagonist,anesthetic_-_local,angiogenesis_inhibitor,angiotensin_receptor_antagonist,anti-inflammatory,antiarrhythmic,antibiotic,anticonvulsant,antifungal,antihistamine,antimalarial,antioxidant,antiprotozoal,antiviral,apoptosis_stimulant,aromatase_inhibitor,atm_kinase_inhibitor,atp-sensitive_potassium_channel_antagonist,atp_synthase_inhibitor,atpase_inhibitor,atr_kinase_inhibitor,aurora_kinase_inhibitor,...,protein_synthesis_inhibitor,protein_tyrosine_kinase_inhibitor,radiopaque_medium,raf_inhibitor,ras_gtpase_inhibitor,retinoid_receptor_agonist,retinoid_receptor_antagonist,rho_associated_kinase_inhibitor,ribonucleoside_reductase_inhibitor,rna_polymerase_inhibitor,serotonin_receptor_agonist,serotonin_receptor_antagonist,serotonin_reuptake_inhibitor,sigma_receptor_agonist,sigma_receptor_antagonist,smoothened_receptor_antagonist,sodium_channel_inhibitor,sphingosine_receptor_agonist,src_inhibitor,steroid,syk_inhibitor,tachykinin_antagonist,tgf-beta_receptor_inhibitor,thrombin_inhibitor,thymidylate_synthase_inhibitor,tlr_agonist,tlr_antagonist,tnf_inhibitor,topoisomerase_inhibitor,transient_receptor_potential_channel_antagonist,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
6415,id_44eb38840,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7632,id_51f2848c8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6028,id_40e204e3f,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
19730,id_d3b4afc0f,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10428,id_70668100b,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


The targets are binary indicators (0 or 1) across 206 different categories. So our model should output a probability score between 0 and 1 (sigmoid activation) across 206 outputs.

The sample submission format matches these expectations:

In [42]:
sample_submission_df = pd.read_csv('/content/list-moa/sample_submission.csv')
sample_submission_df.sample(5)

Unnamed: 0,sig_id,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,adrenergic_receptor_agonist,adrenergic_receptor_antagonist,akt_inhibitor,aldehyde_dehydrogenase_inhibitor,alk_inhibitor,ampk_activator,analgesic,androgen_receptor_agonist,androgen_receptor_antagonist,anesthetic_-_local,angiogenesis_inhibitor,angiotensin_receptor_antagonist,anti-inflammatory,antiarrhythmic,antibiotic,anticonvulsant,antifungal,antihistamine,antimalarial,antioxidant,antiprotozoal,antiviral,apoptosis_stimulant,aromatase_inhibitor,atm_kinase_inhibitor,atp-sensitive_potassium_channel_antagonist,atp_synthase_inhibitor,atpase_inhibitor,atr_kinase_inhibitor,aurora_kinase_inhibitor,...,protein_synthesis_inhibitor,protein_tyrosine_kinase_inhibitor,radiopaque_medium,raf_inhibitor,ras_gtpase_inhibitor,retinoid_receptor_agonist,retinoid_receptor_antagonist,rho_associated_kinase_inhibitor,ribonucleoside_reductase_inhibitor,rna_polymerase_inhibitor,serotonin_receptor_agonist,serotonin_receptor_antagonist,serotonin_reuptake_inhibitor,sigma_receptor_agonist,sigma_receptor_antagonist,smoothened_receptor_antagonist,sodium_channel_inhibitor,sphingosine_receptor_agonist,src_inhibitor,steroid,syk_inhibitor,tachykinin_antagonist,tgf-beta_receptor_inhibitor,thrombin_inhibitor,thymidylate_synthase_inhibitor,tlr_agonist,tlr_antagonist,tnf_inhibitor,topoisomerase_inhibitor,transient_receptor_potential_channel_antagonist,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
1473,id_5e735ae45,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
330,id_1518e7523,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
1336,id_567bc5801,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
3016,id_c331586a7,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
3697,id_ed81c8512,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5


Out of 23,814 samples, how often is each of the 206 target indicators positive?

In [43]:
for target_name in list(train_targets_df)[1:]:
  rate = float(sum(train_targets_df[target_name])) / len(train_targets_df)
  print('%.4f percent positivity rate for %s' % (100*rate, target_name) )

0.0714 percent positivity rate for 5-alpha_reductase_inhibitor
0.0756 percent positivity rate for 11-beta-hsd1_inhibitor
0.1008 percent positivity rate for acat_inhibitor
0.7979 percent positivity rate for acetylcholine_receptor_agonist
1.2640 percent positivity rate for acetylcholine_receptor_antagonist
0.3065 percent positivity rate for acetylcholinesterase_inhibitor
0.2268 percent positivity rate for adenosine_receptor_agonist
0.4031 percent positivity rate for adenosine_receptor_antagonist
0.0504 percent positivity rate for adenylyl_cyclase_activator
1.1338 percent positivity rate for adrenergic_receptor_agonist
1.5117 percent positivity rate for adrenergic_receptor_antagonist
0.2771 percent positivity rate for akt_inhibitor
0.0294 percent positivity rate for aldehyde_dehydrogenase_inhibitor
0.1764 percent positivity rate for alk_inhibitor
0.0504 percent positivity rate for ampk_activator
0.0504 percent positivity rate for analgesic
0.2016 percent positivity rate for androgen_recep

Two things:

- Positivity rates are very low
- Positivity rates are very heterogeneous

Setting aside a validation set
Let's set aside a training set and a validation set: all of our configuration choices will be guided by performance on this subset of the total available training data. We will also keep on the total available training data, which we will use to train our final production models.

In [44]:
num_train_samples = int(0.8 * len(train_features_df))

full_train_features_ids = train_features_df.pop('sig_id')
full_test_features_ids = test_features_df.pop('sig_id')
train_targets_df.pop('sig_id')

full_train_features_df = train_features_df.copy()
full_train_targets_df = train_targets_df.copy()

val_features_df = train_features_df[num_train_samples:]
train_features_df = train_features_df[:num_train_samples]
val_targets_df = train_targets_df[num_train_samples:]
train_targets_df = train_targets_df[:num_train_samples]

print('Total training samples:', len(full_train_features_df))
print('Training split samples:', len(train_features_df))
print('Validation split samples:', len(val_features_df))

Total training samples: 23814
Training split samples: 19051
Validation split samples: 4763


<b>A dumb baseline</b><br>
If you've read my book, you know you should start tough projects by computing a "dumb" baseline that will serve as your reference point. This is usually the highest score you can reach without looking at the test features (or validation features in this case). Let's use the positivity rate of each target as measured in the training subset to generate predictions for the validation subset.

In [46]:
predictions = []
for target_name in list(train_targets_df):
  rate = float(sum(train_targets_df[target_name])) / len(train_targets_df)
  predictions.append(rate)

predictions = np.array([predictions] * len(val_features_df))

targets = np.array(val_targets_df)
score = keras.losses.BinaryCrossentropy()(targets,predictions)
print('Baseline score : %.4f' % score.numpy())

Baseline score : 0.0209


<b>Prepare TF datasets</b><br>
Let's turn our dataframes into tf.data.Datasets, which we will use to train our Keras models in the next step. Our datasets will yield tuples of (features, targets) where features is a dict and targets is a list. In the features dict, we will have 3 keys: cp_type and cp_dose, as well as numerical_features, which will be a vector concatenating all numerical features in the space.

In [48]:
feature_names = list(train_features_df)
categorical_feature_names = ['cp_type','cp_dose']
numerical_feature_names = [name for name in feature_names if name not in categorical_feature_names]

def merge_numerical_features(feature_dict):
  categorical_features = {name : feature_dict[name] for name in categorical_feature_names}
  numerical_feature = tf.stack([tf.cast(feature_dict[name],'float32') for name in numerical_feature_names])

  feature_dict = categorical_feature_names
  feature_dict.update({"numerical_features" : numerical_features})

  return feature_dict

In [None]:
train_features_ds = tf.data.Dataset.from_tensor_slices(dict(train_features_df))
train_features_ds