<a href="https://colab.research.google.com/github/JaeDoo1034/Kaggle-Study/blob/master/Keras_tuner1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install git+https://github.com/keras-team/keras-tuner.git -q

  Building wheel for keras-tuner (setup.py) ... [?25l[?25hdone
  Building wheel for terminaltables (setup.py) ... [?25l[?25hdone


MoA: Keras + KerasTuner best practices¶<br>
This notebook will teach you how to:<br>

1. Use a Keras neural network for the MoA competition
2. Use KerasTuner to find high-performing model configurations
3. Ensemble a few of the top models to generate final predictions

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [3]:
print('TF version:', tf.__version__)
print('GPU devices:', tf.config.list_physical_devices('GPU'))

TF version: 2.3.0
GPU devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In this competition, we're looking at 3 CSV files: one for training features, one for training targets (with the same number of entries and a 1:1 match between entries in the features file and those in the targets file), and one for test features. The goal is to predict the targets that correspond to the test features.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
train_features_df = pd.read_csv('/content/drive/My Drive/Data/train_features.csv')
train_targets_df = pd.read_csv('/content/drive/My Drive/Data/train_targets_scored.csv')
test_features_df = pd.read_csv('/content/drive/My Drive/Data/test_features.csv')

In [6]:
print('train_features_df.shape:', train_features_df.shape)
print('train_targets_df.shape:', train_targets_df.shape)
print('test_features_df.shape:', test_features_df.shape)

train_features_df.shape: (23814, 876)
train_targets_df.shape: (23814, 207)
test_features_df.shape: (3982, 876)


In [7]:
train_features_df.sample(5)

Unnamed: 0,sig_id,cp_type,cp_time,cp_dose,g-0,g-1,g-2,g-3,g-4,g-5,g-6,g-7,g-8,g-9,g-10,g-11,g-12,g-13,g-14,g-15,g-16,g-17,g-18,g-19,g-20,g-21,g-22,g-23,g-24,g-25,g-26,g-27,g-28,g-29,g-30,g-31,g-32,g-33,g-34,g-35,...,c-60,c-61,c-62,c-63,c-64,c-65,c-66,c-67,c-68,c-69,c-70,c-71,c-72,c-73,c-74,c-75,c-76,c-77,c-78,c-79,c-80,c-81,c-82,c-83,c-84,c-85,c-86,c-87,c-88,c-89,c-90,c-91,c-92,c-93,c-94,c-95,c-96,c-97,c-98,c-99
778,id_07ecb3e85,trt_cp,24,D1,0.3388,-0.9912,0.2248,0.3389,1.969,-0.4671,-0.1002,0.4639,-0.0698,-0.2725,-0.0814,0.544,-0.5042,-0.0908,0.1309,-0.6146,1.578,0.1536,0.2466,-1.607,-0.5536,-0.5646,0.288,-0.0313,-0.2242,-0.3258,-0.5939,-0.35,-0.3767,-0.186,-0.2111,-0.0155,-1.337,0.0639,0.0883,-0.399,...,-0.6768,-0.7074,-0.7677,0.1235,-1.236,0.0547,-0.1787,-0.0042,-0.1828,-0.2835,0.4365,-0.4902,-0.2675,-0.2333,-0.3714,-0.1162,-0.4374,-0.7873,0.6983,-0.6884,-0.2678,-1.27,-0.3111,-0.6138,-0.3663,0.1876,-0.1194,-0.4981,-0.0688,-0.2777,-0.5158,0.4552,-0.3342,-1.819,-1.068,-0.7455,0.0101,-0.5123,-0.3272,-0.5126
4641,id_31edc89ef,trt_cp,24,D1,-0.3372,-0.1891,2.586,-0.6923,-0.9179,1.591,-0.0738,0.2384,0.6933,1.644,1.162,-1.83,-0.5271,-0.1228,-0.112,-0.077,0.2116,-0.3374,0.1751,-0.2421,1.409,0.1292,-0.0394,-0.034,-0.154,-0.3701,-0.1268,0.1934,-0.3309,-0.3204,-0.5574,0.5905,-0.3395,-0.4635,0.0658,-1.026,...,0.1636,0.6459,-0.0181,-0.682,-0.4035,0.7838,0.1066,0.5167,0.0856,1.373,-0.3473,-0.1488,0.6359,0.4595,0.5128,-0.3642,-0.0334,0.5751,0.7964,0.0514,0.2861,-0.1883,0.5591,-0.1685,-0.2859,-1.461,0.9457,0.0362,0.1456,1.176,-0.1916,-0.5433,-0.08,0.3584,0.1172,0.1852,0.0601,0.5861,0.0626,-0.5796
3428,id_24cdaa657,trt_cp,48,D2,1.87,-0.0563,-0.9304,-0.9547,-0.8074,-0.7664,-0.3451,-0.5916,-0.7421,-0.0138,0.5622,1.705,-0.5498,-0.5419,0.5361,1.073,1.008,0.3818,0.1558,-0.569,-0.846,-0.5068,3.021,-0.0045,1.211,1.317,0.4085,0.3413,-0.2189,0.8226,2.211,-0.4719,-0.6977,3.5,-1.399,1.513,...,0.4574,1.27,0.1997,0.5201,-0.3351,0.7389,-0.9999,0.3063,0.862,1.19,0.9253,-0.9719,0.4908,-0.4926,0.5025,0.9047,0.4979,0.4417,-0.4749,0.6376,0.4999,1.42,-1.443,-0.9039,0.6668,0.7409,0.9675,1.055,1.396,0.1777,0.8893,0.9004,-0.3938,0.2917,1.042,-0.7796,-1.12,1.232,-0.382,-0.3763
5637,id_3caa1f427,trt_cp,48,D1,-0.2662,0.2736,1.33,0.2343,-0.049,-0.6133,0.5214,-0.481,0.0571,-0.1016,-0.1306,1.106,-1.033,0.1677,-0.32,0.4185,0.9928,-0.1261,-0.0556,0.3991,-0.1965,0.5626,0.9987,-0.3865,-1.056,-0.6101,-0.8921,-0.5305,0.0367,0.6474,0.8108,0.1015,0.1932,0.6029,0.4967,-0.091,...,0.5278,-0.6174,0.0678,0.4054,0.5273,0.6554,-0.2335,-0.2161,-0.4208,0.6477,0.3885,0.5999,1.026,0.6834,0.3242,0.8176,1.723,-0.0353,0.3669,-0.5014,0.4252,0.7724,0.1996,-0.1291,0.6842,1.268,0.7236,-0.5492,0.556,0.1482,-0.2812,0.8796,-0.2169,1.321,0.3599,-0.4869,0.1209,1.348,0.4739,0.3978
131,id_017e29d4d,trt_cp,24,D1,-0.7307,-0.6776,0.6388,0.072,-0.1736,0.4976,1.046,0.1471,0.2634,0.1624,0.1681,-0.6652,2.135,0.2751,-0.459,0.56,2.462,0.4094,-0.0337,0.3257,0.2078,0.7657,-0.2393,0.0893,0.4417,0.4058,0.2082,0.8765,0.1916,-0.432,0.8011,4.335,1.1,0.5923,0.3206,-0.6777,...,0.5443,0.8304,0.0765,0.2288,0.1586,0.1783,0.2578,-0.3513,0.5269,-0.899,-0.001,0.8235,-0.2662,-0.0614,0.6772,0.4411,0.5389,-0.4541,0.0095,0.2137,0.1142,0.3014,0.1031,0.4253,0.4498,0.2493,-0.87,0.4126,-0.1882,0.0735,-1.332,0.2909,1.133,-0.7993,0.5675,0.535,0.2375,0.6432,-0.6192,0.6445


Ok, so we have 2 categorical features (cp_type and cp_dose, which are strings), and everything else is numerical (assuming g-0 to g-99 are homogeneous in type).

We'll use the StringLookup and CategoryEncoding layers to encode the categorical features, and the Normalization layer to normalize the values of the numerical features.

Let's look at the targets:

In [8]:
train_targets_df.sample(5)

Unnamed: 0,sig_id,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,adrenergic_receptor_agonist,adrenergic_receptor_antagonist,akt_inhibitor,aldehyde_dehydrogenase_inhibitor,alk_inhibitor,ampk_activator,analgesic,androgen_receptor_agonist,androgen_receptor_antagonist,anesthetic_-_local,angiogenesis_inhibitor,angiotensin_receptor_antagonist,anti-inflammatory,antiarrhythmic,antibiotic,anticonvulsant,antifungal,antihistamine,antimalarial,antioxidant,antiprotozoal,antiviral,apoptosis_stimulant,aromatase_inhibitor,atm_kinase_inhibitor,atp-sensitive_potassium_channel_antagonist,atp_synthase_inhibitor,atpase_inhibitor,atr_kinase_inhibitor,aurora_kinase_inhibitor,...,protein_synthesis_inhibitor,protein_tyrosine_kinase_inhibitor,radiopaque_medium,raf_inhibitor,ras_gtpase_inhibitor,retinoid_receptor_agonist,retinoid_receptor_antagonist,rho_associated_kinase_inhibitor,ribonucleoside_reductase_inhibitor,rna_polymerase_inhibitor,serotonin_receptor_agonist,serotonin_receptor_antagonist,serotonin_reuptake_inhibitor,sigma_receptor_agonist,sigma_receptor_antagonist,smoothened_receptor_antagonist,sodium_channel_inhibitor,sphingosine_receptor_agonist,src_inhibitor,steroid,syk_inhibitor,tachykinin_antagonist,tgf-beta_receptor_inhibitor,thrombin_inhibitor,thymidylate_synthase_inhibitor,tlr_agonist,tlr_antagonist,tnf_inhibitor,topoisomerase_inhibitor,transient_receptor_potential_channel_antagonist,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
16143,id_ad9b9a725,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
14888,id_a012b7f60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16289,id_af45a4c71,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3974,id_2a7c08eb6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
22150,id_ed9355bf5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


The targets are binary indicators (0 or 1) across 206 different categories. So our model should output a probability score between 0 and 1 (sigmoid activation) across 206 outputs.

The sample submission format matches these expectations:

In [9]:
sample_submission_df = pd.read_csv('/content/drive/My Drive/Data/sample_submission.csv')
sample_submission_df.sample(5)

Unnamed: 0,sig_id,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,adrenergic_receptor_agonist,adrenergic_receptor_antagonist,akt_inhibitor,aldehyde_dehydrogenase_inhibitor,alk_inhibitor,ampk_activator,analgesic,androgen_receptor_agonist,androgen_receptor_antagonist,anesthetic_-_local,angiogenesis_inhibitor,angiotensin_receptor_antagonist,anti-inflammatory,antiarrhythmic,antibiotic,anticonvulsant,antifungal,antihistamine,antimalarial,antioxidant,antiprotozoal,antiviral,apoptosis_stimulant,aromatase_inhibitor,atm_kinase_inhibitor,atp-sensitive_potassium_channel_antagonist,atp_synthase_inhibitor,atpase_inhibitor,atr_kinase_inhibitor,aurora_kinase_inhibitor,...,protein_synthesis_inhibitor,protein_tyrosine_kinase_inhibitor,radiopaque_medium,raf_inhibitor,ras_gtpase_inhibitor,retinoid_receptor_agonist,retinoid_receptor_antagonist,rho_associated_kinase_inhibitor,ribonucleoside_reductase_inhibitor,rna_polymerase_inhibitor,serotonin_receptor_agonist,serotonin_receptor_antagonist,serotonin_reuptake_inhibitor,sigma_receptor_agonist,sigma_receptor_antagonist,smoothened_receptor_antagonist,sodium_channel_inhibitor,sphingosine_receptor_agonist,src_inhibitor,steroid,syk_inhibitor,tachykinin_antagonist,tgf-beta_receptor_inhibitor,thrombin_inhibitor,thymidylate_synthase_inhibitor,tlr_agonist,tlr_antagonist,tnf_inhibitor,topoisomerase_inhibitor,transient_receptor_potential_channel_antagonist,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
3070,id_c6d9b85a5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
1678,id_6cdd13153,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
1542,id_6388f978f,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
1297,id_53fd1f636,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
2710,id_ae1206b60,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5


Out of 23,814 samples, how often is each of the 206 target indicators positive?

In [10]:
for target_name in list(train_targets_df)[1:]:
  rate = float(sum(train_targets_df[target_name])) / len(train_targets_df)
  print('%.4f percent positivity rate for %s' % (100*rate, target_name) )

0.0714 percent positivity rate for 5-alpha_reductase_inhibitor
0.0756 percent positivity rate for 11-beta-hsd1_inhibitor
0.1008 percent positivity rate for acat_inhibitor
0.7979 percent positivity rate for acetylcholine_receptor_agonist
1.2640 percent positivity rate for acetylcholine_receptor_antagonist
0.3065 percent positivity rate for acetylcholinesterase_inhibitor
0.2268 percent positivity rate for adenosine_receptor_agonist
0.4031 percent positivity rate for adenosine_receptor_antagonist
0.0504 percent positivity rate for adenylyl_cyclase_activator
1.1338 percent positivity rate for adrenergic_receptor_agonist
1.5117 percent positivity rate for adrenergic_receptor_antagonist
0.2771 percent positivity rate for akt_inhibitor
0.0294 percent positivity rate for aldehyde_dehydrogenase_inhibitor
0.1764 percent positivity rate for alk_inhibitor
0.0504 percent positivity rate for ampk_activator
0.0504 percent positivity rate for analgesic
0.2016 percent positivity rate for androgen_recep

Two things:

- Positivity rates are very low
- Positivity rates are very heterogeneous

Setting aside a validation set
Let's set aside a training set and a validation set: all of our configuration choices will be guided by performance on this subset of the total available training data. We will also keep on the total available training data, which we will use to train our final production models.

In [11]:
num_train_samples = int(0.8 * len(train_features_df))

full_train_features_ids = train_features_df.pop('sig_id')
full_test_features_ids = test_features_df.pop('sig_id')
train_targets_df.pop('sig_id')

full_train_features_df = train_features_df.copy()
full_train_targets_df = train_targets_df.copy()

val_features_df = train_features_df[num_train_samples:]
train_features_df = train_features_df[:num_train_samples]
val_targets_df = train_targets_df[num_train_samples:]
train_targets_df = train_targets_df[:num_train_samples]

print('Total training samples:', len(full_train_features_df))
print('Training split samples:', len(train_features_df))
print('Validation split samples:', len(val_features_df))

Total training samples: 23814
Training split samples: 19051
Validation split samples: 4763


<b>A dumb baseline</b><br>
If you've read my book, you know you should start tough projects by computing a "dumb" baseline that will serve as your reference point. This is usually the highest score you can reach without looking at the test features (or validation features in this case). Let's use the positivity rate of each target as measured in the training subset to generate predictions for the validation subset.

In [12]:
predictions = []
for target_name in list(train_targets_df):
  rate = float(sum(train_targets_df[target_name])) / len(train_targets_df)
  predictions.append(rate)

predictions = np.array([predictions] * len(val_features_df))

targets = np.array(val_targets_df)
score = keras.losses.BinaryCrossentropy()(targets,predictions)
print('Baseline score : %.4f' % score.numpy())

Baseline score : 0.0209


<b>Prepare TF datasets</b><br>
Let's turn our dataframes into tf.data.Datasets, which we will use to train our Keras models in the next step. Our datasets will yield tuples of (features, targets) where features is a dict and targets is a list. In the features dict, we will have 3 keys: cp_type and cp_dose, as well as numerical_features, which will be a vector concatenating all numerical features in the space.

In [13]:
feature_names = list(train_features_df)
categorical_feature_names = ['cp_type', 'cp_dose']
numerical_feature_names = [name for name in feature_names if name not in categorical_feature_names]

def merge_numerical_features(feature_dict):
    categorical_features = {name: feature_dict[name] for name in categorical_feature_names}
    numerical_features = tf.stack([tf.cast(feature_dict[name], 'float32') for name in numerical_feature_names])
    feature_dict = categorical_features
    feature_dict.update({'numerical_features': numerical_features})
    return feature_dict

tf.data : TensorFlow 입력 파이프 라인 빌드<br>
https://www.tensorflow.org/guide/data?hl=ko

In [31]:
train_features_ds = tf.data.Dataset.from_tensor_slices(dict(train_features_df))
train_features_ds = train_features_ds.map(lambda x: merge_numerical_features(x))

train_targets_ds = tf.data.Dataset.from_tensor_slices(np.array(train_targets_df))
train_ds = tf.data.Dataset.zip((train_features_ds, train_targets_ds))

full_train_features_ds = tf.data.Dataset.from_tensor_slices(dict(full_train_features_df))
full_train_features_ds = full_train_features_ds.map(lambda x: merge_numerical_features(x))
full_train_targets_ds = tf.data.Dataset.from_tensor_slices(np.array(full_train_targets_df))
full_train_ds = tf.data.Dataset.zip((full_train_features_ds, full_train_targets_ds))


In [32]:
# dict형태로 준비하네..
train_features_ds

<MapDataset shapes: {cp_type: (), cp_dose: (), numerical_features: (873,)}, types: {cp_type: tf.string, cp_dose: tf.string, numerical_features: tf.float32}>

In [33]:
full_train_features_ds

<MapDataset shapes: {cp_type: (), cp_dose: (), numerical_features: (873,)}, types: {cp_type: tf.string, cp_dose: tf.string, numerical_features: tf.float32}>

In [34]:
val_features_ds = tf.data.Dataset.from_tensor_slices(dict(val_features_df))
# dict형태로 데이터를 준비하네 신기..
val_features_ds =val_features_ds.map(lambda x: merge_numerical_features(x))

val_targets_ds = tf.data.Dataset.from_tensor_slices(np.array(val_targets_df))
val_ds = tf.data.Dataset.zip((val_features_ds,val_targets_ds))

test_ds = tf.data.Dataset.from_tensor_slices(dict(test_features_df))
test_ds = test_ds.map(lambda x: merge_numerical_features(x))

In [35]:
# cardinalty() 참고자료 : https://www.tensorflow.org/api_docs/python/tf/data/Dataset
# 집합의 크기를 표시
print('Training split samples :', int(train_ds.cardinality()))
print('Validation split samples:',int(val_ds.cardinality()))
print('Test samples:',int(test_ds.cardinality()))

Training split samples : 19051
Validation split samples: 4763
Test samples: 3982


In [36]:
train_ds = train_ds.shuffle(1024).batch(64).prefetch(8)
full_train_ds = full_train_ds.shuffle(1024).batch(64).prefetch(8)
val_ds = val_ds.batch(64).prefetch(8)
test_ds = test_ds.batch(64).prefetch(8)

In [37]:
train_ds

<PrefetchDataset shapes: ({cp_type: (None,), cp_dose: (None,), numerical_features: (None, 873)}, (None, 206)), types: ({cp_type: tf.string, cp_dose: tf.string, numerical_features: tf.float32}, tf.int64)>

20년 9월 30일 작업 완료 : 텐서플로우 파이프라인 구축

<b>Encode our features</b><br>
We use a StringLookup + CategoryEncoding layer to index and encode our string categorical features. It's a bit overkill since there are only two values, and it takes into account the possibility of unknown values at test time, which we don't have in this case. But it is very general and you can't go wrong with it.

Then, we use a single Normalization layer to encode our concatenated numerical features.

Finally, we concatenate the entire feature space into a single vector.

In [38]:
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from tensorflow.keras.layers.experimental.preprocessing import CategoryEncoding
from tensorflow.keras.layers.experimental.preprocessing import StringLookup

In [39]:
def encode_numerical_feature(feature, name, dataset):
    # Create a Normalization layer for our feature
    normalizer = Normalization()

    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])

    print("1. ", feature_ds)
    # Learn the statistics of the data
    normalizer.adapt(feature_ds)

    # Normalize the input feature
    encoded_feature = normalizer(feature)
    return encoded_feature

In [40]:
def encode_categorical_feature(feature, name, dataset):
    # Create a Lookup layer which will turn strings into integer indices
    index = StringLookup()

    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    print("2. ", feature_ds)
    # Learn the set of possible feature values and assign them a fixed integer index
    index.adapt(feature_ds)

    # Turn the values into integer indices
    encoded_feature = index(feature)
    print("3. ", encoded_feature)
    # Create a CategoryEncoding for our integer indices
    encoder = CategoryEncoding(output_mode="binary")

    # Prepare a dataset of indices
    feature_ds = feature_ds.map(index)
    print("4. ", feature_ds)
    # Learn the space of possible indices
    encoder.adapt(feature_ds)
    print("5. ", encoder)
    # Apply one-hot encoding to our indices
    encoded_feature = encoder(encoded_feature)
    print("6. ", encoded_feature)
    return encoded_feature

In [41]:
all_inputs = []
all_encoded_features = []

print('Processing categorical features...')
for name in categorical_feature_names:
    inputs = keras.Input(shape=(1,), name=name, dtype='string')
    encoded = encode_categorical_feature(inputs, name, train_ds)
    all_inputs.append(inputs)
    all_encoded_features.append(encoded)

print('Processing numerical features...')
numerical_inputs = keras.Input(shape=(len(numerical_feature_names),), name='numerical_features')
encoded_numerical_features = encode_numerical_feature(numerical_inputs, 'numerical_features', train_ds)

all_inputs.append(numerical_inputs)
all_encoded_features.append(encoded_numerical_features)
features = layers.Concatenate()(all_encoded_features)

Processing categorical features...
2.  <MapDataset shapes: (None,), types: tf.string>
3.  Tensor("string_lookup_4/None_lookup_table_find/LookupTableFindV2:0", shape=(None, 1), dtype=int64, device=/job:localhost/replica:0/task:0/device:CPU:0)
4.  <MapDataset shapes: (None,), types: tf.int64>
5.  <tensorflow.python.keras.layers.preprocessing.category_encoding.CategoryEncoding object at 0x7fd414076b00>
6.  Tensor("category_encoding_4/bincount/DenseBincount:0", shape=(None, 4), dtype=float32)
2.  <MapDataset shapes: (None,), types: tf.string>
3.  Tensor("string_lookup_5/None_lookup_table_find/LookupTableFindV2:0", shape=(None, 1), dtype=int64, device=/job:localhost/replica:0/task:0/device:CPU:0)
4.  <MapDataset shapes: (None,), types: tf.int64>
5.  <tensorflow.python.keras.layers.preprocessing.category_encoding.CategoryEncoding object at 0x7fd4156d72e8>
6.  Tensor("category_encoding_5/bincount/DenseBincount:0", shape=(None, 4), dtype=float32)
Processing numerical features...
1.  <MapDatase

20년 10월 1일 keras input구조 이해중

<b>Train a basic model to establish a better baseline</b><br>
Can a simple model beat our dumb baseline? Let's try a simple logistic regression over our concatenated feature space.

In [42]:
x = layers.Dropout(0.5)(features)
outputs = layers.Dense(206, activation='sigmoid')(x)
basic_model = keras.Model(all_inputs, outputs)
basic_model.summary()
basic_model.compile(optimizer=keras.optimizers.RMSprop(),
                    loss=keras.losses.BinaryCrossentropy())
basic_model.fit(full_train_ds, epochs=10, validation_data=val_ds)

Model: "functional_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
cp_type (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
cp_dose (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
string_lookup_4 (StringLookup)  (None, 1)            0           cp_type[0][0]                    
__________________________________________________________________________________________________
string_lookup_5 (StringLookup)  (None, 1)            0           cp_dose[0][0]                    
_______________________________________________________________________________________

<tensorflow.python.keras.callbacks.History at 0x7fd41403f2b0>