## AutoEncoder

#### 차원의 저주

차원이 증가하면 그것을 표현하기 위한 데이터의 수가 기하급수적으로 증가한다.

(일정 차원이 넘으면 분류기의 성능은 점점 떨어져 0으로 수렴함)

AE는 입력값보다 작은 차원을 갖는 hidden layer를 이용해 데이터 속에 숨어있는 변수를 발굴할 수 있게 해준다.

흔히 사용되는 PCA는 선형적인 한계가 있다. 하지만 AE는 뉴런이 갖고 있는 non-linear 및 constraints로 인해 훨씬 뛰어난 차원 축소 능력을 가지고 있다.


<VAE와 AE의 차이>
VAE는 AE와 비슷하지만 약간 다른점이 존재한다. z가 training data와 특별이 관련이 없이 단순히 계산 중간에 나오는 한 값일 뿐이라면 VAE에서의 latent variable인 z는 continouous한 분포를 갖는 random variable이라는 점이 중요한 차이다. 이 latent variable z의 분포는 training 과정에서의 data로부터 학습된다.
(즉, VAE는 z를 좀 더 다루기 쉬운 우리가 잘 아는 분포(가우시안)의 형태를 띄게 만들어 지는 것

http://blog.naver.com/PostView.nhn?blogId=laonple&logNo=220880813236&parentCategoryNo=&categoryNo=18&viewDate=&isShowPopularPosts=true&from=search

In [None]:
### IN 파일

1. 

In [3]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.stats import norm

from keras import backend as K

from keras.layers import Input, Dense, Lambda, Layer, Add, Multiply
from keras.models import Model, Sequential
from keras.datasets import mnist

Using TensorFlow backend.


## 사전 변경값 확인

In [4]:
# 구분 최초 설정 필요

df = 'core'
# main/all/core 3가지로 설정 

# core설정시 확인해야 함
# R을 통해 확인한 stepwise 유의 변수 리스트
core_factors = ['DR00000136','6000201001O0','6000903016D1','FNMKFN02','6000901002D1','S41000210FD1',
'6000207003O0','DR00000052','6000906001D6','DR00000156','6000901001D3','DR00000082',
'S41000210FD2','6000902001D2','6000908001D3','6000904001D3','6000908001D2','S41B0D1009O0',
'6000901002D3','6000903001D2','6000403001O0','CO10100170O0','DR00000113','6000908001D7']


## 데이터셋 셋팅

In [5]:
import pickle # 파이썬 객체저장을 위한 모듈

# 파이썬 객체 읽어오기

if df == 'all':
    with open('./pickles/dataset_all.p', 'rb') as file:    # hello.txt 파일을 바이너리 읽기 모드(rb)로 열기
        train_set = pickle.load(file)
        test_set = pickle.load(file)
else:
    with open('./pickles/dataset_main.p', 'rb') as file:    # hello.txt 파일을 바이너리 읽기 모드(rb)로 열기
        train_set = pickle.load(file)
        test_set = pickle.load(file)


In [6]:
# core인 경우 main에서 핵심 변수만 추림

if df == 'core':
    final_factors = ['key', 'industry', 'label']
    final_factors = final_factors + core_factors
    train_set = train_set[final_factors]
    test_set = test_set[final_factors]
else:
    pass

In [7]:
df, train_set.shape, test_set.shape

('core', (5290, 27), (2336, 27))

### 데이터 분리 및 타입변경

In [8]:
# 데이터 분리
cols = train_set.columns.values
train_info = train_set[cols[0:3]]
x_train = train_set[cols[3:]]
y_train = train_set['label']
train_len = x_train.shape[0]

cols = test_set.columns.values
test_info = test_set[cols[0:3]]
x_test = test_set[cols[3:]]
y_test = test_set['label']
test_len = x_test.shape[0]

In [9]:
x_train.shape, y_train.shape

((5290, 24), (5290,))

## VAE 모델링 ------------------------------------------------------------------------------

https://towardsdatascience.com/teaching-a-variational-autoencoder-vae-to-draw-mnist-characters-978675c95776

In [44]:
m = x_train.shape[1]
n = x_train.shape[0]
m

Dimension(24)

In [45]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [51]:
tf.reset_default_graph()

batch_size = 64

x_train = tf.placeholder(dtype=tf.float32, shape=[None, m,1], name='X')
Y    = tf.placeholder(dtype=tf.float32, shape=[None, m,1], name='Y')
Y_flat = tf.reshape(Y, shape=[-1, m])
keep_prob = tf.placeholder(dtype=tf.float32, shape=(), name='keep_prob')

dec_in_channels = 1
n_latent = 8

reshaped_dim = [-1, 7, dec_in_channels]
inputs_decoder = 7 * dec_in_channels / 2


def lrelu(x, alpha=0.3):
    return tf.maximum(x, tf.multiply(x, alpha))

In [59]:
def encoder(X_in, keep_prob):
    activation = lrelu
    with tf.variable_scope("encoder", reuse=None):
        X = tf.reshape(X_in, shape=[-1, m,1])
        x = tf.layers.conv2d(X, filters=64, kernel_size=4, strides=2, padding='same', activation=activation)
        x = tf.nn.dropout(x, keep_prob)
        x = tf.layers.conv2d(x, filters=64, kernel_size=4, strides=2, padding='same', activation=activation)
        x = tf.nn.dropout(x, keep_prob)
        x = tf.layers.conv2d(x, filters=64, kernel_size=4, strides=1, padding='same', activation=activation)
        x = tf.nn.dropout(x, keep_prob)
        x = tf.contrib.layers.flatten(x)
        mn = tf.layers.dense(x, units=n_latent)
        sd       = 0.5 * tf.layers.dense(x, units=n_latent)            
        epsilon = tf.random_normal(tf.stack([tf.shape(x)[0], n_latent])) 
        z  = mn + tf.multiply(epsilon, tf.exp(sd))
        
        return z, mn, sd

In [57]:
def decoder(sampled_z, keep_prob):
    with tf.variable_scope("decoder", reuse=None):
        x = tf.layers.dense(sampled_z, units=inputs_decoder, activation=lrelu)
        x = tf.layers.dense(x, units=inputs_decoder * 2 + 1, activation=lrelu)
        x = tf.reshape(x, reshaped_dim)
        x = tf.layers.conv2d_transpose(x, filters=64, kernel_size=4, strides=2, padding='same', activation=tf.nn.relu)
        x = tf.nn.dropout(x, keep_prob)
        x = tf.layers.conv2d_transpose(x, filters=64, kernel_size=4, strides=1, padding='same', activation=tf.nn.relu)
        x = tf.nn.dropout(x, keep_prob)
        x = tf.layers.conv2d_transpose(x, filters=64, kernel_size=4, strides=1, padding='same', activation=tf.nn.relu)
        
        x = tf.contrib.layers.flatten(x)
        x = tf.layers.dense(x, units=m, activation=tf.nn.sigmoid)
        img = tf.reshape(x, shape=[-1, m,1])

In [58]:
sampled, mn, sd = encoder(x_train, keep_prob)
dec = decoder(sampled, keep_prob)

ValueError: Input 0 of layer conv2d_2 is incompatible with the layer: expected ndim=4, found ndim=3. Full shape received: [None, 24, 1]

In [None]:
sampled

## k-fold 검증

https://www.programcreek.com/python/example/91153/sklearn.model_selection.KFold

In [185]:
x_train.shape

(5290, 24)

In [186]:
from sklearn.model_selection import KFold

cv = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(x_train):
    print("test index :", test_index)
    print("." * 80 )        
    print("train index:", train_index)
    print("=" * 80 )

test index : [   4   12   15   29   30   31   33   39   42   44   49   66   72   98
  113  130  134  137  138  142  144  148  154  159  189  202  214  217
  223  241  247  265  298  302  304  308  311  333  348  352  376  379
  380  381  396  398  402  405  444  446  451  452  487  489  496  499
  502  521  533  534  535  539  548  567  575  582  584  598  601  608
  629  634  660  666  684  690  704  725  731  735  737  799  806  812
  825  829  840  842  843  852  854  862  864  867  871  873  878  896
  898  910  942  949  965  971  979 1012 1013 1022 1034 1042 1051 1065
 1066 1084 1085 1091 1098 1108 1116 1119 1121 1122 1160 1172 1180 1203
 1216 1234 1246 1248 1261 1275 1281 1282 1285 1294 1299 1309 1326 1328
 1330 1362 1367 1379 1393 1410 1418 1419 1433 1458 1461 1471 1475 1489
 1509 1521 1540 1542 1559 1581 1590 1596 1597 1606 1607 1622 1643 1654
 1657 1666 1670 1688 1692 1696 1705 1706 1712 1722 1726 1750 1763 1765
 1768 1784 1789 1790 1791 1800 1814 1815 1816 1819 1826 1833 184

In [187]:
x_train = np.array(x_train)
x_test = np.array(x_test)

x_train = x_train.reshape(-1, original_dim)
x_test = x_test.reshape(-1, original_dim)

In [188]:
x_train.shape, x_test.shape

((5290, 24), (2336, 24))

In [189]:
vae.fit(x_train[train_index],
        x_train[train_index],
        shuffle=True,
        epochs=epochs,
        batch_size=batch_size,
        validation_data=(x_train[test_index],x_train[test_index]))

encoder = Model(x, z_mu)

Train on 4761 samples, validate on 529 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100


In [190]:
z_train = encoder.predict(x_train, batch_size=batch_size)
z_test = encoder.predict(x_test, batch_size=batch_size)

In [191]:
z_train.shape, z_test.shape

((5290, 32), (2336, 32))

In [192]:
z_train

array([[-0.7103654 ,  0.04807185,  0.34784532, ...,  0.67603946,
        -0.61882925,  0.14410692],
       [ 0.17597532, -0.07906461,  0.2999821 , ..., -0.137201  ,
         0.5381074 , -0.28696954],
       [ 0.05146371,  0.04295033, -0.44949186, ..., -0.38139904,
        -0.31941473, -0.24572591],
       ...,
       [ 1.3575836 , -0.22600272,  0.46943986, ..., -0.5714216 ,
        -0.347981  , -0.31212997],
       [ 0.02130628,  0.5950307 ,  0.81731516, ...,  0.4935007 ,
         0.13619776, -0.5710056 ],
       [-0.2728021 , -0.19573328,  1.1327322 , ..., -0.31859037,
         0.06667504, -0.14040695]], dtype=float32)

In [193]:
x_train

array([[0.0, 0.4878479898113108, -0.23557822307347814, ...,
        0.2047541382419316, 0.0, 1.0],
       [0.0, 1.8098741311576565, 0.7469760522524025, ...,
        0.27794649101708274, 0.0, 0.0],
       [0.0, 1.8098741311576565, 0.3159554267026738, ...,
        0.5434321963520068, 0.0, 0.0],
       ...,
       [0.0, 1.8098741311576565, -0.9248494901570484, ...,
        0.5434321963520068, 0.0, 1.0],
       [3.0, 1.8098741311576565, -0.923382115585833, ...,
        0.5434321963520068, 0.0, 1.0],
       [3.0, 1.8098741311576565, -0.9827779300879173, ...,
        0.5434321963520068, 0.0, 1.0]], dtype=object)