## Autoencoder in Feature Engineering
Wiki 定義
- 也稱自動編碼器，是一種人工神經網絡，在無監督學習中用於有效編碼。自編碼的目的是對一組數據學習出一種表示（也稱表徵，編碼），通常用於降維。最近，自編碼的概念廣泛地用於數據的生成模型。自2010年以來，一些先進的人工智慧在深度學習網絡中採用了採用堆疊式稀疏自編碼。

簡而言之
- Autoencoder 是一個非監督式學習，透過Encoder 將原始資料(X) encode 成 latent vector，爾後 Decoder 根據 latent vector 轉換為預測值(X')，
    而預測值(X')要與原始資料(X)越相遇越好。

應用情景
- 參考這篇 [Medium: 7 Applications of Auto-Encoders every Data Scientist should know](https://towardsdatascience.com/6-applications-of-auto-encoders-every-data-scientist-should-know-dc703cbc892b)

**此次主要將著重在 Missing value 以及 特徵萃取的用途上。**

---

Autoencoder in Imputing Missing value(待完成, 搭配[論文](https://www.sciencedirect.com/science/article/pii/S2405896318320949)服用)

步驟
1. 將原始資料隨機將特徵遺失
2. 透過此資料建立Autnencoder
3. 透過真實有缺失的資料進行預測，將值補全

In [23]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

In [2]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.boston_housing.load_data(
    path="boston_housing.npz", test_split=0.2, seed=113
)

print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz
(404, 13) (404,) (102, 13) (102,)


In [8]:
scaler = StandardScaler().fit(x_train)

In [9]:
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

print(x_train.max(), x_train.min())
print(x_test.max(), x_test.min())

9.234847178400438 -3.8172503201932715
4.135832294709217 -3.512256695833765


## Autoencoder in Feature extraction

步驟
1. 透過原始資料建立一個autoencoder
2. 透過autoencoder 中的latent vector 當作合成特徵存取
3. 將合成特徵與原先特徵合併，送入模型訓練

In [10]:
# 為了對比先建立一個簡單的model去預測

rf = RandomForestRegressor().fit(x_train, y_train)
mean_squared_error(y_test, rf.predict(x_test))

13.97517111764706

In [18]:
# 建立autoencoder, 搭配: https://blog.csdn.net/hahajinbu/article/details/77982721, 抽取中間層建立一個萃取器

def get_autoencoder():
    model_input = keras.Input(shape=(13,))
    layer_one = layers.Dense(units=128, activation='relu')(model_input)
    layer_two = layers.Dense(units=64, activation='relu')(layer_one)
    latent_layer = layers.Dense(units=32, activation='relu', name='latent_layer')(layer_two)
    layer_three = layers.Dense(units=64, activation='relu')(latent_layer)
    layer_four = layers.Dense(units=128, activation='relu')(layer_three)
    model_output = layers.Dense(units=13)(layer_four)
    
    return keras.Model(model_input, model_output)

autoencoder = get_autoencoder()
autoencoder.compile(optimizer='adam', loss='mse', metrics=['mse', 'mae'])
autoencoder.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 13)]              0         
_________________________________________________________________
dense_12 (Dense)             (None, 128)               1792      
_________________________________________________________________
dense_13 (Dense)             (None, 64)                8256      
_________________________________________________________________
latent_layer (Dense)         (None, 32)                2080      
_________________________________________________________________
dense_14 (Dense)             (None, 64)                2112      
_________________________________________________________________
dense_15 (Dense)             (None, 128)               8320      
_________________________________________________________________
dense_16 (Dense)             (None, 13)                1677

In [19]:
# 雖然有overfitting 但先不處理，只是簡單應用

autoencoder.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=32, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100


Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100


Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x187ccb656d8>

In [22]:
def get_latent_model():
    return keras.Model(autoencoder.input, autoencoder.get_layer('latent_layer').output)

latent_model = get_latent_model()
x_train_latent_vector = latent_model.predict(x_train)
x_test_latent_vector = latent_model.predict(x_test)

print(x_train_latent_vector.shape, x_test_latent_vector.shape)

(404, 32) (102, 32)


In [24]:
new_x_train = np.concatenate((x_train, x_train_latent_vector), axis=1)
new_x_test = np.concatenate((x_test, x_test_latent_vector), axis=1)

print(new_x_train.shape, new_x_test.shape)

(404, 45) (102, 45)


In [25]:
# 再用一次 RF

new_rf = RandomForestRegressor().fit(new_x_train, y_train)
mean_squared_error(y_test, new_rf.predict(new_x_test))

12.586816833333334

In [28]:
def experimental(times=100):
    count = 0
    for i in range(times):
        rf = RandomForestRegressor().fit(x_train, y_train)
        ori_mse = mean_squared_error(y_test, rf.predict(x_test))
        
        new_rf = RandomForestRegressor().fit(new_x_train, y_train)
        mse = mean_squared_error(y_test, new_rf.predict(new_x_test))
        
        if mse < ori_mse:
            count += 1
    return count

experimental(100)

93