# One Hot + DAE DNN 단일 모델
- **One-Hot + DAE with DropOut**

    - Dropout rate값들 조정해보며 비교
    
    - Encoder, Decoder를 두번 진행해보는 시도를 해봄 -> 성능이 많이 떨어짐
    - Enocder layer와 Decoder layer 사이에 Noise 추가 -> 성능이 떨어짐
    
    - Encoder Layer에도 Noise를 삽입하는 시도를 해봄
        -> 성능이 개선되어 총 2개의 Noise를 추가

- **One-Hot + DAE with GaussianNoise**

    - GaussianNoise의 stddev값 조정(0.01, 0.2, 0.2, 0.3) 및 비교

    - Encoder, Decoder Layer 두 번 반복 -> 성능이 많이 떨어짐

    - 가장 성적이 좋은 stddev=0.2를 기준으로 Dropout과 결합한 Modeling 시도(But. 오히려 성적이 떨어짐) 

- 이외에도 GaussianDropout, AlphaDropout을 사용한 DAE 시도

    - **One-Hot + DAE with GaussianDropout**

    - **One-Hot + DAE with AlphaDropout**
    
< Feature 추가 >
- 'goods_nm'과 'tran_date'를 Feature로 추가
    -> 결과적으로 성능이 개선되는 'goods_nm'만 추가하기로 결정
    
- 위의 과정 그대로 진행
1. Dropout, GaussianNoise의 rate값, stddev값 조정하면서 비교
2. 성적이 좋은 rate값, stddev값에 한해 Encoder Layer에 Dropout Noise 삽입

<font color='tomato'><font color="#CC3D3D"><p>
**최종적으로 가장 성적이 우수했던 'Feature 추가 + GaussianNoise stddev값(0.01) 고정 + Encoder layer Dropout 1개 -> mean=0.77234, std=0.017' 채택**

### Imports

In [9]:
%run import_modules.py
import pandas as pd
import numpy as np
import featuretools as ft
from tqdm import tqdm

import matplotlib.pyplot as plt
%matplotlib inline
import keras

from keras.preprocessing import sequence
from keras.preprocessing.text import *

from keras.models import Model
from keras import Input
from keras import layers
from keras import regularizers
from keras.optimizers import RMSprop
from keras.constraints import max_norm
from keras.callbacks import EarlyStopping

import warnings; warnings.filterwarnings("ignore")

### Read Data

In [2]:
df_train = pd.read_csv('X_train.csv', encoding='cp949')
df_test = pd.read_csv('X_test.csv', encoding='cp949')
y_train = pd.read_csv('y_train.csv').gender
IDtest = df_test.cust_id.unique()

df_train.head()

Unnamed: 0,cust_id,tran_date,store_nm,goods_id,gds_grp_nm,gds_grp_mclas_nm,amount
0,0,2007-01-19 00:00:00,강남점,127105,기초 화장품,화장품,850000
1,0,2007-03-30 00:00:00,강남점,342220,니 트,시티웨어,480000
2,0,2007-03-30 00:00:00,강남점,127105,기초 화장품,화장품,3000000
3,0,2007-03-30 00:00:00,강남점,342205,니 트,시티웨어,840000
4,0,2007-03-30 00:00:00,강남점,342220,상품군미지정,기타,20000


<font color='green'><p>
# Transform data + more features_1

In [3]:
df_train.head()

Unnamed: 0,cust_id,tran_date,store_nm,goods_id,gds_grp_nm,gds_grp_mclas_nm,amount
0,0,2007-01-19 00:00:00,강남점,127105,기초 화장품,화장품,850000
1,0,2007-03-30 00:00:00,강남점,342220,니 트,시티웨어,480000
2,0,2007-03-30 00:00:00,강남점,127105,기초 화장품,화장품,3000000
3,0,2007-03-30 00:00:00,강남점,342205,니 트,시티웨어,840000
4,0,2007-03-30 00:00:00,강남점,342220,상품군미지정,기타,20000


In [4]:
# soter_nm 추가

level = 'gds_grp_nm'

df_all = pd.concat([df_train, df_test])
train1 = pd.pivot_table(df_all, index='cust_id', columns=level, values='amount',
                         aggfunc=lambda x: np.where(len(x) >=1, 1, 0), fill_value=0). \
                         reset_index(). \
                         query('cust_id not in @IDtest'). \
                         drop(columns=['cust_id']).values
test1 = pd.pivot_table(df_all, index='cust_id', columns=level, values='amount',
                         aggfunc=lambda x: np.where(len(x) >=1, 1, 0), fill_value=0). \
                         reset_index(). \
                         query('cust_id in @IDtest'). \
                         drop(columns=['cust_id']).values

level = 'gds_grp_mclas_nm'

df_all = pd.concat([df_train, df_test])
train2 = pd.pivot_table(df_all, index='cust_id', columns=level, values='amount',
                         aggfunc=lambda x: np.where(len(x) >=1, 1, 0), fill_value=0). \
                         reset_index(). \
                         query('cust_id not in @IDtest'). \
                         drop(columns=['cust_id']).values
test2 = pd.pivot_table(df_all, index='cust_id', columns=level, values='amount',
                         aggfunc=lambda x: np.where(len(x) >=1, 1, 0), fill_value=0). \
                         reset_index(). \
                         query('cust_id in @IDtest'). \
                         drop(columns=['cust_id']).values

level = 'goods_id'

df_all = pd.concat([df_train, df_test])
train3 = pd.pivot_table(df_all, index='cust_id', columns=level, values='amount',
                         aggfunc=lambda x: np.where(len(x) >=1, 1, 0), fill_value=0). \
                         reset_index(). \
                         query('cust_id not in @IDtest'). \
                         drop(columns=['cust_id']).values
test3 = pd.pivot_table(df_all, index='cust_id', columns=level, values='amount',
                         aggfunc=lambda x: np.where(len(x) >=1, 1, 0), fill_value=0). \
                         reset_index(). \
                         query('cust_id in @IDtest'). \
                         drop(columns=['cust_id']).values

level = 'store_nm'

df_all = pd.concat([df_train, df_test])
train4 = pd.pivot_table(df_all, index='cust_id', columns=level, values='amount',
                         aggfunc=lambda x: np.where(len(x) >=1, 1, 0), fill_value=0). \
                         reset_index(). \
                         query('cust_id not in @IDtest'). \
                         drop(columns=['cust_id']).values
test4 = pd.pivot_table(df_all, index='cust_id', columns=level, values='amount',
                         aggfunc=lambda x: np.where(len(x) >=1, 1, 0), fill_value=0). \
                         reset_index(). \
                         query('cust_id in @IDtest'). \
                         drop(columns=['cust_id']).values

train_add = np.hstack([train1, train2, train3, train4])
test_add = np.hstack([test1, test2, test3, test4])

train_add.shape, test_add.shape

((3500, 4203), (2482, 4203))

### Encoder layer에 noise 추가

<font color='tomato'><font color="#CC3D3D"><p>
# GaussianNoise stddev값(0.01) 고정 + Encoder layer Dropout 1개
    -> mean=0.77234, std=0.017

In [7]:
### GaussianNoise, stddev=0.01

# Set hyper-parameters for power mean ensemble 
N = 10
p = 3.5
GN_preds_a_4 = []
GN_aucs_a_4 = []
noise_level = 0.01

for i in tqdm(range(N)):    
    X_train, X_test = train_add, test_add

    ##### STEP 1: Randomize Seed
    SEED = np.random.randint(1, 10000)              
    random.seed(SEED)       
    np.random.seed(SEED)     
    if tf.__version__[0] < '2':  
        tf.set_random_seed(SEED)
    else:
        tf.random.set_seed(SEED)

    ##### STEP 2: Build DAE #####

    # Define the encoder dimension
    encoding_dim = 128

    # Input Layer
    input_dim = Input(shape = (X_train.shape[1], ))

    # Encoder Layers
    noise1 = GaussianNoise(noise_level)(input_dim)
    encoded1 = Dense(512, activation = 'relu')(noise1)
    noise2 = Dropout(0.1)(encoded1)
    
    encoded2 = Dense(256, activation = 'relu')(noise2)
    encoded3 = Dense(128, activation = 'relu')(encoded2)
    encoded4 = Dense(encoding_dim, activation = 'relu')(encoded3)

    # Decoder Layers
    decoded1 = Dense(128, activation = 'relu')(encoded4)
    decoded2 = Dense(256, activation = 'relu')(decoded1)
    decoded3 = Dense(512, activation = 'relu')(decoded2)
    decoded4 = Dense(X_train.shape[1], activation = 'linear')(decoded3)

    # Combine Encoder and Deocder layers
    autoencoder = Model(inputs = input_dim, outputs = decoded4)
    autoencoder.summary()

    # Compile the model
    autoencoder.compile(optimizer = 'adam', loss = 'mse')

    # Train the model
    history = autoencoder.fit(X_train, X_train, epochs=20, batch_size=64,
                              shuffle=True, validation_data=(X_test,X_test), verbose=0)

    print(f'DAE learning curve {i+1}/{N}')
    plt.plot(history.history["loss"], label="train loss")
    plt.plot(history.history["val_loss"], label="validation loss")
    plt.legend()
    plt.title("Loss")
    plt.show()

    ##### STEP 3: Reduce Dimension #####

    # Use a middle Bottleneck Layer to Reduce Dimension
    GN_model_a_4 = Model(inputs=input_dim, outputs=encoded4)
    X_train = GN_model_a_4.predict(X_train)
    X_test = GN_model_a_4.predict(X_test)

    ##### STEP 4: Build a DNN Model

    # Define the Model architecture
    GN_model_a_4 = Sequential()
    GN_model_a_4.add(Dense(32, activation='relu', input_shape=(X_train.shape[1],),
                           kernel_regularizer=l2(0.01), kernel_initializer='he_normal'))
    GN_model_a_4.add(Dropout(0.3))
    GN_model_a_4.add(Dense(16, activation='relu'))
    GN_model_a_4.add(Dropout(0.3))
    GN_model_a_4.add(Dense(1, activation='sigmoid'))

    # Train the Model
    GN_model_a_4.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc',tf.keras.metrics.AUC()])
    train_x, valid_x, train_y, valid_y = train_test_split(X_train, y_train, test_size=0.2)
    history = GN_model_a_4.fit(train_x, train_y, epochs=100, batch_size=64,
                               validation_data=(valid_x,valid_y),
                               callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss',patience=5),
                                            keras.callbacks.ModelCheckpoint(filepath='best_model.h5',
                                                                            monitor='val_loss',save_best_only=True)],
                               verbose=0)

    print(f'DNN learning curve {i+1}/{N}')
    plt.plot(history.history["loss"], label="train loss")
    plt.plot(history.history["val_loss"], label="validation loss")
    plt.legend()
    plt.title("Loss")
    plt.show()

    # Make Prediction
    auc = roc_auc_score(valid_y, GN_model_a_4.predict(valid_x).flatten())
    GN_aucs_a_4.append(auc)
    print('AUC', auc)
    GN_preds_a_4.append(GN_model_a_4.predict(X_test).flatten())   

### Validate the Models
print('\nValidation Summary:')
GN_aucs_a_4 = pd.Series(GN_aucs_a_4)
print(GN_aucs_a_4.sort_values(ascending=False))
print('mean={:.5f}, std={:.3f}'.format(GN_aucs_a_4.mean(), GN_aucs_a_4.std()))

  0%|                                                                                           | 0/10 [00:00<?, ?it/s]


NameError: name 'random' is not defined

In [168]:
# Power mean ensemble
THRESHOLD = 0.77  # Use only models whose AUC exceeds this value

pred = 0
n = 0
for i in range(N):
    if GN_aucs_a_4.iloc[i] > THRESHOLD:
        pred = pred + GN_preds_a_4[i]**p 
        n += 1
pred = pred / n    
pred = pred**(1/p)

# Make a submission file
#t = pd.Timestamp.now()
#fname = f"dae_p{p}n{n}_submit_{t.month:02}{t.day:02}{t.hour:02}{t.minute:02}.csv"
submissions = pd.concat([pd.Series(IDtest, name="cust_id"), pd.Series(pred, name="gender")] ,axis=1)
submissions.to_csv('dae_GN_a_4.csv', index=False)
#print(f"'{fname}' is ready to submit.")

# END