> [Application & tips](https://youtu.be/M6H3SGk2ddU?list=PLQ28Nx3M4Jrguyuwg4xe9d9t2XE639e5C)

# Learning Rate

lr: gradient를 적용시킬 때 곱하는 값
lr이 클 수록 더 멀리, 빨리 가지만 **Overshooting**이라는 문제가 생길 수 있다.

또한 너무 작다면 모델을 학습시키는데 시간이 너무 오래 걸린다.

-> 최적의 lr을 찾는 것이 중요

Adam Optimizer의 경우 `3e-4`가 이상적인 lr값이라고 함

lr이 처음에는 큰 것이 좋지만, 나중에는 작아야 좋다.

따라서 유동적으로(학습하는 과정에서) lr의 값을 바꾸기도 한다.

* Step decay: N epoch or validation loss
* Exponential decay
  - $\alpha = \alpha_0 e^{-kt}$
* 1/t decay
   - $\alpha = \alpha_0 / (1 + kt)$

# Data preprocessing(전처리)

## Feature scaling

### Standardization

$\mu: 평균 \\ \sigma: 표준편차$
$$
x_{new} = \frac {x - \mu} {\sigma}
$$

In [None]:
Standardization = (data - np.mean(data)) \
 / sqrt(np.sum((data - np.mean(data)) ** 2) / np.count(data))

Standardization = (X[:, 0] - X[:, 0].mean()) / X[:, 0].std()

### Normalization

$$
x_{new} = \frac {x - x_{min}} {x_{max} - x_{min}}
$$

In [None]:
Normalization = (data - np.min(data, 0)) / (np.max(data, 0) - np.min(data, 0))

## Noizy Data

대부분의 데이터들이 한 곳에 모여있는데, 어떤 일부의 데이터가 그것과 멀리 있어서
전체 데이터에 영향을 주고 있을 때 적용

NLP에서 필요한 부분만 뽑기

# Overfitting

학습 데이터를 가지고 했을 때는 정확도가 높은데, 그 학습 데이터에 너무 맞춰져 있어서
테스트 데이터로 예측 했을 때 정확도가 낮아지는 현상

* High bias -> underfit
* High variance -> overfit

## 해결 방법

* 학습 데이터를 많이 넣는다. - 더욱더 일반화가 된 자료를 사용
* 피쳐의 수를 줄인다. - 2차원 -> 1차원 같은 느낌 (weight 수 줄이기)
* 피쳐의 수를 늘린다. - high bias일 때. 너무 가설이 단순할 때

=> 피쳐 수를 적절히 잡는 것이 중요하다.

### Regularization (Add term to loss)
람다: Regularization strength
$$
c = c + \lambda \sum W^2
$$

Resularization Strength를 크게 잡으면 underfit
작게 잡으면 overfit

적당한 값을 찾는 것이 문제이다.

# Training set, Test set

주어진 데이터를 모두 가지고 학습을 하면 100% 외우는 효과 발생,

완벽한 정확도를 얻기 어려움

7:3 정도로 나누어서 학습

# Training, Validation, Test

Validation set으로 alpha, lambda 등 값을 바꾸면서 어떤게 효율적인지...

# Online Learning
데이터를 순차적으로, 또는 묶어서 주입하여 학습
-- Batch와는 달리 데이터가 지속적으로 들어옴

# LAB

## Without minmax

In [9]:
import tensorflow as tf
import numpy as np

data = np.array([[828.659973, 833.450012, 908100, 828.349976, 831.659973],
               [823.02002, 828.070007, 1828100, 821.655029, 828.070007],
               [819.929993, 824.400024, 1438100, 818.97998, 824.159973],
               [816, 820.958984, 1008100, 815.48999, 819.23999],
               [819.359985, 823, 1188100, 818.469971, 818.97998],
               [819, 823, 1198100, 816, 820.450012],
               [811.700012, 815.25, 1098100, 809.780029, 813.669983],
               [809.51001, 816.659973, 1398100, 804.539978, 809.559998]],
                dtype=np.float32)
X = data[:, 0:-1]
Y = data[:, [-1]]

W = tf.Variable(tf.random.normal([X.shape[1], Y.shape[1]]))
b = tf.Variable(tf.random.normal([Y.shape[1]]))

for i in range(100+1):
  with tf.GradientTape() as tape:
    hypothesis = tf.matmul(X, W) + b
    cost = tf.reduce_mean(tf.square(hypothesis - Y))
  W_grad, b_grad = tape.gradient(cost, [W, b])
  W.assign_sub(W_grad * 1e-5)
  b.assign_sub(b_grad * 1e-5)

  if i % 20 == 0:
    print("{:3} | {:7.4f}".format(i, cost.numpy()))

  0 | 5281608630272.0000
 20 |     nan
 40 |     nan
 60 |     nan
 80 |     nan
100 |     nan


## With minmax - Normalization

In [19]:
import tensorflow as tf
import numpy as np

def min_max_scaler(data):
    numerator = data - np.min(data, 0)
    denominator = np.max(data, 0) - np.min(data, 0)
    # noise term prevents the zero division
    return numerator / (denominator + 1e-7)

data = np.array([[828.659973, 833.450012, 908100, 828.349976, 831.659973],
               [823.02002, 828.070007, 1828100, 821.655029, 828.070007],
               [819.929993, 824.400024, 1438100, 818.97998, 824.159973],
               [816, 820.958984, 1008100, 815.48999, 819.23999],
               [819.359985, 823, 1188100, 818.469971, 818.97998],
               [819, 823, 1198100, 816, 820.450012],
               [811.700012, 815.25, 1098100, 809.780029, 813.669983],
               [809.51001, 816.659973, 1398100, 804.539978, 809.559998]],
                dtype=np.float32)

# very important. It does not work without it.
data = min_max_scaler(data)

X = data[:, 0:-1]
Y = data[:, [-1]]

W = tf.Variable(tf.random.normal([X.shape[1], Y.shape[1]]))
b = tf.Variable(tf.random.normal([Y.shape[1]]))

for i in range(100+1):
  with tf.GradientTape() as tape:
    hypothesis = tf.matmul(X, W) + b
    cost = tf.reduce_mean(tf.square(hypothesis - Y))
  W_grad, b_grad = tape.gradient(cost, [W, b])
  W.assign_sub(W_grad * 3e-1)
  b.assign_sub(b_grad * 3e-1)

  if i % 20 == 0:
    print("{:3} | {:7.4f}".format(i, cost.numpy()))

  0 |  1.4285
 20 |  0.0258
 40 |  0.0075
 60 |  0.0056
 80 |  0.0051
100 |  0.0050


## With Regularization, Decay lr

In [23]:
import tensorflow as tf
import numpy as np

def min_max_scaler(data):
    numerator = data - np.min(data, 0)
    denominator = np.max(data, 0) - np.min(data, 0)
    # noise term prevents the zero division
    return numerator / (denominator + 1e-7)

data = np.array([[828.659973, 833.450012, 908100, 828.349976, 831.659973],
               [823.02002, 828.070007, 1828100, 821.655029, 828.070007],
               [819.929993, 824.400024, 1438100, 818.97998, 824.159973],
               [816, 820.958984, 1008100, 815.48999, 819.23999],
               [819.359985, 823, 1188100, 818.469971, 818.97998],
               [819, 823, 1198100, 816, 820.450012],
               [811.700012, 815.25, 1098100, 809.780029, 813.669983],
               [809.51001, 816.659973, 1398100, 804.539978, 809.559998]],
                dtype=np.float32)

# very important. It does not work without it.
data = min_max_scaler(data)

X = data[:, 0:-1]
Y = data[:, [-1]]

W = tf.Variable(tf.random.normal([X.shape[1], Y.shape[1]]))
b = tf.Variable(tf.random.normal([Y.shape[1]]))

lr = 0.5

for i in range(100+1):
  with tf.GradientTape() as tape:
    hypothesis = tf.matmul(X, W) + b
    cost = tf.reduce_mean(tf.square(hypothesis - Y))
    W_reg = tf.nn.l2_loss(W)
    cost = tf.reduce_mean(cost + W_reg * 0.01)
  W_grad, b_grad = tape.gradient(cost, [W, b])
  W.assign_sub(W_grad * lr)
  b.assign_sub(b_grad * lr)

  if i % 20 == 0:
    lr *= 0.96
    print("{:3} | {:7.4f} | {:3.4f}".format(i, cost.numpy(), lr))

  0 |  0.7742 | 0.4800
 20 |  0.0114 | 0.4608
 40 |  0.0088 | 0.4424
 60 |  0.0081 | 0.4247
 80 |  0.0075 | 0.4077
100 |  0.0071 | 0.3914


# MNIST - with keras

In [25]:
from tensorflow import keras

(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [26]:
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

model = keras.Sequential([
  keras.layers.Input((28, 28)),
  keras.layers.Flatten(),
  keras.layers.Dense(128, 'relu'),
  keras.layers.Dense(len(class_names), 'softmax'),
])

In [28]:
model.compile('adam', 'sparse_categorical_crossentropy', ['acc'])
model.summary()
model.fit(x_train, y_train, None, 5)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               100480    
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f14e892f588>

In [29]:
model.evaluate(x_test, y_test)



[0.5755416750907898, 0.792900025844574]

# IMDB

0: neg
1: pos

In [45]:
from tensorflow import keras

(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=10000)

In [46]:
word_index = keras.datasets.imdb.get_word_index()

In [47]:
word_index = {k:v+3 for k, v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNKNOWN>"] = 2
word_index["<UNUSED>"] = 3
reversed_word_index = dict([(v, k) for (k, v) in word_index.items()])

In [58]:
x_train = keras.preprocessing.sequence.pad_sequences(x_train,
                    value=word_index["<PAD>"], padding='post', maxlen=256)
x_test = keras.preprocessing.sequence.pad_sequences(x_test,
                    value=word_index["<PAD>"], padding='post', maxlen=256)

In [49]:
model = keras.Sequential([
  keras.layers.Embedding(10000, 16),
  keras.layers.GlobalAveragePooling1D(),
  keras.layers.Dense(16, 'relu'),
  keras.layers.Dense(1, 'sigmoid'),
])

In [55]:
model.compile('adam', 'binary_crossentropy', ['acc'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


In [56]:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

history = model.fit(partial_x_train, partial_y_train, 512, 40,
                    validation_data=(x_val, y_val))

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [59]:
model.evaluate(x_test, y_test)



[0.3321771025657654, 0.873520016670227]

# CIFAR-100