## Overfitting
* Occam's Razor
    * More things should not be used than are necessary
* Reduce Overfitting
    * more data
    * constraint model complexity
        * shallow
        * regularization
    * dropout
    * data argumentation
    * early stopping

### Regularization
##### J(zeta) = CrossEntropyLoos + lambda * sum(|zeta_i|)
* where lambda = 0.01, zeta is the parameters of model
* enforce weights close to 0 -> weight decay

#### L1-regularization
* J(zeta) = CrossEntropyLoos + lambda * sum(|zeta_i|)
#### L2-regularization
* J(W;X,y) + 1/2 * lambda * ||W||^2

In [None]:
# L1 regularization
regularization_loss = 0
for param in model.parameters():
    regularization_loss += torch.sum(torch.abs(param))

classify_loss = criteon(logits, target)
loss = classify_loss + 0.01 * regularization_loss

optimizer.zero_grad()
loss.backward()
optimizer.step()

In [None]:
# L2 regularization
optimizer = torch.optim.SGD(net.parameters(), lr=learning_rate, weight_decay=0.01)

### Tricks
* momentum
    * before: w_k+1 = w_k - alph * grad_f(w_k)
    * z_k+1 = beta * z_k + grad_f(w_k),    w_k+1 = w_k - alph * z_k+1
* learning rate decay    
    * A small learning rate requires many updates before reaching the minimum point
    * The optimal learning rate swiftly reaches the minimum point
    * Too large of a learning rate causes drastic updates which lead to divergent behaviors

> 3e-4 is the best learning rate for Adam

In [None]:
# momentum
optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.momentum, weight_decay=args.weight_decay)

scheduler = ReduceLROnPlateau(optimizer,'min')

for epoch in xrange(args.start_epoch, args.epochs):
    train(train_loader, model, criterion, optimizer, epoch)
    result_avg, loss_val = validate(val_loader, model, criterion, epoch)
    scheduler.step(loss_val)

In [None]:
# Assuming optimizer uses lr = 0.05 for all groups
# lr = 0.05     if epoch < 30
# lr = 0.005    if 30 <= epoch < 60
# ...
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
for epoch in range (100):
    scheduler.step()
    train(...)
    validate(...)

### Tricks
* Early Stopping
    * Validation set to select parameters
    * Monitor validation performance
    * Stop at the highest val perf. **(Experience)**
* Dropout
    * Learning less to learn better
    * Each connection has p = \[0,1\] to lose
* Stochastic Gradient Descent
    * Stochastic is not random!
    * Deterministic
    * Because of the limit of GPU memory, gradient descent in range of batch

In [None]:
# dropout
net_dropped = torch.nn.Sequential(
    torch.nn.Linear(784,200),
    torch.nn.Dropout(0.5),  # drop 50% of the neuron
    torch.nn.ReLU(),
    torch.nn.Linear(200,200),
    torch.nn.Dropout(0.5),  # drop 50% of the neuron
    torch.nn.ReLU(),
    torch.nn.Linear(200,10)
)

## torch.nn.Dropout(p=dropout_prob)
## tf.nn.dropout(keep_prob)

In [None]:
# Behavior between train and test
for epoch in range(epochs):

    # train
    net_dropped.train()
    for batch_indx, (data, traget) in enumerate(train_loader):
        ...
    net_dropped.eval() # before test, swich to connection
    test_loss = 0
    correct = 0
    for data, traget in test_loader:
        ...
        