In this practice, we will learn how to use weight decay methods to regularize our models. 

First, let's import tensorflow.

In [1]:
import tensorflow as tf

There are several ways to apply weight decay with tensorflow. 
For a practice, we will use a multi-layer perceptron for our model.

### Method 1 : Using tf.get_regularization_loss()**

* Documentation of tf.GraphKeys: https://www.tensorflow.org/api_docs/python/tf/GraphKeys

"tf.GraphKeys" is the standard library uses various well-known names to collect and retrieve values associtated with a (computational) graph. To investigate the usage of tf.GraphKeys, let's define the computational graph.

In [2]:
x = tf.placeholder(tf.float32, shape=[None, 10])
y = tf.placeholder(tf.float32, shape=[None, ])

As can be found in the documentation of "tf.layers.dense", we can declare which weight (and bias) parameters to apply regularizations and regularization scale. 

* Documentation of tf.layers.dense: https://www.tensorflow.org/api_docs/python/tf/layers/dense

In [3]:
h = tf.layers.dense(x, 
                    units=64,
                    use_bias=True,
                    activation=tf.nn.relu,
                    kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=1e-3))
y_pred = tf.layers.dense(h, 
                    units=1,
                    use_bias=True,
                    activation=None,
                    kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=1e-3))
y_pred = tf.reshape(y_pred, [-1])

W0801 14:49:44.526269 139865723090688 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0801 14:49:44.528181 139865723090688 deprecation.py:323] From <ipython-input-3-61e56739eba4>:5: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
W0801 14:49:44.530981 139865723090688 deprecation.py:506] From /home/wykgroup/appl/anaconda3/envs/ML_study/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions

We can directly configure whether to apply l2-(or l1-, l1l2-) regularization.
As can be found in the below link, the scale corresponds to the \\(\lambda \) of \( L_{total} = L_{nll} + \lambda |w|^2 \\).

https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/contrib/layers/python/layers/regularizers.py#L76-L109

Then, we can use tf.GraphKeys to add the regularization loss term \\( \lambda |w|^2 \\) in the total loss \\(L_{total} \\). The regularization loss term can be called by "tf.losses.get_regularization_loss()".

In [4]:
tf.losses.get_regularization_loss()

<tf.Tensor 'total_regularization_loss:0' shape=() dtype=float32>

Finally, we can define the total loss as follows:

In [5]:
nll = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=y_pred))
reg_loss = tf.losses.get_regularization_loss()
total_loss = nll + reg_loss

W0801 14:49:44.788635 139865723090688 deprecation.py:323] From /home/wykgroup/appl/anaconda3/envs/ML_study/lib/python3.7/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [6]:
total_loss

<tf.Tensor 'add:0' shape=() dtype=float32>

Then, we should minimize this "total_loss" to optimize our hypothesis.

In [7]:
train_op = tf.train.AdamOptimizer(1e-3).minimize(total_loss)

### Method 2 : Direcly add the product of scaling factor and norm of weight parameters
    
The second way is using tf.GraphKeys.TRAINABLE_VARIABLES to directly add the product of scaling factor and norm of weight parameters.

In [8]:
tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)

[<tf.Variable 'dense/kernel:0' shape=(10, 64) dtype=float32_ref>,
 <tf.Variable 'dense/bias:0' shape=(64,) dtype=float32_ref>,
 <tf.Variable 'dense_1/kernel:0' shape=(64, 1) dtype=float32_ref>,
 <tf.Variable 'dense_1/bias:0' shape=(1,) dtype=float32_ref>]

We can see that the collection of trainable parameters can be called by the above command. 
To regulraize weight parameters, we can implement the code as below:

In [9]:
reg_loss = 0
for v in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES):
    if 'kernel' in v.name:
        reg_loss += tf.nn.l2_loss(v)
reg_loss *= 1e-3 # lambda, regularization scale        

This implementation of regularization loss is totally same as using "tf.contrib.layers.l2_regularizer(scale=1e-3) and tf.losses.get_regularization_loss()".

### Method3: AdamW optimzier

However, Ilya Loshchilov and Frank Hutter reported that weight decay with adaptive momentum optimizers (e.g. Adam optimizer) is actually not identical to L2-regularization and suggested new optimizers for more effective weight decay method, so-called the AdamW optimizer. 

* Documentation of the AdamW optimizer: https://www.tensorflow.org/api_docs/python/tf/contrib/opt/AdamWOptimizer
* Reference: https://openreview.net/forum?id=Bkg6RiCqY7
* Blog: https://www.fast.ai/2018/07/02/adam-weight-decay/

In [10]:
decay_var_list = [v for v in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES) if 'kernel' in v.name ]
adamw_optimizer = tf.contrib.opt.AdamWOptimizer(weight_decay=1e-3, 
                                                learning_rate=1e-3)
train_op = adamw_optimizer.minimize(nll, 
                                     decay_var_list=decay_var_list)

In this practice, we learn a variety of methods to apply weight decay (L2-regularization) methods. 
Comprehensively, we recommend to use "AdamW optimizer" based on the results of previous literatures and to read the attatched documentations carefully. 
Especially, please understand how "tf.GraphKeys" works and usage of them, as examplified in this documents such as "tf.GraphKeys.TRAINABLE_VARIABLES". 