Adam stands for adaptive moment estimation, it combines the benefits of Momentum-based Gradient Descent, Adagrad, and RMSprop the learning rate is adaptively adjusted for each parameter based on the moving average of the gradient and the squared gradient, which allows for faster convergence and better performance on non-convex optimization problems. It keeps track of two exponentially decaying averages the first-moment estimate, which is the exponentially decaying average of past gradients, and the second-moment estimate, which is the exponentially decaying average of past squared gradients. The first-moment estimate is used to calculate the momentum, and the second-moment estimate is used to scale the learning rate for each parameter

Adam Optimizer inherits the strengths or the positive attributes of the above two methods and builds upon them to give a more optimized gradient descent. 

From Momentum :  
Wt+1 = Wt - alpha * mt
here instead of grad, we are using mt, what is it ?  
mt = Beta * mt-1 + (1-Beta) * Grad  


From RMS :  
Here 
wt+1 =wt - [ɑt / sqrt(vt + e) ] * grad    
here the LR is adjusted with vt  
vt = Beta* v(t-1) + ( 1 - Beta) * Grad ** 2



Since mt and vt have both initialized as 0 (based on the above methods), it is observed that they gain a tendency to be ‘biased towards 0’ as both β1 & β2 ≈ 1. This Optimizer fixes this problem by computing ‘bias-corrected’ mt and vt. This is also done to control the weights while reaching the global minimum to prevent high oscillations when near it.

Intuitively, we are adapting to the gradient descent after every iteration so that it remains controlled and unbiased throughout the process, hence the name Adam. 

For zero corection :  

mt_hat = mt / 1-Beta1t  
vt_hat = vt / 1-Beta2t  

Weiht Update :  
w(t+1) = wt + [alpha / sqrt(vt_hat + e) ]* mt_hat


In [1]:
import tensorflow as tf

In [2]:
def createModel(input_shape): 
    X_input = tf.keras.layers.Input(input_shape) 
    X = tf.keras.layers.Dense(10, 'relu')(X_input) 
    X_output = tf.keras.layers.Dense(2, 'softmax')(X) 
    model = tf.keras.Model(inputs=X_input, outputs=X_output) 
    return model 

In [3]:
model = createModel((10, 10)) 


In [4]:
print(model.summary()) 


None


In [5]:
print('Initial Layer Weights') 
print() 
for i in range(1, len(model.layers)): 
    print('Weight for Layer '+str(i)+': ') 
    print(model.layers[i].get_weights()[0]) 
    print() 

Initial Layer Weights

Weight for Layer 1: 
[[ 0.296719    0.12479973 -0.03654027 -0.01508188 -0.5403854  -0.52189714
  -0.33469433 -0.01031756  0.00877666 -0.11324003]
 [-0.364394   -0.28073815 -0.05136335 -0.23935804 -0.15700006 -0.19092757
   0.5272951   0.27842534  0.03753638  0.02051485]
 [ 0.27769494 -0.03429574  0.20139593 -0.09179577 -0.01420242  0.08377004
  -0.00702471 -0.04625934  0.321429    0.23440176]
 [ 0.53380704 -0.08840534  0.09720039  0.33096737 -0.14456922  0.12651277
  -0.19154158 -0.07113087 -0.09058133  0.22444093]
 [ 0.1007576   0.08510673 -0.15253386  0.09979451  0.3576933   0.24536377
   0.18271935 -0.1307903   0.21833521 -0.4238966 ]
 [ 0.3218513  -0.35115626  0.2664861  -0.46591854 -0.27767992 -0.4476878
   0.35598958 -0.4031749  -0.25424656 -0.41667828]
 [ 0.03161305 -0.08007896  0.1780827  -0.20949185 -0.4138363   0.01499754
   0.47309482  0.19412655  0.18258017 -0.12325215]
 [-0.08922386  0.26008135 -0.48908237  0.29042888 -0.33507407  0.21547616
  -0.451

In [6]:
tf.random.set_seed(5) 
X = tf.random.normal((2, 10, 10)) 
Y = tf.random.normal((2, 10, 2)) 

In [7]:
model.compile(optimizer='adam', 
              loss='categorical_crossentropy', 
              metrics=['accuracy']) 

In [9]:
print(model.optimizer.get_config()) 

{'name': 'adam', 'learning_rate': 0.0010000000474974513, 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'loss_scale_factor': None, 'gradient_accumulation_steps': None, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False}


In [10]:
model.fit(X,Y)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 668ms/step - accuracy: 0.3000 - loss: 0.1425


<keras.src.callbacks.history.History at 0x1db4cb2bb80>

In [12]:
print('Final Layer Weights') 
print() 
for i in range(1, len(model.layers)): 
    print('Weight for Layer '+str(i)+': ') 
    print(model.layers[i].get_weights()[0]) 
    print()

Final Layer Weights

Weight for Layer 1: 
[[ 0.29771885  0.12379976 -0.03754019 -0.01608174 -0.53938544 -0.52089727
  -0.33569428 -0.00931771  0.00977662 -0.11423938]
 [-0.3633946  -0.2817381  -0.0523631  -0.240358   -0.15799971 -0.18992762
   0.5262951   0.2774269   0.03853633  0.01951489]
 [ 0.2766951  -0.03529542  0.20239589 -0.09079581 -0.01320247  0.0847697
  -0.00802468 -0.0452594   0.32242894  0.23340182]
 [ 0.5348069  -0.08940531  0.09620042  0.32996738 -0.14356926  0.12751272
  -0.1905416  -0.07213072 -0.0915813   0.22544073]
 [ 0.09975782  0.08610667 -0.15353383  0.10079448  0.35669342  0.24436384
   0.18371932 -0.13179024  0.21933514 -0.42289662]
 [ 0.3208514  -0.3521562   0.26748607 -0.46491855 -0.27667993 -0.44668785
   0.3549896  -0.40217498 -0.2532466  -0.4176781 ]
 [ 0.03261162 -0.07907899  0.17908248 -0.20849188 -0.4148363   0.01399759
   0.47209492  0.19512624  0.18358007 -0.12225237]
 [-0.08822399  0.25908142 -0.49008226  0.29142815 -0.3340741   0.21447623
  -0.45217