## $$ \textbf{From Temporal Different Learning to Q-learning}$$

#### Data: $$ s_0;a_1,r_1, s_1;a_2,r_2, s_3;\dots, a_n,r_n,s_n$$

#### A data point: $$ (s,a,r,s') $$

#### Prediction at a single data point: $$ Pred(w) := V(s;w) $$

#### True value of a single data point: $$ Target := r + \gamma V(s',w)$$

#### Loss function to compute the loss: $$ min_w \frac{1}{2} (Pred(w) - Target)^2$$

#### The gradient of the loss function w.r.t. $w$:$$ (Pred(w) - Target) \nabla_w Pred(w)$$


#### Gradient Descent Update:
$$ w \leftarrow w - \eta \big(Pred(w) - Target \big) \nabla_w Pred(w)$$

Update the weights $w$ in the negative direction of the gradient of the loss w.r.t. $w$. 

#### Gradient Descent Update for linear functions: $$ w \leftarrow w - \eta \big(w \cdot \phi(s) - (r + \gamma w \phi(s') \big) \phi(s)$$

as 
$$ V(s;w) = w \phi(s) $$ 
$$ \nabla_w V(s;w) = \phi(s) $$ 


In [1]:
import numpy as np
eta, gamma, r=.5, 1., .5
phi_s = np.array([1., 2.])
phi_s_prime = np.array([1., 1.5])
w = np.array([0., 0.])
for i in range(10):
    # (1) Generate predictions
    pred = w@phi_s
    # (2) Compute target 
    target = r + gamma * w @  phi_s_prime
    # (3) Compute loss
    loss = .5*((pred-target)**2)
    # (4) Compute gradient
    gradient = (pred - target) * phi_s
    # (5) Update weights
    w -= eta * (pred - target) * phi_s
    print(f'{i}.th update: Pred:{pred:.3f}\tTarget:{target:.3f}\tGradient:{gradient}\tLoss:{loss:.3f}')
    #print(pred, target, loss, gradient)

0.th update: Pred:0.000	Target:0.500	Gradient:[-0.5 -1. ]	Loss:0.125
1.th update: Pred:1.250	Target:1.500	Gradient:[-0.25 -0.5 ]	Loss:0.031
2.th update: Pred:1.875	Target:2.000	Gradient:[-0.125 -0.25 ]	Loss:0.008
3.th update: Pred:2.188	Target:2.250	Gradient:[-0.0625 -0.125 ]	Loss:0.002
4.th update: Pred:2.344	Target:2.375	Gradient:[-0.03125 -0.0625 ]	Loss:0.000
5.th update: Pred:2.422	Target:2.438	Gradient:[-0.015625 -0.03125 ]	Loss:0.000
6.th update: Pred:2.461	Target:2.469	Gradient:[-0.0078125 -0.015625 ]	Loss:0.000
7.th update: Pred:2.480	Target:2.484	Gradient:[-0.00390625 -0.0078125 ]	Loss:0.000
8.th update: Pred:2.490	Target:2.492	Gradient:[-0.00195312 -0.00390625]	Loss:0.000
9.th update: Pred:2.495	Target:2.496	Gradient:[-0.00097656 -0.00195312]	Loss:0.000


# $$ \textbf{Q-learning} $$

#### Prediction at a single data point: $$ Pred(w) := \hat Q_* (s, a;w) $$

#### True value of a single data point: $$ Target := r + \gamma max_{a' \in A(s')} \hat Q_* (s',a';w)$$


#### Gradient Descent Update : $$ w \leftarrow w - \eta \big[ \hat Q_* (s,a;w) - (r + \gamma max_{a' \in A(s') )} \hat Q_* (s',a',w)
\big] \nabla_w \hat Q_* (s,a;w) $$

