<a href="https://colab.research.google.com/github/QaziSaim/CASE-STUDIES/blob/main/PPO_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Nice — here’s a concrete math problem about PPO’s clipped surrogate objective, solved step-by-step.

### Problem

You collected three timesteps. For each timestep you have:

* Old policy probability ( \pi_{\text{old}}(a|s) = [0.20,\ 0.80,\ 0.50] )
* New policy probability ( \pi_{\theta}(a|s) = [0.25,\ 0.60,\ 0.75] )
* Advantage estimates ( A = [1.5,\ -2.0,\ 0.5] )
  Use clipping parameter ( \varepsilon = 0.20 ).
  Compute the PPO clipped surrogate objective
  [
  L^{CLIP}(\theta) = \frac{1}{N}\sum_{t=1}^{N} \min\big( r_t A_t,\ \text{clip}(r_t,1-\varepsilon,1+\varepsilon),A_t \big),
  ]
  where ( r_t = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)} ).

---

### Solution (digit-by-digit arithmetic)

1. Compute probability ratios ( r_t ):

* ( r_1 = 0.25 / 0.20 = 1.25 )
* ( r_2 = 0.60 / 0.80 = 0.75 )
* ( r_3 = 0.75 / 0.50 = 1.50 )

2. Compute unclipped terms ( r_t A_t ):

* ( r_1 A_1 = 1.25 \times 1.5 = 1.875 )
* ( r_2 A_2 = 0.75 \times (-2.0) = -1.5 )
* ( r_3 A_3 = 1.50 \times 0.5 = 0.75 )

3. Clip each ( r_t ) to ([1-\varepsilon,,1+\varepsilon] = [0.8,,1.2]):

* ( \text{clip}(r_1) = \text{clip}(1.25) = 1.20 )
* ( \text{clip}(r_2) = \text{clip}(0.75) = 0.80 )
* ( \text{clip}(r_3) = \text{clip}(1.50) = 1.20 )

4. Compute clipped terms ( \text{clip}(r_t)A_t ):

* clipped(_1) = (1.20 \times 1.5 = 1.80)
* clipped(_2) = (0.80 \times (-2.0) = -1.60)
* clipped(_3) = (1.20 \times 0.5 = 0.60)

5. For each timestep take the minimum between the unclipped and clipped term (PPO uses ( \min(\cdot) ) because we maximize):

* timestep1: ( \min(1.875,\ 1.80) = 1.80 )
* timestep2: ( \min(-1.5,\ -1.60) = -1.60 )  *(smaller is -1.60)*
* timestep3: ( \min(0.75,\ 0.60) = 0.60 )

6. Average to get ( L^{CLIP} ):
   [
   L^{CLIP} = \frac{1.80 + (-1.60) + 0.60}{3} = \frac{0.80}{3} = 0.266666\ldots
   ]
   So ( L^{CLIP} \approx 0.2667 ).

If you implement PPO as a minimization of loss, you'd use loss ( = -L^{CLIP} \approx -0.2667 ).

---

Want another example with different ( \varepsilon ), or to see how the gradient sign would look for one sample?


In [None]:
old_policy_pr = [0.20, 0.80, 0.50]
new_policy_pr = [0.25, 0.60, 0.75]
advantages_estimates = [1.5, -2.0, 0.5]
epsilon = 0.2


In [None]:
rt = [round(x/y,5) for x,y in zip(new_policy_pr,old_policy_pr)]

In [None]:
rt

[1.25, 0.75, 1.5]

In [None]:
rtAt = [x*y for x,y in zip(rt,advantages_estimates)]

In [None]:
rtAt

[1.875, -1.5, 0.75]

In [None]:
upper = 1 + epsilon
lower = 1 - epsilon

In [None]:
clip_rt = []
for i in rt:
  if i>=upper:
    clip_rt.append(upper)
  if i<=lower:
    clip_rt.append(lower)

In [None]:
clip_rt

[1.2, 0.8, 1.2]

In [None]:
clip_rt_x_at = [round(x*y,3) for x,y in zip(clip_rt,advantages_estimates)]

In [None]:
clip_rt_x_at

[1.8, -1.6, 0.6]

In [None]:
rtAt

[1.875, -1.5, 0.75]

In [None]:
min_timestep = [min(x,y) for x,y in zip(rtAt,clip_rt_x_at)]

In [None]:
min_timestep

[1.8, -1.6, 0.6]

In [None]:
L_clip = sum(min_timestep)/len(min_timestep)

In [None]:
L_clip

0.26666666666666666

In [None]:
print('Final L clip found ', round(L_clip,4))

Final L clip found  0.2667


### CartPole

In [1]:
old_policy = [[0.60,0.40],[0.70,0.30],[0.20,0.80],[0.55,0.45]]
new_policy = [[0.65,0.35],[0.50,0.50],[0.30,0.70],[0.60,0.40]]
actions = [0, 1, 1, 0]
advatages = [1.2, -0.5, 0.8, 0.3]
clipping_parameter = 0.20
cp_range = [0.8, 1.2]
v_pred = [1.5, 0.2, 0.9, 0.1]
v_target = [2.0, -0.3, 1.5, 0.2]
## Hyperparameters
c1 = 0.5# value_loss_coefficient
c2= 0.01 # entropy coefficient

In [2]:
import numpy as np
old_policy = np.array(old_policy)
new_policy = np.array(new_policy)
advatages = np.array(advatages)

In [3]:
rt = new_policy[np.arange(len(actions)), actions] / old_policy[np.arange(len(actions)), actions]

In [4]:
rt

array([1.08333333, 1.66666667, 0.875     , 1.09090909])

In [5]:
rtAt=rt*advatages

In [6]:
rtAt

array([ 1.3       , -0.83333333,  0.7       ,  0.32727273])

In [7]:
clipRt = np.clip(rt,1 - clipping_parameter, 1+ clipping_parameter)

In [8]:
clipRt

array([1.08333333, 1.2       , 0.875     , 1.09090909])

In [9]:
cRcT = clipRt * advatages

In [10]:
min_terms = np.array([min(x,y) for x,y in zip(rtAt,cRcT)])

In [11]:
final_sum = sum(min_terms)

In [12]:
final_sum

np.float64(1.4939393939393937)

In [13]:
lclip  = final_sum/4

In [14]:
print(lclip)

0.3734848484848484


In [20]:
mse = sum(np.array([(x-y)**2 for x,y in zip(v_pred,v_target)]))/4

In [21]:
mse

np.float64(0.2175)

In [23]:
small_epsilo = 1e-10
new_policy_entropy = -np.sum(new_policy * np.log(new_policy + small_epsilo),axis=1)
avg_entropy = np.mean(new_policy_entropy)


In [24]:
new_policy_entropy

array([0.64744664, 0.69314718, 0.6108643 , 0.67301167])

In [25]:
avg_entropy

np.float64(0.6561174469646819)

In [32]:
c1*mse,c2*avg_entropy,lclip

(np.float64(0.10875),
 np.float64(0.006561174469646819),
 np.float64(0.3734848484848484))

In [33]:
loss = -lclip + (c1*mse - c2*avg_entropy)

In [34]:
loss

np.float64(-0.27129602295449523)