#### RL in continuous spaces with discrete actions - using DQN ####

Cartpole problem - A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right 
direction on the cart.

Check the deatils for this environment here - https://gymnasium.farama.org/environments/classic_control/cart_pole/

In [1]:
from tqdm import tqdm

import gymnasium as gym
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam,SGD
import warnings
warnings.filterwarnings('ignore')

In [2]:
from collections import deque
import random
import copy
import numpy as np

In [3]:
from torchsummary import summary

In [4]:
torch.manual_seed(42)

<torch._C.Generator at 0x17ba8643c30>

In [5]:
print(torch.__version__)

2.5.1+cu118


In [6]:
if torch.cuda.is_available():
    device = 'cuda'
else:
    device = 'cpu'

### Description
    This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in
    ["Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem"](https://ieeexplore.ieee.org/document/6313077).
    A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
    The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces
     in the left and right direction on the cart.
    ### Action Space
    The action is a `ndarray` with shape `(1,)` which can take values `{0, 1}` indicating the direction
     of the fixed force the cart is pushed with.
    | Num | Action                 |
    |-----|------------------------|
    | 0   | Push cart to the left  |
    | 1   | Push cart to the right |
    **Note**: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle
     the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it
    ### Observation Space
    The observation is a `ndarray` with shape `(4,)` with the values corresponding to the following positions and velocities:
    | Num | Observation           | Min                 | Max               |
    |-----|-----------------------|---------------------|-------------------|
    | 0   | Cart Position         | -4.8                | 4.8               |
    | 1   | Cart Velocity         | -Inf                | Inf               |
    | 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
    | 3   | Pole Angular Velocity | -Inf                | Inf               |
    **Note:** While the ranges above denote the possible values for observation space of each element,
        it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:
    -  The cart x-position (index 0) can take values between `(-4.8, 4.8)`, but the episode terminates
       if the cart leaves the `(-2.4, 2.4)` range.
    -  The pole angle can be observed between  `(-.418, .418)` radians (or **±24°**), but the episode terminates
       if the pole angle is not in the range `(-.2095, .2095)` (or **±12°**)
    ### Rewards
    Since the goal is to keep the pole upright for as long as possible, a reward of `+1` for every step taken,
    including the termination step, is allotted. The threshold for rewards is 475 for v1.
    ### Starting State
    All observations are assigned a uniformly random value in `(-0.05, 0.05)`
    ### Episode End
    The episode ends if any one of the following occurs:
    1. Termination: Pole Angle is greater than ±12°
    2. Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
    3. Truncation: Episode length is greater than 500 (200 for v0)
    ### Arguments
    ```
    gym.make('CartPole-v1')
    ```
    No additional arguments are currently supported.
    """

In [7]:
env = gym.make('CartPole-v1')

In [8]:
# Base model

class Model(nn.Module):
    def __init__(self):
        super(Model,self).__init__()
        self.lin1 = nn.Linear(4,16)     # 4 input features as there are 4 states
        self.lin2 = nn.Linear(16,16)
        self.lin3 = nn.Linear(16,2)     # 2 outputs as there are 2 actions
        
    
    def forward(self,x):
        x = self.lin1(x)
        x = F.relu(x)
        x = self.lin2(x)
        x = F.relu(x)
        x = self.lin3(x)
    
        return x
        


In [9]:
# Copy the DQNET and create targetNet
# Make sure its a deepcopy else changes to one will affect the other

DQNet = Model().to(device)
targetNet = copy.deepcopy(DQNet).to(device)

# targetNet is always in eval mode - no back propogation
targetNet.eval()

Model(
  (lin1): Linear(in_features=4, out_features=16, bias=True)
  (lin2): Linear(in_features=16, out_features=16, bias=True)
  (lin3): Linear(in_features=16, out_features=2, bias=True)
)

In [10]:
# You could look at the model summary

summary(targetNet,(1,4))
# summary(DQNet,(1,4))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                [-1, 1, 16]              80
            Linear-2                [-1, 1, 16]             272
            Linear-3                 [-1, 1, 2]              34
Total params: 386
Trainable params: 386
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
----------------------------------------------------------------


In [11]:
# Test the models and see if they give the same output

input = torch.from_numpy(np.array([1,2,3,4])).float().to(device)
print(input)
print(DQNet(input))
print(targetNet(input))

tensor([1., 2., 3., 4.], device='cuda:0')
tensor([0.0982, 0.2028], device='cuda:0', grad_fn=<ViewBackward0>)
tensor([0.0982, 0.2028], device='cuda:0', grad_fn=<ViewBackward0>)


In [12]:
# Initialize parameters for network

lr = 0.001
optimizer = Adam(params=DQNet.parameters(), lr=lr)   # We only train DQN

# E-Greedy policy but as we explore deeper we take les random actions
eps_start = 1
eps_end = 0.0001
eps_decay = 0.999

loss_fn = nn.MSELoss()  # Simple MEan Square Error function
episodes = 500
mini_batch_size = 128

loss_value = 0
loss_history = []
update_freq = 15       # Updating target network
gamma = 0.95
scores = []
rewards = 0


In [13]:
# Training function

def training(Qnet,t_net,replay_memory,optimizer,loss_fn,mini_batch_size=32):
    
    # Sample random observations
    observations = random.choices(replay_memory,k=mini_batch_size)
        
    Qnet.train()
    
    for epochs in range(1): 
        
        # Loop over sampled observations
        for observation in observations:
            
            state = torch.from_numpy(np.array(observation[0])).float().to(device)  # Convert to tensor and put on device

            # Predict Q-value at time t and next action
            q_values = Qnet(state)
            #expected_value = q_values.detach().numpy()[int(observation[2])]
            expected_value = q_values[int(observation[1])]
            
            #expected_value = torch.tensor(expected_value,requires_grad=True)

            done = observation[4]

            # Determine Q-value at time t+1
            next_state = torch.from_numpy(np.array(observation[3])).float() 
            next_q_values = t_net(next_state.to(device))          
            next_action = torch.argmax(next_q_values)
            next_q_value = next_q_values[next_action]       

            # Bellman's eqution for current state
            if done:
                target_value = torch.tensor(observation[2])   # If episode is done the target value is the reward
                
            else:
                target_value = observation[2] + (gamma * next_q_value)   # Add direct reward to obtain target value
                
            # Compute loss value
            loss = loss_fn(expected_value.to(device), target_value.to(device))
            loss_value = loss.item()

            # Back prop
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            

    return Qnet,loss_value
    

In [14]:
# Q explore / exploit loop

eps = eps_start
replay = []

for i in range(episodes):   
    s = env.reset()                         # Reset to start state
    s = torch.from_numpy(s[0])              # Convert State to tensor a
    done=False                              # Episode end flag
    rewards = 0                             # Container for rewards accumulation
    t_step = 0    
    
    while not done: 
            DQNet.eval()

            s = s.to(device)                # Put to device
        
            # Predict Q-value at time t
            q_values = DQNet(s)
            
            # Take action based on DQNet prediction or random
            if np.random.random() < eps:  
                a = env.action_space.sample()
                
                new_state,reward,done,_,_ = env.step(a)
                
            else:
                a = torch.argmax(q_values)
                new_state,reward,done,_,_ = env.step(a.item())
                
            # Gather new experience and append to replay buffer   
            new_experience = s.tolist(), a, reward, new_state.tolist(), done 
            replay.append(new_experience)
            
            # Limit repaly buffer to 100000
            if len(replay) >100000:
                replay.pop(0)
                
            # Accumulate rewards
            rewards += reward
             
            s = torch.from_numpy(new_state) # Next state becomes current state

            # # decrease the epsilon
            # eps = max(eps*eps_decay,eps_end)                 
            
            # Swap weights from DQNet
            if i % update_freq == 0:
                targetNet = copy.deepcopy(DQNet)
                targetNet.eval()
                
            # Don't let the episode run more than 500 time steps
            t_step += 1
            if t_step > 500:
                #print('here')
                break
            
            # Train the DQNet that approximates q(s,a), using the replay memory
            if len(replay) > 2000:
                DQNet,loss = training(Qnet=DQNet,t_net=targetNet, replay_memory=replay, optimizer=optimizer, loss_fn=loss_fn, mini_batch_size=mini_batch_size)
                loss_history.append(loss)   # Save the loss value
                
                # decrease the epsilon
                eps = max(eps*eps_decay,eps_end)
            
    scores.append(rewards)
    if rewards >= 500 and i > 200:
        torch.save(DQNet,'./saved_models/DQN'+str(i)+'pth')
        torch.save(DQNet.state_dict(), 'DQN_parameters.pth')
        print(i)

    if i % 10 == 0:
        print(f"Episode {i}/{episodes} Rewards:{rewards} Buffer Length:{len(replay)} and eps:{eps}")
    #print(rewards,end = " ")

Episode 0/500 Rewards:22.0 Buffer Length:22 and eps:1
Episode 10/500 Rewards:45.0 Buffer Length:298 and eps:1
Episode 20/500 Rewards:30.0 Buffer Length:549 and eps:1
Episode 30/500 Rewards:35.0 Buffer Length:850 and eps:1
Episode 40/500 Rewards:13.0 Buffer Length:1046 and eps:1
Episode 50/500 Rewards:19.0 Buffer Length:1279 and eps:1
Episode 60/500 Rewards:14.0 Buffer Length:1460 and eps:1
Episode 70/500 Rewards:25.0 Buffer Length:1686 and eps:1
Episode 80/500 Rewards:20.0 Buffer Length:1833 and eps:1
Episode 90/500 Rewards:14.0 Buffer Length:1985 and eps:1
Episode 100/500 Rewards:10.0 Buffer Length:2267 and eps:0.7655707927460921
Episode 110/500 Rewards:70.0 Buffer Length:2624 and eps:0.5356297035976458
Episode 120/500 Rewards:279.0 Buffer Length:3979 and eps:0.1380705958965304
Episode 130/500 Rewards:501.0 Buffer Length:7049 and eps:0.006418796221744962
Episode 140/500 Rewards:112.0 Buffer Length:9734 and eps:0.00043906368915942414
Episode 150/500 Rewards:100.0 Buffer Length:10679 an

In [None]:
# Test the model
env = gym.make('CartPole-v1',render_mode='human')
model_500 = torch.load('DQN1.pth').to(device)
#model_500.load_state_dict(torch.load('DQN_parameters.pth'))
model_500.eval()

for e in range(10):
    sta = env.reset()[0]
    
    sta = torch.from_numpy(sta)
    done = False
    i = 0
    while not done:
        env.render()
        action = torch.argmax(model_500(sta.to(device)))
        new_sta, rew, done,info, _ = env.step(action.item())
        sta = torch.from_numpy(new_sta)
        if done:
            print(e, i)
            #break
env.close()

In [None]:
env.close()