# COMP5328 - Advanced Machine Learning

## Tutorial - Reinforcement Learning

**Semester 2, 2025**

**Objectives:**
* To familiar with a popular reinforcement learning environment openAI gym.
* To implement the deep Q-learning with memory replay with Pytorch.

**Instructions:**

* Exercises to be completed on IPython notebook such as: 
   * Ipython 3 (Jupyter) notebook installed on your computer http://jupyter.org/install (you need to have Python installed first https://docs.python.org/3/using/index.html )
   * Web-based Ipython notebooks such as Google Colaboratory https://colab.research.google.com/ 
   
* If you are using Jupyter intalled on your computer, Go to File->Open. Drag and drop "week11_tutorial.ipynb" file to the home interface and click upload. 
* If you are using Google Colaboratory, Click File->Upload notebook, and and upload "week11_tutorial.ipynb" file
* Complete exercises in "week11_tutorial.ipynb".
* To run the cell you can press Ctrl-Enter or hit the Play button at the top.
* Complete all exercises marked with **TODO**.
* Save your file when you are done with the exercises, so you can show your tutor next week.

Lecturers: Tongliang Liu

## 1. Installation

To get started, you’ll need to have Python 3.5+ installed.

### 1.1 Install pytorch via pip3

**To install pytorch with a specifical version in command line:**

pip3 install torch==\\$TORCH_VERSION_NUMBER torchvision==\$TORCHVISION_VERSION_NUMBER \\$SOURCE_URL

To work properly, different systems and GPUs require different versions of pytorch. Please check out https://pytorch.org/get-started/locally/ carefully before installation. You can also find instructions for other installation methods. 

You can validate whether you have installed correctly via following command.

In [1]:
import torch
import torchvision

### 1.2 Install gym via pip3

**To install gym in command line:** 
    
pip3 install gym

Similarly, you can validate whether you have installed correctly via following command.

In [2]:
import gym

## 2. Introduction to openAI gym

### 2.1 A Minimum Example

If you have installed gym correctly, by runing following code, you should observe a pop-up window that a cart-Pole starts at the middle position and does random actions.

In [3]:
import gym

env = gym.make('CartPole-v0')
env.reset()
for _ in range(250):
    env.render()
    #take a random action from action space
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) 
env.close()



You can obtain detailed documentation via their offical website (https://gym.openai.com/docs/). Here, we summarise some key points.

**observation or state (object)**: an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.

**reward (float)**: amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

**done (boolean)**: whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)

**info (dict)**: diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

### 2.2 A Best Code Practice in Reinforcement Learning

Here, we provide a highly readable and scalable code structure for training RL aloghrihms.

In [4]:
import gym


class DummpyAgent():
    """For best practice, we define an Agent object with three methods that interacts with environment"""
    def __init__(self, env):
        self.env = env
        
    # This method is used to update network parameters, store experiences into memory buffer (if any) ...
    def step(self):
        self.update_parameters()
        
    # This method is used to sample actions.
    # This dummpy agent takes actions radnomly without looking at observation and rewards.  
    def act(self):
        action = self.env.action_space.sample()      
        return action
    
    # For the best practice, to reduce the code length of the step method, we split update rules out from the step method.
    # This dummpy agent does not have any parameter to update.
    def update_parameters(self):
        pass

    
env = gym.make('CartPole-v0')
agent = DummpyAgent(env)

for i_episode in range(20):
    state = env.reset()
    # The CartPole environment totally has 50 time steps, i.e., the environment will be reset every 50 time step.
    for t in range(50):
        env.render()
        action = agent.act()
        state, reward, done, info = env.step(action)
        agent.step()
        # Done indicates whether the game is finished or not.
        # In this environment, done equals true when the cart-pole is off-screen (game over), then we should end this episode.
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

Episode finished after 31 timesteps
Episode finished after 17 timesteps
Episode finished after 15 timesteps
Episode finished after 15 timesteps
Episode finished after 24 timesteps
Episode finished after 15 timesteps
Episode finished after 14 timesteps
Episode finished after 15 timesteps
Episode finished after 17 timesteps
Episode finished after 39 timesteps
Episode finished after 12 timesteps
Episode finished after 17 timesteps
Episode finished after 25 timesteps
Episode finished after 16 timesteps
Episode finished after 19 timesteps
Episode finished after 33 timesteps
Episode finished after 18 timesteps
Episode finished after 10 timesteps
Episode finished after 42 timesteps
Episode finished after 11 timesteps


## 3. DQN with Memory Replay

We first provide the pseudocode code of the QDN with Memory Replay.

![texte](https://blog.oliverxu.cn/2019/12/01/Playing-Cartpole-with-natural-deep-reinforcement-learning/dqn_algorithm.png)

### 3.1 Define Replay Buffer 

In [5]:
import numpy as np
import random 
from collections import namedtuple, deque 
import torch
import torch.nn.functional as F
import torch.optim as optim

class ReplayBuffer:
    """This object is used to store experiences."""
    
    def __init__(self, action_size, buffer_size, batch_size, device):
        self.device = device
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experiences = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        
        
    def add(self,state, action, reward, next_state,done):
        ##TODO Add a new experience to the memory buffer.
        

    def sample(self):
        """Randomly sample a batch of experiences from memory buffer"""
        experiences = random.sample(self.memory, k=self.batch_size)
        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(self.device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(self.device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(self.device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(self.device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device)
        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """override the len method"""
        return len(self.memory)

IndentationError: expected an indented block (<ipython-input-5-a40bc9e61aac>, line 23)

### 3.1 Define a Q-network in Pytorch

In [None]:
import torch 
import torch.nn as nn
import torch.nn.functional as F

# create a two layer neural network to model Q-value function
class DQN(nn.Module):

    def __init__(self, state_size, action_size):
        super(DQN,self).__init__() 
        self.fc1= nn.Linear(state_size,128)
        self.fc2 = nn.Linear(128,128)
        self.out = nn.Linear(128, action_size)
        
    def forward(self,x):
        
        #TODO ensembling the defined layers together with relu activation function
        return self.out(x)

### 3.2 Define a Smart Agent

Now, we define a general agent that models Q-value function with the neural network, and learns a batch of experience from the memory buffer. This agent could play any game with discrete actions.

In [None]:
import numpy as np
import random 
from collections import namedtuple, deque 

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T


class Agent():
    """Interacts with and learns form environment."""
    def __init__(self, env, replay_buffer_size = int(100000), batch_size = 64, lr = 5e-4, gamma=0.99, update_interval = 5):
        self.update_step = 0
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.env = env
        self.batch_size = batch_size
        self.replay_buffer_size = replay_buffer_size
        self.action_size = self.env.action_space.n
        self.state_size = self.env.observation_space.shape[0]
        self.gamma = gamma
        self.update_interval = update_interval
        # init a network used to model Q-value function
        self.qnetwork = DQN(state_size = self.state_size, action_size = self.action_size).to(self.device)

        # optimization method
        self.optimizer = optim.Adam(self.qnetwork.parameters(),lr=lr)

        # init replay memory 
        self.memory = ReplayBuffer(
            action_size = self.action_size, 
            batch_size = self.batch_size,
            buffer_size= self.replay_buffer_size, 
            device = self.device
            )


    def step(self, state, action, reward, next_step, done):
        ## TODO save experience in replay memory
        self.memory.add(state, action, reward, next_step, done)
        # Updating the network parameters after every update_interval time steps.
        self.update_step = (self.update_step+1)% self.update_interval
        if self.update_step == 0:
            # If enough samples are available in memory, random sample batch of experiences from memory and learn
            if len(self.memory)>self.batch_size:
                ## TODO sample experiences and update network parameters
           

    def act(self, state, exploration_rate):
        state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
        self.qnetwork.eval()
        with torch.no_grad():
            action_values = self.qnetwork(state)
        ## TODO 
        # 1) With probability 1-exploration_rate, this agent follows the action that maximizes the estimated Q values.
        # 2) With probability exploration_rate, this agent randomly sample actions from action space for exploration.

     
    def update_parameters(self, experiences):
        """update network parameters"""
        states, actions, rewards, next_states, dones = experiences
        criterion = torch.nn.MSELoss()
        self.qnetwork.train()
        ## TODO calcuate the estimated q_value based on current the current state
        
        
        with torch.no_grad():
            q_value_next = self.qnetwork(next_states).detach().max(1)[0].unsqueeze(1)

        q_value = rewards + (self.gamma * q_value_next*(1-dones))
        loss = criterion(q_value_old, q_value).to(self.device)
        ## TODO update the network

### 3.3 Train the Agent

Training the agent with CartPole environment (https://gym.openai.com/envs/CartPole-v0/). The CartPole will be balanced after 400th episode apporixmately, and the maximium score is 50.

**To accelerate the training process, you could comment "env.render()" out. Then the code will be run without rendering the animation.**

In [None]:
import gym

env = gym.make('CartPole-v0')

agent = Agent(env)
exploration_rate = 1.0
for i_episode in range(450):
    state = env.reset()
    score = 0
    # The CartPole environment totally has 50 time steps, i.e., the environment will be reset every 50 time step.
    for t in range(50):
        # You can comment the following line to accelerate the training process.
        env.render()
        
        action = agent.act(state,exploration_rate)
        next_state, reward, done, info = env.step(action)
        agent.step(state,action,reward,next_state,done)
        state = next_state
        score += reward
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
        exploration_rate = max(exploration_rate*0.996,0.01)
    print('Score for '+str(i_episode)+" is "+str(score))
env.close()