#### Breakout
Code guided from:
* https://www.nature.com/articles/nature14236
* https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
* https://github.com/fg91/Deep-Q-Learning/blob/master/DQN.ipynb
* https://github.com/openai/baselines/tree/master/baselines/deepq
* https://github.com/jasonbian97/Deep-Q-Learning-Atari-Pytorch





In [2]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image 
from tqdm.notebook import tqdm
from collections import deque
import itertools
import random
import baselines_wrappers # https://github.com/openai/baselines/tree/master/baselines/deepq
import gym
from collections import namedtuple
from pytorch_wrappers import make_atari_deepmind,BatchedPytorchFrameStack,PytorchLazyFrames
from torch.utils.tensorboard import SummaryWriter 
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as T

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.get_device_name(0))

True
1
0
NVIDIA GeForce RTX 3080


### Hyperparamters

Based on 

https://www.nature.com/articles/nature14236 *Human-level control through deep reinforcement
learning*



In [3]:
gamma = 0.99                  # discount rate 
batch_size = 32               # batch size
memory_size = 1000000         # mem capacity of our replay memory
required_mem = 50000          # init-size how much we want in the memory before training 
e_start = 1.0                 # start value of epsilon 
e_end =  0.1                  # end value of epsilon 
e_steps = 1000000             # number of steps for epsilon reaches end
target_update = 2500          # how often we update target network to action network
lr = 25e-5                    # learning rate


# Preprocess
84x84x1 as used in paper 
https://arxiv.org/pdf/1312.5602.pdf 

4.1 Preprocessing and Model Architecture

The raw frames are preprocessed by first converting their RGB representation
to gray-scale and down-sampling it to a 110×84 image.

The final input representation is obtained by
cropping an 84 × 84 region of the image that roughly captures the playing area

 note: tensor has channel as first dimension

In [22]:
from torchvision.transforms import Compose, ToTensor, Normalize, Grayscale, Resize
# No samples for calculate mean std for normalisaiton - skip
class Preprocess():
    def __init__(self, target_width = 84, target_height = 84):
        self.width = target_width
        self.height = target_height
        
        # transform
        self.transform = T.Compose([
            ToTensor(),
            Grayscale(),
            Resize((110,84))
        ])
    
    def process(self,image):
        #return self.transform(image)
        # crop to play area not clear from paper exactly which area is considered
        # tested different values
        return T.functional.crop(self.transform(image),18,0,self.height,self.width)

# Replay Memory
We don't want to just learn from the previous states actiosn since they are highly correlated and the network would "forget" old states actions 
So we sample a batch from this memory

## notes from nature paper
**memory of 1 million most recent frames**

First, we use a technique known as experience replay in which we store the
agent’s experiences at each time-step, $e_t = (s_t, a_t,r_t,s_{t+1})$, in a data set $D_t = \{e_1,…,e_t\}$,
pooled over many episodes (where the end of an episode occurs when a terminal state is reached) into a replay memory. 

##### Details
During the inner loop of the algorithm,
we apply Q-learning updates, or minibatch updates, to samples of experience,
$(s, a,r,s^{'}) \sim U(D)$, drawn at random from the pool of stored samples. This approach
has several advantages over standard online Q-learning. First, each step of experience
is potentially used in many weight updates, which allowsfor greater data efficiency.

Second, learning directly from consecutive samples is inefficient, owing to the strong
correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning onpolicy the current parameters determine the next data sample that the parameters
are trained on. For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch.
Itis easy to see how unwanted feedback loops may arise and the parameters could get
stuckin a poor localminimum, or even diverge catastrophically. By using experience replay the behaviour distribution is averaged over many of its previous states,
smoothing out learning and avoiding oscillations or divergence in the parameters.
Note that when learning by experience replay, it is necessary to learn off-policy
(because our current parameters are different to those used to generate the sample), which motivates the choice of Q-learning.


In practice, our algorithm only stores the last $N$ experience tuples in the replay
memory, and samples uniformly at random from $D$ when performing updates. This
approach is in some respects limited because the memory buffer does not differentiate important transitions and always overwrites with recent transitions owing
to the finite memory size $N$. Similarly, the uniform sampling gives equal importance to all transitions in the replay memory. A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to
prioritized sweeping.



#### Tips
Store frames in unit8 to save memory 

state t and state t+1 share 3 frames

In [5]:
class ReplayMemory():
    def __init__(self):
        self.replay_mem = deque(maxlen=memory_size)
        self.epinfos_mem = deque([], maxlen=100)
    def update_replay(self,experience):
        self.replay_mem.append(experience)
        
    def update_epinfos(self,info):
        self.epinfos_mem.append(info)
        

        

### Architecture
Based on https://www.nature.com/articles/nature14236

The input to the neural network consists of an 84x84x4 image produced by the preprocessing map w. 

The first hidden layer convolves 32 filters of 8x8 with stride 4 with the
input image and applies a rectifier nonlinearity. 

The second hidden layer convolves 64 filters of 4x4 with stride 2, again followed by a rectifier nonlinearity.

This is followed by a third convolutional layer that convolves 64filters of 3x3 with
stride 1 followed by a rectifier. 

The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action. 

##### We have 4 valid actions!

In [7]:
class DeepQ(nn.Module):
    # TODO clean up
    def __init__(self,env, device = "cuda"):
        super().__init__()
        self.in_features = env.observation_space.shape[0]
        self.action_space = env.action_space.n
        self.device = device
        
        self.conv = nn.Sequential(
            nn.Conv2d(self.in_features, out_channels = 32, kernel_size = (8,8), stride = 4),
            nn.ReLU(),
            nn.Conv2d(32, 64, (4,4), 2),
            nn.ReLU(),
            nn.Conv2d(64,64,(3,3),1),
            nn.ReLU()
        )
        self.linear = nn.Sequential(
            nn.Linear(64*7*7,512),
            nn.ReLU(),
            nn.Linear(512,self.action_space)
        )
        
    def forward(self,state):
        state = self.conv(state)
        state = state.flatten(start_dim=1)
        state = self.linear(state)
        return state
        
    
    def action(self,states,epsilon):
        states = torch.as_tensor(states, dtype = torch.float32, device = self.device) # tensor 
        q_values = self.forward(states) # model output (q_values)
        highest_q_indexes = torch.argmax(q_values,dim=1) # find action with highest q value
        actions = highest_q_indexes.detach().tolist() # convert tensor to regular number
        
        for i in range(4):
            rnd_sample = random.random()
            # overwrite action with random action with epsilon prob, explore
            if rnd_sample <= epsilon:
                actions[i] = random.randint(0, 3)
        return actions
    
    def loss(self, experiences, target_net):
        # divided experience into parts 
        states = []
        actions = []
        rewards = []
        dones = []
        next_states = []
        for e in experiences:
            states.append(e.state)
            actions.append(e.action)
            rewards.append(e.reward)
            dones.append(e.done)
            next_states.append(e.next_state)
        
        states = np.stack([s.get_frames() for s in states])
        next_states =np.stack([s.get_frames() for s in next_states])  
        
        # convert to tensors
        states = torch.as_tensor(states, dtype=torch.float,device = self.device)
        actions = torch.as_tensor(np.asarray(actions), dtype=torch.int64,device = self.device).unsqueeze(-1)
        rewards = torch.as_tensor(np.asarray(rewards), dtype=torch.float,device = self.device).unsqueeze(-1)
        dones = torch.as_tensor(np.asarray(dones), dtype=torch.float,device = self.device).unsqueeze(-1)
        next_states = torch.as_tensor(next_states, dtype=torch.float,device = self.device)
        
        
        
        
        
        # Target
        target_q_values = target_network(next_states)
        max_target_q_values = target_q_values.max(dim=1, keepdim=True)[0]
        # set y_j = r_j + gamma * max(Q_target(next_state)) 
        # if teminal state y_j = r_j -> (1-done_j) = 0
        targets = rewards + gamma * (1 - dones) * max_target_q_values

        # Loss
        q_values = self.forward(states)
        action_q_values = torch.gather(input=q_values, dim=1,index=actions)
        loss = nn.functional.mse_loss(action_q_values, targets)
        return loss
        
        

#### Algorithm 1: deep Q-learning with experience replay
source: https://www.nature.com/articles/nature14236/

In [7]:
# tensboard stuff
%load_ext tensorboard
writer = SummaryWriter()

In [8]:
# wrappers from https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py
# some small changes done for compatibility 


# parallel enviroments
env_lambda = lambda: baselines_wrappers.Monitor(make_atari_deepmind("BreakoutDeterministic-v4", scale_values=True), allow_early_resets=True)
#vec_env = baselines_wrappers.DummyVecEnv([env_lambda for _ in range(4)])
vec_env = baselines_wrappers.SubprocVecEnv([env_lambda for _ in range(4)])
# wrapper stacks 4 frames into state 
env = BatchedPytorchFrameStack(vec_env, k=4)


In [9]:
Experience= namedtuple(
    "Experience",
    ("state", "action", "reward", "done", "next_state" )
)
# initlize memory
# Note differences with cartpole version
# we have 4 enviroments running seperatley
# we therefore we need several actions as inputs
# and we get several outputs 
memory  = ReplayMemory()
def mem_init(env,mem):
    states = env.reset()
    for _ in range(required_mem):
        # random uniform actions
        actions = [env.action_space.sample() for _ in range(4)]
        new_states, rewards, dones, infos = env.step(actions)
        for state, action, reward, done, new_state in zip(states, actions, rewards, dones, new_states):
            experience = Experience(state,action,reward,done, new_state)
            # add to memory
            mem.update_replay(experience)
            
        states = new_states
        
        # don't need to check done, baselines_wrappers handles it
        
    return mem

memory = mem_init(env,memory)

    

In [10]:
env.action_space.n

4

#### Training 

In [11]:
# initlize models
action_network = DeepQ(env).to(device)
target_network = DeepQ(env).to(device)
# init, target = action 
target_network.load_state_dict(action_network.state_dict())

# set optimizer 
optimizer = torch.optim.Adam(action_network.parameters(), lr=lr)


In [12]:
env.observation_space.shape

(4, 84, 84)

In [13]:
def train():
    episode_count = 0
    states = env.reset()
    # while loop with a count
    for step in itertools.count():

        # get the current epsilon value
        epsilon = np.interp(step * 4,[0, e_steps], [e_start, e_end])
        # sample from [0.0,...,1.0)
        rnd = random.random()
        
        # action from network, (network handles random actions)
        act_states = np.stack([o.get_frames() for o in states])
        actions = action_network.action(act_states, epsilon)    
            
            
        new_states, rewards, dones, infos = env.step(actions)
        for state, action, reward, done, new_state, info in zip(states, actions, rewards, dones, new_states, infos):
            experience = Experience(state,action,reward,done, new_state)
            # add to memory
            memory.update_replay(experience)
            if done:
                memory.update_epinfos(info["episode"])
                episode_count +=1
            
        states = new_states


        # sample from replay memory (batch size number of experiences)
        experiences = random.sample(memory.replay_mem, batch_size)
        loss = action_network.loss(experiences,target_network)


        # Gradient descent
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # -  every target_update step (called C in paper)
        # - save model parameters
        # - logg metrics
        if step % target_update == 0:
            # update target network
            target_network.load_state_dict(action_network.state_dict())

        # save model
        if step % 10000 == 0 and step != 0:
            print("save")
            torch.save({
                'step': step,
                'model_state_dict': action_network.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': loss
                }, "./models/Breakout")
                
        
        # loggs
        if step % 10000 ==0:
            rew_mean = np.nan_to_num(np.mean([e["r"] for e in memory.epinfos_mem]))
            len_mean = np.nan_to_num(np.mean([e["l"] for e in memory.epinfos_mem]))
            print()
            print("Step", step)
            print("Avg Reward", rew_mean)
            print("Avg Ep Len", len_mean)
            print("Episodes", episode_count)
            
            writer.add_scalar("AvgRew", rew_mean, global_step = step)
            writer.add_scalar("AvgEpLen", len_mean, global_step = step)
            writer.add_scalar("Episodes", episode_count, global_step = step)
            #step_logging.append(step)
            #reward_logging.append(avg_reward)
            #loss_logging.append(loss)
            #print("\nReward: ", avg_reward)
    
    
        
        
        

    

In [14]:
train()

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)



Step 0
Avg Reward 0.0
Avg Ep Len 0.0
Episodes 0
save

Step 10000
Avg Reward 0.18
Avg Ep Len 34.21
Episodes 1128
save

Step 20000
Avg Reward 0.31
Avg Ep Len 39.95
Episodes 2248
save

Step 30000
Avg Reward 0.17
Avg Ep Len 33.15
Episodes 3389
save

Step 40000
Avg Reward 0.25
Avg Ep Len 36.96
Episodes 4523
save

Step 50000
Avg Reward 0.2
Avg Ep Len 34.04
Episodes 5612
save

Step 60000
Avg Reward 0.4
Avg Ep Len 41.1
Episodes 6584
save

Step 70000
Avg Reward 0.42
Avg Ep Len 42.47
Episodes 7536
save

Step 80000
Avg Reward 0.47
Avg Ep Len 42.96
Episodes 8405
save

Step 90000
Avg Reward 0.74
Avg Ep Len 55.2
Episodes 9192
save

Step 100000
Avg Reward 0.51
Avg Ep Len 46.47
Episodes 10068
save

Step 110000
Avg Reward 0.88
Avg Ep Len 61.86
Episodes 10891
save

Step 120000
Avg Reward 0.89
Avg Ep Len 59.07
Episodes 11580
save

Step 130000
Avg Reward 0.98
Avg Ep Len 64.24
Episodes 12179
save

Step 140000
Avg Reward 1.34
Avg Ep Len 76.06
Episodes 12738
save

Step 150000
Avg Reward 1.47
Avg Ep Len 79.6

save

Step 1240000
Avg Reward 8.41
Avg Ep Len 264.84
Episodes 34360
save

Step 1250000
Avg Reward 7.98
Avg Ep Len 260.84
Episodes 34513
save

Step 1260000
Avg Reward 10.0
Avg Ep Len 290.33
Episodes 34666
save

Step 1270000
Avg Reward 11.07
Avg Ep Len 299.71
Episodes 34803
save

Step 1280000
Avg Reward 10.21
Avg Ep Len 302.28
Episodes 34937
save

Step 1290000
Avg Reward 8.9
Avg Ep Len 269.21
Episodes 35080
save

Step 1300000
Avg Reward 9.64
Avg Ep Len 263.71
Episodes 35226
save

Step 1310000
Avg Reward 8.1
Avg Ep Len 252.73
Episodes 35380
save

Step 1320000
Avg Reward 8.94
Avg Ep Len 271.7
Episodes 35525
save

Step 1330000
Avg Reward 8.31
Avg Ep Len 270.35
Episodes 35682
save

Step 1340000
Avg Reward 9.06
Avg Ep Len 286.04
Episodes 35823
save

Step 1350000
Avg Reward 8.29
Avg Ep Len 251.01
Episodes 35986
save

Step 1360000
Avg Reward 8.3
Avg Ep Len 266.1
Episodes 36149
save

Step 1370000
Avg Reward 8.78
Avg Ep Len 279.2
Episodes 36297
save

Step 1380000
Avg Reward 11.29
Avg Ep Len 313.2

KeyboardInterrupt: 

In [15]:
writer.flush()
writer.close()

#### Display


In [92]:
model = DeepQ(env_display).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=25e-5)

state_of_model = torch.load("./models/Breakout")
model.load_state_dict(state_of_model["model_state_dict"])

optimizer.load_state_dict(state_of_model['optimizer_state_dict'])

In [109]:
import time
env_lambda_display = lambda: baselines_wrappers.Monitor(make_atari_deepmind("BreakoutDeterministic-v4", scale_values=True), allow_early_resets=True)
vec_env_display = baselines_wrappers.DummyVecEnv([env_lambda_display for _ in range(1)])
env_display = BatchedPytorchFrameStack(vec_env_display, k=4)

In [110]:
state = env_display.reset()
episode_reward = 0.0
episodes_reward =[]
episode_count = 0
while episode_count < 5:
    
    # stack frames to state
    act_state = np.stack([s.get_frames() for s in state])
    # no random action epsilon = 0
    action = model.action(act_state, 0.0)
    
    if first_eps:
        # start with firing ball
        # models sometimes gets stuck otherwise
        action = [1]
        first_eps = False 
    state, reward, done, _ = env_display.step(action)
    episode_reward += reward
    
    if done[0]:
        episode_count += 1
        episodes_reward.append(episode_reward)
        episode_reward = 0.0
        state = env_display.reset()
        first_eps = True

In [97]:
print("Mean episode reward: ", np.mean(episodes_reward))
print(episodes_reward)

Mean episode reward:  16.2
[array([33.], dtype=float32), array([26.], dtype=float32), array([1.], dtype=float32), array([21.], dtype=float32), array([0.], dtype=float32)]


#### Random Move

In [82]:
enviroment = gym.make("BreakoutDeterministic-v4", render_mode="human")

In [83]:
enviroment.reset()
terminated = False
last_img = None
episode_reward = 0.0
episodes_reward =[]
for i in range(1000):
    action = enviroment.action_space.sample()
    observation, reward, done, _ = enviroment.step(action)
    episode_reward += reward
    if done:
        episodes_reward.append(episode_reward)
        episode_reward = 0.0
        enviroment.reset()
        
    
print("Mean episode reward: ", np.mean(episodes_reward))
    

Mean episode reward:  0.6666666666666666


In [84]:
episodes_reward

[1.0, 0.0, 1.0, 1.0, 1.0, 0.0]