<a href="https://colab.research.google.com/github/PatchFramework/deep-q-reinforcement-learning/blob/main/deep_q_reinforcement_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Q Reinforcement Learning 


You need to uncomment the following line to install all dependencies, if you are not using google colab and you have not installed them already:

In [2]:
!pip install pytorch matplotlib pandas numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from collections import namedtuple, deque
import random

## 1. Creating the Network Policy

In [4]:
class DeepQReinforcementPolicy(nn.Module):
  def __init__(self, lr, in_dims, l1_dims, l2_dims, out_dims):
    """
    This class contains the neural network policy that is used for estimating the q values for each action that the agent can make.
    The q values are an estimate of the potential reward that the agent can expect for an action.

    Parameters:
    ---
    lr: learning rate during training
    in_dims: Dimensions of the input; this is equal to the size of the state vector
    l1_dims: The amount of neuron connections that the first fully connected layer of the policy should have
    l2_dims: The amount of neuron connections that the second fully connected layer has
    out_dims: Output dimensions of the network; equal to the amount of actions, that the agent can perform in the environment; they will return the q-value for each action
    """
    # Initilize the class and the input parameters
    super(DeepQReinforcementPolicy, self).__init__()
    self.lr = lr
    self.in_dims = in_dims
    self.l1_dims = l1_dims
    self.l2_dims = l2_dims
    self.out_dims = out_dims

    # define the fully connected layers (fully connected = nn.Linear())
    self.l1 = nn.Linear(self.input_dims, self.l1_dims)
    self.l2 = nn.Linear(self.l1_dims, self.l2_dims)
    self.l3 = nn.Linear(self.l2_dims, self.out_dims)

    # the messiah is chosen as the optimizer ;)
    self.optim = optim.Adam(self.parameters(), lr=self.lr)
    # Mean squared error loss function
    self.loss = nn.MSELoss()
    
    # use the GPU if it is available
    self.device = T.device('cuda:0' if T.cuda.is_available() else "cpu")
    self.to(self.device)


  def forward(self, state):
    """
    Defines how the data is propagated through the layer of the nework.
    """
    x = F.relu(self.l1(state))
    x = F.relu(self.l2(x))
    action_q_values = self.l3(x)
    return action_q_values

### 1.1 Creating a Replay Memory

The replay memory saves a few past experiences of the Policy. 
These experiences are past environment states, actions, the resulting state based on that action and the reward that the agent got for that action.
The agent is able to use random past experiences instead of consecutive experiences. 

The concept of a replay memory is that the agent can remember what consequences his actions had in the past. Hence, bad actions that the agend made a long while ago will still have an impact on the agents decision in the present. Therefore, repeating bad decisions can be avoided.

In contrast, if no replay memory is used, the agent only remembers the consequences of his last few action. 

For further documentation see [here](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html).

In [5]:
# the Transition object maps the relation between the state and a previous action to the resulting state and the reward for that action
Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward'))

class ReplayMemory(object):
  def __init__(self, memory_capacity):
    self.memory = deque([],maxlen=memory_capacity)

  def push(self, *args):
    """Save a tuple that records a action and it's consequences."""
    self.memory.append(Transition(*args))

  def sample(self, batch_size):
    """Select a random batch of samples from the memory."""
    return random.sample(self.memory, batch_size)

  def __len__(self):
    return len(self.memory)

## 2. Create the Agent

The agent is the piece of code that will take actions in an environment and use the policy to estimate how "good" future actions will be (meaning the q-value of the actions). The higher the q-value of an action, the more probable it is, that it will lead to a high reward for the agent.

The agent wants to collect as much reward as possible inside the environment.

In [None]:
class Agent():
  def __init__(self, lr, in_dims, gamma, epsilon, n_actions, batch_size, memory_capacity=100000, epsilon_decrement=0.0001, epsilon_bottom=0.005):
    self.lr = lr
    self.in_dims = in_dims
    self.gamma = gamma
    self.epsilon = epsilon
    self.epsilon_decrement = epsilon_decrement
    self.epsilon_bottom = epsilon_bottom
    self.batch_size = batch_size
    self.n_actions = n_actions
    self.action_space = list(range(n_actions))

    # stores the last free memory
    self.memory_counter = 0

    # init the policy
    self.eval_q_values = DeepQReinforcementPolicy(lr=self.lr, in_dims=self.in_dims,l1_dims=64, l2_dims=32, out_dims=self.n_actions)

    # init the ReplayMemory to sample recent memories from