# ML-Agents Q-Learning with GridWorld

Q-Learning 모델로 ML-Agents의 GridWorld 환경에서 학습  
[Q-Learning with a UnityEnvironment](https://colab.research.google.com/github/Unity-Technologies/ml-agents/blob/release_19_docs/colab/Colab_UnityEnvironment_2_Train.ipynb)을 클론 코딩함으로써 Unity ML-Agents python low level api와 Deep Reinforcement Learning (DRL)에 대한 실습을 진행함.


<img src="https://github.com/Unity-Technologies/ml-agents/blob/release_19_docs/docs/images/gridworld.png?raw=true" align="middle" width="435"/>

## Train the GridWorld Environment with Q-Learning

### What is the  GridWorld Environment

[GridWorld](https://github.com/Unity-Technologies/ml-agents/blob/release_19_docs/docs/Learning-Environment-Examples.md#gridworld) Environment는 간단한 Unity Visual environment이다. Agent는 파란색 사각형이며 3x3 grid내에서 red `x`를 피하면서 green `+`에 도달하는것을 목표로 한다.

observation은 image로 grid의 위에서 카메라에 의해 획득된다.

Action은 5개 중 하나이다.

* Do not move
* Move up
* Move down
* Move right
* Move left

Agent는 green `+`에 도달하면 1.0의 reward를 획득한다. red `x`에 도달 시 -1의 패널티를 획득한다. 또한 각 step마다 -0.01의 패널티가 부여된다.

> **Note** There are 9 Agents, each in their own grid, at once in the environment. This allows for faster data collection.

### The Q-Learning Algorithm

매우 간단한 Q-Learning 알고리즘으로 [pytorch](https://pytorch.org/)를 사용하였다.  

아래는 매우 간단한 신경망이다.

In [None]:
import torch
from typing import Tuple
from math import floor
from torch.nn import Parameter

class VisualQNetwork(torch.nn.Module):
    """image를 학습하는 매우 간단한 visual neural"""
    
    def __init__(self, input_shape: Tuple[int, int, int], encoding_size: int, output_size: int):
        """image batch (3 dimensional tensors)를 입력으로 사용하는 neural network를 생성한다.

        Args:
            input_shape (Tuple[int, int, int]): channel, height, width
            encoding_size (int): fully connected layer의 encoding size
            output_size (int): ouput size
        """
        
        super(VisualQNetwork, self).__init__()
        
        height = input_shape[1]
        width = input_shape[2]
        initial_channels = input_shape[0]
        conv_1_hw = self.conv_output_shape((height, width), 8, 4)
        conv_2_hw = self.conv_output_shape(conv_1_hw, 4, 2)
        
        self.final_flat = conv_2_hw[0] * conv_2_hw[1] * 32 # flatten된 conv2 ouput tensor의 size: height * width * out_channels
        self.conv1 = torch.nn.Conv2d(initial_channels, 16, [8, 8], [4, 4])
        self.conv2 = torch.nn.Conv2d(16, 32, [4, 4], [2, 2])
        self.dense1 = torch.nn.Linear(self.final_flat, encoding_size)
        self.dense2 = torch.nn.Linear(encoding_size, output_size)
        
    def forward(self, visual_obs: torch.Tensor):
        conv_1 = torch.relu(self.conv1(visual_obs))
        conv_2 = torch.relu(self.conv2(conv_1))
        hidden = self.dense1(conv_2.reshape([-1, self.final_flat])) # flatten and input to the fully connected layer
        hidden = torch.relu(hidden) # activation function
        hidden = self.dense2(hidden)
        return hidden
        
    
    @staticmethod
    def conv_output_shape(h_w: Tuple[int, int], kernel_size: int = 1, stride: int = 1, pad: int = 0, dilation: int = 1):
        """convolution layer의 출력의 height과 width를 반환한다."""
        
        h = floor(
            ((h_w[0] + (2 * pad) - (dilation * (kernel_size - 1)) - 1) / stride) + 1
        )
        w = floor(
            ((h_w[1] + (2 * pad) - (dilation * (kernel_size - 1)) - 1) / stride) + 1
        )
        return h, w
        

Q-Learning을 학습시키는데 사용할 data를 저장하기 위한 data type 정의. ReplayBuffer에 쓰임

In [None]:
import numpy as np
from typing import NamedTuple, List

class Experience(NamedTuple):
    """Agent transition data를 포함하는 experience"""
    
    obs: np.ndarray
    action: np.ndarray
    reward: float
    done: bool
    next_obs: np.ndarray
    
# A Trajectory is an ordered sequence of Experiences
Trajectory = List[Experience]

# A Buffer is an unordered list of Experiences from multiple Trajectories
Buffer = List[Experience]

trainer class를 정의함. trainer class는 policy를 따르는 environment로부터 data를 모은 뒤 Q-Network를 학습함.

In [1]:
# TODO Trainer class 정의

## References

[1] [Q-Learning with a UnityEnvironment](https://colab.research.google.com/github/Unity-Technologies/ml-agents/blob/release_19_docs/colab/Colab_UnityEnvironment_2_Train.ipynb)