# Reinforcement Learning
- 크게 Environment와 Actor로 나눌 수 있다.
- actor가 action을 하면 state(observation), reward가 반환된다.

### 예를 들어
![](../img/openAI-GYM_Frozen_Lake.PNG)

##### S지점에서 우측(RIGHT)으로 action을 취하면 state는 1, reward는 0이 반환된다. (여기서 state는 위치, G지점으로 가야 reward가 1)

---
## Playing OpenAI Gym Games

In [2]:
import gym
env = gym.make("FrozenLake-v0")
observation = env.reset() # reset
for _ in range(1000):
    env.render() # 출력
    action = env.action_space.sample()
    # done은 끝났는지/아닌지를 의미 ==> H(hole)에 빠지거나, G(goal)에 도착한 경우
    observation, reward, done, info = env.step(action) 


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Up)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Up)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Up)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Down)
SFFF
F[41mH[0mFH
FFFH


### Python arrow keyin

In [4]:
# 키보드에서 입력받고 그에 맞게 action을 취한다
# code의 큰 의미는 없다.
class _Getch:
    def __call__(self):
        fd = sys.stdin.fileno()
        old_settings = termios.tcgetattr(fd)
        try:
            tty.setraw(sys.stdin.fileno())
            ch = sys.stdin.read(3)
        finally:
            termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
        return ch
inkey = _Getch() # key를 받아온다.

# Macros
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3

# Key mapping
arrow_keys = {
    '\x1b[A' : UP,
    '\x1b[B' : DOWN,
    '\x1b[C' : RIGHT,
    '\x1b[D' : LEFT
}
register(
        id='FrozenLake-v3',
        entry_point='gym.envs.toy_text:FrozenLakeEnv',
        kwargs={'map_name':'4x4','is_slippery':False}
        )

```python
### window version 
import gym
from gym.envs.registration import register
# import sys, tty, termios
from msvcrt import getch
#inkey=msvcrt.getch()

LEFT=0
DOWN=1
RIGHT=2
UP=3

arrow_keys={72:UP,
                  80:DOWN,
                  77:RIGHT,
                  75:LEFT}
while True:
    key=getch()
    #print(ord(key))
    if ord(key)==224:
        key=ord(getch())
        #print(key)
    if key not in arrow_keys.keys():
        print("Game aborted!")
        break
    
    action=arrow_keys[key]
    state,reward,done,info=env.step(action)
    
    
    
    env.render()
    
    print("State ",state,"Action ",action, "Reward: ", reward, "Info: ",info)
    
    if done:
        print("Finished with reward",reward)
        break
```

---

## Q-Function
: 현재의 위치/상태(state)와 action을 Q 함수에 입력하면 quality(reward)가 출력된다.
<img src="../img/Q-Learning.PNG" alt="drawing" width="500"/>

##### Example
- Q(s1,left) : 0
- Q(s1,right) : 0.5
- Q(s1,up) : 0
- Q(s1,down) : 0.3
>   **1. 이 중에서 최대출력값 찾기(Max Q(s1,a))** <br>
>   **2. And 해당 arg 찾기; $argmax_{a} Q(s1,a)$ = right** <br>
    **3. $\pi^{*}(s)$ = optimal policy**

## 그렇다면 어떻게 Q-Function을 학습해야하는 걸까?
- 현재 state(s)
- 다음 state(s') <br>
**가정 : s'에서의 Q는 알고 있다고 가정($Q(s',a')$), 다만 $Q(s,a)$를 모름**

### $Q(s',a')$를 이용하여 $Q(s,a)$를 표현하자!!
- $Q(s,a)$ = r + $max Q(s',a')$

<img src="../img/Q-Learning1.PNG" alt="drawing" width="500"/>

<img src="../img/Q-Learning2.PNG" alt="drawing" width="600"/>

<p style="text-align: center;"><strong>if $R_{t+1}$이 optimal이라면, $R^{*} _{t} = r_{t} + max R_{t+1}$이 된다.<strong></p>

### Dummy Q-Learning Algorithm
1. $Q(s,a)$를 0으로 초기화한다.
2. 현재의 s(state)를 파악
3. 다음을 reward=1/done=1이 될 때까지 반복한다. 
    - a(action)을 선택하고 실행
    - r(reward)를 받는다.
    - 새로운 s'를 파악한다.
    - $Q(s,a)$를 업데이트한다.(아래와 같이)
    $$Q(s,a) = r + maxQ(s',a')$$
4. s = s'로 업데이트

```python
# Initialize table with all zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])
# Discount factor
dis = .99
num_episodes = 2000

# create lists to contain total rewards and steps per episode
rList = []

for i in range(num_episodes):
    # Reset environment and get first new observation
    state = env.reset() # 이 부분이 s(state) 파악하는 부분
    rAll = 0
    done = False

    # The Q-Table learning algorithm
    while not done:
        # Choose an action by greedily (with noise) picking from Q table
        action = np.argmax(Q[state, :] + np.random.randn(1,
                                                         env.action_space.n) / (i + 1))

        # Get new state and reward from environment
        new_state, reward, done, _ = env.step(action)

        # Update Q-Table with new knowledge using decay rate
        Q[state, action] = reward + dis * np.max(Q[new_state, :])

        rAll += reward
        state = new_state

    rList.append(rAll)

print("Success rate: " + str(sum(rList) / num_episodes))
print("Final Q-Table Values")
print(Q)
plt.bar(range(len(rList)), rList, color="blue")
plt.show()
```