***250707 홍송남 교수님 <현대자동차 Bootcamp 실습자료> - Q-learning***

##0. Visualize
###0.1. 필요한 패키지 설치
  - `ffmpeg`, `imageio`: 비디오/오디오의 인코딩 및 디코딩, 입출력 패키지입니다.
  - `gymnasium[toy_text]`: 강화학습 환경 라이브러리 중 classic control 모듈을 설치합니다.

In [1]:
!apt-get update -qq
!apt-get install -y ffmpeg > /dev/null
!pip install gymnasium[toy_text] imageio imageio-ffmpeg > /dev/null 2>&1

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


### 0.2. 비디오 관련 함수 정의
- 학습 결과를 시각적으로 확인하기 위해 결과를 이미지 프레임단위로 저장하여 비디오로 렌더링합니다.
- 이를 저장하고 colab상에서 재생하기 위한 함수 `show_video`를 정의합니다.

In [2]:
import os, glob, io, base64
from IPython.display import HTML, display

os.makedirs('video', exist_ok=True)

def show_video(name):
    mp4list = glob.glob(f'video/{name}.mp4')
    if mp4list:
        mp4 = mp4list[0]
        video = io.open(mp4,'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(f'''
            <video autoplay controls style="max-height: 400px;">
              <source src="data:video/mp4;base64,{encoded.decode('ascii')}" type="video/mp4"/>
            </video>'''))
    else:
        print("Could not find video")


---
## 1. Gym environment 구축

In [48]:
import gymnasium as gym

env = gym.make('FrozenLake-v1', render_mode='rgb_array', is_slippery=True)

Gym Environment options
- `env.observation_space.n`: Dimension of State space
- `env.action_space.n`: Dimension of Action space
- Others : https://gymnasium.farama.org/api/env/

In [49]:
print("State  :", env.observation_space.n)
print("Action :", env.action_space.n)


State  : 16
Action : 4


---
## 2. Q-leanring

### 2.1. Q-Table 생성 및 초기화
$$Q(S, A) = 0 \;\;\;\forall S, A$$

In [50]:
import numpy as np

Q = np.zeros((env.observation_space.n, env.action_space.n))

### 2.2. Q-learning 함수 구현
- $\epsilon$-greedy policy: $\epsilon$의 확률로 random action 선택, 그 외에는 Q-Table에서 가장 Q-value가 높은 action 선택
  - `np.random.rand()`: [0, 1] 사이의 random 실수값 추출
- Bellman Equation update
$$Q(S, A) ← Q(S, A) + \alpha [R+\gamma\max_{a'}Q(S',a')-Q(S,A)]$$
- $\alpha$: Learning rate (0~1)
- $\gamma$: Discount factor (0~1)
- $\epsilon$: Exploration probability (0~1)

In [51]:
def Q_learning(env, Q, alpha, gamma, epsilon, n_episodes):
  for epi in range(n_episodes):
    state, _ = env.reset()
    done = False
    while not done:
      ## 𝜖-greedy policy
      if np.random.rand() < epsilon:
        action = env.action_space.sample()
      else:
        action = np.argmax(Q[state])
      next_state, reward, terminated, truncated, _ = env.step(action)
      done = terminated or truncated
      best_next = np.argmax(Q[next_state])
      Q[state, action] += alpha * (reward + gamma * Q[next_state, best_next] - Q[state, action])
      state = next_state

  return Q

### 2.3. Q-learning 학습 수행


In [52]:
alpha = 0.1         # Learning rate
gamma = 0.99        # Discount factor
epsilon = 0.4       # Exploration rate
n_episodes = 1

Q_trained = Q_learning(env, Q.copy(), alpha, gamma, epsilon, n_episodes)

### 2.4. 학습 결과 확인

In [53]:
import imageio

writer = imageio.get_writer('video/cliffwalking.mp4', fps=10)
state, _ = env.reset()
done = False
step = 0

while not done:
    step += 1
    frame = env.render()
    writer.append_data(frame)
    action = np.argmax(Q_trained[state])
    state, _, terminated, truncated, _ = env.step(action)
    done = terminated or truncated or step >= 100

writer.close()
env.close()
show_video('cliffwalking')

## 3. 기타 실험

1. **Exploration probability** 가 더 높은 값이라면?
2. 유사한 환경인 `FrozenLake`에서 학습에 **더 많은 step**이 필요한 이유?
3. **Stochastic transition**의 영향 확인