<a href="https://colab.research.google.com/github/DoHyung08/RL/blob/main/0407value/value_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 가치 함수 구하기

아래는 우리가 다루게 될 환경입니다.

In [None]:
# define the 4 * 4 grid environment
class GridEnv:
    def __init__(self):
        self.state_space = [(i, j) for i in range(4) for j in range(4)]
        self.action_space = ['up', 'down', 'left', 'right']

        self.grid = [
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0]
        ]
        self.start = (3, 0)
        self.goal = (0, 3)
        self.rewards = [
            [-1, -1, -1, 1],
            [-1, -1, -1, -1],
            [-1, -1, -1, -1],
            [-1, -1, -1, -1]
        ]
        self.done = [
            [0, 0, 0, 1],
            [0, 0, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 0]
        ]

    def transition(self, state, action):
        x, y = state
        if action == 'up':
            x = max(x - 1, 0)
        elif action == 'down':
            x = min(x + 1, 3)
        elif action == 'left':
            y = max(y - 1, 0)
        elif action == 'right':
            y = min(y + 1, 3)
        return (x, y)

    def transition_prob(self, next_state, state, action):##
        return next_state == self.transition(state, action)##

    def reward(self, state, action):
        next_state = self.transition(state, action)
        x, y = next_state
        return self.rewards[x][y]

    def reset(self):
        self.grid = [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
        self.state = self.start
        return self.start

    def step(self, action):
        next_state = self.transition(self.state, action)
        reward = self.reward(self.state, action)
        done = self.done[next_state[0]][next_state[1]]
        self.state = next_state
        return next_state, reward, done

In [None]:
# Sample code for running the environment
import random

env = GridEnv()
state = env.reset()
done = False
print("Initial state:", state)
while not done:
    action = random.choice(env.action_space)
    print("Chose action:", action)
    state, reward, done = env.step(action)
    print("New state:", state)
    print("Reward:", reward)

Initial state: (3, 0)
Chose action: left
New state: (3, 0)
Reward: -1
Chose action: down
New state: (3, 0)
Reward: -1
Chose action: up
New state: (2, 0)
Reward: -1
Chose action: up
New state: (1, 0)
Reward: -1
Chose action: left
New state: (1, 0)
Reward: -1
Chose action: right
New state: (1, 1)
Reward: -1
Chose action: up
New state: (0, 1)
Reward: -1
Chose action: up
New state: (0, 1)
Reward: -1
Chose action: left
New state: (0, 0)
Reward: -1
Chose action: right
New state: (0, 1)
Reward: -1
Chose action: right
New state: (0, 2)
Reward: -1
Chose action: up
New state: (0, 2)
Reward: -1
Chose action: right
New state: (0, 3)
Reward: 1


먼저, 아무 전략이나 만들어봅시다. 단순하게 무조건 위로만 가는 전략을 생각해볼까요~?

In [None]:
PI = {s: 'up' for s in env.state_space}

In [None]:
def visualize_policy(env, PI):
    for i in range(4):
        for j in range(4):
            if (i, j) == env.goal:
                print(" G ", end='\t')
            else:
                print(PI[(i, j)], end='\t')
        print()

In [None]:
visualize_policy(env, PI)

up	up	up	 G 	
up	up	up	up	
up	up	up	up	
up	up	up	up	


### 가치 평가 단계
우리가 가지고 있는 전략이 얼마나 좋은지 평가하려면, 가치 함수를 구하면 됩니다.

수업 시간 때 다룬 것처럼, 가치 함수는 이 전략을 따라하면 기대할 수 있는 보상의 총합을 구해줍니다.

In [None]:
# Policy Evaluation
def policy_evaluation(env, PI, gamma=1):
    V = {s: 0 for s in env.state_space}
    for _ in range(1000):
        delta = 0
        for s in env.state_space:

            action = PI[s]
            next_state = env.transition(s,action)
            reward = env.reward(s, action)
            V[s] = reward + gamma*V[next_state]

    return V

In [None]:
V = policy_evaluation(env, PI)
# Visualize the value function
def visualize_value_function(V):
    for i in range(4):
        for j in range(4):
            print(f"{V[(i, j)]:.2f}", end='\t')
        print()

visualize_value_function(V)

-1000.00	-1000.00	-1000.00	1000.00	
-1001.00	-1001.00	-1001.00	1001.00	
-1002.00	-1002.00	-1002.00	1000.00	
-1003.00	-1003.00	-1003.00	999.00	


### 정책 개선 단계

주어진 상태의 가치는 이 상태가 얼마나 좋은지를 알려줍니다.

따라서, 가치가 큰 방향으로 이동하는 새로운 전략을 세우면, 더 좋은 전략이 될 수 있겠네요.

In [None]:
# Policy Improvement
def policy_improvement(env, V, PI, gamma=1):#할인 인자 = 1
    newPI = {}
    for s in env.state_space:
        max_v = float('-inf')
        for a in env.action_space:
            next_state = env.transition(s,a)#가치를 이용해 정책 설정
            if(max_v < V[next_state]):
              max_v = V[next_state]
              newPI[s] = a

            # nextReward = env.reward(s,a) ##보상을 이용해 정책 설정
            # if nextReward > max_v:
            #   max_v = nextReward
            #   newPI[s] = a

    return newPI

In [None]:
PI = policy_improvement(env, V, PI)
visualize_policy(env, PI)

up	up	right	 G 	
up	up	up	up	
up	up	up	up	
up	up	up	up	


### 정책 반복법 시행하기

In [None]:
PI = {s: 'up' for s in env.state_space}
for _ in range(3):
    V = policy_evaluation(env, PI)
    PI = policy_improvement(env, V, PI)
    visualize_policy(env, PI)
    visualize_value_function(V)
    print()

up	up	right	 G 	
up	up	up	up	
up	up	up	up	
up	up	up	up	
-1000.00	-1000.00	-1000.00	1000.00	
-1001.00	-1001.00	-1001.00	1001.00	
-1002.00	-1002.00	-1002.00	1000.00	
-1003.00	-1003.00	-1003.00	999.00	

up	up	right	 G 	
up	up	up	up	
up	up	up	up	
up	up	up	up	
-1000.00	-1000.00	1000.00	1000.00	
-1001.00	-1001.00	999.00	1001.00	
-1002.00	-1002.00	998.00	1000.00	
-1003.00	-1003.00	997.00	999.00	

up	up	right	 G 	
up	up	up	up	
up	up	up	up	
up	up	up	up	
-1000.00	-1000.00	1000.00	1000.00	
-1001.00	-1001.00	999.00	1001.00	
-1002.00	-1002.00	998.00	1000.00	
-1003.00	-1003.00	997.00	999.00	

