## Task 2A - GridWorld
Lag et enkelt gridworld-environment. Dette innebærer at environmentet har et
diskret rutenett, og at en agent kan bevege seg rundt med fire handlinger (opp,
ned, høyre, venstre). Simuleringen terminerer når agenten har nådd et plassert
mål-posisjon som gir reward 1. Om man ønsker, kan det legges inn f.eks. solide
vegger eller farlige områder som gir straff rundt omkring. Environmentet skal
ha samme interface som cartpole (.step(a)-funksjon, og .reset())

Deretter skal implementasjonen av Q-læring fra forrige oppgave brukes for å
trene en agent i environmentet. Til slutt skal Q-verdiene visualiserer inne i selve
environmentet, og dette kan gjøres på flere måter. En måte erå fargelegge rutene
basert på den høyeste Q-verdien fra tilsvarende rad i Q-tabellen. Alternativt så
kan man tegne inn piler som peker i samme retning som handlingen med høyest
Q-verdi.

Tips: Biblioteket pygame er veldig greit for å lage visualisering av environmentet.

In [64]:
import gym 
import math 
import numpy as np 
from gridworld import GridWorld

In [65]:
env = GridWorld(800, 64, 1)

In [66]:
# Hyperparameters 
BUCKETS = (8, 8) 
EPISODES = 3000
MIN_LEARNING_RATE = 0.1
MIN_EPSILON = 0.1
DISCOUNT = 1.0
DECAY = 500

# Visualization variables 
SHOW_STATS = 500

In [67]:
q_table = np.zeros(BUCKETS + (env.action_space.n, ))

In [68]:
upper_bounds = [env.observation_space.high[0], 0.5, env.observation_space.high[1], math.radians(50) / 1.]
lower_bounds = [env.observation_space.low[0], -0.5, env.observation_space.low[1], -math.radians(50) / 1.]

In [69]:
# Discretizes the state 
def discretize_state(obs):
    discretized = list()
    
    for i in range(len(obs)):
        scaling = (obs[i] + abs(lower_bounds[i])) / (upper_bounds[i] - lower_bounds[i])
        new_obs = int(round((BUCKETS[i] - 1) * scaling))
        new_obs = min(BUCKETS[i] - 1, max(0, new_obs))
        discretized.append(new_obs)
        
    return tuple(discretized)

In [70]:
# Chooses what action to take (random or look in Q-Table)
def choose_action(state):
    if (np.random.random() < epsilon):
        return env.action_space.sample() # Random action
    else:
        return np.argmax(q_table[state]) # Looks up in the Q-Table 

In [71]:
# Updates the Q-Table 
def update_q(state, action, reward, new_state):
    q_table[state][action] += learning_rate * (reward + DISCOUNT * np.max(q_table[new_state]) - q_table[state][action])

In [72]:
# Updates epsilon value (logarithmically decreasing)
def get_epsilon(episode):
    return max(MIN_EPSILON, min(1., 1. - math.log10((episode + 1) / DECAY)))

In [73]:
# Updates the learning rate (logarithmically decreasing)
def get_learning_rate(episode):
    return max(MIN_LEARNING_RATE, min(1., 1. - math.log10((episode + 1) / DECAY)))

In [74]:
print('Episode  Score')

scores = []
completionCount = 0 

for episode in range(EPISODES):

    current_state = tuple(env.reset()) 
    
    # Updates learning rate and epsilon 
    learning_rate = get_learning_rate(episode)
    epsilon = get_epsilon(episode)
    
    # Runs through an episode 
    done = False
    while not done:
        
        action = choose_action(current_state)              # Chooses action
        obs, reward, done, _ = env.step(action)            # Performs action 
        new_state = tuple(obs)                             # Discretizes new state
        update_q(current_state, action, reward, new_state) # Updates the Q-Table
        current_state = new_state                          # Updates the current state
        
        if reward == 10.0: completionCount += 1 
     
    # Prints some statistics  
    if (episode + 1) % SHOW_STATS == 0: 
        score = round((completionCount / SHOW_STATS) * 100, 2)
        completionCount = 0
        print(f'{episode + 1}\t {score}%') 

Episode  Score
500	 8.2%
1000	 44.4%
1500	 92.6%
2000	 98.2%
2500	 99.8%
3000	 100.0%


In [75]:
epsilon = 0.0 
current_state = tuple(env.reset()) 

done = False 
while not done:
        
        # Chooses and performs action
        action = choose_action(current_state)   
        obs, reward, done, _ = env.step(action) 
        
        # Sets new state
        new_state = tuple(obs)
        current_state = new_state    
        
        # Renders the frame 
        env.render(q_table)
        print(obs)

[1, 0]
[2, 0]
[3, 0]
[4, 0]
[5, 0]
[6, 0]
[6, 1]
[6, 2]
[6, 3]
[6, 4]
[6, 5]
[6, 6]
