<h1 align='center'>Python GYM FrozenLake Q-Learning</h1>

Patryk Kośmider s16863 i Krzysztof Marek s16663

https://en.wikipedia.org/wiki/Q-learning

https://gym.openai.com/envs/FrozenLake-v0/

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

![FrozenLake](FrozenLake.png)

In [1]:
import gym
import random
import time
import numpy as np
from IPython.display import clear_output

Stworzenie środowiska

In [2]:
env = gym.make('FrozenLake-v0')

Tablica przechowywująca dane dla algorytmu

In [3]:
q_table = np.zeros((env.observation_space.n, env.action_space.n))

print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


Zmnienne ilościowe, gdzie:
* **episode_count** - całkowita ilość powtórzeń uczenia się agenta
* **step_count** - maksymalna ilość akcji w danym epizodzie
* **epoch_count** - co ile epizodów będzie generowane podsumowanie

In [4]:
episode_count = 30000
step_count = 100
epoch_count = 1000

Tempo uczenia się agenta, określa w jakim stopniu stara informacja jest nadpisywana przez nową

In [5]:
learning_rate = 0.06

Określa jak bardzo agent zwraca uwagę na przyszłe nagrody

In [6]:
discount_factor = 0.99

Zmienne ostatecznie określająca jaka akcja zostanie wybrana przez agenta

In [7]:
expl_rate = 1.0
min_expl_rate = 0.001
max_expl_rate = 0.9
expl_decay_rate = 0.001

Trenowanie agenta

![BellmanEq](BellmanEq.png)

In [8]:
all_rewards = []

for episode in range(episode_count):
    state = env.reset()

    rewards_sum = 0

    for step in range(step_count):
        threshold = random.uniform(0, 1)

        if threshold > expl_rate:
            action = np.argmax(q_table[state, :])
        else:
            action = env.action_space.sample()

        new_state, reward, done, info = env.step(action)

        # Równanie Bellmana w formie iteracyjnej
        old_value = q_table[state, action]
        p = learning_rate
        q = 1.0 - learning_rate
        best = np.max(q_table[new_state, :])

        q_table[state, action] = q * old_value + p * (reward + discount_factor * best)

        state = new_state
        rewards_sum += reward

        if done == True:
            break

    expl_rate = min_expl_rate + (max_expl_rate - min_expl_rate) * np.exp(-expl_decay_rate * episode)

    all_rewards.append(rewards_sum)    


Wygląd tablicy danych po treningu

In [9]:
print(q_table)

[[0.54655058 0.46917624 0.4746591  0.46875876]
 [0.23289098 0.22607855 0.21895984 0.4514138 ]
 [0.37036749 0.26425586 0.21491679 0.20674966]
 [0.16333394 0.02619242 0.01173932 0.0139407 ]
 [0.56421131 0.41000003 0.35779265 0.28737961]
 [0.         0.         0.         0.        ]
 [0.19558762 0.17333731 0.22027372 0.10900427]
 [0.         0.         0.         0.        ]
 [0.29600173 0.38791525 0.38714603 0.61018952]
 [0.42440071 0.63352107 0.42042915 0.46552934]
 [0.5462905  0.3755134  0.43306242 0.30098955]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.42264874 0.52232768 0.73580395 0.55069006]
 [0.71773127 0.86052823 0.7342618  0.7315334 ]
 [0.         0.         0.         0.        ]]


Wypisanie jakości treningu agenta po konkretnej ilości epizodów

In [10]:
rewards_per_epoch = np.split(np.array(all_rewards), episode_count / epoch_count)

for i, r in enumerate(rewards_per_epoch):
    print(f'{(i + 1) * epoch_count}: {round(sum(r / epoch_count), 5)}')

print(f'All: {round(np.average(np.array(all_rewards)), 5)}')

1000: 0.036
2000: 0.184
3000: 0.445
4000: 0.612
5000: 0.67
6000: 0.697
7000: 0.725
8000: 0.724
9000: 0.731
10000: 0.71
11000: 0.692
12000: 0.716
13000: 0.709
14000: 0.723
15000: 0.724
16000: 0.74
17000: 0.715
18000: 0.733
19000: 0.74
20000: 0.726
21000: 0.728
22000: 0.753
23000: 0.721
24000: 0.735
25000: 0.716
26000: 0.743
27000: 0.712
28000: 0.737
29000: 0.709
30000: 0.72
All: 0.66753


Przetestowanie agenta

In [11]:
for episode in range(3):
    state = env.reset()

    time.sleep(1)

    for step in range(step_count):
        clear_output(wait = True)
        
        env.render()
        
        time.sleep(0.2)
        
        action = np.argmax(q_table[state, :])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            clear_output(wait = True)
            
            env.render()
            
            if reward == 1:
                print('Wygrana')
                time.sleep(3)
            else:
                print('Przegrana')
                time.sleep(3)
            
            clear_output(wait = True)
            break
        
        state = new_state

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Wygrana


Zamknięcie środowiska

In [12]:
env.close()