# Sprawozdanie z laboratorium 6

***Autor: Adam Dąbkowski***

Celem szóstego laboratorium jest zaimplementowanie algorytmu ***Q-learning***. Dodatkowo należy stworzyć agenta rozwiązującego problem ***Taxi***.


## 0. Importowanie niezbędnych bibliotek

In [1]:
import gym
import numpy as np
import pandas as pd

## 1. Wizualizacja stanu środowiska

Wykorzystywane przez na środowisko zawiera cztery wyznaczone miejsca (***R***, ***G***, ***Y***, ***B***), w których pasażer może wsiąść do taksówki (***żółty prostokąt***) lub wysiąść. Gracz otrzymuje pozytywne nagrody za udane podrzucenie pasażera w odpowiednim miejscu, natomiast negatywne nagrody za próby odebrania/odwiezienia pasażera kończące się niepowodzeniem oraz za każdy krok, w którym nie otrzymano kolejnej nagrody.

In [2]:
env = gym.make('Taxi-v3')
env.render()

+---------+
|[35mR[0m: | : :G|
| :[43m [0m| : : |
| : : : : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+



## 2. Implementacja algorytmu ***Q-learning***

Głównym zadaniem szóstego laboratorium jest implementacja algorytmu ***Q-learning***. Po za tym należy stworzyć agenta rozwiązującego problem ***Taxi***. W tym celu stworzona została klasa ***QlearningAgent***. Podczas tworzenia obiektu tej klasy istnieje możliwość podania parametrów ***env*** (*wykorzystywane środowisko*), ***beta*** (*współczynnik uczenia*), ***gamma*** (*stopa dyskontowa*) oraz ***epsilon*** (*prawdopodobieństwo $\epsilon$*).



Klasa ***QlearningAgent*** zawiera także cztery metody:
- ***get_parameters()*** - metoda zwracająca wartości parametrów ***beta***, ***gamma*** oraz ***epsilon***
- ***exploration()*** - metoda odpowiadająca za strategię eksploracji (w tym przypadku ***strategię $\epsilon$-zachłanną***)
- ***learn()*** - metoda odpowiadająca za uczenie według algorytmu ***Q-learning***
- ***evaluate()*** - metoda odpowiedzialna za ocenę na danym etapie uczenia

In [3]:
class QlearningAgent:
    def __init__(self, env, beta=0.05, gamma=0.9, epsilon=0.01):
        self.env = env
        self.beta = beta
        self.gamma = gamma
        self.epsilon = epsilon
        self.Q = np.zeros([env.observation_space.n, env.action_space.n])

    def get_parameters(self):
        return {
            "beta": self.beta,
            "gamma": self.gamma,
            "epsilon": self.epsilon
        }

    def exploration(self, state):
        if np.random.rand() < self.epsilon:
            action = self.env.action_space.sample()
        else:
            action = np.argmax(self.Q[state])
        return action

    def learn(self, n_episodes=10000, n_eval_episodes=20, eval_period=2000, deep_printing=False):
        for i in range(n_episodes):
            state = self.env.reset()
            done = False
            while not done:
                action = self.exploration(state)
                new_state, reward, done, _ = self.env.step(action)
                self.Q[state, action] += self.beta * (reward + self.gamma * np.max(self.Q[new_state, :]) - self.Q[state, action])
                state = new_state

            if (i+1) % eval_period == 0 or (i+1) == n_episodes:
                average_reward = self.evaluate(n_eval_episodes, deep_printing)
                print(f'After {i+1}/{n_episodes} learning episodes - average reward: {average_reward}')
                if deep_printing:
                    print(" ")

        return average_reward


    def evaluate(self, n_eval_episodes, printing=False):
        all_rewards = []
        for i in range(n_eval_episodes):
            episode_reward = 0
            state = self.env.reset()
            done = False
            while not done:
                action = self.exploration(state)
                state, reward, done, _ = self.env.step(action)
                episode_reward += reward

            all_rewards.append(episode_reward)

            if printing:
                print(f'Episode {i} reward: {episode_reward}')

        return np.mean(all_rewards)

Aby móc w łatwy sposób prezentować i analizować rezultaty działania algorytmu dla poszczególnych przypadków, zaimplementowana została prosta klasa ***Results***.

In [4]:
class Results:
    def __init__(self):
        self.results = pd.DataFrame(columns=["Learning episodes", "beta", "gamma", "epsilon", "Average reward"])

    def update_results(self, n_episodes, beta, gamma, epsilon, average_reward):
        self.results.loc[len(self.results)] = [n_episodes, beta, gamma, epsilon, average_reward]

    def delete_row(self, index):
        self.results.drop([index], axis=0, inplace=True)

    def sort_results(self, column_name):
        self.results = self.results.sort_values(by=[column_name])

    def __repr__(self):
        return self.results.to_string()

## 3. Zastosowanie algorytmu

In [5]:
n_episodes = 20000
n_eval_episodes = 500
eval_period = 2000

#### 3.1 Badanie wpływu współczynnika $\beta$

In [6]:
results_beta = Results()

In [7]:
beta = 0.03
gamma = 0.9
epsilon = 0.01

In [8]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [9]:
agent.get_parameters()

{'beta': 0.03, 'gamma': 0.9, 'epsilon': 0.01}

In [10]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -190.958
After 4000/20000 learning episodes - average reward: -115.576
After 6000/20000 learning episodes - average reward: -59.666
After 8000/20000 learning episodes - average reward: -1.1
After 10000/20000 learning episodes - average reward: 5.338
After 12000/20000 learning episodes - average reward: 5.37
After 14000/20000 learning episodes - average reward: 7.412
After 16000/20000 learning episodes - average reward: 7.624
After 18000/20000 learning episodes - average reward: 7.384
After 20000/20000 learning episodes - average reward: 7.542


In [11]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [12]:
beta = 0.05
gamma = 0.9
epsilon = 0.01

In [13]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [14]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -150.468
After 4000/20000 learning episodes - average reward: -40.392
After 6000/20000 learning episodes - average reward: -3.896
After 8000/20000 learning episodes - average reward: 6.458
After 10000/20000 learning episodes - average reward: 7.398
After 12000/20000 learning episodes - average reward: 7.308
After 14000/20000 learning episodes - average reward: 7.338
After 16000/20000 learning episodes - average reward: 7.376
After 18000/20000 learning episodes - average reward: 7.3
After 20000/20000 learning episodes - average reward: 7.61


In [15]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [16]:
beta = 0.1
gamma = 0.9
epsilon = 0.01

In [17]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [18]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -27.872
After 4000/20000 learning episodes - average reward: 6.828
After 6000/20000 learning episodes - average reward: 7.408
After 8000/20000 learning episodes - average reward: 7.574
After 10000/20000 learning episodes - average reward: 7.254
After 12000/20000 learning episodes - average reward: 7.514
After 14000/20000 learning episodes - average reward: 7.228
After 16000/20000 learning episodes - average reward: 7.586
After 18000/20000 learning episodes - average reward: 7.6
After 20000/20000 learning episodes - average reward: 7.374


In [19]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [20]:
beta = 0.001
gamma = 0.9
epsilon = 0.01

In [21]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [22]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -402.752
After 4000/20000 learning episodes - average reward: -363.962
After 6000/20000 learning episodes - average reward: -405.398
After 8000/20000 learning episodes - average reward: -232.076
After 10000/20000 learning episodes - average reward: -342.938
After 12000/20000 learning episodes - average reward: -346.16
After 14000/20000 learning episodes - average reward: -253.522
After 16000/20000 learning episodes - average reward: -335.156
After 18000/20000 learning episodes - average reward: -313.778
After 20000/20000 learning episodes - average reward: -404.234


In [23]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [24]:
results_beta.sort_results("beta")

In [25]:
results_beta.results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
3,20000.0,0.001,0.9,0.01,-404.234
0,20000.0,0.03,0.9,0.01,7.542
1,20000.0,0.05,0.9,0.01,7.61
2,20000.0,0.1,0.9,0.01,7.374


#### 3.2 Badanie wpływu współczynnika $\gamma$

In [26]:
results_gamma = Results()

In [27]:
beta = 0.03
gamma = 0.95
epsilon = 0.01

In [28]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [29]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -296.59
After 4000/20000 learning episodes - average reward: -99.99
After 6000/20000 learning episodes - average reward: -31.842
After 8000/20000 learning episodes - average reward: 1.582
After 10000/20000 learning episodes - average reward: 6.444
After 12000/20000 learning episodes - average reward: 7.428
After 14000/20000 learning episodes - average reward: 7.328
After 16000/20000 learning episodes - average reward: 7.146
After 18000/20000 learning episodes - average reward: 7.244
After 20000/20000 learning episodes - average reward: 7.55


In [30]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [31]:
beta = 0.03
gamma = 0.99
epsilon = 0.01

In [32]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [33]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -215.382
After 4000/20000 learning episodes - average reward: -93.334
After 6000/20000 learning episodes - average reward: -31.656
After 8000/20000 learning episodes - average reward: -4.966
After 10000/20000 learning episodes - average reward: 7.396
After 12000/20000 learning episodes - average reward: 7.488
After 14000/20000 learning episodes - average reward: 7.258
After 16000/20000 learning episodes - average reward: 7.078
After 18000/20000 learning episodes - average reward: 7.402
After 20000/20000 learning episodes - average reward: 7.504


In [34]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [35]:
beta = 0.03
gamma = 0.999
epsilon = 0.01

In [36]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [37]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -255.778
After 4000/20000 learning episodes - average reward: -102.792
After 6000/20000 learning episodes - average reward: -14.124
After 8000/20000 learning episodes - average reward: -0.282
After 10000/20000 learning episodes - average reward: 7.124
After 12000/20000 learning episodes - average reward: 7.456
After 14000/20000 learning episodes - average reward: 7.288
After 16000/20000 learning episodes - average reward: 7.364
After 18000/20000 learning episodes - average reward: 7.392
After 20000/20000 learning episodes - average reward: 7.698


In [38]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [39]:
beta = 0.03
gamma = 0.8
epsilon = 0.01

In [40]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [41]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -213.438
After 4000/20000 learning episodes - average reward: -126.088
After 6000/20000 learning episodes - average reward: -64.824
After 8000/20000 learning episodes - average reward: -19.336
After 10000/20000 learning episodes - average reward: -9.42
After 12000/20000 learning episodes - average reward: -7.414
After 14000/20000 learning episodes - average reward: -3.496
After 16000/20000 learning episodes - average reward: 1.474
After 18000/20000 learning episodes - average reward: -2.726
After 20000/20000 learning episodes - average reward: 0.826


In [42]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [43]:
beta = 0.03
gamma = 0.6
epsilon = 0.01

In [44]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [45]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -201.26
After 4000/20000 learning episodes - average reward: -149.1
After 6000/20000 learning episodes - average reward: -117.466
After 8000/20000 learning episodes - average reward: -114.804
After 10000/20000 learning episodes - average reward: -83.39
After 12000/20000 learning episodes - average reward: -67.466
After 14000/20000 learning episodes - average reward: -64.258
After 16000/20000 learning episodes - average reward: -62.378
After 18000/20000 learning episodes - average reward: -39.184
After 20000/20000 learning episodes - average reward: -39.812


In [46]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [47]:
results_gamma.sort_results("gamma")

In [48]:
results_gamma.results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
4,20000.0,0.03,0.6,0.01,-39.812
3,20000.0,0.03,0.8,0.01,0.826
0,20000.0,0.03,0.95,0.01,7.55
1,20000.0,0.03,0.99,0.01,7.504
2,20000.0,0.03,0.999,0.01,7.698


#### 3.3 Badanie wpływu wartości parametru $\epsilon$

In [49]:
results_epsilon = Results()

In [50]:
beta = 0.03
gamma = 0.9
epsilon = 0.05

In [51]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [52]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -206.832
After 4000/20000 learning episodes - average reward: -80.504
After 6000/20000 learning episodes - average reward: -17.9
After 8000/20000 learning episodes - average reward: -3.29
After 10000/20000 learning episodes - average reward: 4.682
After 12000/20000 learning episodes - average reward: 4.712
After 14000/20000 learning episodes - average reward: 4.74
After 16000/20000 learning episodes - average reward: 5.482
After 18000/20000 learning episodes - average reward: 5.206
After 20000/20000 learning episodes - average reward: 5.16


In [53]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [54]:
beta = 0.03
gamma = 0.9
epsilon = 0.1

In [55]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [56]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -194.29
After 4000/20000 learning episodes - average reward: -52.172
After 6000/20000 learning episodes - average reward: -13.088
After 8000/20000 learning episodes - average reward: -4.35
After 10000/20000 learning episodes - average reward: 1.19
After 12000/20000 learning episodes - average reward: 2.648
After 14000/20000 learning episodes - average reward: 2.076
After 16000/20000 learning episodes - average reward: 2.144
After 18000/20000 learning episodes - average reward: 2.898
After 20000/20000 learning episodes - average reward: 2.118


In [57]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [58]:
beta = 0.03
gamma = 0.9
epsilon = 0.2

In [59]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [60]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -228.964
After 4000/20000 learning episodes - average reward: -65.712
After 6000/20000 learning episodes - average reward: -19.906
After 8000/20000 learning episodes - average reward: -9.816
After 10000/20000 learning episodes - average reward: -5.226
After 12000/20000 learning episodes - average reward: -6.08
After 14000/20000 learning episodes - average reward: -3.872
After 16000/20000 learning episodes - average reward: -4.592
After 18000/20000 learning episodes - average reward: -4.334
After 20000/20000 learning episodes - average reward: -4.742


In [61]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [62]:
beta = 0.03
gamma = 0.9
epsilon = 0.5

In [63]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [64]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -319.056
After 4000/20000 learning episodes - average reward: -93.28
After 6000/20000 learning episodes - average reward: -57.288
After 8000/20000 learning episodes - average reward: -48.666
After 10000/20000 learning episodes - average reward: -48.088
After 12000/20000 learning episodes - average reward: -45.686
After 14000/20000 learning episodes - average reward: -52.022
After 16000/20000 learning episodes - average reward: -48.556
After 18000/20000 learning episodes - average reward: -48.978
After 20000/20000 learning episodes - average reward: -49.836


In [65]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [66]:
beta = 0.03
gamma = 0.9
epsilon = 1

In [67]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [68]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -765.588
After 4000/20000 learning episodes - average reward: -769.708
After 6000/20000 learning episodes - average reward: -766.946
After 8000/20000 learning episodes - average reward: -773.808
After 10000/20000 learning episodes - average reward: -760.854
After 12000/20000 learning episodes - average reward: -774.11
After 14000/20000 learning episodes - average reward: -768.034
After 16000/20000 learning episodes - average reward: -769.906
After 18000/20000 learning episodes - average reward: -771.422
After 20000/20000 learning episodes - average reward: -763.266


In [69]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [70]:
beta = 0.03
gamma = 0.9
epsilon = 0.005

In [71]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [72]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -231.524
After 4000/20000 learning episodes - average reward: -87.856
After 6000/20000 learning episodes - average reward: -42.346
After 8000/20000 learning episodes - average reward: -0.466
After 10000/20000 learning episodes - average reward: 3.612
After 12000/20000 learning episodes - average reward: 7.676
After 14000/20000 learning episodes - average reward: 7.22
After 16000/20000 learning episodes - average reward: 7.56
After 18000/20000 learning episodes - average reward: 7.71
After 20000/20000 learning episodes - average reward: 7.634


In [73]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [74]:
beta = 0.03
gamma = 0.9
epsilon = 0.001

In [75]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [76]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -258.802
After 4000/20000 learning episodes - average reward: -153.74
After 6000/20000 learning episodes - average reward: -23.084
After 8000/20000 learning episodes - average reward: -9.842
After 10000/20000 learning episodes - average reward: 6.256
After 12000/20000 learning episodes - average reward: 4.066
After 14000/20000 learning episodes - average reward: 8.08
After 16000/20000 learning episodes - average reward: 7.822
After 18000/20000 learning episodes - average reward: 7.862
After 20000/20000 learning episodes - average reward: 7.786


In [77]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [78]:
beta = 0.03
gamma = 0.9
epsilon = 0.0001

In [79]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [80]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -235.302
After 4000/20000 learning episodes - average reward: -114.92
After 6000/20000 learning episodes - average reward: -34.098
After 8000/20000 learning episodes - average reward: -2.76
After 10000/20000 learning episodes - average reward: 4.676
After 12000/20000 learning episodes - average reward: 7.562
After 14000/20000 learning episodes - average reward: 7.792
After 16000/20000 learning episodes - average reward: 7.926
After 18000/20000 learning episodes - average reward: 7.888
After 20000/20000 learning episodes - average reward: 7.93


In [81]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [82]:
beta = 0.03
gamma = 0.9
epsilon = 0

In [83]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [84]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -274.594
After 4000/20000 learning episodes - average reward: -76.946
After 6000/20000 learning episodes - average reward: -51.86
After 8000/20000 learning episodes - average reward: -1.12
After 10000/20000 learning episodes - average reward: 1.736
After 12000/20000 learning episodes - average reward: 4.974
After 14000/20000 learning episodes - average reward: 8.018
After 16000/20000 learning episodes - average reward: 8.068
After 18000/20000 learning episodes - average reward: 8.004
After 20000/20000 learning episodes - average reward: 8.032


In [85]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [86]:
results_epsilon.sort_results("epsilon")

In [87]:
results_epsilon.results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
8,20000.0,0.03,0.9,0.0,8.032
7,20000.0,0.03,0.9,0.0001,7.93
6,20000.0,0.03,0.9,0.001,7.786
5,20000.0,0.03,0.9,0.005,7.634
0,20000.0,0.03,0.9,0.05,5.16
1,20000.0,0.03,0.9,0.1,2.118
2,20000.0,0.03,0.9,0.2,-4.742
3,20000.0,0.03,0.9,0.5,-49.836
4,20000.0,0.03,0.9,1.0,-763.266


## 4. Podsumowanie

In [88]:
results = pd.concat([results_beta.results, results_gamma.results, results_epsilon.results])

In [89]:
results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
3,20000.0,0.001,0.9,0.01,-404.234
0,20000.0,0.03,0.9,0.01,7.542
1,20000.0,0.05,0.9,0.01,7.61
2,20000.0,0.1,0.9,0.01,7.374
4,20000.0,0.03,0.6,0.01,-39.812
3,20000.0,0.03,0.8,0.01,0.826
0,20000.0,0.03,0.95,0.01,7.55
1,20000.0,0.03,0.99,0.01,7.504
2,20000.0,0.03,0.999,0.01,7.698
8,20000.0,0.03,0.9,0.0,8.032


Aby móc w łatwy sposób prezentować i analizować rezulaty działania algorytmu dla poszczególnych przypadków, zaimplementowana została prosta klasa ***Results***.

In [12]:
class Results:
    def __init__(self):
        self.results = pd.DataFrame(columns=["Learning episodes", "beta", "gamma", "epsilon", "Average reward"])

    def update_results(self, n_episodes, beta, gamma, epsilon, average_reward):
        self.results.loc[len(self.results)] = [n_episodes, beta, gamma, epsilon, average_reward]

    def delete_row(self, index):
        self.results.drop([index], axis=0, inplace=True)

    def sort_results(self, column_name):
        self.results = self.results.sort_values(by=[column_name])

    def __repr__(self):
        return self.results.to_string()

## 3. Zastosowanie algorytmu

In [13]:
n_episodes = 20000
n_eval_episodes = 500
eval_period = 2000

#### 3.1 Badanie wpływu współczynnika $\beta$

In [14]:
results_beta = Results()

In [15]:
beta = 0.03
gamma = 0.9
epsilon = 0.01

In [16]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [17]:
agent.get_parameters()

{'beta': 0.03, 'gamma': 0.9, 'epsilon': 0.01}

In [330]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -257.094
After 4000/20000 learning episodes - average reward: -141.25
After 6000/20000 learning episodes - average reward: -29.326
After 8000/20000 learning episodes - average reward: -32.986
After 10000/20000 learning episodes - average reward: -4.412
After 12000/20000 learning episodes - average reward: 6.962
After 14000/20000 learning episodes - average reward: 5.366
After 16000/20000 learning episodes - average reward: 7.588
After 18000/20000 learning episodes - average reward: 7.402
After 20000/20000 learning episodes - average reward: 7.212


In [331]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [332]:
beta = 0.05
gamma = 0.9
epsilon = 0.01

In [333]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [334]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -162.898
After 4000/20000 learning episodes - average reward: -39.872
After 6000/20000 learning episodes - average reward: 0.3
After 8000/20000 learning episodes - average reward: 7.518
After 10000/20000 learning episodes - average reward: 7.228
After 12000/20000 learning episodes - average reward: 7.536
After 14000/20000 learning episodes - average reward: 7.278
After 16000/20000 learning episodes - average reward: 7.116
After 18000/20000 learning episodes - average reward: 7.32
After 20000/20000 learning episodes - average reward: 7.342


In [335]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [336]:
beta = 0.1
gamma = 0.9
epsilon = 0.01

In [337]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [338]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -24.792
After 4000/20000 learning episodes - average reward: 6.02
After 6000/20000 learning episodes - average reward: 7.648
After 8000/20000 learning episodes - average reward: 7.686
After 10000/20000 learning episodes - average reward: 7.598
After 12000/20000 learning episodes - average reward: 7.52
After 14000/20000 learning episodes - average reward: 6.792
After 16000/20000 learning episodes - average reward: 7.452
After 18000/20000 learning episodes - average reward: 7.482
After 20000/20000 learning episodes - average reward: 7.112


In [339]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [340]:
beta = 0.001
gamma = 0.9
epsilon = 0.01

In [341]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [342]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -358.934
After 4000/20000 learning episodes - average reward: -249.356
After 6000/20000 learning episodes - average reward: -274.808
After 8000/20000 learning episodes - average reward: -354.836
After 10000/20000 learning episodes - average reward: -360.918
After 12000/20000 learning episodes - average reward: -271.586
After 14000/20000 learning episodes - average reward: -273.998
After 16000/20000 learning episodes - average reward: -288.786
After 18000/20000 learning episodes - average reward: -290.2
After 20000/20000 learning episodes - average reward: -247.772


In [343]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [344]:
results_beta.sort_results("beta")

In [345]:
results_beta.results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
3,20000.0,0.001,0.9,0.01,-247.772
0,20000.0,0.03,0.9,0.01,7.212
1,20000.0,0.05,0.9,0.01,7.342
2,20000.0,0.1,0.9,0.01,7.112


#### 3.2 Badanie wpływu współczynnika $\gamma$

In [346]:
results_gamma = Results()

In [347]:
beta = 0.03
gamma = 0.95
epsilon = 0.01

In [348]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [349]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -217.816
After 4000/20000 learning episodes - average reward: -144.682
After 6000/20000 learning episodes - average reward: -17.73
After 8000/20000 learning episodes - average reward: 4.996
After 10000/20000 learning episodes - average reward: -5.194
After 12000/20000 learning episodes - average reward: 7.394
After 14000/20000 learning episodes - average reward: 7.442
After 16000/20000 learning episodes - average reward: 7.236
After 18000/20000 learning episodes - average reward: 7.538
After 20000/20000 learning episodes - average reward: 7.312


In [350]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [351]:
beta = 0.03
gamma = 0.99
epsilon = 0.01

In [352]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [353]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -318.338
After 4000/20000 learning episodes - average reward: -95.568
After 6000/20000 learning episodes - average reward: -14.472
After 8000/20000 learning episodes - average reward: 5.588
After 10000/20000 learning episodes - average reward: 7.186
After 12000/20000 learning episodes - average reward: 7.36
After 14000/20000 learning episodes - average reward: 7.454
After 16000/20000 learning episodes - average reward: 7.494
After 18000/20000 learning episodes - average reward: 7.426
After 20000/20000 learning episodes - average reward: 7.35


In [354]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [355]:
beta = 0.03
gamma = 0.999
epsilon = 0.01

In [356]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [357]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -330.794
After 4000/20000 learning episodes - average reward: -90.86
After 6000/20000 learning episodes - average reward: -11.456
After 8000/20000 learning episodes - average reward: 2.686
After 10000/20000 learning episodes - average reward: 7.37
After 12000/20000 learning episodes - average reward: 7.644
After 14000/20000 learning episodes - average reward: 7.38
After 16000/20000 learning episodes - average reward: 7.044
After 18000/20000 learning episodes - average reward: 7.45
After 20000/20000 learning episodes - average reward: 7.582


In [358]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [359]:
beta = 0.03
gamma = 0.8
epsilon = 0.01

In [360]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [361]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -216.558
After 4000/20000 learning episodes - average reward: -107.98
After 6000/20000 learning episodes - average reward: -92.628
After 8000/20000 learning episodes - average reward: -25.068
After 10000/20000 learning episodes - average reward: -12.496
After 12000/20000 learning episodes - average reward: -5.676
After 14000/20000 learning episodes - average reward: -6.916
After 16000/20000 learning episodes - average reward: 3.128
After 18000/20000 learning episodes - average reward: 2.07
After 20000/20000 learning episodes - average reward: -22.116


In [362]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [363]:
beta = 0.03
gamma = 0.6
epsilon = 0.01

In [364]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [365]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -210.912
After 4000/20000 learning episodes - average reward: -142.57
After 6000/20000 learning episodes - average reward: -121.476
After 8000/20000 learning episodes - average reward: -101.846
After 10000/20000 learning episodes - average reward: -82.382
After 12000/20000 learning episodes - average reward: -66.676
After 14000/20000 learning episodes - average reward: -51.844
After 16000/20000 learning episodes - average reward: -60.442
After 18000/20000 learning episodes - average reward: -41.414
After 20000/20000 learning episodes - average reward: -31.144


In [366]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [367]:
results_gamma.sort_results("gamma")

In [368]:
results_gamma.results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
4,20000.0,0.03,0.6,0.01,-31.144
3,20000.0,0.03,0.8,0.01,-22.116
0,20000.0,0.03,0.95,0.01,7.312
1,20000.0,0.03,0.99,0.01,7.35
2,20000.0,0.03,0.999,0.01,7.582


#### 3.3 Badanie wpływu wartości parametru $\epsilon$

In [370]:
results_epsilon = Results()

In [371]:
beta = 0.03
gamma = 0.9
epsilon = 0.05

In [372]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [373]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -210.868
After 4000/20000 learning episodes - average reward: -56.588
After 6000/20000 learning episodes - average reward: -9.576
After 8000/20000 learning episodes - average reward: -5.61
After 10000/20000 learning episodes - average reward: 3.264
After 12000/20000 learning episodes - average reward: 4.158
After 14000/20000 learning episodes - average reward: 5.588
After 16000/20000 learning episodes - average reward: 5.172
After 18000/20000 learning episodes - average reward: 4.694
After 20000/20000 learning episodes - average reward: 5.172


In [374]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [375]:
beta = 0.03
gamma = 0.9
epsilon = 0.1

In [376]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [377]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -209.75
After 4000/20000 learning episodes - average reward: -64.122
After 6000/20000 learning episodes - average reward: -9.168
After 8000/20000 learning episodes - average reward: -3.708
After 10000/20000 learning episodes - average reward: 2.108
After 12000/20000 learning episodes - average reward: 2.864
After 14000/20000 learning episodes - average reward: 2.338
After 16000/20000 learning episodes - average reward: 2.132
After 18000/20000 learning episodes - average reward: 2.316
After 20000/20000 learning episodes - average reward: 2.322


In [378]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [379]:
beta = 0.03
gamma = 0.9
epsilon = 0.2

In [380]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [381]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -236.492
After 4000/20000 learning episodes - average reward: -62.758
After 6000/20000 learning episodes - average reward: -22.046
After 8000/20000 learning episodes - average reward: -6.728
After 10000/20000 learning episodes - average reward: -4.58
After 12000/20000 learning episodes - average reward: -5.086
After 14000/20000 learning episodes - average reward: -5.22
After 16000/20000 learning episodes - average reward: -5.332
After 18000/20000 learning episodes - average reward: -4.496
After 20000/20000 learning episodes - average reward: -5.014


In [382]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [383]:
beta = 0.03
gamma = 0.9
epsilon = 0.5

In [384]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [385]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -323.826
After 4000/20000 learning episodes - average reward: -99.7
After 6000/20000 learning episodes - average reward: -56.884
After 8000/20000 learning episodes - average reward: -48.954
After 10000/20000 learning episodes - average reward: -51.102
After 12000/20000 learning episodes - average reward: -49.864
After 14000/20000 learning episodes - average reward: -49.004
After 16000/20000 learning episodes - average reward: -49.448
After 18000/20000 learning episodes - average reward: -45.522
After 20000/20000 learning episodes - average reward: -47.132


In [386]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [387]:
beta = 0.03
gamma = 0.9
epsilon = 1

In [388]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [389]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -772.506
After 4000/20000 learning episodes - average reward: -771.48
After 6000/20000 learning episodes - average reward: -768.062
After 8000/20000 learning episodes - average reward: -768.826
After 10000/20000 learning episodes - average reward: -772.874
After 12000/20000 learning episodes - average reward: -772.304
After 14000/20000 learning episodes - average reward: -764.03
After 16000/20000 learning episodes - average reward: -780.152
After 18000/20000 learning episodes - average reward: -767.658
After 20000/20000 learning episodes - average reward: -771.484


In [390]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [391]:
beta = 0.03
gamma = 0.9
epsilon = 0.005

In [392]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [393]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -242.444
After 4000/20000 learning episodes - average reward: -108.55
After 6000/20000 learning episodes - average reward: -32.988
After 8000/20000 learning episodes - average reward: -13.33
After 10000/20000 learning episodes - average reward: 6.1
After 12000/20000 learning episodes - average reward: 7.902
After 14000/20000 learning episodes - average reward: 7.916
After 16000/20000 learning episodes - average reward: 7.63
After 18000/20000 learning episodes - average reward: 7.488
After 20000/20000 learning episodes - average reward: 7.332


In [394]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [395]:
beta = 0.03
gamma = 0.9
epsilon = 0.001

In [396]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [397]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -270.278
After 4000/20000 learning episodes - average reward: -139.106
After 6000/20000 learning episodes - average reward: -25.786
After 8000/20000 learning episodes - average reward: -13.312
After 10000/20000 learning episodes - average reward: 4.2
After 12000/20000 learning episodes - average reward: 6.018
After 14000/20000 learning episodes - average reward: 7.47
After 16000/20000 learning episodes - average reward: 7.954
After 18000/20000 learning episodes - average reward: 7.866
After 20000/20000 learning episodes - average reward: 7.722


In [398]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [399]:
beta = 0.03
gamma = 0.9
epsilon = 0.0001

In [400]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [401]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -315.662
After 4000/20000 learning episodes - average reward: -108.048
After 6000/20000 learning episodes - average reward: -53.186
After 8000/20000 learning episodes - average reward: -20.012
After 10000/20000 learning episodes - average reward: 4.2
After 12000/20000 learning episodes - average reward: 5.418
After 14000/20000 learning episodes - average reward: 7.194
After 16000/20000 learning episodes - average reward: 8.094
After 18000/20000 learning episodes - average reward: 7.916
After 20000/20000 learning episodes - average reward: 7.896


In [402]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [403]:
beta = 0.03
gamma = 0.9
epsilon = 0

In [404]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [405]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -297.592
After 4000/20000 learning episodes - average reward: -161.314
After 6000/20000 learning episodes - average reward: -34.672
After 8000/20000 learning episodes - average reward: -2.57
After 10000/20000 learning episodes - average reward: -11.752
After 12000/20000 learning episodes - average reward: 6.658
After 14000/20000 learning episodes - average reward: 7.866
After 16000/20000 learning episodes - average reward: 7.894
After 18000/20000 learning episodes - average reward: 8.134
After 20000/20000 learning episodes - average reward: 7.814


In [406]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [407]:
results_epsilon.sort_results("epsilon")

In [408]:
results_epsilon.results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
8,20000.0,0.03,0.9,0.0,7.814
7,20000.0,0.03,0.9,0.0001,7.896
6,20000.0,0.03,0.9,0.001,7.722
5,20000.0,0.03,0.9,0.005,7.332
0,20000.0,0.03,0.9,0.05,5.172
1,20000.0,0.03,0.9,0.1,2.322
2,20000.0,0.03,0.9,0.2,-5.014
3,20000.0,0.03,0.9,0.5,-47.132
4,20000.0,0.03,0.9,1.0,-771.484


## 4. Podsumowanie

In [410]:
results = pd.concat([results_beta.results, results_gamma.results, results_epsilon.results])

In [411]:
results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
3,20000.0,0.001,0.9,0.01,-247.772
0,20000.0,0.03,0.9,0.01,7.212
1,20000.0,0.05,0.9,0.01,7.342
2,20000.0,0.1,0.9,0.01,7.112
4,20000.0,0.03,0.6,0.01,-31.144
3,20000.0,0.03,0.8,0.01,-22.116
0,20000.0,0.03,0.95,0.01,7.312
1,20000.0,0.03,0.99,0.01,7.35
2,20000.0,0.03,0.999,0.01,7.582
8,20000.0,0.03,0.9,0.0,7.814
