# Sprawozdanie z laboratorium 6

***Autor: Adam Dąbkowski***

Celem szóstego laboratorium jest zaimplementowanie algorytmu ***Q-learning***. Dodatkowo należy stworzyć agenta rozwiązującego problem ***Taxi***.


## 0. Importowanie niezbędnych bibliotek

In [33]:
import gym
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 1. Wizualizacja stanu środowiska

Wykorzystywane przez na środowisko zawiera cztery wyznaczone miejsca (***R***, ***G***, ***Y***, ***B***), w których pasażer może wsiąść do taksówki (***żółty prostokąt***) lub wysiąść. Gracz otrzymuje pozytywne nagrody za udane podrzucenie pasażera w odpowiednim miejscu, natomiast negatywne nagrody za próby odebrania/odwiezienia pasażera kończące się niepowodzeniem oraz za każdy krok, w którym nie otrzymano kolejnej nagrody.

In [34]:
env = gym.make('Taxi-v3')
env.render()

+---------+
|[35mR[0m: | : :[34;1mG[0m|
| : |[43m [0m: : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+



## 2. Implementacja algorytmu ***Q-learning***

Głównym zadaniem szóstego laboratorium jest implementacja algorytmu ***Q-learning***. Po za tym należy stworzyć agenta rozwiązującego problem ***Taxi***. W tym celu stworzona została klasa ***QlearningAgent***. Podczas tworzenia obiektu tej klasy istnieje możliwość podania parametrów ***env*** (*wykorzystywane środowisko*), ***beta*** (*współczynnik uczenia*), ***gamma*** (*stopa dyskontowa*) oraz ***epsilon*** (*prawdopodobieństwo $\epsilon$*).



Klasa ***QlearningAgent*** zawiera także cztery metody:
- ***get_parameters()*** - metoda zwracająca wartości parametrów ***beta***, ***gamma*** oraz ***epsilon***
- ***exploration()*** - metoda odpowiadająca za strategię eksploracji (w tym przypadku ***strategię $\epsilon$-zachłanną***)
- ***learn()*** - metoda odpowiadająca za uczenie według algorytmu ***Q-learning***
- ***evaluate()*** - metoda odpowiedzialna za ocenę na danym etapie uczenia

In [35]:
class QlearningAgent:
    def __init__(self, env, beta=0.03, gamma=0.9, epsilon=0.01):
        self.env = env
        self.beta = beta
        self.gamma = gamma
        self.epsilon = epsilon
        self.Q = np.zeros([env.observation_space.n, env.action_space.n])

    def get_parameters(self):
        return {
            "beta": self.beta,
            "gamma": self.gamma,
            "epsilon": self.epsilon
        }

    def exploration(self, state):
        if np.random.rand() < self.epsilon:
            action = self.env.action_space.sample()
        else:
            action = np.argmax(self.Q[state])
        return action

    def learn(self, n_episodes=10000, n_eval_episodes=20, eval_period=2000, deep_printing=False, plot_history=True):
        # all_rewards = []
        for i in range(n_episodes):
            # episode_reward = 0
            state = self.env.reset()
            done = False
            while not done:
                action = self.exploration(state)
                new_state, reward, done, _ = self.env.step(action)
                self.Q[state, action] += self.beta * (reward + self.gamma * np.max(self.Q[new_state, :]) - self.Q[state, action])
                # episode_reward += reward
                state = new_state

            # all_rewards.append(episode_reward)

            if (i+1) % eval_period == 0 or (i+1) == n_episodes:
                average_reward = self.evaluate(n_eval_episodes, deep_printing)
                print(f'After {i+1}/{n_episodes} learning episodes - average reward: {average_reward}')
                if deep_printing:
                    print(" ")

        # if plot_history:
        #     plt.plot(all_rewards)

        return average_reward


    def evaluate(self, n_eval_episodes, printing=False):
        all_rewards = []
        for i in range(n_eval_episodes):
            episode_reward = 0
            state = self.env.reset()
            done = False
            while not done:
                action = np.argmax(self.Q[state])
                state, reward, done, _ = self.env.step(action)
                episode_reward += reward

            all_rewards.append(episode_reward)

            if printing:
                print(f'Episode {i} reward: {episode_reward}')

        return np.mean(all_rewards)

Aby móc w łatwy sposób prezentować i analizować rezultaty działania algorytmu dla poszczególnych przypadków, zaimplementowana została prosta klasa ***Results***.

In [36]:
class Results:
    def __init__(self):
        self.results = pd.DataFrame(columns=["Learning episodes", "beta", "gamma", "epsilon", "Average reward"])

    def update_results(self, n_episodes, beta, gamma, epsilon, average_reward):
        self.results.loc[len(self.results)] = [n_episodes, beta, gamma, epsilon, average_reward]

    def delete_row(self, index):
        self.results.drop([index], axis=0, inplace=True)

    def sort_results(self, column_name):
        self.results = self.results.sort_values(by=[column_name])

    def __repr__(self):
        return self.results.to_string()

## 3. Zastosowanie algorytmu

In [37]:
n_episodes = 20000
n_eval_episodes = 500
eval_period = 2000

#### 3.1 Badanie wpływu współczynnika $\beta$

In [38]:
results_beta = Results()

In [39]:
beta = 0.03
gamma = 0.9
epsilon = 0.01

In [40]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [41]:
agent.get_parameters()

{'beta': 0.03, 'gamma': 0.9, 'epsilon': 0.01}

In [42]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -302.512
After 4000/20000 learning episodes - average reward: -123.432
After 6000/20000 learning episodes - average reward: -45.092
After 8000/20000 learning episodes - average reward: -30.12
After 10000/20000 learning episodes - average reward: -11.482
After 12000/20000 learning episodes - average reward: 7.596
After 14000/20000 learning episodes - average reward: 6.934
After 16000/20000 learning episodes - average reward: 7.764
After 18000/20000 learning episodes - average reward: 8.066
After 20000/20000 learning episodes - average reward: 7.968


In [43]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [44]:
beta = 0.05
gamma = 0.9
epsilon = 0.01

In [45]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [46]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -147.908
After 4000/20000 learning episodes - average reward: -40.524
After 6000/20000 learning episodes - average reward: 1.622
After 8000/20000 learning episodes - average reward: 7.556
After 10000/20000 learning episodes - average reward: 7.994
After 12000/20000 learning episodes - average reward: 7.898
After 14000/20000 learning episodes - average reward: 7.842
After 16000/20000 learning episodes - average reward: 8.168
After 18000/20000 learning episodes - average reward: 7.842
After 20000/20000 learning episodes - average reward: 8.074


In [47]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [48]:
beta = 0.1
gamma = 0.9
epsilon = 0.01

In [49]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [50]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -55.356
After 4000/20000 learning episodes - average reward: 7.654
After 6000/20000 learning episodes - average reward: 7.978
After 8000/20000 learning episodes - average reward: 7.924
After 10000/20000 learning episodes - average reward: 7.794
After 12000/20000 learning episodes - average reward: 7.858
After 14000/20000 learning episodes - average reward: 8.0
After 16000/20000 learning episodes - average reward: 7.956
After 18000/20000 learning episodes - average reward: 7.95
After 20000/20000 learning episodes - average reward: 7.678


In [51]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [52]:
beta = 0.001
gamma = 0.9
epsilon = 0.01

In [53]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [54]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -343.19
After 4000/20000 learning episodes - average reward: -382.952
After 6000/20000 learning episodes - average reward: -292.628
After 8000/20000 learning episodes - average reward: -267.482
After 10000/20000 learning episodes - average reward: -418.016
After 12000/20000 learning episodes - average reward: -278.876
After 14000/20000 learning episodes - average reward: -318.242
After 16000/20000 learning episodes - average reward: -264.62
After 18000/20000 learning episodes - average reward: -307.352
After 20000/20000 learning episodes - average reward: -264.242


In [55]:
results_beta.update_results(n_episodes, beta, gamma, epsilon, reward)

In [56]:
results_beta.sort_results("beta")

In [57]:
results_beta.results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
3,20000.0,0.001,0.9,0.01,-264.242
0,20000.0,0.03,0.9,0.01,7.968
1,20000.0,0.05,0.9,0.01,8.074
2,20000.0,0.1,0.9,0.01,7.678


#### 3.2 Badanie wpływu współczynnika $\gamma$

In [58]:
results_gamma = Results()

In [59]:
beta = 0.03
gamma = 0.95
epsilon = 0.01

In [60]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [61]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -306.65
After 4000/20000 learning episodes - average reward: -131.518
After 6000/20000 learning episodes - average reward: -63.67
After 8000/20000 learning episodes - average reward: 2.188
After 10000/20000 learning episodes - average reward: 6.752
After 12000/20000 learning episodes - average reward: 7.808
After 14000/20000 learning episodes - average reward: 8.07
After 16000/20000 learning episodes - average reward: 7.812
After 18000/20000 learning episodes - average reward: 8.0
After 20000/20000 learning episodes - average reward: 8.006


In [62]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [63]:
beta = 0.03
gamma = 0.99
epsilon = 0.01

In [64]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [65]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -247.096
After 4000/20000 learning episodes - average reward: -110.636
After 6000/20000 learning episodes - average reward: -25.574
After 8000/20000 learning episodes - average reward: 7.82
After 10000/20000 learning episodes - average reward: 7.82
After 12000/20000 learning episodes - average reward: 8.086
After 14000/20000 learning episodes - average reward: 7.982
After 16000/20000 learning episodes - average reward: 7.988
After 18000/20000 learning episodes - average reward: 7.854
After 20000/20000 learning episodes - average reward: 7.824


In [66]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [67]:
beta = 0.03
gamma = 0.999
epsilon = 0.01

In [68]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [69]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -247.45
After 4000/20000 learning episodes - average reward: -92.066
After 6000/20000 learning episodes - average reward: -58.87
After 8000/20000 learning episodes - average reward: 4.662
After 10000/20000 learning episodes - average reward: 7.952
After 12000/20000 learning episodes - average reward: 7.856
After 14000/20000 learning episodes - average reward: 7.926
After 16000/20000 learning episodes - average reward: 8.018
After 18000/20000 learning episodes - average reward: 8.026
After 20000/20000 learning episodes - average reward: 7.934


In [70]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [71]:
beta = 0.03
gamma = 0.8
epsilon = 0.01

In [72]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [73]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -203.434
After 4000/20000 learning episodes - average reward: -153.17
After 6000/20000 learning episodes - average reward: -90.02
After 8000/20000 learning episodes - average reward: -47.838
After 10000/20000 learning episodes - average reward: -21.296
After 12000/20000 learning episodes - average reward: -12.718
After 14000/20000 learning episodes - average reward: -9.374
After 16000/20000 learning episodes - average reward: 1.662
After 18000/20000 learning episodes - average reward: 2.184
After 20000/20000 learning episodes - average reward: 1.814


In [74]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [75]:
beta = 0.03
gamma = 0.6
epsilon = 0.01

In [76]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [77]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -214.526
After 4000/20000 learning episodes - average reward: -154.838
After 6000/20000 learning episodes - average reward: -131.562
After 8000/20000 learning episodes - average reward: -117.354
After 10000/20000 learning episodes - average reward: -86.134
After 12000/20000 learning episodes - average reward: -67.1
After 14000/20000 learning episodes - average reward: -75.028
After 16000/20000 learning episodes - average reward: -47.654
After 18000/20000 learning episodes - average reward: -44.388
After 20000/20000 learning episodes - average reward: -38.094


In [78]:
results_gamma.update_results(n_episodes, beta, gamma, epsilon, reward)

In [79]:
results_gamma.sort_results("gamma")

In [80]:
results_gamma.results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
4,20000.0,0.03,0.6,0.01,-38.094
3,20000.0,0.03,0.8,0.01,1.814
0,20000.0,0.03,0.95,0.01,8.006
1,20000.0,0.03,0.99,0.01,7.824
2,20000.0,0.03,0.999,0.01,7.934


#### 3.3 Badanie wpływu wartości parametru $\epsilon$

In [81]:
results_epsilon = Results()

In [82]:
beta = 0.03
gamma = 0.9
epsilon = 0.05

In [83]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [84]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -167.262
After 4000/20000 learning episodes - average reward: -104.718
After 6000/20000 learning episodes - average reward: -31.56
After 8000/20000 learning episodes - average reward: -20.166
After 10000/20000 learning episodes - average reward: 4.742
After 12000/20000 learning episodes - average reward: 7.618
After 14000/20000 learning episodes - average reward: 7.984
After 16000/20000 learning episodes - average reward: 7.87
After 18000/20000 learning episodes - average reward: 7.922
After 20000/20000 learning episodes - average reward: 8.006


In [85]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [86]:
beta = 0.03
gamma = 0.9
epsilon = 0.1

In [87]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [88]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -192.81
After 4000/20000 learning episodes - average reward: -88.974
After 6000/20000 learning episodes - average reward: -33.29
After 8000/20000 learning episodes - average reward: -8.498
After 10000/20000 learning episodes - average reward: 1.834
After 12000/20000 learning episodes - average reward: 4.466
After 14000/20000 learning episodes - average reward: 7.686
After 16000/20000 learning episodes - average reward: 7.712
After 18000/20000 learning episodes - average reward: 7.78
After 20000/20000 learning episodes - average reward: 7.868


In [89]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [90]:
beta = 0.03
gamma = 0.9
epsilon = 0.2

In [91]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [92]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -172.422
After 4000/20000 learning episodes - average reward: -86.99
After 6000/20000 learning episodes - average reward: -25.952
After 8000/20000 learning episodes - average reward: -11.02
After 10000/20000 learning episodes - average reward: 2.748
After 12000/20000 learning episodes - average reward: 8.03
After 14000/20000 learning episodes - average reward: 7.984
After 16000/20000 learning episodes - average reward: 7.856
After 18000/20000 learning episodes - average reward: 7.922
After 20000/20000 learning episodes - average reward: 7.772


In [93]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [94]:
beta = 0.03
gamma = 0.9
epsilon = 0.5

In [95]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [96]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -168.276
After 4000/20000 learning episodes - average reward: -71.186
After 6000/20000 learning episodes - average reward: -6.908
After 8000/20000 learning episodes - average reward: -0.346
After 10000/20000 learning episodes - average reward: 5.224
After 12000/20000 learning episodes - average reward: 7.834
After 14000/20000 learning episodes - average reward: 7.722
After 16000/20000 learning episodes - average reward: 7.936
After 18000/20000 learning episodes - average reward: 7.86
After 20000/20000 learning episodes - average reward: 7.902


In [97]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [98]:
beta = 0.03
gamma = 0.9
epsilon = 1

In [99]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [100]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -175.062
After 4000/20000 learning episodes - average reward: 7.608
After 6000/20000 learning episodes - average reward: 7.92
After 8000/20000 learning episodes - average reward: 7.976
After 10000/20000 learning episodes - average reward: 7.882
After 12000/20000 learning episodes - average reward: 7.93
After 14000/20000 learning episodes - average reward: 7.868
After 16000/20000 learning episodes - average reward: 7.802
After 18000/20000 learning episodes - average reward: 8.026
After 20000/20000 learning episodes - average reward: 7.906


In [101]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [102]:
beta = 0.03
gamma = 0.9
epsilon = 0.005

In [103]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [104]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -361.53
After 4000/20000 learning episodes - average reward: -123.074
After 6000/20000 learning episodes - average reward: -43.53
After 8000/20000 learning episodes - average reward: -4.774
After 10000/20000 learning episodes - average reward: 2.23
After 12000/20000 learning episodes - average reward: 5.758
After 14000/20000 learning episodes - average reward: 7.96
After 16000/20000 learning episodes - average reward: 7.888
After 18000/20000 learning episodes - average reward: 7.93
After 20000/20000 learning episodes - average reward: 8.062


In [105]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [106]:
beta = 0.03
gamma = 0.9
epsilon = 0.001

In [107]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [108]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -215.634
After 4000/20000 learning episodes - average reward: -113.16
After 6000/20000 learning episodes - average reward: -28.83
After 8000/20000 learning episodes - average reward: 0.902
After 10000/20000 learning episodes - average reward: 2.248
After 12000/20000 learning episodes - average reward: 2.378
After 14000/20000 learning episodes - average reward: 7.69
After 16000/20000 learning episodes - average reward: 7.952
After 18000/20000 learning episodes - average reward: 7.98
After 20000/20000 learning episodes - average reward: 8.022


In [109]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [110]:
beta = 0.03
gamma = 0.9
epsilon = 0.0001

In [111]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [112]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -247.76
After 4000/20000 learning episodes - average reward: -126.51
After 6000/20000 learning episodes - average reward: -34.206
After 8000/20000 learning episodes - average reward: -3.952
After 10000/20000 learning episodes - average reward: -15.384
After 12000/20000 learning episodes - average reward: 8.038
After 14000/20000 learning episodes - average reward: 6.818
After 16000/20000 learning episodes - average reward: 8.16
After 18000/20000 learning episodes - average reward: 7.714
After 20000/20000 learning episodes - average reward: 7.988


In [113]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [114]:
beta = 0.03
gamma = 0.9
epsilon = 0

In [115]:
agent = QlearningAgent(env=env, beta=beta, gamma=gamma, epsilon=epsilon)

In [116]:
reward = agent.learn(n_episodes=n_episodes, n_eval_episodes=n_eval_episodes, eval_period=eval_period)

After 2000/20000 learning episodes - average reward: -174.01
After 4000/20000 learning episodes - average reward: -68.62
After 6000/20000 learning episodes - average reward: -23.978
After 8000/20000 learning episodes - average reward: -0.548
After 10000/20000 learning episodes - average reward: 2.138
After 12000/20000 learning episodes - average reward: 7.476
After 14000/20000 learning episodes - average reward: 6.52
After 16000/20000 learning episodes - average reward: 8.012
After 18000/20000 learning episodes - average reward: 8.1
After 20000/20000 learning episodes - average reward: 7.748


In [117]:
results_epsilon.update_results(n_episodes, beta, gamma, epsilon, reward)

In [118]:
results_epsilon.sort_results("epsilon")

In [119]:
results_epsilon.results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
8,20000.0,0.03,0.9,0.0,7.748
7,20000.0,0.03,0.9,0.0001,7.988
6,20000.0,0.03,0.9,0.001,8.022
5,20000.0,0.03,0.9,0.005,8.062
0,20000.0,0.03,0.9,0.05,8.006
1,20000.0,0.03,0.9,0.1,7.868
2,20000.0,0.03,0.9,0.2,7.772
3,20000.0,0.03,0.9,0.5,7.902
4,20000.0,0.03,0.9,1.0,7.906


## 4. Podsumowanie

In [120]:
results = pd.concat([results_beta.results, results_gamma.results, results_epsilon.results])

In [121]:
results

Unnamed: 0,Learning episodes,beta,gamma,epsilon,Average reward
3,20000.0,0.001,0.9,0.01,-264.242
0,20000.0,0.03,0.9,0.01,7.968
1,20000.0,0.05,0.9,0.01,8.074
2,20000.0,0.1,0.9,0.01,7.678
4,20000.0,0.03,0.6,0.01,-38.094
3,20000.0,0.03,0.8,0.01,1.814
0,20000.0,0.03,0.95,0.01,8.006
1,20000.0,0.03,0.99,0.01,7.824
2,20000.0,0.03,0.999,0.01,7.934
8,20000.0,0.03,0.9,0.0,7.748
