# Tabular Q-learning Challenge

First we import our required packages.

In [17]:
import gym
import numpy as np
import random
import sys

Now we create a class that can hold our agent data. Our Q learning algorithm will utilize this class.

Using a class like this enables us to make multiple agent_data objects with different parameters. This is very convenient for comparing the results of different training parameters in a parameter study.

As you can see most parameters are referenced already.

Fill in the gaps in the code using the Q-learning pseudocode.

In [14]:
class q_learning_agent:
    def __init__(self, alpha=1, epsilon=0.5, gamma=0.8, env_name="Taxi-v3"):
        self.alpha = alpha
        self.epsilon = epsilon
        self.gamma = gamma
        self.q_table = None           #Is initialized when training the agent using the initialize_q_table function
        self.env_name = env_name
        self.env = None
        self.initialize_q_table()
    
    def initialize_q_table(self):
        self.env = gym.make(self.env_name)
        self.q_table = #CODING ASSIGNMENT, 1 LINE: Define the shape of the Q table based on the pseudocode, and the OpenAI gym documentation.
        self.env.close()
    
    def epsilon_greedy(self, observation):
        #CODING ASSIGNMENT, ~5 LINES: Implement epsilon greedy.
    
    def train(self, episodes=2000):
        self.env = gym.make(self.env_name)
        
        for _ in range(episodes):
            state = self.env.reset() #Initialize state
            done = False

            while not done:
                #CODING ASSIGNMENT, ~7 LINES: Implement the Q-learning algorithm.

                #Select action with epsilon greedy

                next_state, reward, done, _ = self.env.step(action) #The OpenAI Gym step function to execute an action and get the next state and reward.

                # Retrieve old value from the q-table.

                # Calculate td-error.

                # Update q-value for current state using td-error.

                # S <- S'

                if done:
                    state = self.env.reset()
                    break
        
        self.env.close()

    def test(self, episodes=200, render=False):
        correct, incorrect = 0, 0
        
        self.env = gym.make(self.env_name)
        for _ in range(episodes):
            observation = self.env.reset()
            done = False
            while not done:
                if render:
                    self.env.render()
                
                action = np.argmax(self.q_table[observation])
                observation, reward, done, _ = self.env.step(action)

                if done:

                    if (reward > 0):
                        correct = correct + 1
                    else:
                        incorrect = incorrect + 1
                    
                    observation = self.env.reset()

                    break

        print(f"Correct: {correct}, incorrect: {incorrect}")
        self.env.close()
        return correct, incorrect

    def show_episode(self):
        self.test(episodes=1, render=True)
        
    def train_test_sequence(self, train_episodes=1000, test_episodes=200, repetitions=5):
        self.initialize_q_table()
        results = []
        
        print(f"Training and testing in {repetitions} repetitions of {train_episodes} train episodes and {test_episodes} test episodes each.")
        print(f"Parameters: alpha: {self.alpha}, epsilon: {self.epsilon}, gamma: {self.gamma}")
        
        for i in range(repetitions):
            self.train(episodes=train_episodes)
            correct, incorrect = self.test(episodes=test_episodes)
            results.append([correct, incorrect])
        
        return results
        


Now we are ready to test our Q-learning agent.

In [None]:
agent = q_learning_agent(env_name="Taxi-v3") #We instantiate the agent with base parameters
agent.train()
_ = agent.test()
_ = agent.show_episode()

### Parameter Study
We are going to do a parameter study on our agent for the Taxi-v3 environment. What this means is that we take a list of possible values for each parameter, and train our agent using all of these values, to get an idea of which value works best for which parameter.

- Create lists of possible values for both gamma and epsilon.
- Create for-loops where you set your agent's parameters to those values, and train
- Find the optimal combination of values for gamma and epsilon.

Use the train_test_sequence method for convenience.

Now that you have found your (hopefully) optimal values, experiment with how few training episodes and repetitions you need to get a 100% correct when resetting and retraining your agent 3 times. <br> (The total amount of training episodes is train_episodes * repetitions)

### FrozenLake-v0
Repeat the parameter study for the FrozenLake-v0 environment.
This environment is stochastic: the movement direction depends only partially on the action taken by the agent.
Because the environment is stochastic, alpha = 1 is no longer optimal. This is why you will now include alpha in your parameter study.

Now that you have found your (hopefully) optimal values, experiment with how few training episodes and repetitions you need to get a 100% correct when resetting and retraining your agent 3 times.
(The total amount of training episodes is train_episodes * repetitions)

### Extra parameter study: FrozenLake8x8-v0
This is just FrozenLake-v0 but with a larger level. Because the level is larger, more steps need to be taken to get to a reward. These kinds of environments are called "sparse reward" environments. Because the agent currently only learns when it finds a reward, sparse reward environments can pose quite the problem. You probably will not be able to solve this yet, with the agent you currently have.

Now that you have found your (hopefully) optimal values, experiment with how few training episodes and repetitions you need to get a 100% correct when resetting and retraining your agent 3 times.
(The total amount of training episodes is train_episodes * repetitions)