## <center> Exercise with continuos Q-Learning</center> 

In this exercise we take a look at the MountainCar-v0 (https://gym.openai.com/envs/MountainCar-v0/), which has the goal to reach the top of the mountain within some time limit.



In [None]:
import numpy as np
import matplotlib.pyplot as plt
import gym

**TASK: Create the gym mountain car environment** <br />


In [None]:
def recall():
    """
    Each time a reset or a render are called, the environment has to be recharged or recalled.
    In the recall function the name and the make of the environment must me set. 
    """
    env_name = "MountainCar-v0"  # Use the exact same name as stated on gym.openai
    env = gym.make(env_name)  # use gym.make to create your environment, important declare the render_mode

    return env

env = recall()

**TASK: Write a code to create a numpy array holding the bins for the observations of the car (position and velocity).**

The function should take one argument which acts as the bins per observation Hint: You will probably need around 25 bins for good results, but feel free to use less to reduce training time.


In [None]:
NUM_BINS = 50

position_bin = np.linspace(-1.2, 0.6, NUM_BINS)
velocity_bin = np.linspace(-0.07, 0.07, NUM_BINS)

BINS = [position_bin, velocity_bin]

**TASK: Create a function that will take in observations from the environment and the bins array and return the discretized version of the observation.**

In [None]:
def binner(observations, bins):
    binned_observations = []

    for ind, observation in enumerate(observations):
        binned_val = np.digitize(observation, bins[ind])
        binned_observations.append(binned_val)
    
    return tuple(binned_observations) # Important for later indexing

**TASK: Confirm that your *binner()* function works by running the following cell***

In [None]:
bin1 =  [-1.2 , -0.75, -0.3 ,  0.15,  0.6]
bin2 = [-0.07 , -0.035,  0.   ,  0.035,  0.07 ]
test_bins = [bin1, bin2]

test_observation = np.array([-0.9, 0.03])
discretized_test_bins = binner(test_observation, test_bins)
assert discretized_test_bins == (1, 3)

**TASK: Create the Q-Table** <br />
Remember the shape that the Q-Table needs to have.

In [None]:
q_table_shape = (NUM_BINS, NUM_BINS, env.action_space.n)
q_table = np.zeros(q_table_shape)
print(q_table.shape)

**TASK: Fill out the Epislon Greedy Action Selection function:**

In [None]:
def action_selection(epsilon, q_table, discrete_state):
    random_number = np.random.random()
    
    # EXPLOITATION, USE BEST Q(s,a) Value
    if random_number > epsilon:
        action = np.argmax(q_table[discrete_state])

    # EXPLORATION, USE A RANDOM ACTION
    else:
        action = np.random.randint(0, env.action_space.n)

    return action

**TASK: Fill out the function to compute the next Q value.**

In [None]:
def next_q_value(old_q_value, reward, next_optimal_q_value):
    
    return old_q_value +  ALPHA * (reward + GAMMA * next_optimal_q_value - old_q_value)


**TASK: Create a function to reduce epsilon, feel free to choose any reduction method you want. We'll use a reduction with BURN_IN and EPSILON_END limits in the solution. We'll also show a way to reduce epsilon based on the number of epochs. Feel free to experiment here.**

In [None]:
def reduce_epsilon(epsilon, epoch):
    if BURN_IN <= epoch <= EPSILON_END:
        epsilon -= EPSILON_REDUCE
    
    return epsilon

**TASK: Define your hyperparameters. Note, we'll show our solution hyperparameters here, but depending on your *reduce_epsilon* function, your epsilon hyperparameters may be different.**

In [None]:
EPOCHS = 30000
BURN_IN = 100
epsilon = 1

EPSILON_END= 10000
EPSILON_REDUCE = 0.0001 

ALPHA = 0.8
GAMMA = 0.9


**TASK: Create the training loop for the reinforcement learning agent and run the loop.**

In [None]:
# Lists

points = []
mean_points = []

for epoch in range(EPOCHS):

    # Reset the environment
    env = recall()
    state = env.reset()
    state = state[0]
    binned_state = binner(state, BINS)
    
    done = False
    score = 0 
    
    while not done:
        action = action_selection(epsilon, q_table, binned_state)

        next_state, reward, done, *info = env.step(action)
        score += reward

        old_q_value =  q_table[binned_state + (action,)]

        binned_next_state = binner(next_state, BINS) 
        next_optimal_q_value = np.max(q_table[binned_next_state])  

        next_q = next_q_value(old_q_value, reward, next_optimal_q_value)   

        q_table[binned_state + (action,)] = next_q
        
        binned_state = binned_next_state

    epsilon = reduce_epsilon(epsilon,epoch)

    points.append(score)
    running_mean = round(np.mean(points[-50:]), 2)
    mean_points.append(running_mean) 

    print(epoch)

env.close()


In [None]:
fig, ax = plt.subplots()
ax.scatter(np.arange(0, EPOCHS, 1), points)
ax.plot(np.arange(0, EPOCHS, 1), points)
ax.plot(np.arange(0, EPOCHS, 1), mean_points, label=f"Running Mean: {running_mean}")
plt.legend()

**TASK: Use your Q-Table to test your agent and render its performance.**

In [None]:
env = gym.make("MountainCar-v0", render_mode="human") 
observation = env.reset()
observation = observation[0]
rewards = 0

for _ in range(1000):
    env.render()
    discrete_state = binner(observation, BINS)  # get bins
    action = np.argmax(q_table[discrete_state])  # and chose action from the Q-Table
    observation, reward, done, *info = env.step(action) # Finally perform the action
    rewards += reward
    if done:
        print(f"You got {rewards} points!")
        break


env.close()