# Large Notebook 6: Reinfocement Learning
## Ex. 6: Control problem MountainCar-v0

In the exercise class we will cover the control problem of a car at the bottom of a valley which should pick-up enough momentum to get over the hill. We will use the environment from the OpenAI Gym, which allows you to play and visualize the 'game'. Use RL to train a policy that gets the car over the hill in the least amount of time. 

**Before you can start this exercise you have to install the package OpenAI Gym. Start your anaconda environment with python3 and install:**

* pip install gymnasium[classic-control]


In [1]:
%pip install gymnasium[classic-control]

import gymnasium as gym
import numpy as np


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





If you have properly installed the openAI gym you should be able to import it. We will now run a DEMO to see if everything is working. The code already is able to simulate the MountainCar problem for the case that it actions are **random**. To be able to view the rendered video of the poor and helpless car, desperately trying to drive up the hill, you should run the code on your own computer.
For more info on this particular environment see e.g. the website: https://gymnasium.farama.org/environments/classic_control/mountain_car/



In [2]:
def demo():
    """run the MountainCar environment with random actions"""
    
    env = gym.make('MountainCar-v0', render_mode='human')  #  create an instance of the environment

    state = env.reset()  # reset the current game

    for _ in range(200):  # play 200 random actions
        env.render()  # render the current game state to screen
        a = env.action_space.sample()  # get a random action
        state, reward, terminated, truncated, info = env.step(a) # take the action and return the outcome
        
    env.close()
    
# run the demo 
demo()        

## Building your RL player, ie. training your policy.
We have to start by creating the game environment and checking some properties of the state and action space:

In [3]:
env = gym.make('MountainCar-v0')  # no rendering!

# get usefull information about the environment:
state = env.reset()
print('start state:', state)
print('Number of sctions in the action space:', env.action_space.n)
print('Lowest state in the state space:', env.observation_space.low)
print('Hightest state in the state space:', env.observation_space.high)
#perform one step of the game for action a=1
a=1
state, reward, terminated, truncated, info = env.step(a)
print('After the step with (a=1):',state, reward, terminated, truncated, info )


start state: (array([-0.4962899,  0.       ], dtype=float32), {})
Number of sctions in the action space: 3
Lowest state in the state space: [-1.2  -0.07]
Hightest state in the state space: [0.6  0.07]
After the step with (a=1): [-4.9649450e-01 -2.0458747e-04] -1.0 False False {}


We see that the car starts out in state with two floats [-0.525, 0] (as it initializes random these numbers will differ each time you reset). You can perform any of 3 actions (a = 0 or 1 or 2). We don't know what the numbers in the state mean, they could be the $x$, $y$ coordinates of the car or the velocity and height, but **we also don't have to know!** We will let the RL algortithm learn how to drive the car regardless of the exact meaning of the state.

You should now code a function `s2q(s)` that links state `s` to a location in the Q-matrix. This can quickly be done by discretizing the state space into bins and determine the bin number corresponding to a certain value. The function should return a tuple (or list) `loc` that holds the two bin numbers.

In [4]:
def s2q(s):
    # convert continous state values to discrete location indices inside the Q matrix
    
    #----------ADD CODE HERE---------#
    position_bins = np.linspace(-1.2, 0.6, 20)
    velocity_bins  = np.linspace(-0.07, 0.07, 20)
    return [np.digitize(s[0], position_bins) , np.digitize(s[1], velocity_bins)]
print(s2q(state))



[8, 10]


The next function `qlearn()` should train your Q-matrix by playing `num_games` games according to an 'epsilon-greedy' strategy (Google it!) and update the Q-matrix accoding to the following Bellmann equation:

$$ \mathbf{Q}^{\rm new}[s_t,a_t]=(1-\alpha)\mathbf{Q}[s_t,a_t]+\alpha\left(R_t+\gamma\, \text{max}_a  \mathbf{Q}[s_{t+1},a]\right). $$

Here, $\alpha$ is the learning rate and $\gamma$ is the discount factor and are bounded by $\alpha,\gamma\in[0,1]$. These parameters have to be set with care, as they influence the speed of convergence of the Q-matrix. The discount factor lets you weigh the importance of future over immediate rewards. This is done by mixing-in the term $\text{max}_a  \mathbf{Q}[s_{t+1},a]$, which gives the maximum Q value in the future state.

In [10]:
def qLearn(Q, α, γ, ϵ, ϵ_min, num_games):
    """ 
    learns the Q table by interacting with the environment and applying the Bellman eqation 
    Q: q-table (n-dimensional ndarray)
    α: learning rate
    γ: discount factor
    ϵ: probability of taking a random action in the epsilon-greedy policy
    ϵ_min: minimum value ϵ can take when applying a reduction algortihm to ϵ
    """
    wins = 0

    for i in range(num_games):
        state = env.reset()[0]  
        state  = s2q(state)
        state_pos = state[0]
        state_vel = state[1]
        terminated = False
        truncated = False
        while truncated == False: 
            random_number = np.random.uniform(0,1)
            if random_number < ϵ:
                a = env.action_space.sample()
            else:
                a = np.argmax(Q[state_pos][state_vel])

            next_state, reward, terminated, truncated, info = env.step(a)
            next_state = s2q(next_state)
            next_state_pos = next_state[0]
            next_state_vel = next_state[1]
            Q[state_pos][state_vel][a] = Q[state_pos][state_vel][a] + α * (reward + γ * np.max(Q[next_state_pos][next_state_vel]) - Q[state_pos][state_vel][a])
            state = next_state
        if ϵ > ϵ_min:
            ϵ *= 0.99
            

    




    
    #----------ADD CODE HERE---------#

    print(f'Training ended. Number of wins: {wins}')

Finally you put everything together. It is almost completly finished for you. What values for the hyperparameters do you choose? 

In [11]:
# initialize the Q matrix as a numpy array with zeros
Qdim = (20, 20, 3)  
Q = np.zeros(shape=Qdim)
state = env.reset()
# set the hyperparameters
α =   0.1
γ =   0.1
ϵ =   0.1
ϵ_min =   0.01
num_games = 1000
# train the agent and store results
qLearn(Q, α, γ, ϵ, ϵ_min, num_games)
np.save('qrun1.npy', Q)


error: display Surface quit

Once a Q-matrix has been trained we can use it as a policy and play a game. Write code that performs actions according to the input Q-matrix to play a single episode.

In [9]:
# replay the game using the trained Q matrix
Q = np.load('qrun1.npy')

# create and reset the environment with render mode on
env = gym.make('MountainCar-v0', render_mode='human')
state = env.reset()[0]
    
# play a single episode with max. 1000 actions
for _ in range(1000):           
    env.render()                
    loc = s2q(state)
    state_pos = loc[0]
    state_vel = loc[1]
    a = np.argmax(Q[state_pos][state_vel])
    state, reward, terminated, truncated, info = env.step(a) 
        
    if terminated: 
        print('Qplay Output:', reward, terminated, truncated, info)
        break

env.close()

# Experiments
Set up a couple fo experiments to figure oyt the following things:
* How do $\alpha$ and $\gamma$ effect your learning perfomance?
* Are both elements of the state vector equally important and can we reduce the dimensions of the Q-matrix of one (or both) of them?