# Time to Practice ...

After going through background content about the of Reinforcement Learning and its taxonomy, here comes the time to practice. In this notebook, we are going to explore multiple problems and practice solving them using one the most popular RL algorithms, Q-Learning. Along this practical we are going to go through
1. a very simple implementation of Q-Learning where we will be required to implement the Q function.
2. next, we will apply few enhancements to our implementation by handling the exploration-exploitation issue as well as exploring the effect of the different hyperparameter values on agent training. 
3. Finally, we are going to implement the Q function approximator to run the Deep Q Network model.


#### Objectives
- experiment with some basic to advanced environments in OpenAI gym
- implement a simple Q function for the case of discre and continuous spaces
- implement epislon-greedy exploration algorithm with Q learning
- experiment with Q function approximation with DQN on Atari games

Finally, ask your tutors if you needed any help!


### Let's start ....



## First Exercise : Taxi Game

Taxi-v2 is an example of toy text environments. Where there are 4 locations, labelled by different letters, and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.


The taxi is the only car in this parking lot. *italicized text*
We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. 
These 25 locations are one part of our state space. 

In the environment, there are four possible locations where you can drop the passengers in the taxi which are: R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)] in (row, col) coordinates 

When we also account for one (1) additional passenger state of being inside the taxi, 
we can take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; 
there are four (4) destinations and five (4 + 1) passenger locations. 

So, our taxi environment has 5×5×5×4=500 total possible states. The agent encounters one of the 500 states, and it takes action. 

We have six possible actions: pickup, drop, north, east, south, west(These four directions are the moves by which the taxi is moved.)


#### Let's start practice ..
Import gym and load the environment.

In [0]:
import gym
import numpy as np
from IPython.display import clear_output

# Init Taxi-V2 Env
env = gym.make("Taxi-v2").env
env.render()

+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[35m[43mY[0m[0m| : |[34;1mB[0m: |
+---------+



#### Let's check how does the state look like?

In [0]:
print("State space: ", env.observation_space)

State space:  Discrete(500)


#### And what actions are available?

In [0]:
print("Action space: ", env.action_space)

Action space:  Discrete(6)


#### Your task: Intialize Q table and Implement the Q update function
In the following code we have a simple implementation of the QLearning algorithm where you are required to implement its Q update function.
Make sure you understand the flow of the algorithm very well as it will be the basis for all upcoming exercises as well.

Note: no exploration considered in this implementation.

In [0]:
def QLearning(env, alpha, gamma, epsilon, episodes):
    
    # initialize Q values
    Q = # < Add your code here>  

    all_epochs = []
    all_penalties = []
    frames = []
    for i in range(1, episodes):
        state = env.reset()
        
        # initialize variables
        epochs, penalties, reward, = 0, 0, 0
        done = False

        while not done:
            action = np.argmax(Q[state])

            next_state, reward, done, info = env.step(action)

            # Put each rendered frame into dict for animation
            frames.append({
                'frame': env.render(mode='ansi'),
                'state': next_state,
                'action': action,
                'reward': reward
                }
            )
            
            old_value = Q[state, action]
            next_max = np.max(Q[next_state])

            # Update q-table
            # < Add your code here>  
            
            
            if reward == -10:
                penalties += 1

            state = next_state
            epochs += 1

        if i % 100 == 0:
            clear_output(wait=True)
            print("Episode: %s" % i)
            env.render()
    return frames

import time
def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print("Timestep: %d" % (i + 1))
        print("State: %s" % frame['state'])
        print("Action: %s" % frame['action'])
        print("Reward: %s" % frame['reward'])
        time.sleep(.1)

#### Now let's initialize the Q-Learning parameters and see the algorithm in action.

In [0]:
# Hyperparameters intialization
alpha = 0.1
gamma = 0.6
epsilon = 0.1
episods = 100001
            
frames = QLearning(env, alpha, gamma, epsilon, episods)

#### Let's see how agent's actions evolve over the multiple episodes
printing steps in slow motion

In [0]:
# print latest episodes to see if the agent was able to learn anything
print_frames(frames[-200:])

In [0]:
env.close()

## Second Exercise : Mountain Car

A car is on a one-dimensional track, positioned between two mountains. The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.

Moutain Car, on the contrary to the Taxi, have states in a continuous space (box space). The States of the Car are represented with 2 continuous elements

1.   car position
2.   car velocity

However the action of the mountain car is discrete

0: push left, 
1: no push, and 
2: push right

In [0]:
import gym
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make('MountainCar-v0')

print("State space: ", env.observation_space)
print("low: %s \n high: %s" %(env.observation_space.low, env.observation_space.high))
print("Action space: ", env.action_space)

State space:  Box(2,)
low: [-1.2  -0.07] 
 high: [0.6  0.07]
Action space:  Discrete(3)


#### Your tasks:  
1.   implement the greedy epsilon strategy. 
2.   reuse pre-implemented Q update rule, from previous exercise.
3.   track and plot the change in reward values along training episodes, e.g. plot the average reward values at each x episodes.


In [0]:
def QLearning(env, alpha, gamma, epsilon, min_eps, episodes):
    
    # Determine size of discretized state space
    num_states = (env.observation_space.high - env.observation_space.low)*\
                    np.array([10, 100])
    num_states = np.round(num_states, 0).astype(int) + 1
    
    # Initialize q-table
    Q = np.random.uniform(low = -1, high = 1, 
                          size = (num_states[0], num_states[1], 
                                  env.action_space.n))
    
    # Initialize variables to track rewards
    reward_list = []
    ave_reward_list = []
    
    # Calculate episodic reduction in epsilon
    reduction = (epsilon - min_eps)/episodes
    
    # Run Q learning algorithm
    for i in range(episodes):
        # Initialize parameters
        done = False
        tot_reward, reward = 0,0
        state = env.reset()
        
        # Discretize state
        state_adj = (state - env.observation_space.low) * np.array([10, 100])
        state_adj = np.round(state_adj, 0).astype(int)
    
        while done != True:   
            # Render environment for last five episodes
            if i >= (episodes - 20):
                env.render()
                
            # Determine next action - epsilon greedy strategy
            # < Add your code here>
            
            
                
            # Get next state and reward
            # < Add your code here>

            
            # Discretize next state
            next_state_adj = (next_state - env.observation_space.low) * np.array([10, 100])
            next_state_adj = np.round(next_state_adj, 0).astype(int)
            
            # Allow for terminal states
            if done and next_state[0] >= 0.5:
                Q[state_adj[0], state_adj[1], action] = reward
                
            # Adjust Q value for current state
            else:
                
                # Reuse your implementation for the Q update function in the previous example
                # < Add your code here>
                
               
            
            # Update variables
            tot_reward += reward
            state_adj = next_state_adj
        
        # Decay epsilon
        if epsilon > min_eps:
            epsilon -= reduction
        
        # Track rewards
        reward_list.append(tot_reward)
        
        if (i+1) % 100 == 0:
            ave_reward = np.mean(reward_list)
            ave_reward_list.append(ave_reward)
            reward_list = []
   
        if (i+1) % 100 == 0:    
            print('Episode {} Average Reward: {}'.format(i+1, ave_reward))
            env.render()     
    
    return ave_reward_list

In [0]:
#run the Qlearning to train our agent, this will take a bit of time, especially the rendering part
env.reset()

# initialize hyperparameters then train the agent ..

# note: epsilon is used as a parameter for exploration
# but as we further go with the training the system actually needs to exploit
# it's trained model so we reduce the epsilon with the episodes
# it follows a linear schedule starting from epsilon to min_eps
# over the episodes
# epsilon and min eps are probabilites this should give you an intuition about the range
# gammwe is better used in range of 0.7-0.99 as a gamma of 1 loses the previliage of discount
# alpha is the learning rate, use in range of 0.15-0.5
# finally you have to set the episode number as the more number of episodes the better
# use # of episodes in range 30,000 to 70,000
# < Add your code here>



rewards = 

In [0]:
# close rendering window
env.close()

In [0]:
# Plot Rewards vs Episodes
# Note: you will need to modify the QLearning function to keep track with the reward value over each episode 

# < Add your code here>



## Third Exercise: DQN

In this exercise, we are going to practice implementing DQN for atari game learning, Pong in this exercise. But before we dig into the details of the DQN model, let's first apply few pre-processing steps on our data (game frames).

For preprocessing , we are going to use OpenAI Gym Wrappers. These wrappers make it easier to interact with OpenAI Gym. 

Starting simple, we are going to apply only two wrappers

1.   frame processing (downsampling and greyscaling)
2.   image normalization

as you can notice in the make_env() function below. (which we are going to use to create our environment later)

In [0]:
import cv2
import gym
import random
import numpy as np
from itertools import combinations
from collections import deque

# Taken from OpenAI baseline wrappers
# https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py


class ProcessFrame84(gym.ObservationWrapper):
    """
    Downsamples image to 84x84
    Greyscales image

    Returns numpy array
    """
    def __init__(self, env=None):
        super(ProcessFrame84, self).__init__(env)
        self.observation_space = gym.spaces.Box(low=0, high=255, shape=(84, 84, 1), dtype=np.uint8)

    def observation(self, obs):
        return ProcessFrame84.process(obs)

    @staticmethod
    def process(frame):
        if frame.size == 210 * 160 * 3:
            img = np.reshape(frame, [210, 160, 3]).astype(np.float32)
        elif frame.size == 250 * 160 * 3:
            img = np.reshape(frame, [250, 160, 3]).astype(np.float32)
        else:
            assert False, "Unknown resolution."
        img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114
        resized_screen = cv2.resize(img, (84, 110), interpolation=cv2.INTER_AREA)
        x_t = resized_screen[18:102, :]
        x_t = np.reshape(x_t, [84, 84, 1])
        return x_t.astype(np.uint8)

    
class ScaledFloatFrame(gym.ObservationWrapper):
    """Normalize pixel values in frame --> 0 to 1"""
    def observation(self, obs):
        return np.array(obs).astype(np.float32) / 255.0

def make_env(env_name):
    env = gym.make(env_name)
    env = ProcessFrame84(env)
    return ScaledFloatFrame(env)

### Time to build our DQN approximator

In [0]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

class DQNAgent:
    
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=100000)
        self.gamma = 0.99    
        self.epsilon = 1.0
        self.epsilon_min = 0.02
        self.epsilon_decay = 0.995
        self.learning_rate = 0.0001
        self.model = self._build_model()

    def _build_model(self):
        
        # This function should return a model that learn the Q-function
        # that can be further trained 
        # when desired, in other words you should build your computational graph
        # based on your input shape and the number of the actions
        # it's preferable to use keras
        
        # < Add your code here>
        
        
        return model

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    #get action
    def act(self, state):
        # select random action with prob=epsilon else action=maxQ
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        
        # you have your model and your state now you should 
        # get action value
        # < Add your code here>
         
        
        return np.argmax(act_values[0])  # returns action

    def replay(self, batch_size):
        # experience replay
        # sample random transitions
        minibatch = random.sample(self.memory, batch_size)

        for state, action, reward, next_state, done in minibatch:
            target = reward

            if not done:
                #calculate target for each minibatch
                Q_next=self.model.predict(next_state)[0]  # Q=NN.predict(state)
                target = (reward + self.gamma *np.amax(Q_next)) #Belman

            target_f = self.model.predict(state)
            target_f[0][action] = target

            #train network
            self.model.fit(state, target_f, epochs=1, verbose=0)

        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
        
    def learn(self, env, episodes, batch_size):
        
        for e in range(episodes):
            state = env.reset()
            state = np.reshape(state, [1, state_size])
            done = False
            
            total_reward = 0
            while not done:
                # epsilon-greedy action
                action = self.act(state)

                next_state, reward, done, _ = env.step(action)

                total_reward += reward 
                
                next_state = np.reshape(next_state, [1, state_size])

                #add to experience memory
                self.remember(state, action, reward, next_state, done)

                state = next_state

                if done:
                    print("episode: {}/{}, reward: {}"
                          .format(e, episodes, total_reward))
                    break
                    
            #experience replay
            if len(self.memory) > batch_size:
                self.replay(batch_size)


All together

In [0]:
import gym
import random
import numpy as np
from itertools import combinations 

env = make_env('PongNoFrameskip-v4')

env.reset()
state_size = np.prod(env.observation_space.shape)
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)
print(state_size)

batch_size = 32
episodes = 500

agent.learn(env, episodes, batch_size)


### Finally, let's see how our trained agent could act like if it's trained for longer time. Say, 5 millions time steps!


In [0]:
!python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=pong_5M_ppo2 --play