# COMP 484 - Practical Assignment 7

#### Ramraj Chimouriya
#### CE IV/I

## Book - Artificial Intelligence with Python
## Chapter 15 - Reinforcement Learning

___

### Reinforcement learning

Reinforcement learning refers to the process of learning what to do and mapping situations to certain actions in order to maximize the reward. 

In most paradigms of machine learning, a learning agent is told what actions to take in order to achieve certain results. 

In the case of reinforcement leaning, the learning agent is not told what actions to take. Instead, it must discover what actions yield the highest reward by trying them out. These actions tend to affect the immediate reward as well as the next situation. This means that all the subsequent rewards will be affected too.

### Creating an environment

In [5]:
import gym

In [27]:
def create_environment(input_env):
    name_map = {'cartpole': 'CartPole-v1', 
                'mountaincar': 'MountainCar-v0',
                'pendulum': 'Pendulum-v1',
                'taxi': 'Taxi-v3',
                'lake': 'FrozenLake-v1'}

    # Create the environment and reset it
    env = gym.make(name_map[input_env])
    env.reset()

    # Iterate 1000 times
    for _ in range(500):
        # Render the environment
        env.render()

        # take a random action
        env.step(env.action_space.sample()) 

In [30]:
create_environment('cartpole')

___

## Building a learning agent

In [31]:
def learning_agent(input_env):
    name_map = {'cartpole': 'CartPole-v1',
                'mountaincar': 'MountainCar-v0',
                'pendulum': 'Pendulum-v1'}
    
    # Create the environment
    env = gym.make(name_map[input_env])
    
    # Start iterating
    for _ in range(20):
        # Reset the environment
        observation = env.reset()
        
        # Iterate 100 times
        for i in range(100):
            # Render the environment
            env.render()
            
            # Print the current observation
            print(observation)
            
            # Take action
            action = env.action_space.sample()
            
            # Extract the observation, reward, status and
            # other info based on the action taken
            observation, reward, done, info = env.step(action)
            
            # Check if it's done
            if done:
                print('Episode finished after {} timesteps'.format(i+1))
                break

In [35]:
learning_agent('cartpole')

[ 0.02312142  0.02564569  0.0205996  -0.00056591]
[ 0.02363434 -0.16976553  0.02058828  0.29854462]
[ 0.02023903 -0.36517483  0.02655918  0.5976489 ]
[ 0.01293553 -0.56065816  0.03851216  0.898578  ]
[ 0.00172237 -0.7562803   0.05648372  1.2031134 ]
[-0.01340324 -0.95208526  0.08054598  1.5129498 ]
[-0.03244494 -0.7580257   0.11080498  1.2464591 ]
[-0.04760546 -0.95438045  0.13573416  1.5716951 ]
[-0.06669307 -1.1508359   0.16716807  1.9034512 ]
[-0.08970978 -1.3473247   0.20523709  2.2429945 ]
Episode finished after 10 timesteps
[ 0.04971329 -0.01266102  0.02794717 -0.03411588]
[ 0.04946008 -0.20817237  0.02726485  0.26725203]
[ 0.04529663 -0.4036726   0.03260989  0.56840825]
[ 0.03722318 -0.20902286  0.04397805  0.28617448]
[ 0.03304272 -0.0145548   0.04970154  0.00767981]
[ 0.03275162  0.17982043  0.04985514 -0.26891676]
[ 0.03634803  0.37419677  0.0444768  -0.5454677 ]
[ 0.04383197  0.178479    0.03356745 -0.23910947]
[ 0.04740155 -0.017106    0.02878526  0.06396974]
[ 0.04705943 -

## Book - Mastering Machine Learning with Python in Six Steps
## Chapter 6 - Reinforcement Learning

Reinforcement learning is a goal-oriented learning method based on interaction with its environment. The objective is getting an agent to act in an environment in order to maximize its rewards. 

## Q-Learning

In [41]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
from random import randint

%matplotlib inline

In [60]:
# defines the reward/link connection graph
R = np.matrix([[-1, -1, -1, -1, 0, -1],
[-1, -1, -1, 0, -1, 100],
[-1, -1, -1, 0, -1, -1],
[-1, 0, 0, -1, 0, -1],
[ 0, -1, -1, 0, -1, 100],
[-1, 0, -1, -1, 0, 100]]).astype("float32")
Q = np.zeros_like(R)

In [61]:
# learning parameter
gamma = 0.8
# Initialize random_state
initial_state = randint(0,4)

Define a function that returns all available actions in the state given as an argument.

In [62]:
def available_actions(state):
    current_state_row = R[state,]
    av_act = np.where(current_state_row >= 0)[1]
    return av_act

Define another function that chooses at random which action to be performed within the range of all the available actions.

In [63]:
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_act,1))
    return next_action

Function that updates the Q matrix according to the path selected and the Q learning algorithm

In [64]:
def update(current_state, action, gamma):
    max_index = np.where(Q[action,] == np.max(Q[action,]))[1]
    
    if max_index.shape[0] > 1:
        max_index = int(np.random.choice(max_index, size = 1))
    else:
        max_index = int(max_index)
    max_value = Q[action, max_index]
    
    # Q learning formula
    Q[current_state, action] = R[current_state, action] + gamma * max_value

In [65]:
# Get available actions in the current state
available_act = available_actions(initial_state)

In [66]:
# Sample next action to be performed
action = sample_next_action(available_act)

In [67]:
# Train over 100 iterations, re-iterate the process above).
for i in range(100):
    current_state = np.random.randint(0, int(Q.shape[0]))
    available_act = available_actions(current_state)
    action = sample_next_action(available_act)
    update(current_state,action,gamma)

In [68]:
# Normalize the "trained" Q matrix
print ("Trained Q matrix: \n", Q/np.max(Q)*100)

Trained Q matrix: 
 [[  0.           0.           0.           0.          73.04227948
    0.        ]
 [  0.           0.           0.          42.77895093   0.
  100.        ]
 [  0.           0.          34.49273705  42.77895093   0.
    0.        ]
 [  0.          49.8554796   34.22316015   0.          53.47368717
    0.        ]
 [ 39.88438547   0.           0.          39.88438547   0.
   91.3028419 ]
 [  0.          64.34512734   0.           0.          73.04227948
   91.3028419 ]]


In [69]:
# Testing
current_state = 2
steps = [current_state]

while current_state != 5:
    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size = 1))
    else:
        next_step_index = int(next_step_index)
    steps.append(next_step_index)
    current_state = next_step_index

In [70]:
# Print selected sequence of steps
print ("Best sequence path: ", steps)

Best sequence path:  [2, 3, 4, 5]
