# Assignment 5 - Q-Learning Agent by Abdullah Karagøz

In [1]:
import numpy as np

ModuleNotFoundError: No module named 'numpy'

QLearningCalculator class has only some functions to help answer the assignment questions. The states and actions start from 1 as described in the assignment, but in the inner code they start from 0.

We assume the number of states and actions are equal, where actions just describe "which state to go from there". Not all states are available from any state. There's no "punishment" for hitting walls as it was not specified in the assignment text, so the reward for hitting walls is just 0. The unavailable actions (like hitting walls) are marked as None (or nan in Numpy), but they return the reward as 0.

The states 1 to 6 and actions are put into indexes from 0 to 5 on Numpy matrices. We assume we know all the available actions and states.

In [2]:
class QLearningCalculator():
    '''
    This class is just calcualting some functions of Q learning algorithm. It's not a class that fully
    implements Q learning algorithm. States and actions start from 1.
    
    Attributes:
        R (Numpy matrice): Reward matrice, each row for each state, each column for each action. 
            There are equal number of actions and states.
        Q (Numpy matrice): Q matrice
        gamma (float): discounting factor to discount future rewards
        lr (float): learning rate
        state (int): the state our agent is in
    '''
    
    def __init__(self, R, nr_of_states, discount_factor, learning_rate, init_state):
        '''
        Initializing the QLearning Calculator class.
        
        Parameters:
            R (Numpy matrice): Reward matrice, each row for each state, each column for each action.
            nr_of_states (int): How many states and actions there should be (used to initialize Q array)
            discount_factor (float): discounting factor to discount future rewards
            learning_rate (float): learning rate
            init_state (int): the state our agent is in at the start    
        '''
        self.R = R
        # Q matrix. Here we know number of states.
        self.Q = np.zeros((nr_of_states, nr_of_states), dtype=np.float64) 
        self.gamma = discount_factor
        self.lr = learning_rate
        self.state = init_state-1
    
    
    def set_state(self, state):
        '''
        Function to set state of the agent.
        
        Parameters:
            state (int): The state of the agent. Starts from 1.
        '''
        self.state = state-1
    
    
    def action(self, action):
        '''
        Function to apply action.
        
        Parameters:
            action (int): The action to take, starts from 1.
            
        Returns:
            int: the next state that action put the agent into. Starts from 1.
        '''
        action -= 1
        if np.isnan(R[self.state, action]):
            q_value = 0
            next_state = self.state
        else:
            next_state = action
            q_value = (1 - self.lr)*self.Q[self.state, action] + self.lr*(R[self.state, action] + self.gamma*(np.max(self.Q[next_state])))
        
        self.Q[self.state][action] = q_value
        self.state = next_state
        return self.state
        
    def next_action(self, state):
        '''
        Function to return most probable next action, given state.
        
        Parameters:
            state (int): The given state.
            
        Returns:
            int: the most probable action to take given state.
        '''
        next_action = np.argmax(self.Q[state-1]) + 1
        return next_action
    
    
    
        

### Initializing R matrice, discount factor and learning rate

In [3]:
# Reward matrix
R = np.array([[None,0,None,0,None,None], [0,None,-10,None,0,None], [None,0,None,None,None,100], 
              [0,None,None,None,0,None], [None,0,None,0,None,100], [None,None,-10,None,0,None]], dtype=np.float64)

dis_fac = 0.9 # Discount factor
lr = 1.0 # Learning rate

In [4]:
ql_calc = QLearningCalculator(R, len(R), dis_fac, lr, 4)

### Running the actions

In [5]:
ql_calc.action(5)
ql_calc.action(6)
ql_calc.set_state(4)
ql_calc.action(5)
ql_calc.set_state(1)
ql_calc.action(4)
print("The Q matrix looks like this:\n", ql_calc.Q) 

The Q matrix looks like this:
 [[  0.   0.   0.  81.   0.   0.]
 [  0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.  90.   0.]
 [  0.   0.   0.   0.   0. 100.]
 [  0.   0.   0.   0.   0.   0.]]


### Printing the most probable path from state 2 
Below is when we start from 2 and continue using np.max(Q values given state) when choosing next action.

In [6]:
#We start from 2
state = 2
ql_calc.set_state(2)
path = str(2)
while state != 6:
    state = ql_calc.next_action(state)
    ql_state = ql_calc
    path += " --> " + str(state)
    ql_calc.next_action(state)

print("The most probable path is:", path)

The most probable path is: 2 --> 1 --> 4 --> 5 --> 6


On state 2, any action has Q value zero, so any action is equally probably to choose.
But if it chooses 1 (which is the case when you use max() or np.max() function), then it will move to 4, then 5 and then to 6.

So the path will probably be like: 2 --> 1 --> 4 --> 5 --> 6

But if it makes a random choice between max Q values, then other paths could be like following:

If it chooses 5, it will then move to 6 directly like 2 --> 5 --> 6

If it chooses 3, it will then learn that choosing 3 from 2 is bad choice. But both 2 or 6 from 3 will be equally probable.
So then it can be like: 

2 --> 3 --> 6 

or 

2 --> 3 --> 2 --> 1 --> 4 --> 5 --> 6 

or 

2 --> 3 --> 2 --> 5 --> 6

So here are the probable paths:

2 --> 1 --> 4 --> 5 --> 6

2 --> 5 --> 6

2 --> 3 --> 6

2 --> 3 --> 2 --> 1 --> 4 --> 5 --> 6

2 --> 3 --> 2 --> 5 --> 6