#                            Little Mario Game - Using Q-learning


## Chapter 21 - Reinforcement Learning - Q-learning Algorithm

In this Notebook we will make a simple Mario game and solve it using the Q-learning algorithm.

<img src="mario.jpg" alt="Drawing" style="width: 300px;"/>

Our Mario game won't be as complex as the actual Mario game. It will look something like this -

<img src="mario_game.png" alt="Drawing" style="width: 600px;"/>

So, in this game Little Mario has to cross all the obstacles to reach the Winning cell that is the 10th position. When Little Mario reaches 2nd cell he needs to jump on to the 3rd cell and similarly for 5th and 8th cell to win the game.

So, the possible actions that he can take is -
0. MOVE STRAIGHT
1. JUMP UP

The state that he receive after taking an action is just the cell number.

Let's define reward now -
1. When need to jump -
    a) If jumps then - Reward is -0.04 and game continues.
    b) No jump - Reward is -1 and game is over.
2. When need to go straight -
    a) If goes staright - Reward is -0.04 and game continues
    b) If jumps - Reward is -0.1 ( To make sure that the agent does not jump unnecessarily)          and game continues.
3. On reaching 10th state or Goal cell - Reward is +1 and game is over.

In [None]:
import random
''' This class contains the method get_state_and_reward() which returns whether 
    the game is over or not, next state and reward received.
'''
class Mario:
    def __init__(self):
        self.obstacles = [3,5,8]
        self.terminal_state = 10
        
    
    def get_state_and_reward(self,state,action):
        
        need_to_jump = False
        
        for i in range(len(self.obstacles)):
            if state == self.obstacles[i]-1:
                need_to_jump = True
                break
        
        if need_to_jump == True:
            if action == 1:
                return False,state+1,-0.04
            
            else:
                return True,state+1,-1
        elif state == self.terminal_state-1:
            return True,state+1,1
        else:
            if action == 0:
                return False,state+1,-0.04
            else:
                return False,state+1,-0.1


## Q-Learning Algorithm

In the Q-learning algorithm, we learn the Q-value for the actions taken from a state. Q-value of an action is basically the expected future reward we can get if that action is taken from the current state.

Q(st,at)=E[Rt+1+γ∗Rt+2+γ2∗Rt+3+...|(st,at)]

Here γ
is the discount factor and 

Rt+1 is the Reward at time step t+1 and so on.

The Q-function takes two inputs state and action and returns the expected future reward. In this algorithm, we experience the environment again and again like playing the game several times, every time an action is taken we update its Q-value which was set randomly initially. The update is performed according to the following equation :

Q(St,At)=Q(St,At)+α×[R+γ×maxaQ(S′,a)−Q(St,At)]

Here α
is the learning rate and γ is the discount factor.

In [None]:
'''
First we will initialize the Q-table
In this implementation QValues[i][0] -> Q-value for the action Going Straight from state i.
                       QValues[i][1] -> Q-value for the action Jumping from state i.

'''

QValues = [[0,0] for i in range(11)]
alpha=0.2
epsilon=0.2
discount=0.9
game = Mario()
for i in range(1,11,1):
    QValues[i][0]=random.uniform(0,1)
    QValues[i][1]=random.uniform(0,1)
    print('Initial Q-values for state',i,'is: ',"{0:.3f}".format(round(QValues[i][0],3)),"{0:.3f}".format(round(QValues[i][1],3)))
    
#Playing the game numgames times
numgames = 1000
for i in range(numgames):
    state = 0
    done = False
    while done!=True:
        num = random.uniform(0,1)
        action = 0
        if num<epsilon:
            action = random.randint(0,1)
        else:
            action= (QValues[state].index(max(QValues[state])))
        done,next_state,reward=game.get_state_and_reward(state,action)
        if next_state == 10:
            continue
        else:
            nxtlist = QValues[next_state]
            currval = QValues[state][action]
            QValues[state][action]=currval +  alpha * ( reward + discount*(max(nxtlist)) - currval)
        state=next_state
            
    
print()    
print('Q-values after playing 1000 games')
for i in range(1,11,1):
    print('Q-values for state',i,'is: ',"{0:.3f}".format(round(QValues[i][0],3)),"{0:.3f}".format(round(QValues[i][1],3)))



So we see that after playing for 1000 games, the Q-value corresponding to the states 2,4 and 7 is high for Jumping. Whereas for other states going straight has slightly higher value than jumping.

Feel free to change the obstacles position to any other position and even increase the number of time steps from 10 and observe the Q-values of different states.

Go ahead and try the Practise Notebook for a little more involved environment.