#### Reinforcement Learning Agent to play **Frozen Lake** Game!

**Game Rules:**
- We are in a 3x3 grid world which is 0-indexed.
- Starting from (0,0), Player should move in the grid inorder to maximise the reward.
- The player will receive a reward of +1 if he enters the grid numbered with 4/6 ( Treasure ).
- The player will receive a reward of -10 & the game terminates if enters the grid numbered 5  ( Danger! ).
- The player will receive a reward of +10 & the game terminates if enters the grid numbered 8 ( End Point ). 
- In each step, the player receives a reward of -0.1. This makes sure that RL agent finds the best shortest path.
-  In each step, player can move in any one of the 4 directions: Up, Down, Left, Right.
- The randomness in environment is as follows:
 **Once, the player decides an action ( direction ) , with 0.5 probability player will take one step in the desired direction and  with 0.5 player slip and takes 2 steps in that direction.**

### Basic Game Environment:

In [2]:
# Importing all the required libraries.
import random
import numpy as np
%autosave 5

Autosaving every 5 seconds


In [3]:
# Defining a class for all possible actions taken by the Player.
class Action:
  
    def __init__(self):
        self.L=0
        self.R=1
        self.U=2
        self.D=3

In [4]:
# Defining a class for simulating the randomness of the environment & its impact on actions of the player!
class Environment:

    def __init__(self):
        self.action_space=4
        self.observation_space=9
        self.state=1
        self.done=False
        self.reward=[[0,0,0],[0,1,-10],[1,0,10]]
        
    def reset(self):
        self.state=0
        return self.state
    
    def step(self,action):

        self.Reward=0
        Act = Action()

         # Player is in the 1st cell        
        if(self.state==0):

              # Choosen action is to move LEFT.
            if(action==Act.L):

                # With prob 0.5 each he will move to state 1 or 2 due to randomness of environment.
                self.state=np.random.choice([1,2],p=[0.5,0.5])

                # If he moves to state 2, He took 1 step, so in rewards, we subtract 0.1
                if ( self.state==2 ):
                    self.Reward=self.reward[0][self.state] -0.1

                # If he moves to state 1, He took 2 steps, so in rewards, we subtract 0.2
                else:
                    self.Reward=self.reward[0][self.state] -0.2
                
                # Note: Subtracting 0.1 helps in finding shortest path.

                self.done=False
                
            
            elif (action==Act.R):
                self.state=np.random.choice([1,2],p=[0.5,0.5])
                if(self.state==2):
                    self.Reward=self.reward[0][self.state] -0.1
                else:
                    self.Reward=self.reward[0][self.state] -0.2
                self.done=False
                
                
            elif(action==Act.U):
                self.state=np.random.choice([3,6],p=[0.5,0.5])
                if self.state==3:
                    self.Reward=self.reward[1][0] -0.2
                else :
                    self.Reward=self.reward[2][0] -0.1
                self.done=False
                
            elif(action==Act.D):
                self.state=np.random.choice([3,6],p=[0.5,0.5])
                if self.state==3:
                    self.Reward=self.reward[1][0] -0.1
                else :
                    self.Reward=self.reward[2][0] -0.2
                self.done=False
                
        elif(self.state==1):
            if(action==Act.L):
                self.state=np.random.choice([0,2],p=[0.5,0.5])
                if(self.state==2):
                    self.Reward=self.reward[0][self.state] -0.2
                else:
                    self.Reward=self.reward[0][self.state] -0.1
                self.done=False
                
            elif (action==Act.R):
                self.state=np.random.choice([0,2],p=[0.5,0.5])
                if(self.state==2):
                    self.Reward=self.reward[0][self.state] -0.1
                else:
                    self.Reward=self.reward[0][self.state] -0.2
                self.done=False
                
                
            elif(action==Act.U):
                self.state=np.random.choice([4,7],p=[0.5,0.5])
                if self.state==4:
                    self.Reward=self.reward[1][1] -0.2
                else :
                    self.Reward=self.reward[2][1] -0.1
                self.done=False
                
            elif(action==Act.D):
                self.state=np.random.choice([4,7],p=[0.5,0.5])
                if self.state==4:
                    self.Reward=self.reward[1][1] -0.1
                else :
                    self.Reward=self.reward[2][1] -0.2
                self.done=False
                
        elif(self.state==2):
            if(action==Act.L):
                self.state=np.random.choice([1,0],p=[0.5,0.5])
                if(self.state==1):
                    self.Reward=self.reward[0][self.state] -0.1
                else:
                    self.Reward=self.reward[0][self.state] -0.2
                self.done=False
                
            elif (action==Act.R):
                self.state=np.random.choice([1,0],p=[0.5,0.5])
                if(self.state==0):
                    self.Reward=self.reward[0][self.state] -0.1
                else:
                    self.Reward=self.reward[0][self.state] -0.2
                self.done=False
                
                
            elif(action==Act.U):
                self.state=np.random.choice([5,8],p=[0.5,0.5])
                if self.state==5:
                    self.Reward=self.reward[1][2] 
                else :
                    self.Reward=self.reward[2][2] 
                self.done=True
                
            elif(action==Act.D):
                self.state=np.random.choice([5,8],p=[0.5,0.5])
                if self.state==5:
                    self.Reward=self.reward[1][2] 
                else :
                    self.Reward=self.reward[2][2] 
                self.done=True
                
        elif(self.state==3):
            if(action==Act.L):
                self.state=np.random.choice([5,4],p=[0.5,0.5])
                if(self.state==5):
                    self.Reward=self.reward[1][2] 
                    self.done=True
                else:
                    self.Reward=self.reward[1][1] -0.2
                    self.done=False
                
            elif (action==Act.R):
                self.state=np.random.choice([5,4],p=[0.5,0.5])
                if(self.state==4):
                    self.Reward=self.reward[1][1] -0.1
                    self.done=False
                else:
                    self.Reward=self.reward[1][2] 
                    self.done=True
                
                
            elif(action==Act.U):
                self.state=np.random.choice([0,6],p=[0.5,0.5])
                if self.state==6:
                    self.Reward=self.reward[2][0] -0.2
                else :
                    self.Reward=self.reward[0][0] -0.1
                self.done=False
                
            elif(action==Act.D):
                self.state=np.random.choice([0,6],p=[0.5,0.5])
                if self.state==6:
                    self.Reward=self.reward[2][0] -0.1
                else :
                    self.Reward=self.reward[0][0] -0.2
                self.done=False
                   
        elif(self.state==4):
            if(action==Act.L):
                self.state=np.random.choice([5,3],p=[0.5,0.5])
                if(self.state==5):
                    self.Reward=self.reward[1][2] 
                    self.done=True
                else:
                    self.Reward=self.reward[1][0] -0.2
                    self.done=False
                
            elif (action==Act.R):
                self.state=np.random.choice([5,3],p=[0.5,0.5])
                if(self.state==3):
                    self.Reward=self.reward[1][0] -0.1
                    self.done=False
                else:
                    self.Reward=self.reward[1][2] 
                    self.done=True
                
                
            elif(action==Act.U):
                self.state=np.random.choice([1,7],p=[0.5,0.5])
                if self.state==7:
                    self.Reward=self.reward[2][1] -0.2
                else :
                    self.Reward=self.reward[0][1] -0.1
                self.done=False
                
            elif(action==Act.D):
                self.state=np.random.choice([1,7],p=[0.5,0.5])
                if self.state==7:
                    self.Reward=self.reward[2][1] -0.1
                else :
                    self.Reward=self.reward[0][1] -0.2
                self.done=False
                
        elif(self.state==6):
            if(action==Act.L):
                self.state=np.random.choice([7,8],p=[0.5,0.5])
                if(self.state==8):
                    self.Reward=self.reward[2][2] 
                    self.done=True
                else:
                    self.Reward=self.reward[2][1] -0.2
                    self.done=False
                
            elif (action==Act.R):
                self.state=np.random.choice([7,8],p=[0.5,0.5])
                if(self.state==7):
                    self.Reward=self.reward[2][1] -0.1
                    self.done=False
                else:
                    self.Reward=self.reward[2][2] 
                    self.done=True
                
                
            elif(action==Act.U):
                self.state=np.random.choice([0,3],p=[0.5,0.5])
                if self.state==3:
                    self.Reward=self.reward[1][0] -0.1
                else :
                    self.Reward=self.reward[0][0] -0.2
                self.done=False
                
            elif(action==Act.D):
                self.state=np.random.choice([0,3],p=[0.5,0.5])
                if self.state==0:
                    self.Reward=self.reward[0][0] -0.1
                else :
                    self.Reward=self.reward[1][0] -0.2
                self.done=False
                
        elif(self.state==7):
            if(action==Act.L):
                self.state=np.random.choice([6,8],p=[0.5,0.5])
                if(self.state==6):
                    self.Reward=self.reward[2][0] -0.1
                    self.done=False
                else:
                    self.Reward=self.reward[2][2] 
                    self.done=True
                
            elif (action==Act.R):
                self.state=np.random.choice([6,8],p=[0.5,0.5])
                if(self.state==8):
                    self.Reward=self.reward[2][2] 
                    self.done=True
                else:
                    self.Reward=self.reward[2][0] -0.2
                    self.done=False
                
                
            elif(action==Act.U):
                self.state=np.random.choice([1,4],p=[0.5,0.5])
                if self.state==4:
                    self.Reward=self.reward[1][1] -0.1
                else :
                    self.Reward=self.reward[0][1] -0.2
                self.done=False
                
            elif(action==Act.D):
                self.state=np.random.choice([1,4],p=[0.5,0.5])
                if self.state==0:
                    self.Reward=self.reward[0][1] -0.1
                else :
                    self.Reward=self.reward[1][1] -0.2
                self.done=False
                
        return self.state,self.Reward,self.done
                       

### Moves by Human Player

In [5]:
env = Environment()
state = env.reset()
done = False
total_r = 0

while True:

    action = int(input("Enter a direction: 0 (LEFT), 1 (RIGHT), 2 (UP), 3(DOWN) ->"))

    if (action==0):
        print("You took Left turn ← ")
        
    elif(action==1):
        print("You took Right turn → ")
        
    elif(action==3):
        print("You took Upward turn ↑ ")
        
    else:
        print("You took Downward turn ↓ ")

    new_state, reward, done = env.step(action)

    total_r += reward;

    print("Your new state is ", new_state)
    print("Your current reward is", reward )
    print("Your Total reward is", total_r )

    if (done):
        print("You entered the termination. Bye!")
        break

Enter a direction: 0 (LEFT), 1 (RIGHT), 2 (UP), 3(DOWN) ->2
You took Downward turn ↓ 
Your new state is  6
Your current reward is 0.9
Your Total reward is 0.9
Enter a direction: 0 (LEFT), 1 (RIGHT), 2 (UP), 3(DOWN) ->1
You took Right turn → 
Your new state is  7
Your current reward is -0.1
Your Total reward is 0.8
Enter a direction: 0 (LEFT), 1 (RIGHT), 2 (UP), 3(DOWN) ->1
You took Right turn → 
Your new state is  6
Your current reward is 0.8
Your Total reward is 1.6
Enter a direction: 0 (LEFT), 1 (RIGHT), 2 (UP), 3(DOWN) ->3
You took Upward turn ↑ 
Your new state is  3
Your current reward is -0.2
Your Total reward is 1.4000000000000001
Enter a direction: 0 (LEFT), 1 (RIGHT), 2 (UP), 3(DOWN) ->1
You took Right turn → 
Your new state is  5
Your current reward is -10
Your Total reward is -8.6
You entered the termination. Bye!


### Training RL agent using **Monte Carlo Control**

In [6]:
class MC_agent:

    def __init__(self,enviroment):
        
        self.num_episodes = 1000
        self.steps_per_episode = 500
        self.env = enviroment
        self.episode_rewards = []

        # N(s,a) is the number of times that action a has been selected from state s.
        self.N = np.zeros((self.env.observation_space,self.env.action_space))
        
        # Q-table for Q(s,a) and Initialise the value function to zero. 
        self.Q = np.zeros((self.env.observation_space,self.env.action_space))


    # get optimal action, with epsilon exploration using ε-greedy exploration strategy
    # with εt = 1/K where K is the Epsiode Number
    def get_action(self, state, epsiode):

        curr_epsilon = 1/(epsiode)

        # epsilon greedy policy
        if np.random.uniform(0, 1) < curr_epsilon:
            r_action = np.random.choice([0,1,2,3],p=[0.25,0.25,0.25,0.25])
            return r_action
        else:
            action =  np.argmax(self.Q[state, :])
            return action
        
    def train(self):

        # Loop episodes
        for episode in range(1,self.num_episodes+1):

            # for storing each state, action pair in each step of episodes
            history = []
            
            # get initial state for current episode
            state = self.env.reset()
            
            rewards_current_episode = 0
            
            done = False
            
            for step in range(self.steps_per_episode):

                action = MC_agent.get_action(self,state, episode)
               
                # store action state pairs
                history.append((state,action))
                
                # update visits
                # N(s,a) is the number of times that action a has been selected from state s. 
                self.N[state, action] += 1
                
                # execute action
                state,reward,done = self.env.step(action)
                
                rewards_current_episode += reward
                
                # When Termination state is reached
                if(done == True):
                    break
             
            self.episode_rewards.append(rewards_current_episode)

            # Update Action value function accordingly
            for curr_state, curr_action in history:
                
                # Alpha(learning rate)
                step = 1.0 / self.N[curr_state, curr_action]   
            
                error = rewards_current_episode - self.Q[curr_state, curr_action]
                
                self.Q[curr_state, curr_action] += step * error
                
    def ShowActionValueFunc(self):
        
        print("\n <-- Q(s,a) in Q-Table Format --> \n")
        print(self.Q)

        # Saving Q table
        np.save('Q_table_MC',self.Q)


In [7]:
agent = MC_agent(Environment())
agent.train()
agent.ShowActionValueFunc()


 <-- Q(s,a) in Q-Table Format --> 

[[ 78.73662132  82.68464661  72.80275221  91.30770945]
 [ 95.84514047  91.83810813 138.53331373  51.93213272]
 [ 42.525       78.68186506  -4.2          3.8       ]
 [ 62.97681267  68.67493173  82.84229395  91.34397048]
 [ 70.33991224  74.26241993 138.46304624  61.39756979]
 [  0.           0.           0.           0.        ]
 [ 80.40656579  89.75822416  73.09673854  91.66867301]
 [ 83.0732932   85.21336562 106.46771132 138.52376344]
 [  0.           0.           0.           0.        ]]


### Moves by RL Agent!

In [1]:
# Load the Q-Values from the saved table.
q_table  = np.load('Q_table_MC.npy')

env = Environment()
state = env.reset()
done = False
total_r = 0

while True:

    action = np.argmax(q_table[state,:])

    if (action==0):
        print("You took Left turn ← ")
        
    elif(action==1):
        print("You took Right turn → ")
        
    elif(action==3):
        print("You took Upward turn ↑ ")
        
    else:
        print("You took Downward turn ↓ ")

    new_state, reward, done = env.step(action)

    total_r += reward;

    print("Your new state is ", new_state)
    print("Your current reward is", reward )
    print("Your Total reward is", total_r )

    if (done):
        print("You entered the termination. Bye!")
        break

### Interpretation:

**The divergence of Total reward when game is played by the RL agent trained under Monte Carlo suggests that our RL agent is working pretty much well!**