# Introduction
OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It provides a wide variety of environments where agents can interact and learn from their experiences. In this report, we'll explore the basic functionality of OpenAI Gym, with a focus on two key methods: reset() and step().

# Installation
Before diving into the functionality, make sure you have OpenAI Gym installed. You can install it via pip:

In [1]:
!pip install gym tqdm pygame swig



# Creating our first gym environment

Let's introduce one of the simplest environments called the frozen lake environment. The frozen lake environment is shown below. As we can observe, in the frozen lake environment, the goal of the agent is to start from the initial state S and reach the goal state G.

![title](Images/1.png)

In the above environment, the following applies:

* S denotes the starting state
* F denotes the frozen state
* H denotes the hole state
* G denotes the goal state

So, the agent has to start from the state S and reach the goal state G. But one issue is that if the agent visits the state H, which is just the hole state, then the agent will fall into the hole and die as shown below:


![title](Images/2.png)

So, we need to make sure that the agent starts from S and reaches G without falling into the hole state H as shown below:


![title](Images/3.png)
Each grid box in the above environment is called state, thus we have 16 states (S to G) and we have 4 possible actions which are up, down, left and right. We learned that our goal is to reach the state G from S without visiting H. So, we assign reward as 0 to all the states and + 1 for the goal state G. 

Thus, we learned how the frozen lake environment works. Now, to train our agent in the frozen lake environment, first, we need to create the environment by coding it from scratch in Python. But luckily we don't have to do that! Since the gym provides the various environment, we can directly import the gym toolkit and create a frozen lake environment using the gym.


Now, we will learn how to create our frozen lake environment using the gym. Before running any code, make sure that you activated our virtual environment universe. First, let's import the gym library:




In [2]:
import gym


Next, we can create a gym environment using the make function.  The make function requires the environment id as a parameter. In the gym, the id of the frozen lake environment is `FrozenLake-v0`. So, we can create our frozen lake environment as shown below:





In [3]:
env = gym.make("FrozenLake-v1",render_mode="human")
# environment = gym.make("FrozenLake-v1", is_slippery=False, render_mode="human", map_name="4x4")


After creating the environment, we can see how our environment looks like using the render function:

# Reset() Function
In OpenAI Gym, the `reset` method is a crucial part of the environment lifecycle, serving multiple purposes within the context of reinforcement learning. It is primarily used to initiate a new episode, resetting the environment to its initial state. This is essential because each episode in reinforcement learning starts with the agent in a specific state, and the `reset` method ensures that this starting condition is met. Additionally, `reset` is called after an episode ends, either when the agent reaches a terminal state or when the maximum number of time steps is reached. This allows for the continuation of training by starting a new episode from the initial state.

Moreover, `reset` can also be used at the very beginning of training to establish the initial state for the first episode. This practice helps in setting up the environment correctly before the training loop begins. The `reset` method returns the initial observation and any auxiliary information required for the environment, which is crucial for the agent to understand its initial state and make informed decisions.



In [4]:
env.reset()
env.render()

1   HIToolbox                           0x00000001ba4005c8 _ZN15MenuBarInstance22EnsureAutoShowObserverEv + 120
2   HIToolbox                           0x00000001ba400188 _ZN15MenuBarInstance14EnableAutoShowEv + 60
3   HIToolbox                           0x00000001ba3a3310 SetMenuBarObscured + 372
4   HIToolbox                           0x00000001ba3a2ee8 _ZN13HIApplication15HandleActivatedEP14OpaqueEventRefhP15OpaqueWindowPtrh + 172
5   HIToolbox                           0x00000001ba39cfcc _ZN13HIApplication13EventObserverEjP14OpaqueEventRefPv + 296
6   HIToolbox                           0x00000001ba363cd0 _NotifyEventLoopObservers + 176
7   HIToolbox                           0x00000001ba39c96c AcquireEventFromQueue + 432
8   HIToolbox                           0x00000001ba38bc84 ReceiveNextEventCommon + 320
9   HIToolbox                           0x00000001ba38bb2c _BlockUntilNextEventMatchingListInModeWithFilter + 72
10  AppKit                              0x00000001b3f38424 _DPS


As we can observe, the frozen lake environment consists of 16 states (S to G) as we learned. The state S is highlighted indicating that it is our current state, that is, agent is in the state S. So whenever we create an environment, an agent will always begin from the initial state, in our case, it is the state S. 

That's it! Creating the environment using the gym is that simple. In the next section, we will understand more about the gym environment by relating all the concepts we have learned in the previous chapter. 





## Exploring the environment

In the previous chapter, we learned that the reinforcement learning environment can be modeled as the Markov decision process (MDP) and an MDP consists of the following: 

* __States__ -  A set of states present in the environment 
* __Actions__ - A set of actions that the agent can perform in each state. 
* __Transition probability__ - The transition probability is denoted by $P(s'|s,a) $. It implies the probability of moving from a state $s$ to the state $s'$ while performing an action $a$.
* __Reward function__ - Reward function is denoted by $R(s,a,s')$. It implies the reward the agent obtains moving from a state $s$ to the state  $s'$ while performing an action $a$.

Let's now understand how to obtain all the above information from the frozen lake environment we just created using the gym.



## States
A state space consists of all of our states. We can obtain the number of states in our environment by just typing `env.observation_space` as shown below:

In [5]:
print(env.observation_space)

Discrete(16)


It implies that we have 16 discrete states in our state space starting from the state S to G. Note that, in the gym, the states will be encoded as a number, so the state S will be encoded as 0, state F will be encoded as 1 and so on as shown below:


![title](Images/5.png)

## Actions

We learned that the action space consists of all the possible actions in the environment. We can obtain the action space by `env.action_space` as shown below:

In [6]:
print(env.action_space)

Discrete(4)


It implies that we have 4 discrete actions in our action space which are left, down, right, up. Note that, similar to states, actions also will be encoded into numbers as shown below:


![title](Images/6.PNG)

# Step() Function
The `step()` function in OpenAI Gym is a critical method used within the reinforcement learning framework to interact with the environment. When an agent takes an action in the environment, the `step()` function is called to execute that action and return the resulting state of the environment. The `step()` function is expected to return a tuple containing four values: observation, reward, done, and info.

- **Observation**: This is the new state of the environment after the action has been taken. It represents the agent's view of the environment at the current time step.
- **Reward**: This is the immediate reward given to the agent after taking the action. Rewards are used to guide the agent towards achieving its goal.
- **Done**: This is a boolean value indicating whether the episode has ended. An episode ends when the agent reaches a terminal state or when a maximum number of steps has been reached.
- **Info**: This is a dictionary containing additional information about the environment. This information can be useful for debugging or for providing extra details about the state of the environment.

## Transition probability and Reward function

Now, let's look at how to obtain the transition probability and the reward function. We learned that in the stochastic environment, we cannot say that by performing some action $a$, agent will always reach the next state $s'$ exactly because there will be some randomness associated with the stochastic environment and by performing an action $a$ in the state $s$, agent reaches the next state  with some probability.

Let's suppose we are in state 2 (F). Now if we perform action 1 (down) in state 2, we can reach the state 6 as shown below:


![title](Images/7.png)

Our frozen lake environment is a stochastic environment. When our environment is stochastic we won't always reach the state 6 by performing action 1(down) in state 2, we also reach other states with some probability. So when we perform an action 1 (down) in the state 2, we reach state 1 with probability 0.33333, we reach state 6 with probability 0.33333 and we reach the state 3 with probability 0.33333 as shown below:


![title](Images/8.png)


As we can notice, in the stochastic environment we reach the next states with some probability. Now, let's learn how to obtain this transition probability using the gym environment.  

We can obtain the transition probability and the reward function by just typing `env.P[state][action]` So, in order to obtain the transition probability of moving from the state S to the other states by performing an action right, we can type, `env.P[S][right]`. But we cannot just type state S and action right directly since they are encoded into numbers. We learned that state S is encoded as 0 and the action right is encoded as 2, so, in order to obtain the transition probability of state S by performing an action right, we type `env.P[0][2]` as shown below:






In [7]:
print(env.P[0][2])

[(0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False)]


What does this imply? Our output is in the form of `[(transition probability, next state, reward, Is terminal state?)]` It implies that if we perform an action 2 (right) in state 0 (S) then:

* We reach the state 4 (F) with probability 0.33333 and receive 0 reward. 
* We reach the state 1 (F) with probability 0.33333 and receive 0 reward.
* We reach the same state 0 (S) with probability 0.33333 and receive 0 reward.

The transition probability is shown below:



![title](Images/9.png)

Thus, when we type `env.P[state][action]` we get the result in the form of `[(transition probability, next state, reward, Is terminal state?)]`. The last value is the boolean and it implies that whether the next state is a terminal state, since 4, 1 and 0 are not the terminal states it is given as false. 

The output of `env.P[0][2]` is shown in the below table for more clarity:


![title](Images/10.PNG)

Let's understand this with one more example. Let's suppose we are in the state 3 (F) as shown below:


![title](Images/11.png)

Say, we perform action 1 (down) in the state 3(F). Then the transition probability of the state 3(F) by performing action 1(down) can be obtained as shown below:



In [8]:
print(env.P[3][1])

[(0.3333333333333333, 2, 0.0, False), (0.3333333333333333, 7, 0.0, True), (0.3333333333333333, 3, 0.0, False)]


As we learned, our output is in the form of `[(transition probability, next state, reward, Is terminal state?)]` It implies that if we perform an action 1 (down) in state 3 (F) then:

* We reach the state 2 (F) with probability 0.33333 and receive 0 reward. 
* We reach the state 7 (H) with probability 0.33333 and receive 0 reward.
* We reach the same state 3 (F) with probability 0.33333 and receive 0 reward.


The transition probability is shown below:



![title](Images/12.png)


The output of `env.P[3][1]` is shown in the below table for more clarity:


![title](Images/13.PNG)

As we can observe, in the second row of our output, we have, `(0.33333, 7, 0.0, True)`,and the last value here is marked as True. It implies that state 7 is a terminal state. That is, if we perform action 1(down) in state 3(F) then we reach the state 7(H) with 0.33333 probability and since 7(H) is a hole, the agent dies if it reaches the state 7(H). Thus 7(H) is a terminal state and so it is marked as True. 

Thus, we learned how to obtain the state space, action space, transition probability and the reward function using the gym environment. In the next section, we will learn how to generate an episode. 

# Looping Throough Episodes

In [9]:
from tqdm import tqdm

observation, info = env.reset(seed=82)
n = 10 # number of episodes
for _ in tqdm(range(n)):
    action = env.action_space.sample()
    observation,reward, terminated, truncated, info = env.step(action)
    print("Observation: ", observation, "\nReward: ", reward, "\nTerminated: ", terminated, "\nTruncated: ", truncated, "\nInfo: ", info);
    
    if terminated or truncated:
        observation, info = env.reset()
        
env.close()

1   HIToolbox                           0x00000001ba526d4c _ZN15MenuBarInstance21IsAutoShowHideAllowedEv + 284
2   HIToolbox                           0x00000001ba400230 _ZN15MenuBarInstance24UpdateAutoShowVisibilityE5Pointh + 40
3   HIToolbox                           0x00000001ba36d30c _ZN15MenuBarInstance16ForEachMenuBarDoEU13block_pointerFvPS_E + 72
4   HIToolbox                           0x00000001ba4008a0 _ZN15MenuBarInstance20AutoShowHideObserverEjP14OpaqueEventRefPv + 216
5   HIToolbox                           0x00000001ba363cd0 _NotifyEventLoopObservers + 176
6   HIToolbox                           0x00000001ba397bb0 PostEventToQueueInternal + 696
7   HIToolbox                           0x00000001ba399340 _ZL29CreateAndPostEventWithCGEventP9__CGEventjhP17__CFMachPortBoost + 460
8   HIToolbox                           0x00000001ba3a5c28 _ZL15Convert1CGEventh + 264
9   HIToolbox                           0x00000001ba3a5ab0 _ZL16MainLoopObserverjP14OpaqueEventRefPv + 56
10  HITo

Observation:  1 
Reward:  0.0 
Terminated:  False 
Truncated:  False 
Info:  {'prob': 0.3333333333333333}


 20%|██        | 2/10 [00:00<00:01,  4.13it/s]

Observation:  1 
Reward:  0.0 
Terminated:  False 
Truncated:  False 
Info:  {'prob': 0.3333333333333333}


 30%|███       | 3/10 [00:00<00:01,  4.07it/s]

Observation:  1 
Reward:  0.0 
Terminated:  False 
Truncated:  False 
Info:  {'prob': 0.3333333333333333}


 40%|████      | 4/10 [00:00<00:01,  4.04it/s]

Observation:  2 
Reward:  0.0 
Terminated:  False 
Truncated:  False 
Info:  {'prob': 0.3333333333333333}


 50%|█████     | 5/10 [00:01<00:01,  4.02it/s]

Observation:  2 
Reward:  0.0 
Terminated:  False 
Truncated:  False 
Info:  {'prob': 0.3333333333333333}


 60%|██████    | 6/10 [00:01<00:00,  4.00it/s]

Observation:  6 
Reward:  0.0 
Terminated:  False 
Truncated:  False 
Info:  {'prob': 0.3333333333333333}


 70%|███████   | 7/10 [00:01<00:00,  4.00it/s]

Observation:  10 
Reward:  0.0 
Terminated:  False 
Truncated:  False 
Info:  {'prob': 0.3333333333333333}


 80%|████████  | 8/10 [00:01<00:00,  3.99it/s]

Observation:  6 
Reward:  0.0 
Terminated:  False 
Truncated:  False 
Info:  {'prob': 0.3333333333333333}


 90%|█████████ | 9/10 [00:02<00:00,  3.99it/s]

Observation:  10 
Reward:  0.0 
Terminated:  False 
Truncated:  False 
Info:  {'prob': 0.3333333333333333}


100%|██████████| 10/10 [00:02<00:00,  4.02it/s]

Observation:  6 
Reward:  0.0 
Terminated:  False 
Truncated:  False 
Info:  {'prob': 0.3333333333333333}



