<a href="https://colab.research.google.com/github/Alian3785/game-of-ur/blob/main/%D0%A3%D0%A0%20%D0%B4%D0%BB%D1%8F%20%D0%B2%D1%81%D0%B5%D1%85.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines Tutorial - Creating a custom Gym environment

Github repo: https://github.com/araffin/rl-tutorial-jnrr19

Stable-Baselines: https://github.com/hill-a/stable-baselines

Documentation: https://stable-baselines.readthedocs.io/en/master/

RL Baselines zoo: https://github.com/araffin/rl-baselines-zoo


## Introduction

In this notebook, you will learn how to use your own environment following the OpenAI Gym interface.
Once it is done, you can easily use any compatible (depending on the action space) RL algorithm from Stable Baselines on that environment.

## Install Dependencies and Stable Baselines Using Pip



In [None]:
# Stable Baselines only supports tensorflow 1.x for now

!pip install "stable-baselines3[extra]>=2.0.0a4"
!pip install sb3-contrib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stable-baselines3[extra]>=2.0.0a4
  Downloading stable_baselines3-2.0.0a10-py3-none-any.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.0/178.0 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium==0.28.1 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading gymnasium-0.28.1-py3-none-any.whl (925 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m925.5/925.5 kB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m
Collecting shimmy[atari]~=0.2.1 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading Shimmy-0.2.1-py3-none-any.whl (25 kB)
Collecting autorom[accept-rom-license]~=0.6.0 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading AutoROM-0.6.1-py3-none-any.whl (9.4 kB)
Collecting jax-jumpy>=1.0.0 (from gymnasium==0.28.1->stable-baselines3[extra]>=2.0.0a4)
  Downloading jax_jumpy-1.0.0-py3-none-any.whl (20 kB)
Colle

## First steps with the gym interface

As you have noticed in the previous notebooks, an environment that follows the gym interface is quite simple to use.
It provides to this user mainly three methods:
- `reset()` called at the beginning of an episode, it returns an observation
- `step(action)` called to take an action with the environment, it returns the next observation, the immediate reward, whether the episode is over and additional information
- (Optional) `render(method='human')` which allow to visualize the agent in action. Note that graphical interface does not work on google colab, so we cannot use it directly (we have to rely on `method='rbg_array'` to retrieve an image of the scene

Under the hood, it also contains two useful properties:
- `observation_space` which one of the gym spaces (`Discrete`, `Box`, ...) and describe the type and shape of the observation
- `action_space` which is also a gym space object that describes the action space, so the type of action that can be taken

The best way to learn about gym spaces is to look at the [source code](https://github.com/openai/gym/tree/master/gym/spaces), but you need to know at least the main ones:
- `gym.spaces.Box`: A (possibly unbounded) box in $R^n$. Specifically, a Box represents the Cartesian product of n closed intervals. Each interval has the form of one of [a, b], (-oo, b], [a, oo), or (-oo, oo). Example: A 1D-Vector or an image observation can be described with the Box space.
```python
# Example for using image as input:
observation_space = spaces.Box(low=0, high=255, shape=(HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
```                                       

- `gym.spaces.Discrete`: A discrete space in $\{ 0, 1, \dots, n-1 \}$
  Example: if you have two actions ("left" and "right") you can represent your action space using `Discrete(2)`, the first action will be 0 and the second 1.



[Documentation on custom env](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html)

In [None]:
import gym

env = gym.make("CartPole-v1")

# Box(4,) means that it is a Vector with 4 components
print("Observation space:", env.observation_space)
print("Shape:", env.observation_space.shape)
# Discrete(2) means that there is two discrete actions
print("Action space:", env.action_space)

# The reset method is called at the beginning of an episode
obs = env.reset()
# Sample a random action
action = env.action_space.sample()
print("Sampled action:", action)
obs, reward, done, info = env.step(action)
# Note the obs is a numpy array
# info is an empty dict for now but can contain any debugging info
# reward is a scalar
print(obs.shape, reward, done, info)


##  Gym env skeleton

In practice this is how a gym environment looks like.
Here, we have implemented a simple grid world were the agent must learn to go always left.

In [None]:
import numpy as np
import gym
from gym import spaces
import random


class GoLeftEnv(gym.Env):
  """
  Custom Environment that follows gym interface.
  This is a simple env where the agent must learn to go always left. 
  """
  # Because of google colab, we cannot implement the GUI ('human' render mode)
  metadata = {'render.modes': ['console']}
  # Define constants for clearer code
  FIRST = 0
  SECOND = 1
  THIRD = 2
  FORTH = 3
  FIFTH= 4
  SIXTH = 5
  SEVENTH = 6

  def __init__(self, grid_size=10):
    super(GoLeftEnv, self).__init__()

    self.fields = [[0, 0, 7, 7,], [1, 0, 0, 0,], [2, 0, 0, 0,], [3, 0, 0, 0,], [4, 1, 0, 0,], [5, 0, 0, 0], [6, 0, 0, 0], [7, 0, 0, 0], [8, 1, 0, 0], [9, 0, 0, 0], [10, 0, 0, 0], [11, 0, 0, 0], [12, 0, 0, 0], 
                   [13, 0, 0, 0], [14, 1, 0, 0], [15, 0, 0, 0], [0, 0, 0, 0]]

    # Define action and observation space
    # They must be gym.spaces objects
    # Example when using discrete actions, we have two: left and right
    n_actions = 7
    self.action_space = spaces.Discrete(n_actions)
    # The observation will be the coordinate of the agent
    # this can be described both by Discrete and Box space
    self.observation_space = spaces.Box(low=-100, high=100,
                                        shape=(17,4), dtype=np.float32)

  def reset(self):
    """
    Important: the observation must be a numpy array
    :return: (np.array) 
    """

    # Initialize the agent at the right of the grid
    self.fields = [[0, 0, 7, 7,], [1, 0, 0, 0,], [2, 0, 0, 0,], [3, 0, 0, 0,], [4, 1, 0, 0,], [5, 0, 0, 0], [6, 0, 0, 0], [7, 0, 0, 0], [8, 1, 0, 0], [9, 0, 0, 0], [10, 0, 0, 0], [11, 0, 0, 0], [12, 0, 0, 0], 
                   [13, 0, 0, 0], [14, 1, 0, 0], [15, 0, 0, 0], [0, 0, 0, 0]]

    # here we convert to float32 to make it more general (in case we want to use continuous actions)
    return np.array(self.fields).astype(np.float32)

  def step(self, action):

     def first():
            numberforcycles = 15

            balcycle1 = 0

            roll = self.fields[16][0]

            lst = [0, 1, 2, 3, 4]
            weights = [6.25, 25, 37.5, 25, 6.25]
            nextroll = random.choices(lst, weights=weights, k=1)
            nextroll = nextroll[0]
            self.fields[16][0] = nextroll
            #print("Выпала" , roll)
            arrplayer1 = []
            while balcycle1 < numberforcycles:
              ballnow = self.fields[balcycle1][2]
              ballindex = self.fields[balcycle1][0]
          #расположение фишек игрока 1
              if ballnow != 0:  
               arrplayer1.append(ballindex)
              balcycle1 += 1
            if self.fields[8][3] == 1:
                arrplayer1.append(8)
            #print(arrplayer1) 

            balcycle1 = 0
            arrchoice1 = []
            while balcycle1 < numberforcycles:
              ballnow = self.fields[balcycle1][2]
              ballindex = self.fields[balcycle1][0]
              if ballnow != 0:  
                ballfuture = ballindex + roll
                if ballfuture <= numberforcycles:
                 if ballfuture not in arrplayer1:
                  arrchoice1.append(ballfuture)
              # print("Выпала" , self.roll)
                  #print("текущая позиция" , ballindex)
                  #print("возможная позиция" , ballfuture)   
              balcycle1 += 1



            if arrchoice1:

              #случайный бот:
              playeronemove = random.choice(arrchoice1)

              #Очень жадный бот:
              #playeronemove = len(arrchoice1)
              #playeronemove = arrchoice1[playeronemove-1]

              #Жадный бот:
              #playeronemove = 33
              #for arrchoice in arrchoice1:
              #  if self.fields[arrchoice][1] == 1 or (self.fields[arrchoice][3] == 1 and self.fields[arrchoice][0] >= 5 and self.fields[arrchoice][0] <= 12):
              #    playeronemove = self.fields[arrchoice][0]                
              #if playeronemove == 33:    
              #  playeronemove = len(arrchoice1)
              #  playeronemove = arrchoice1[playeronemove-1]

              #playeronemove = random.choice(arrchoice1)
              playeronepos = playeronemove - roll
              #print(playeronepos)
              self.fields[playeronepos][2] = self.fields[playeronepos][2] - 1
              self.fields[playeronemove][2] = self.fields[playeronemove][2] + 1
              if self.fields[playeronemove][3] == 1 and playeronemove >= 5 and playeronemove <= 12:
                self.fields[playeronemove][3] = 0
                self.fields[0][3] = self.fields[0][3] + 1
                #print("мы забрали шашку на" , self.fields[playeronemove][0])
              #print("боту выпал", roll)
              #print("бот ходит", playeronemove)
              return playeronemove

     def second():
            numberforcycles = 15

            agentroll = self.fields[16][1] 

            lst = [0, 1, 2, 3, 4]
            weights = [6.25, 25, 37.5, 25, 6.25]
            nextagentroll = random.choices(lst, weights=weights, k=1)
            nextagentroll = nextagentroll[0]
            self.fields[16][1] = nextagentroll
            #print("Агенту Выпал" , agentroll)

            agentcycle = 0
            arrplayeragent = []

            while agentcycle < numberforcycles:
              agentballnow = self.fields[agentcycle][3]
              agentballindex = self.fields[agentcycle][0]
          #расположение фишек игрока 2
              if agentballnow != 0:  
               arrplayeragent.append(agentballindex)    
              agentcycle += 1
            if self.fields[8][2] == 1:
                arrplayeragent.append(8)

            #print("фишки агента", arrplayeragent) 

            agentcycle = 0
            arrplayeragent2 = []

            while agentcycle < numberforcycles:
              agentballnow = self.fields[agentcycle][3]
              agentballindex = self.fields[agentcycle][0]
          #будущее фишек игрока 2
              if agentballnow != 0:  
               agentballfuture = agentballindex + agentroll  
               if agentballfuture <= numberforcycles:
                if agentballfuture not in arrplayeragent:
                  arrplayeragent2.append(agentballfuture) 
              agentcycle += 1

            #print(arrplayeragent2)

            

            if arrplayeragent2:
              playeragentmove = len(arrplayeragent2)
              playeragentmove = arrplayeragent2[0]#arrplayeragent2[playeragentmove-1]

              if action == self.FIRST:
                if len(arrplayeragent2) > 0:
                 playeragentmove = arrplayeragent2[0]
                else:
                 playeragentmove = playeragentmove
              elif action == self.SECOND:
               if len(arrplayeragent2) > 1:
                playeragentmove = arrplayeragent2[1]
               else:
                playeragentmove = playeragentmove
              elif action == self.THIRD:
               if len(arrplayeragent2) > 2:
                playeragentmove = arrplayeragent2[2]
               else:
                playeragentmove = playeragentmove
              elif action == self.FORTH:
               if len(arrplayeragent2) > 3:
                playeragentmove = arrplayeragent2[3]
               else:
                playeragentmove = playeragentmove
              elif action == self.FIFTH:
               if len(arrplayeragent2) > 4:
                playeragentmove = arrplayeragent2[4]
               else:
                playeragentmove = playeragentmove
              elif action == self.SIXTH:
               if len(arrplayeragent2) > 5:
                playeragentmove = arrplayeragent2[5]
               else:
                playeragentmove = playeragentmove
              elif action == self.SEVENTH:
               if len(arrplayeragent2) > 6:
                playeragentmove = arrplayeragent2[6]
               else:
                playeragentmove = playeragentmove 
              else:
                raise ValueError("Received invalid action={} which is not part of the action space".format(action))



              #print("Агент делает выбор", playeragentmove)

              #playeragentmove = random.choice(arrplayeragent)
              playeragentpos = playeragentmove - agentroll
              self.fields[playeragentpos][3] = self.fields[playeragentpos][3] - 1
              self.fields[playeragentmove][3] = self.fields[playeragentmove][3] + 1
              if self.fields[playeragentmove][2] == 1 and playeragentmove >= 5 and playeragentmove <= 12:
                self.fields[playeragentmove][2] = 0
                self.fields[0][2] = self.fields[0][2] + 1
                #print("Агент забрал шашку на" , self.fields[playeragentmove][0])
              #print("Агент ходит" , playeragentmove)
              return playeragentmove
            else:          
              rrr = 0#print("Выпал 0 и агент не делает выбора")
     #while True:
     # thisbotmove = first()      
     # if thisbotmove != 4 or thisbotmove != 8 or thisbotmove != 14:
     #    #print("нет второго хода бота")
     #    break

     if self.fields[16][3] != 1:
      thisbotmove = first()
      #print(first())
      if thisbotmove == 4 or thisbotmove == 8 or thisbotmove == 14:
       #print("Второй ход бота", thisbotmove)
       self.fields[16][2] = 1
      else:
       self.fields[16][2] = 0 
    # if thisbotmove == 4 or thisbotmove == 8 or thisbotmove == 14:
     if self.fields[16][2] != 1:
      thisagentmove = second()
      #print(second())
      if thisagentmove == 4 or thisagentmove == 8 or thisagentmove == 14:
        #print("Второй ход агента", thisagentmove)
        self.fields[16][3] = 1
      else:
        self.fields[16][3] = 0 
     #while True:
     # thisagentmove =  second()
     # if thisagentmove != 4 or thisagentmove != 8 or thisagentmove != 14:
     #    #print("нет второго хода агента")
     #    break


     done = bool(self.fields[15][2] >= 7 or self.fields[15][3] >= 7)


     reward = 0
     if self.fields[15][3] >= 7:
       reward = 100

    # Optionally we can pass additional info, we are not using that for now
     info = {}

     return np.array(self.fields).astype(np.float32), reward, done, info


  def render(self, mode='console'):
    if mode != 'console':
      raise NotImplementedError()
    # agent is represented as a cross, rest as a dot

    #print(self.fields[0])

  def close(self):
    pass
    

### Testing the environment

In [None]:
iii=0
iiia = []
while iii < 10:
 env = GoLeftEnv()
 obs = env.reset()
 n_steps = 200
 for _ in range(n_steps):
    # Random action
     action = env.action_space.sample()
     print("Action: ", action)
     obs, reward, done, info = env.step(action)
     print('obs=', obs, 'reward=', reward, 'done=', done)
     print("observation space shape:", env.observation_space.shape)
     if obs[4][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[5][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[5][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[4][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")
     if obs[3][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[6][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[6][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[3][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")
     if obs[2][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[7][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[7][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[2][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")
     if obs[1][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[8][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[8][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[1][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")
     print(u"\u2B1B", end="")
     if obs[9][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[9][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     print(u"\u2B1B")
     print(u"\u2B1B", end="")
     if obs[10][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[10][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     print(u"\u2B1B")
     if obs[14][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[11][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[11][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[14][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")
     if obs[13][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[12][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[12][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[13][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")



     env.render(mode='console')
     if (done == 1):
       print("Goal reached!", "reward=", reward)
       iiia.append(reward)
       break
 iii += 1 
print(iiia)
win = []
notwin = []
for iiiaa in iiia:
  if iiiaa <= 0:
    notwin.append(iiiaa)
  elif iiiaa > 0:
    win.append(iiiaa)
print(win)
print(notwin)
print(len(win))
print(len(notwin))
# sample action:
print("sample action:", env.action_space.sample())

# observation space shape:
print("observation space shape:", env.observation_space.shape)

# sample observation:
print("sample observation:", env.observation_space.sample())

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
⬜⬜⬜
⬜⬜⬜
⬛⬜⬛
⬛⬜⬛
⬜⬜⬜
⬜⬜⬜
Action:  6
obs= [[ 0.  0.  6.  7.]
 [ 1.  0.  1.  0.]
 [ 2.  0.  0.  0.]
 [ 3.  0.  0.  0.]
 [ 4.  1.  0.  0.]
 [ 5.  0.  0.  0.]
 [ 6.  0.  0.  0.]
 [ 7.  0.  0.  0.]
 [ 8.  1.  0.  0.]
 [ 9.  0.  0.  0.]
 [10.  0.  0.  0.]
 [11.  0.  0.  0.]
 [12.  0.  0.  0.]
 [13.  0.  0.  0.]
 [14.  1.  0.  0.]
 [15.  0.  0.  0.]
 [ 2.  1.  0.  0.]] reward= 0 done= False
observation space shape: (17, 4)
⬜⬜⬜
⬜⬜⬜
⬜⬜⬜
🟥⬜⬜
⬛⬜⬛
⬛⬜⬛
⬜⬜⬜
⬜⬜⬜
Action:  2
obs= [[ 0.  0.  6.  6.]
 [ 1.  0.  0.  1.]
 [ 2.  0.  0.  0.]
 [ 3.  0.  1.  0.]
 [ 4.  1.  0.  0.]
 [ 5.  0.  0.  0.]
 [ 6.  0.  0.  0.]
 [ 7.  0.  0.  0.]
 [ 8.  1.  0.  0.]
 [ 9.  0.  0.  0.]
 [10.  0.  0.  0.]
 [11.  0.  0.  0.]
 [12.  0.  0.  0.]
 [13.  0.  0.  0.]
 [14.  1.  0.  0.]
 [15.  0.  0.  0.]
 [ 1.  2.  0.  0.]] reward= 0 done= False
observation space shape: (17, 4)
⬜⬜⬜
🟥⬜⬜
⬜⬜⬜
⬜⬜🟦
⬛⬜⬛
⬛⬜⬛
⬜⬜⬜
⬜⬜⬜
Action:  3
obs= [[ 0.  0.  6.  6.]
 [ 1.

### Try it with Stable-Baselines

Once your environment follow the gym interface, it is quite easy to plug in any algorithm from stable-baselines

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
#from stable_baselines3.common.cmd_util import make_vec_env

# Instantiate the env
env = GoLeftEnv(grid_size=10)
# wrap it
env = make_vec_env(lambda: env, n_envs=1)

In [None]:
# Train the agent
model = PPO('MlpPolicy', env, verbose=1).learn(100000)

Using cpu device
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 96.9     |
|    ep_rew_mean     | 9.52     |
| time/              |          |
|    fps             | 893      |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 101          |
|    ep_rew_mean          | 7.5          |
| time/                   |              |
|    fps                  | 793          |
|    iterations           | 2            |
|    time_elapsed         | 5            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0098211365 |
|    clip_fraction        | 0.0753       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.94        |
|    explained_variance   | 0.00666      

In [None]:
import os
PPO_path = os.path.join('Training', 'Saved Models', '5millionppo')
model.save(PPO_path)

In [None]:
from stable_baselines3.common.evaluation import evaluate_policy
#evaluate_policy(model, env, n_eval_episodes=1000)
#run.finish()
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, deterministic=True)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

In [None]:
iii=0
iiia = []
while iii < 50:
 env = GoLeftEnv()
 obs = env.reset()
 n_steps = 200
 for _ in range(n_steps):
     action, _ = model.predict(obs, deterministic=True)    
     print("Action: ", action)
     obs, reward, done, info = env.step(action)
     print('obs=', obs, 'reward=', reward, 'done=', done)
     print("observation space shape:", env.observation_space.shape)
     if obs[4][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[5][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[5][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[4][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")
     if obs[3][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[6][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[6][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[3][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")
     if obs[2][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[7][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[7][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[2][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")
     if obs[1][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[8][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[8][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[1][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")
     print(u"\u2B1B", end="")
     if obs[9][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[9][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     print(u"\u2B1B")
     print(u"\u2B1B", end="")
     if obs[10][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[10][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     print(u"\u2B1B")
     if obs[14][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[11][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[11][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[14][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")
     if obs[13][2] == 1:
       print(u"\U0001F7E5", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[12][2] == 1:
       print(u"\U0001F7E5", end="")
     elif obs[12][3] == 1:
       print(u"\U0001F7E6", end="")
     else:
       print(u"\u2B1C", end="")
     if obs[13][3] == 1:
       print(u"\U0001F7E6")
     else:
       print(u"\u2B1C")



     env.render(mode='console')
     if (done == 1):
       print("Goal reached!", "reward=", reward)
       iiia.append(reward)
       break
 iii += 1 
print(iiia)
win = []
notwin = []
for iiiaa in iiia:
  if iiiaa <= 0:
    notwin.append(iiiaa)
  elif iiiaa > 0:
    win.append(iiiaa)
print(win)
print(notwin)
print(len(win))
print(len(notwin))
# sample action:
print("sample action:", env.action_space.sample())

# observation space shape:
print("observation space shape:", env.observation_space.shape)

# sample observation:
print("sample observation:", env.observation_space.sample())

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
⬜🟥⬜
⬛🟥⬛
⬛⬜⬛
⬜⬜🟦
⬜⬜⬜
Action:  2
obs= [[ 0.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 2.  0.  0.  0.]
 [ 3.  0.  0.  0.]
 [ 4.  1.  0.  0.]
 [ 5.  0.  0.  1.]
 [ 6.  0.  0.  0.]
 [ 7.  0.  0.  1.]
 [ 8.  1.  1.  0.]
 [ 9.  0.  1.  0.]
 [10.  0.  0.  0.]
 [11.  0.  0.  0.]
 [12.  0.  0.  0.]
 [13.  0.  0.  0.]
 [14.  1.  0.  1.]
 [15.  0.  5.  4.]
 [ 1.  2.  0.  0.]] reward= 0 done= False
observation space shape: (17, 4)
⬜🟦⬜
⬜⬜⬜
⬜🟦⬜
⬜🟥⬜
⬛🟥⬛
⬛⬜⬛
⬜⬜🟦
⬜⬜⬜
Action:  2
obs= [[ 0.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 2.  0.  0.  0.]
 [ 3.  0.  0.  0.]
 [ 4.  1.  0.  0.]
 [ 5.  0.  0.  1.]
 [ 6.  0.  0.  0.]
 [ 7.  0.  0.  0.]
 [ 8.  1.  1.  0.]
 [ 9.  0.  0.  1.]
 [10.  0.  1.  0.]
 [11.  0.  0.  0.]
 [12.  0.  0.  0.]
 [13.  0.  0.  0.]
 [14.  1.  0.  1.]
 [15.  0.  5.  4.]
 [ 2.  3.  0.  0.]] reward= 0 done= False
observation space shape: (17, 4)
⬜🟦⬜
⬜⬜⬜
⬜⬜⬜
⬜🟥⬜
⬛🟦⬛
⬛🟥⬛
⬜⬜🟦
⬜⬜⬜
Action:  2
obs= [[ 0.  0.  1.  0.]
 [ 1.  0.

## It is your turn now, be creative!

As an exercise, that's now your turn to build a custom gym environment.
There is no constrain about what to do, be creative! (but not too creative, there is not enough time for that)

If you don't have any idea, here is is a list of the environment you can implement:
- Transform the discrete grid world to a continuous one, you will need to change a bit the logic and the action space
- Create a 2D grid world and add walls
- Create a tic-tac-toe game
