# Learn Reinforcement Learning in Python: Step-by-step Tutorial

```
$ pip install "gymnasium[atari]"
$ pip install autorom[accept-rom-license]
$ AutoROM --accept-license
```

In [7]:
import gymnasium as gym

env = gym.make("ALE/Breakout-v5")

In [8]:
env.action_space

Discrete(4)

In [12]:
env.action_space.sample()

1

In [9]:
env.observation_space

Box(0, 255, (210, 160, 3), uint8)

In [11]:
state = env.observation_space.sample()

state.shape

(210, 160, 3)

In [18]:
state, info = env.reset()

state[:3, :3, :3]

array([[[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]], dtype=uint8)

In [19]:
info

{'lives': 5, 'episode_frame_number': 0, 'frame_number': 0}

In [23]:
observation, reward, terminated, truncated, info = env.step(1)

In [27]:
info

{'lives': 5, 'episode_frame_number': 12, 'frame_number': 12}

In [34]:
import gymnasium as gym

env = gym.make("ALE/Breakout-v5", render_mode="human")
observation, info = env.reset()

for _ in range(100):
    action = (
        env.action_space.sample()
    )  # agent policy that uses the observation and info
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()

In [39]:
env.render?

[0;31mSignature:[0m [0menv[0m[0;34m.[0m[0mrender[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Renders the environment with `kwargs`.
[0;31mFile:[0m      ~/.local/lib/python3.8/site-packages/gymnasium/wrappers/order_enforcing.py
[0;31mType:[0m      method


In [40]:
epochs = 0
penalties, reward = 0, 0

frames = []  # for animation

done = False

env = gym.make("ALE/Breakout-v5", render_mode="rgb_array")
observation, info = env.reset()

while not done:
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)

    if reward == -10:
        penalties += 1

    # Put each rendered frame into dict for animation
    frames.append(
        {
            "frame": env.render(),
            "state": observation,
            "action": action,
            "reward": reward,
        }
    )

    epochs += 1
    break


print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 1
Penalties incurred: 0


  logger.warn(


In [41]:
frames

[{'frame': array([[[0, 0, 0],
          [0, 0, 0],
          [0, 0, 0],
          ...,
          [0, 0, 0],
          [0, 0, 0],
          [0, 0, 0]],
  
         [[0, 0, 0],
          [0, 0, 0],
          [0, 0, 0],
          ...,
          [0, 0, 0],
          [0, 0, 0],
          [0, 0, 0]],
  
         [[0, 0, 0],
          [0, 0, 0],
          [0, 0, 0],
          ...,
          [0, 0, 0],
          [0, 0, 0],
          [0, 0, 0]],
  
         ...,
  
         [[0, 0, 0],
          [0, 0, 0],
          [0, 0, 0],
          ...,
          [0, 0, 0],
          [0, 0, 0],
          [0, 0, 0]],
  
         [[0, 0, 0],
          [0, 0, 0],
          [0, 0, 0],
          ...,
          [0, 0, 0],
          [0, 0, 0],
          [0, 0, 0]],
  
         [[0, 0, 0],
          [0, 0, 0],
          [0, 0, 0],
          ...,
          [0, 0, 0],
          [0, 0, 0],
          [0, 0, 0]]], dtype=uint8),
  'state': array([[[0, 0, 0],
          [0, 0, 0],
          [0, 0, 0],
          ...,
     

env.reset - to start an episode
done=False - to see if game is terminated


### Article plan

1. Define what reward, state and actions are for the current problem
2. Show how to install gymnasium with cmake and scipy
3. Show how to render the env in both human and rgb_array mode
4. Explain env.reset, step and render methods
5. Pseudo-code for solving the environment without RL:
   - Initialize epochs, penalties, reward and an empty list to store frames
   - Define the `done` variable
   - While not done, get a random action and execute with step
   - Increase or decrease the penalty based on the reward
   - Append the current frame to frames using rgb_array mode of render
   - Increase the number of epochs
6. Pseudo-code to display the frames as a GIF
   - Using imageio, collect all rgb-arrays in frames and put them together as a gif
7. Pseudo-code to solve the environment with Q-learning
   - Define the hyperparameters - alpha, gamma, epsilon
   - Define the q_table with the same dims as the number of states and the number of actions
   - For a large number of epochs:
     - Reset the environment
     - Initialize epochs, penalties and reward with 0 values
     - While not done:
       - Generate a random value to compare with epsilon - exp vs. exploit trade-off
       - if random value smaller than epsilon, choose a new random action, else, find the argmax of q_table for the state - i.e. choose the action that gives the biggest reward
       - Take the action
       - Find the old value for the current state and action
       - Find the max value for the next state
       - Create a new value using the Q-learning formula
       - Update the q_table for the current state and action with new_value
       - 