# Gym by OpenAI
Gym is a toolkit for developing and comparing reinforcement learning algorithms. <br>
It makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano. <br>

The gym library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. <br>
These environments have a shared interface, allowing you to write general algorithms.

# Installation
To get started, you’ll need to have Python 3.5+ installed. Simply install gym using pip: <br>
Please take note, this step works well for Mac and Linux. <br>
If you are using Windows, you will need to do some workaround, using this link: https://towardsdatascience.com/how-to-install-openai-gym-in-a-windows-environment-338969e24d30 <br>

Please remember, you will need to install the following libraries: <br>
1. pystan
2. swig
3. Box2D

You may need to install other libraries depending on the environments that you are using.

In [None]:
#conda install conda-forge::gymnasium


## The environments
We will be looking at a few environments:
1. CartPole-v1
2. MountainCar-v0
3. BipedalWalker-v3
4. LunarLander-v2
5. CarRacing-v0
6. Pendulum-v0
7. Acrobot-v1
8. Taxi-v3
9. Copy-v0

We will look at the problems posed. <br>
As this is an introductory course, we will not go in-depth of the solutions. <br>
You are encouraged to find the answers online by yourself.

## 1. Cart Pole
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. <br>
The system is controlled by applying a force of +1 or -1 to the cart. <br>
The pendulum starts upright, and the goal is to prevent it from falling over. <br>
A reward of +1 is provided for every timestep that the pole remains upright. <br>
The episode ends when the pole is more than 15 degrees from vertical or the cart moves more than 2.4 units from the center.

In [2]:
# Random-action demo using the maintained Gymnasium package
import gymnasium as gym

# Create the environment with a display window
env = gym.make("CartPole-v1", render_mode="human")

# Start a new episode
obs, info = env.reset(seed=0)

# Run for 200 time steps using random actions
for _ in range(200):
    # Sample a random action from the action space
    action = env.action_space.sample()
    
    # Take a step in the environment
    obs, reward, terminated, truncated, info = env.step(action)
    
    # Episode ends if either condition is True
    if terminated or truncated:
        obs, info = env.reset()

# Always close the environment to release resources
env.close()


In [4]:
# CartPole: simple heuristic "solution" (better than random)
# Uses Gymnasium API and proper render_mode

import gymnasium as gym
import numpy as np

# Make an on-screen environment. Use render_mode="rgb_array" if you're headless.
env = gym.make("CartPole-v1", render_mode="human")
obs, info = env.reset(seed=0)

total_reward = 0.0

for _ in range(200):
    # Observation format: [cart_x, cart_v, pole_angle, pole_ang_vel]
    cart_x, cart_v, theta, theta_dot = obs

    # Heuristic controller:
    # push right if pole leans right (theta > 0), else push left.
    # A tiny PD-ish nudge using angular velocity helps stability:
    action = int(theta + 0.1 * theta_dot > 0)   # 0 = left, 1 = right

    # Take a step
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward

    # End episode if done or time limit reached
    if terminated or truncated:
        break

env.close()
print(f"Episode return (heuristic): {total_reward:.1f}")


Episode return (heuristic): 200.0


## 2. Mountain Car
A car is on a one-dimensional track, positioned between two "mountains". <br>
The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. <br>
Therefore, the only way to succeed is to drive back and forth to build up momentum.

In [7]:
# MountainCar: random action demo (Gymnasium version)
import gymnasium as gym

# Create the environment with rendering
env = gym.make("MountainCar-v0", render_mode="human")

# Reset to start a new episode
obs, info = env.reset(seed=0)

# Run for 5100 steps with random actions
for _ in range(500):
    action = env.action_space.sample()        # choose random action (0, 1, or 2)
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        obs, info = env.reset()               # restart if the episode ends

env.close()


In [8]:
# MountainCar: simple heuristic demo using Gymnasium
import gymnasium as gym
import numpy as np

# Create the environment (set render_mode="rgb_array" if no display)
env = gym.make("MountainCar-v0", render_mode="human")
obs, info = env.reset(seed=0)

total_reward = 0.0

for t in range(5100):
    # Observation: [position, velocity]
    position, velocity = obs

    # Heuristic policy:
    # If the car is moving right, keep accelerating right (action=2)
    # If it's moving left, accelerate left (action=0)
    # This builds momentum to reach the flag at the top.
    action = 2 if velocity > 0 else 0

    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward

    if terminated or truncated:
        break

env.close()
print(f"Episode return (heuristic): {total_reward:.1f}")


Episode return (heuristic): -101.0


## 3. Bipedal
Reward is given for moving forward, total 300+ points up to the far end. <br>
If the robot falls, it gets -100. <br>
Applying motor torque costs a small amount of points, more optimal agent will get better score. <br>
State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. <br>
There's no coordinates in the state vector.

In [3]:
# BipedalWalker: random action demo using Gymnasium
import gymnasium as gym

# Create the environment (requires gymnasium[box2d] and pygame)
env = gym.make("BipedalWalker-v3", render_mode="human")

# Reset to start a new episode
obs, info = env.reset(seed=0)

# Run for 1000 steps with random continuous actions
for _ in range(1000):
    action = env.action_space.sample()  # sample random torque values (4D)
    obs, reward, terminated, truncated, info = env.step(action)
    
    # Restart if the episode ends
    if terminated or truncated:
        obs, info = env.reset()

env.close()



In [2]:
import gymnasium as gym
import numpy as np

env = gym.make("BipedalWalker-v3", render_mode="human")

obs, info = env.reset(seed=0)

alpha = 0.85  # I did some experimenting with this, from 0.8 to 0.95. This one worked best IMO.
prev_action = np.zeros(env.action_space.shape, dtype=np.float32)

obs, info = env.reset(seed=0)
ep_reward = 0.0
for t in range(1000):
    raw = env.action_space.sample()  
    action = alpha * prev_action + (1 - alpha) * raw
    action = np.clip(action, env.action_space.low, env.action_space.high)

    obs, reward, terminated, truncated, info = env.step(action)
    prev_action = action
    ep_reward += reward

    if terminated or truncated:
        print(f"Episode reward with smoothing: {ep_reward:.1f}")
        obs, info = env.reset()
        prev_action[:] = 0
        ep_reward = 0.0


        
env.close()

  from pkg_resources import resource_stream, resource_exists


Episode reward with smoothing: -116.8
Episode reward with smoothing: -120.3


## 4. Lunar Lander
Landing pad is always at coordinates (0,0). <br>
Coordinates are the first two numbers in the state vector. <br>
Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. <br>
If the lander moves away from the landing pad it loses reward back. <br>
The episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. <br>
Each leg ground contact is +10. The Firing main engine is -0.3 points each frame. Solved is 200 points. <br>
Landing outside the landing pad is possible. <br>
Fuel is infinite, so an agent can learn to fly and then land on its first attempt. <br>
Four discrete actions are available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.

In [5]:
# LunarLander: random action demo using Gymnasium
import gymnasium as gym

# Create the environment (requires gymnasium[box2d] and pygame)
env = gym.make("LunarLander-v3", render_mode="human")

# Reset to start a new episode
obs, info = env.reset(seed=0)

# Run for 500 steps with random actions
for _ in range(500):
    action = env.action_space.sample()            # 0: do nothing, 1: left, 2: main, 3: right
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:                   # restart if the episode ends
        obs, info = env.reset()

env.close()



In [6]:
import gymnasium as gym
import numpy as np

env = gym.make("LunarLander-v3", render_mode="human")

obs, info = env.reset(seed=0)

for _ in range(500):
    action = env.action_space.sample()            
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:                   
        obs, info = env.reset()

A = np.array([0, 1, 2, 3])                       
P = np.array([0.45, 0.25, 0.05, 0.25], dtype=float)  

obs, info = env.reset(seed=0)
ep_reward = 0.0
for t in range(500):
    action = int(np.random.choice(A, p=P))       
    obs, reward, terminated, truncated, info = env.step(action)
    ep_reward += reward

    if terminated or truncated:
        print(f"Episode reward (cost-aware sampling): {ep_reward:.1f}")
        obs, info = env.reset()
        ep_reward = 0.0

env.close()



Episode reward (cost-aware sampling): -117.6
Episode reward (cost-aware sampling): -156.2
Episode reward (cost-aware sampling): -130.6
Episode reward (cost-aware sampling): -151.8
Episode reward (cost-aware sampling): -123.4
Episode reward (cost-aware sampling): -156.6
Episode reward (cost-aware sampling): -151.2


## 5. Car Racing
Easiest continuous control task to learn from pixels, a top-down racing environment.<br> 
Discreet control is reasonable in this environment as well, on/off discretisation is fine. <br>
State consists of 96x96 pixels. <br>
Reward is -0.1 every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. <br>
For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points. <br>
Episode finishes when all tiles are visited. Some indicators shown at the bottom of the window and the state RGB buffer. <br>
From left to right: true speed, four ABS sensors, steering wheel position, gyroscope.

In [7]:
# CarRacing: random action demo using Gymnasium
import gymnasium as gym

# Create the environment (requires gymnasium[box2d] and pygame)
env = gym.make("CarRacing-v3", render_mode="human")

# Reset to start a new episode
obs, info = env.reset(seed=0)

# Run for 2000 steps with random continuous actions
for _ in range(200):
    action = env.action_space.sample()  # shape (3,) = [steering, gas, brake]
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        obs, info = env.reset()

env.close()


In [8]:
import gymnasium as gym
import numpy as np

env = gym.make("CarRacing-v3", render_mode="human")

obs, info = env.reset(seed=0)

hold_frames = 5          
frame_count = 0
current_action = env.action_space.sample()

obs, info = env.reset(seed=0)
ep_reward = 0.0
for t in range(2000):
   
    if frame_count % hold_frames == 0:
        current_action = env.action_space.sample()
    frame_count += 1

    
    if t > 0:
        current_action = 0.8 * current_action + 0.2 * prev_action
    prev_action = current_action

    obs, reward, terminated, truncated, info = env.step(current_action)
    ep_reward += reward

    if terminated or truncated:
        print(f"Episode reward (persistent actions): {ep_reward:.1f}")
        obs, info = env.reset()
        frame_count, ep_reward = 0, 0.0

for _ in range(200):
    action = env.action_space.sample()  
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        obs, info = env.reset()

env.close()


Episode reward (persistent actions): -37.3
Episode reward (persistent actions): -25.3


## 6. Pendulum
The inverted pendulum swing-up problem is a classic problem in the control literature. <br>
In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright.

In [17]:
# Pendulum: random action demo using Gymnasium
import gymnasium as gym

# Create the environment (requires gymnasium[classic-control] and pygame)
env = gym.make("Pendulum-v1", render_mode="human")

# Reset to start a new episode
obs, info = env.reset(seed=0)

# Run for 1000 steps with random continuous actions
for _ in range(1000):
    action = env.action_space.sample()      # one continuous torque value in [-2, 2]
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        obs, info = env.reset()

env.close()


In [3]:
import gymnasium as gym
import numpy as np

env = gym.make("Pendulum-v1", render_mode="human")

obs, info = env.reset(seed=0)

for _ in range(1000):
    action = env.action_space.sample()      
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        obs, info = env.reset()

def angle_and_rate(obs):
    c, s, thdot = obs
    return np.arctan2(s, c), thdot  

Kp, Kd = 2.0, 0.5        
noise_std = 0.05         

obs, info = env.reset(seed=0)
ep_reward = 0.0
for t in range(1000):
    theta, theta_dot = angle_and_rate(obs)
    u = -(Kp * theta + Kd * theta_dot) + np.random.randn() * noise_std
    action = np.array([np.clip(u, env.action_space.low[0], env.action_space.high[0])], dtype=np.float32)

    obs, reward, terminated, truncated, info = env.step(action)
    ep_reward += reward

    if terminated or truncated:
        print(f"Episode reward (PD control): {ep_reward:.1f}")
        obs, info, ep_reward = env.reset(),  env.reset()[1], 0.0
        obs, info = env.reset()

env.close()


Episode reward (PD control): -956.3
Episode reward (PD control): -964.0
Episode reward (PD control): -1381.2
Episode reward (PD control): -1001.4
Episode reward (PD control): -930.6


## 7. Acrobot
The acrobot system includes two joints and two links, where the joint between the two links is actuated. <br>
Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height.

In [18]:
# Acrobot: random action demo using Gymnasium
import gymnasium as gym

# Create the environment (requires gymnasium[classic-control] and pygame)
env = gym.make("Acrobot-v1", render_mode="human")

# Reset to start a new episode
obs, info = env.reset(seed=0)

# Run for 5100 steps with random actions
for _ in range(500):
    action = env.action_space.sample()           # sample a random action (0 or 1)
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        obs, info = env.reset()

env.close()


In [None]:
# can we introduce a solution here?

## 8. Taxi
This task was introduced  to illustrate some issues in hierarchical reinforcement learning. <br>
There are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. <br>
You receive +20 points for a successful drop-off, and lose 1 point for every timestep it takes. <br>
There is also a 10 point penalty for illegal pick-up and drop-off actions.

In [20]:
# Taxi: random action demo using Gymnasium
import gymnasium as gym

# Create the environment (requires gymnasium[toy-text])
env = gym.make("Taxi-v3", render_mode="human")

# Reset to start a new episode
obs, info = env.reset(seed=0)

# Run for 1000 steps with random actions
for _ in range(100):
    action = env.action_space.sample()            # pick a random discrete action
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        obs, info = env.reset()

env.close()


In [None]:
import gymnasium as gym
import numpy as np

env = gym.make("Taxi-v3", render_mode="human")

obs, info = env.reset(seed=0)
def get_angles(obs):
    c1, s1, c2, s2, th1dot, th2dot = obs
    theta1, theta2 = np.arctan2(s1, c1), np.arctan2(s2, c2)
    return theta1, theta2, th1dot, th2dot

for _ in range(100):
    action = env.action_space.sample()            
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        obs, info = env.reset()


obs, info = env.reset(seed=0)
ep_reward = 0.0
for t in range(500):
    theta1, theta2, th1dot, th2dot = get_angles(obs)

    energy = 0.5*(th1dot**2 + th2dot**2) - np.cos(theta1) - np.cos(theta1 + theta2)
    
    action = 1 if energy < 1.0 else 0

    obs, reward, terminated, truncated, info = env.step(action)
    ep_reward += reward

    if terminated or truncated:
        print(f"Episode reward (energy heuristic): {ep_reward:.1f}")
        obs, info, ep_reward = env.reset(), env.reset()[1], 0.0

env.close()


In [4]:
# Reflection Questions
# 1. I noticed that, mainly, the agent's actions (on average) became more consistent over time. It would take a lot longer to train one to become "better" over time than we are given here, however, this was enough for them to become more consistent.
# 2. This one would change specificaly depending on the model (duh...), but I noticed the main change in the Bipedal model. As I stated in it, I ended up messing with the Alpha value. I noticed that when I increased the value, the model got smoother. However, smoother doesn't always mean better, so after running it at a few different values, I found .85 to be the best.
# 3. Simplest, it is in my opinion a very fun way of using reinforcement learning. You essentialy let it free in an environment and watch it learn the best ways to use each
#    it's given actions. This makes it interesting to watch and my favorite part of AI. Which is ironically also why this part took so long - I wanted to absolutely ensure I went
#    through it so meticulously so that I know what I am doing.