# **Step-by-Step Tutorial: Solving CartPole using Reinforcement Learning**

### **Step 1: Install Necessary Libraries**
# Install the Gymnasium library for the CartPole environment.

In [1]:
!pip install gymnasium



### **Step 2: Import Required Libraries**

In [2]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

### **Step 3: Initialize the CartPole Environment**
# Create the CartPole environment.
# Reset the environment to start with an initial observation.

In [3]:
env = gym.make("CartPole-v1", render_mode="rgb_array")

In [4]:
observation, info = env.reset()
print("Initial Observation:", observation)

Initial Observation: [-0.0095345  -0.04972725  0.03152118  0.00246065]


"The observation is four floating point numbers: x coordinate of stick's center of mass, its speed, its angle to the platform, its angular speed
Remember however our goal is to balance the pole based on rewards only -- not on physics!


### **Step 4: Understand Action and Observation Space**
# Print action and observation spaces.

In [5]:
print("Action Space:", env.action_space)  # Discrete(2): 0 = Push cart left, 1 = Push cart right

Action Space: Discrete(2)


In [6]:
print("Observation Space:", env.observation_space)  # Box(4,): VECTOR of size 4

Observation Space: Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)


# Understanding the Observation Space in Reinforcement Learning

In **Reinforcement Learning (RL)**, the **observation space** defines the possible states an agent can observe. Below is an example of an observation space from **OpenAI Gym**, commonly found in environments like **CartPole**:


## **Breaking it Down**
- **Box**: Represents a continuous space (real-valued numbers).
- **Lower bounds**: `[-4.8, -inf, -0.41887903, -inf]` - The minimum values for each of the four state variables.
- **Upper bounds**: `[4.8, inf, 0.41887903, inf]` - The maximum values for each state variable.
- **(4,)**: The observation space consists of **4 continuous state variables**.
- **float32**: The datatype of the values in this space.

---

## **Key Insights**
- The cart can move between **-4.8 and 4.8 meters** before the episode ends.
- The pole angle must stay within **± 0.41887903 radians (±24 degrees)** to avoid failure.
- Velocity variables (`cart velocity` and `pole angular velocity`) are unbounded (`-inf` to `inf`), meaning they can take any real value.

---

## **Usage in Reinforcement Learning**
This observation space defines the **input state** that the agent observes in the environment. The agent uses this information to decide which **action** to take, aiming to balance the pole upright.


In [7]:
#lets move to left
env.step(0)

(array([-0.01052904, -0.24528675,  0.03157039,  0.30491987], dtype=float32),
 1.0,
 False,
 False,
 {})

# Understanding the Step Output Tuple in OpenAI Gym

When calling `env.step(action)` in an **OpenAI Gym** environment, the function returns a tuple with five elements:


In [8]:
# Take an action (e.g., action = 1)
action = 1  # Move right
next_state, reward, done, truncated, info = env.step(action)

print("Next State:", next_state)
print("Reward:", reward)
print("Done:", done)
print("Truncated:", truncated)
print("Info:", info)

Next State: [-0.01543478 -0.05062859  0.03766878  0.02235829]
Reward: 1.0
Done: False
Truncated: False
Info: {}


In [9]:
#call random action using sample()
print(env.action_space.sample())
print(env.action_space.sample())
print(env.action_space.sample())
print(env.action_space.sample())
print(env.action_space.sample())

0
1
1
0
1


In [10]:
#call random obervation states using sample()
print(env.observation_space.sample())
print(env.observation_space.sample())
print(env.observation_space.sample())
print(env.observation_space.sample())

[-0.28419644 -0.88558966  0.13893068 -0.5007224 ]
[-2.8329377  -1.4832586   0.06538605  0.75585717]
[-4.229765    1.6419837   0.14979577 -1.3968955 ]
[-1.8428878e+00 -3.2399336e-01  7.0862332e-04  1.5177377e-01]


### **Step 5:  Run the environment for N steps.

In [11]:
# Number of episodes
num_episodes = 5  # Adjust as needed

In [12]:
# Create the environment
env = gym.make("CartPole-v1")
num_episodes = 5  # Define the number of episodes

for episode in range(num_episodes):
    state, info = env.reset(seed=episode)  # Reset environment at the start of each episode
    terminated = False
    total_reward = 0
    step_count = 0

    print(f"\nEpisode {episode + 1} starting...")

    while not terminated:
        #render(env)
        action = env.action_space.sample()  # Random action (0 or 1)
        state, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        step_count += 1

    print(f"Episode {episode + 1} ended after {step_count} steps with total reward: {total_reward}")
env.close()


Episode 1 starting...
Episode 1 ended after 20 steps with total reward: 20.0

Episode 2 starting...
Episode 2 ended after 11 steps with total reward: 11.0

Episode 3 starting...
Episode 3 ended after 34 steps with total reward: 34.0

Episode 4 starting...
Episode 4 ended after 12 steps with total reward: 12.0

Episode 5 starting...
Episode 5 ended after 14 steps with total reward: 14.0
