In [2]:
! pip install gym

Collecting gym
  Downloading gym-0.26.2.tar.gz (721 kB)
     ---------------------------------------- 0.0/721.7 kB ? eta -:--:--
     -------------- ------------------------- 262.1/721.7 kB ? eta -:--:--
     ---------------------------------------- 721.7/721.7 kB 1.9 MB/s  0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting gym_notices>=0.0.4 (from gym)
  Downloading gym_notices-0.1.0-py3-none-any.whl.metadata (1.2 kB)
Downloading gym_notices-0.1.0-py3-none-any.whl (3.3 kB)
Building wheels for collected packages: gym
  Building wheel for gym (pyproject.toml): started
  Building wheel for gym (pyproject.toml): finished with status 'done'
  Created wheel for gym: filename=gym-0.26.2-py3-none-any

In [4]:
!pip uninstall -y gym && pip install gymnasium

Found existing installation: gym 0.26.2
Uninstalling gym-0.26.2:
  Successfully uninstalled gym-0.26.2
Collecting gymnasium
  Downloading gymnasium-1.2.0-py3-none-any.whl.metadata (9.9 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading gymnasium-1.2.0-py3-none-any.whl (944 kB)
   ---------------------------------------- 0.0/944.3 kB ? eta -:--:--
   ----------- ---------------------------- 262.1/944.3 kB ? eta -:--:--
   --------------------------------- ------ 786.4/944.3 kB 2.1 MB/s eta 0:00:01
   ---------------------------------------- 944.3/944.3 kB 2.1 MB/s  0:00:00
Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium

   -------------------- ------------------- 1/2 [gymnasium]
   ---------------------------------------- 2/2 [gymnasium]

Successfully installed farama-notifications-0.0.4 gymnasium-1.2.0


The Python library gym is a toolkit developed by OpenAI for building and experimenting with reinforcement learning (RL) environments. It provides a standardized API to interact with a wide variety of environments, making it easier to develop and compare RL algorithms.

🔧 Key Features of gym:
* Unified interface for different environments (e.g., games, robotics, control tasks).
* Easy integration with popular RL libraries like Stable Baselines, RLlib, and TensorFlow/PyTorch.
* Extensible: You can create custom environments.
* Benchmarking: Includes classic control problems and Atari games for algorithm comparison.

The gymnasium library is the actively maintained successor to OpenAI's original gym library, designed for developing and benchmarking reinforcement learning (RL) algorithms. It provides a standardized API and a rich set of environments for training RL agents.

In [5]:
# ============================
# REINFORCE (Vanilla Policy Gradient) — Minimal NumPy version
# Task: OpenAI Gym CartPole-v1
# Policy: linear + softmax  π(a|s) = softmax(W @ s)
# Update rule: W ← W + α * G_t * ∇_W log π(a_t|s_t)
# where G_t is the discounted return from time t.
# ============================

import numpy as np

# --- Support both gymnasium and legacy gym ---
try:
    import gymnasium as gym
    NEW_API = True
except ImportError:
    import gym
    NEW_API = False

# ============ Hyperparameters ============
ENV_NAME      = "CartPole-v1"
GAMMA         = 0.99   # discount factor
LR            = 0.02   # learning rate (small for stability with linear policy)
NUM_EPISODES  = 600    # number of training episodes
SEED          = 42     # RNG seed for reproducibility

np.random.seed(SEED)

# ============ Utilities ============

def softmax(z: np.ndarray) -> np.ndarray:
    """
    Numerically stable softmax:
      softmax(z)_i = exp(z_i - max(z)) / sum_j exp(z_j - max(z))
    """
    z = z - np.max(z)           # shift for numerical stability
    e = np.exp(z)
    return e / (np.sum(e) + 1e-8)

def choose_action(W: np.ndarray, state: np.ndarray) -> (int, np.ndarray):
    """
    Compute action probabilities p = softmax(W @ s) and sample an action a ~ p.
    Returns:
      a: sampled action (int)
      p: probability vector of shape (n_actions,)
    """
    logits = W @ state                # shape: (n_actions,)
    p = softmax(logits)
    a = np.random.choice(len(p), p=p) # sample according to the stochastic policy
    return a, p

def discounted_returns(rewards, gamma=GAMMA):
    """
    Compute discounted returns G_t for a single episode:
      G_t = r_t + γ r_{t+1} + γ^2 r_{t+2} + ...
    Implemented via a backward pass in O(T).
    Also standardizes G to reduce variance (helps learning stability).
    """
    G = np.zeros_like(rewards, dtype=np.float32)
    running = 0.0
    for t in reversed(range(len(rewards))):
        running = rewards[t] + gamma * running
        G[t] = running
    # Standardize (optional but recommended for variance reduction)
    if len(G) > 1:
        G = (G - G.mean()) / (G.std() + 1e-8)
    return G

def grad_log_pi(s: np.ndarray, a: int, p: np.ndarray, n_actions: int) -> np.ndarray:
    """
    For a linear + softmax policy, the gradient of log π(a|s) w.r.t. W is:
      ∇_W log π(a|s) = (one_hot(a) - p)[:, None] * s[None, :]
    This is an outer product producing a matrix with shape (n_actions, n_features).
    """
    one_hot = np.zeros(n_actions, dtype=np.float32)
    one_hot[a] = 1.0
    diff = one_hot - p                       # shape: (n_actions,)
    return diff[:, None] * s[None, :]        # outer product -> (n_actions, n_features)

# ============ Env & Parameters ============
env = gym.make(ENV_NAME)
# gymnasium reset returns (obs, info); legacy gym returns obs
if NEW_API:
    obs, _ = env.reset(seed=SEED)
else:
    obs = env.reset(seed=SEED)

n_features = env.observation_space.shape[0]  # CartPole has 4-D state
n_actions  = env.action_space.n              # CartPole has 2 actions

# Initialize linear policy weights: shape (n_actions, n_features)
W = np.random.randn(n_actions, n_features).astype(np.float32) * 0.01

# ============ Training Loop ============
best_reward = -np.inf
reward_history = []

for episode in range(1, NUM_EPISODES + 1):
    # Collect one full episode (trajectory) before updating
    states, actions, rewards, probs = [], [], [], []

    if NEW_API:
        s, _ = env.reset()
    else:
        s = env.reset()

    done = False
    ep_reward = 0.0

    while not done:
        s = np.asarray(s, dtype=np.float32)     # ensure NumPy 1-D array
        a, p = choose_action(W, s)              # sample action from current policy

        if NEW_API:
            s_next, r, terminated, truncated, _ = env.step(a)
            done = terminated or truncated
        else:
            s_next, r, done, _ = env.step(a)

        # Store transition pieces
        states.append(s)
        actions.append(a)
        rewards.append(r)
        probs.append(p)

        ep_reward += r
        s = s_next

    reward_history.append(ep_reward)
    best_reward = max(best_reward, ep_reward)

    # ----- REINFORCE update after the episode -----
    G = discounted_returns(rewards, gamma=GAMMA)  # (optionally standardized)

    # Accumulate policy gradients over the trajectory
    grad_sum = np.zeros_like(W)
    for s_t, a_t, p_t, G_t in zip(states, actions, probs, G):
        grad = grad_log_pi(s_t, a_t, p_t, n_actions)  # (n_actions, n_features)
        grad_sum += G_t * grad

    # Gradient ASCENT (we maximize return)
    W += LR * grad_sum

    # ----- Logging -----
    if episode % 20 == 0:
        avg_last_20 = np.mean(reward_history[-20:])
        print(f"Ep {episode:4d} | R={ep_reward:6.1f} | avg(20)={avg_last_20:6.1f} | best={best_reward:6.1f}")

env.close()

# After training, you should see the moving average reward go up
# and often approach/clear the CartPole "solved" threshold.


Ep   20 | R=  15.0 | avg(20)=  26.9 | best=  60.0
Ep   40 | R=  17.0 | avg(20)=  38.4 | best=  76.0
Ep   60 | R=  66.0 | avg(20)=  56.8 | best= 163.0
Ep   80 | R=  83.0 | avg(20)=  55.5 | best= 163.0
Ep  100 | R=  83.0 | avg(20)=  81.0 | best= 163.0
Ep  120 | R= 177.0 | avg(20)= 100.8 | best= 200.0
Ep  140 | R=  71.0 | avg(20)=  93.4 | best= 200.0
Ep  160 | R=  41.0 | avg(20)= 106.9 | best= 200.0
Ep  180 | R= 276.0 | avg(20)= 133.7 | best= 276.0
Ep  200 | R= 467.0 | avg(20)= 184.7 | best= 467.0
Ep  220 | R=  67.0 | avg(20)= 244.1 | best= 500.0
Ep  240 | R= 165.0 | avg(20)= 150.2 | best= 500.0
Ep  260 | R= 243.0 | avg(20)= 217.8 | best= 500.0
Ep  280 | R= 378.0 | avg(20)= 309.6 | best= 500.0
Ep  300 | R= 500.0 | avg(20)= 324.2 | best= 500.0
Ep  320 | R= 126.0 | avg(20)= 306.9 | best= 500.0
Ep  340 | R= 500.0 | avg(20)= 390.9 | best= 500.0
Ep  360 | R= 343.0 | avg(20)= 346.8 | best= 500.0
Ep  380 | R= 239.0 | avg(20)= 376.7 | best= 500.0
Ep  400 | R= 109.0 | avg(20)= 235.3 | best= 500.0


What to notice

* Policy: p(a|s) = softmax(W @ s) — a simple linear model is enough to learn CartPole.

* Stochastic actions: sampled from p, ensuring exploration.

* Returns: we compute discounted returns G_t for each time step; standardization reduces gradient variance.

* Gradient: (one_hot(a) - p) ⊗ s (outer product) is all you need for the REINFORCE update.

* Ascent vs. descent: we maximize expected return → do gradient ascent on W.

Quick tuning tips

* If learning is unstable: try smaller LR (e.g., 0.01 or 0.005) or increase NUM_EPISODES.

* For even lower variance: subtract a baseline (e.g., episode mean return) from G_t, or move to Actor-Critic later.