# Reinforcement Learning for Inverted Pendulum Control

## Introduction

The inverted pendulum problem is a classic challenge in the field of control theory and robotics, serving as a fundamental test for various control strategies. In this project, students are required to use reinforcement learning (RL) algorithms to tackle both the single and double inverted pendulum challenges(bonus). You could refer to [inverted_double_pendulum](https://gymnasium.farama.org/environments/mujoco/inverted_double_pendulum/) and [inverted_pendulum](https://gymnasium.farama.org/environments/mujoco/inverted_pendulum/) for details.

## Project Targets

The project is divided into two main parts:

### Part 1: Single Inverted Pendulum

**1. Stabilization Task:**

The pendulum starts with a slight inclination.
The goal is to stabilize the pendulum in the upright position by controlling the movement of the cart.
    
**2. Swing-Up Task:**

The pendulum begins in the downward position.
The objective is to swing the pendulum up and stabilize it in the upright position.

![part1](part1.png)

### Part 2: Double Inverted Pendulum (Bonus)

The task involves swinging and stabilizing the double inverted pendulum from an downward initial phase, to any other stable phase(double inverted pendulum has more than phases).

![part2](part2.png)

## Demo Code

In [1]:
from __future__ import annotations

import numpy as np
import gymnasium as gym
import torch
import torch.nn as nn
from torch.distributions.normal import Normal


  from distutils.dep_util import newer, newer_group
  from distutils.dep_util import newer, newer_group
  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)


You might Need an Agent class and a Policy Network class to help you build the RL framework.

In [2]:
class Agent:
    """Agent that learns to solve the Inverted Pendulum task using a policy gradient algorithm.
    The agent utilizes a policy network to sample actions and update its policy based on
    collected rewards.
    """
    
    def __init__(self, obs_space_dims: int, action_space_dims: int):
        """Initializes the agent with a neural network policy.
        
        Args:
            obs_space_dims (int): Dimension of the observation space.
            action_space_dims (int): Dimension of the action space.
        """
        self.policy_network = Policy_Network(obs_space_dims, action_space_dims)
    
    def sample_action(self, state: np.ndarray) -> float:
        """Samples an action according to the policy network given the current state.
        
        Args:
            state (np.ndarray): The current state observation from the environment.
        
        Returns:
            float: The action sampled from the policy distribution.
        """
        return np.array([0])  # Return the action
    
    def update(self, rewards, log_probs):
        """Updates the policy network using the REINFORCE algorithm based on collected rewards and log probabilities.
        
        Args:
            rewards (list): Collected rewards from the environment.
            log_probs (list): Log probabilities of the actions taken.
        """
        # The actual implementation of the REINFORCE update will be done here.
        pass

In [3]:
class Policy_Network(nn.Module):
    """Neural network to parameterize the policy by predicting action distribution parameters."""
    
    def __init__(self, obs_space_dims: int, action_space_dims: int):
        """Initializes layers of the neural network.
        
        Args:
            obs_space_dims (int): Dimension of the observation space.
            action_space_dims (int): Dimension of the action space.
        """
        super().__init__()
        # Define the neural network layers here
        pass

    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        """Predicts parameters of the action distribution given the state.
        
        Args:
            x (torch.Tensor): The state observation.
        
        Returns:
            tuple[torch.Tensor, torch.Tensor]: Predicted mean and standard deviation of the action distribution.
        """
        # Implement the prediction logic here
        return torch.tensor(0.0), torch.tensor(1.0)  # Example placeholders

The inverted pendulum and inverted double pendulum of gymnasium is loaded by the following code, you need to step into its source codes to see how the models
 of `"InvertedPendulum-v4"` and `"InvertedDoublePendulum-v4"` work. You should change the gymnasium codes to help you open the simulation. For example, you need to add the `render_mode` parameter to help you open the simulation window.

In [5]:
env = gym.make("InvertedPendulum-v4")  # Initialize the environment
# env = gym.make("InvertedDoublePendulum-v4")

In [None]:
# IN inverted_double_pendulum_v4.py (could not run here)

def __init__(self, **kwargs):
    observation_space = Box(low=-np.inf, high=np.inf, shape=(11,), dtype=np.float64)
    MujocoEnv.__init__(
        self,
        "inverted_double_pendulum.xml",
        5,
        observation_space=observation_space,
        default_camera_config=DEFAULT_CAMERA_CONFIG,
        render_mode="human", # you need to add this code
        **kwargs,
    )
    utils.EzPickle.__init__(self, **kwargs)

![pendulum_model](pendulum_model.png)

Then you can run the following codes to run trainging and simulation.

In [6]:
wrapped_env = gym.wrappers.RecordEpisodeStatistics(env, 50)  # Wrap the environment to record statistics

obs_space_dims = env.observation_space.shape[0]  # Dimension of the observation space
action_space_dims = env.action_space.shape[0]  # Dimension of the action space
agent = Agent(obs_space_dims, action_space_dims)  # Instantiate the agent

total_num_episodes = int(5e3)  # Total number of episodes

# Simulation main loop
for episode in range(total_num_episodes):
    obs, info = wrapped_env.reset()  # Reset the environment at the start of each episode
    done = False
    while not done:
        action = agent.sample_action(obs)  # Sample an action based on the current observation
        obs, reward, terminated, truncated, _ = wrapped_env.step(action)  # Take the action in the environment
        done = terminated or truncated  # Check if the episode has terminated
    # The collection of rewards and log probabilities should happen within the loop.
    # agent.update(rewards, log_probs)  # Update the policy based on the episode's experience

/home/jude/.local/lib/python3.8/site-packages/glfw/__init__.py:914: GLFWError: (65537) b'The GLFW library is not initialized'
