## Part 2 : REINFORCE algorithm

In [1]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pygame
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributions as dist

For the second part of this project, we are going to work on the **Pendulum**. 

This environment is part of the Ici ce qui suit [Classic Control environments](https://gymnasium.farama.org/environments/classic_control/).

Please read that page first for general information.

![Ceci est un exemple d’image](https://gymnasium.farama.org/_images/pendulum.gif)
   
    | Action space     | Box(-2.0, 2.0, (1,), float32) |
    |Observation Shape | (3,)                          |
    |Observation High  | [1. 1. 8.]                    |
    |Observation low   | [-1. -1. -8.]                 |
    |Import            | gymnasium.make("Pendulum-v1") |

### Description 

The inverted pendulum swingup problem is based on the classic problem in control theory. The system consists of a pendulum attached at one end to a fixed point, and the other end being free. The pendulum starts in a random position and the goal is to apply torque on the free end to swing it into an upright position, with its center of gravity right above the fixed point.

The diagram below specifies the coordinate system used for the implementation of the pendulum’s dynamic equations.

![Ceci est un exemple d’image](https://gymnasium.farama.org/_images/pendulum.png)

* x-y: cartesian coordinates of the pendulum’s end in meters.

* theta: angle in radians.

* tau: torque in N.m. Defined as positive *counter-clockwise*.

### Action space 

The action is a ndarray with shape (1,) representing the torque applied to free end of the pendulum.


    | Num | Action | Min  | Max |
    |-----|--------|------|-----|
    | 0   | Torque | -2.0 | 2.0 |
    
### Observation space 

The observation is a ndarray with shape (3,) representing the x-y coordinates of the pendulum’s free end and its angular velocity.

    | Num | Observation      | Min  | Max |
    |-----|------------------|------|-----|
    | 0   | x = cos(theta)   | -1.0 | 1.0 |
    | 1   | y = sin(theta)   | -1.0 | 1.0 |
    | 2   | Angular Velocity | -8.0 | 8.0 |

### Rewards 

he reward function is defined as:

$ r = -(theta^2 + 0.1 * theta_{dt}^2 + 0.001*torque^2) $

where $theta$ is the pendulum’s angle normalized between $[-\pi, \pi]$ (with 0 being in the upright position). Based on the above equation, the minimum reward that can be obtained is $-(\pi^2 + 0.1 * 8^2 + 0.001 * 2^2) = -16.2736044$, while the maximum reward is zero (pendulum is upright with zero velocity and no torque applied).

### Starting state 

The starting state is a random angle in $[-\pi, \pi]$ and a random angular velocity in [-1,1].

### Episode Truncation
The episode truncates at 200 time steps.

### Arguments
* g: acceleration of gravity measured in $(m.s^{-2})$ used to calculate the pendulum dynamics. The default value is g = 10.0 .

On reset, the options parameter allows the user to change the bounds used to determine the new random state.

In [2]:
env = gym.make('Pendulum-v1', g=9.81, render_mode = "human")
print(f'Number of possible actions: {env.action_space} that correspond to the torque applied to the free end of the pendulum (positive counter-clock wise)')
print(f'Number of states: {env.observation_space} that correspond to the x-y coordinates (Cartesian basis) and the angular velocity of the free end of the pendulum')

Number of possible actions: Box(-2.0, 2.0, (1,), float32) that correspond to the torque applied to the free end of the pendulum (positive counter-clock wise)
Number of states: Box([-1. -1. -8.], [1. 1. 8.], (3,), float32) that correspond to the x-y coordinates (Cartesian basis) and the angular velocity of the free end of the pendulum


In [4]:
def theta_init(x):
    theta=[0]*len(x)
    return np.array(theta)

def policy(action,state,theta):
    p=state @ theta
    q=1/(1+np.exp(-p))
    if action==0:
        return 1-q
    else : 
        return q
def gradient_function(action,state,theta):
    if action==0:
        return -state*policy(1,state,theta)
    else :
        return state*policy(0,state,theta)

In [4]:
def action(state,theta):
    if state[1]>=0:
        action=np.random.uniform(-2,0)
    else :
        action=np.random.uniform(0,2)
    return action

def gradient_function(action,state,theta):
    if action==0:
        return -state*policy(1,state,theta)
    else :
        return state*policy(0,state,theta)

In [8]:
def reinforce(theta_0,lr,gamma,n_episode):
    env = gym.make("Pendulum-v1", render_mode="human")
    theta=theta_0
    
    for i in range(n_episode):
        X=[] #list of states
        A=[] #list of actions
        R=[] #list of rewards
        x,_=env.reset()
        n_move = 0 
        terminated=False
        while not terminated: #episode to fill the lists
            if n_move > 500:
                env.close()
                raise Exception("Too many attempts, failed")
            n_move += 1
            X.append(x)
            
            u= np.random.uniform(0,1)
            if u<=policy(1,x,theta):
                action=1
            else :
                action=0
            A.append(action)
            x, r, terminated, truncated, info = env.step([action])
            R.append(r)
        
        print("Total rewards :",np.sum(R))
        n=0 
        while n<n_move: #list run for the adjustment of theta
          
            G=0
            for i in range(n+1,n_move):
                G=G+gamma**(i-n-1)*R[i]
                
            grad=gradient_function(A[n],X[n],theta)
            theta=theta+lr*gamma**n*G*grad
            
            n += 1
    env.close()
    return theta

In [6]:
lr=0.001
n_episode=100
gamma=1
theta_0=[0,0,0]

In [9]:
reinforce(theta_0,lr,gamma,n_episode)

Exception: Too many attempts, failed

In [None]:
for i in range(n_episode):
    env = gym.make("CartPole-v1", render_mode = "human")
    x,info=env.reset()
    n_move=0
    action = env.action_space.sample()
    while not terminated:
        if n_move > 10000:
            raise Exception("Too many attempts, failed")
        n_move += 1
        X.append(x)
        policy=policyparam(action,x,theta)
        action = np.argmax(policy)
        A.append(action)
        x, reward, terminated, truncated, info = env.step(action)
        print(reward)
        R.append(reward)
    env.close()