# Classic Control Environments

In this notebook, we explore classic-control enviornments from the **gymnasium** python library. This set includes the following five enviornments:

1. Acrobot

2. Cart Pole

3. Mountain Car Continuous

4. Mountain Car

5. Pendulum

We will solve each of these environments using a simple classical agent (random sampling) and then using `BattleEnv` Wrapper from the `qrl` library for a battle between the classical and quantum agent.

## Acrobot



<p align="center">
  <img src="images/acrobot.gif" alt="Description" width="400"/>
</p>

The Acrobot environment is based on Sutton’s work in “Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding” and Sutton and Barto’s book. The system consists of two links connected linearly to form a chain, with one end of the chain fixed. The joint between the two links is actuated. The goal is to apply torques on the actuated joint to swing the free end of the linear chain above a given height while starting from the initial state of hanging downwards.

As seen in the Gif: two blue links connected by two green joints. The joint in between the two links is actuated. The goal is to swing the free end of the outer-link to reach the target height (black horizontal line above system) by applying torque on the actuator.

**Action Space**: The action is discrete, deterministic, and represents the torque applied on the actuated joint between the two links -> {-1, 0, 1}

**Observation Space**: The observation is a ndarray with shape (6,) that provides information about the two rotational joint angles as well as their angular velocities: Cosine of $\theta_1$ [-1,1], Sine of $\theta_1$[-1,1], Cosine of $\theta_2$ [-1,1], Sine of $\theta_2$[-1,1], Angular Velocity of $\theta_1$[~ -12.567 (-4 $\pi$), ~ 12.567 (4 $\pi$)], Angular velocity of $\theta_2$[~ -28.274 (-9 $\pi$), ~ 28.274 (9 $\pi$)].

Here, 

* $\theta_1$ : angle of the first joint, where an angle of 0 indicates the first link is pointing directly downwards.

* $\theta_2$: relative to the angle of the first link. An angle of 0 corresponds to having the same angle between the two links.

**Rewards**

* Objective: The goal is to have the free end reach a designated target height in as few steps as possible

* All steps that do not reach the goal incur a reward of -1. 

* Achieving the target height results in termination with a reward of 0. 

* The reward threshold is -100.

**How can the episode end?**

The episode ends if one of the following occurs:

1. **Termination**: The free end reaches the target height, which is constructed as: -cos($\theta_1$) - cos($\theta_2$ + $\theta_1$) > 1.0

2. **Truncation**: Episode length is greater than 500 (200 for v0)

**Available Versions**

* **v1**: Maximum number of steps increased from 200 to 500. The observation space for v0 provided direct readings of theta1 and theta2 in radians, having a range of [-$\pi$, $\pi$]. The v1 observation space as described here provides the sine and cosine of each angle instead.

* **v0**: Initial versions release

<u>**WE SOLVE THE LATEST v1 VERSION IN THIS EXAMPLE**</u>

### Classical Agent (Random Sampling) in Acrobot environment

In [6]:
import gymnasium as gym
import numpy as np
import os
from moviepy import ImageSequenceClip
from PIL import Image
import cv2

env_id="Acrobot-v1" 
agent_type="classical", 
episodes=10
MAX_STEPS=500
env = gym.make(env_id,render_mode="rgb_array")
action_space = env.action_space
classical_agent_frames = {} #{episode_num:list_of_frames}
reward_history = {} #{episode_num:reward_value}

best_reward = float("-inf")
best_actions = []

# Run episodes and track best
for ep in range(episodes):
    obs, _ = env.reset()
    total_reward = 0
    actions = []
    frames_list = []
    for _ in range(MAX_STEPS):
        action = action_space.sample()
        obs, reward, terminated, truncated, _ = env.step(action)
        frame = env.render()

        actions.append(action)
        frames_list.append(frame)

        total_reward += reward
        if terminated or truncated:
            break

    print(f"Episode {ep+1}/{episodes} - Total Reward: {total_reward}")
    classical_agent_frames[ep+1] = frames_list
    reward_history[ep+1] = total_reward

    if total_reward > best_reward:
        best_reward = total_reward
        best_actions = actions.copy()

env.close()


# get franes for the best episode (highest reward)
best_episode = max(reward_history, key=reward_history.get)
best_frames = classical_agent_frames.get(best_episode)
annotated_best_frames = []

# post processing
for frame in best_frames:
    frame_bgr = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)     # Convert to BGR (OpenCV format)

    # Add episode number on the frames
    cv2.putText(img=frame_bgr,
                text=f"Classical Agent - Episode {best_episode}",
                org=(10, 30),
                fontFace=cv2.FONT_HERSHEY_SIMPLEX,
                fontScale=1,
                color=(0, 0, 0),
                thickness=2,
                lineType=cv2.LINE_AA)

    # Add total reward value on the frames
    cv2.putText(img=frame_bgr,
            text=f"Total Reward: {best_reward}",
            org=(10, 65),
            fontFace=cv2.FONT_HERSHEY_SIMPLEX,
            fontScale=1,
            color=(0, 0, 0),
            thickness=2,
            lineType=cv2.LINE_AA)
    
    # Convert back to RGB
    frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
    annotated_best_frames.append(frame_rgb)

video_path = r"videos_best/Acrobot-v1/classical.mp4"
clip = ImageSequenceClip(annotated_best_frames, fps=30)
clip.write_videofile(video_path, codec='libx264')


Episode 1/10 - Total Reward: -500.0
Episode 2/10 - Total Reward: -500.0
Episode 3/10 - Total Reward: -500.0
Episode 4/10 - Total Reward: -500.0
Episode 5/10 - Total Reward: -500.0
Episode 6/10 - Total Reward: -500.0


Episode 7/10 - Total Reward: -500.0
Episode 8/10 - Total Reward: -500.0
Episode 9/10 - Total Reward: -500.0
Episode 10/10 - Total Reward: -500.0


frame_index:   0%|          | 2/499 [11:31<47:43:22, 345.68s/it, now=None]

MoviePy - Building video videos_best/Acrobot-v1/classical.mp4.
MoviePy - Writing video videos_best/Acrobot-v1/classical.mp4



frame_index:   0%|          | 2/499 [11:33<47:53:04, 346.85s/it, now=None]

MoviePy - Done !
MoviePy - video ready videos_best/Acrobot-v1/classical.mp4


## Cart Pole

## Mountain Car COntinuous 

## Mountain Car

## Pendulum