# **[Stable-Baselines3 (SB3)](https://stable-baselines3.readthedocs.io/en/master/)** 
### **What is Stable-Baselines3?**
Stable-Baselines3 (SB3) is an **open-source PyTorch-based library** that reliably implements **Reinforcement Learning (RL) algorithms**.

## **Installation**

### **Stable version**
To install the stable version with all optional packages (TensorBoard, OpenCV, ale-py for Atari)

In [1]:
%pip install stable-baselines3[extra]

Collecting opencv-python (from stable-baselines3[extra])
  Using cached opencv_python-4.11.0.86-cp37-abi3-win_amd64.whl.metadata (20 kB)
Collecting pygame (from stable-baselines3[extra])
  Downloading pygame-2.6.1-cp310-cp310-win_amd64.whl.metadata (13 kB)
Collecting ale-py>=0.9.0 (from stable-baselines3[extra])
  Downloading ale_py-0.10.2-cp310-cp310-win_amd64.whl.metadata (8.4 kB)
Downloading ale_py-0.10.2-cp310-cp310-win_amd64.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------- -------------------------------- 0.3/1.5 MB ? eta -:--:--
   -------------- ------------------------- 0.5/1.5 MB 1.5 MB/s eta 0:00:01
   --------------------- ------------------ 0.8/1.5 MB 1.5 MB/s eta 0:00:01
   ---------------------------- ----------- 1.0/1.5 MB 1.4 MB/s eta 0:00:01
   ----------------------------------- ---- 1.3/1.5 MB 1.4 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 1.4 MB/s eta 0:00:00
Using cached opencv_python-4.11.0.8

If you **don't need the extra dependencies**, install only the base library:

In [None]:
%pip install stable-baselines3

## Stable-Baselines3 (SB3) **Quick Start Guide**
* **Key Concepts**
SB3 uses **vectorized environments (VecEnv)**, which allows **multiple instances of an environment to be run simultaneously**, improving training efficiency compared to a single Gym environment.
The library follows a similar syntax to scikit-learn for training and using Reinforcement Learning algorithms.


**Example: Train A2C on CartPole**
The following code shows how to train and test an A2C agent on CartPole-v1:

**Steps**:
1. Import the necessary **libraries**.
2. Create the **Gymnasium environment**.
3. Initialize the **A2C model with MlpPolicy**.
4. **Train the model** for 10_000 steps.
5. **Test the model** for 1000 steps.

In [2]:
import gymnasium as gym
from stable_baselines3 import A2C

In [3]:
# Create the **Gymnasium environment**
env = gym.make("CartPole-v1", render_mode="rgb_array")

In [4]:
# Initialize the **A2C model with MlpPolicy**.
model = A2C("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10_000)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 26.1     |
|    ep_rew_mean        | 26.1     |
| time/                 |          |
|    fps                | 478      |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.671   |
|    explained_variance | -0.13    |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 1.76     |
|    value_loss         | 10.8     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 32.8     |
|    ep_rew_mean        | 32.8     |
| time/                 |          |
|    fps                | 490      |
|    iterations         | 200      |
|    time_elapsed 

<stable_baselines3.a2c.a2c.A2C at 0x24d6a25e080>

In [5]:
# Getting the vectorized environment
vec_env = model.get_env()
obs = vec_env.reset()

In [None]:
# Model testing
for i in range(1000):
 action, _state = model.predict(obs, deterministic=True)
 obs, reward, done, info = vec_env.step(action)
 vec_env.render("human") # Show the environment

**What Happens During Training?**
- The **agent observes the current state** of the environment (obs).
- **Predicts an action** based on the trained **policy**.
- **Executes the action** and **receives a reward**.
- **Updates the policy** based on accumulated experiences.
- **Repeats the process** until the defined number of steps (total_timesteps) is reached.