### What is Reinforcement Learning?

Reinforcement learning is a sub-branch of Machine Learning that trains a model to return an optimum solution for a problem by taking a sequence of decisions by itself. 

We model an environment after the problem statement. The model interacts with this environment and comes up with solutions all on its own, without human interference. To push it in the right direction, we simply give it a positive reward if it performs an action that brings it closer to its goal or a negative reward if it goes away from its goal. 

To understand reinforcement learning better, consider a dog that we have to house train. Here, the dog is the agent and the house, the environment.

![image.png](attachment:daab1b44-f8bd-40ed-ad46-6f5f604bb4f3.png)

                                            Figure 1: Agent and Environment 

We can get the dog to perform various actions by offering incentives such as dog biscuits as a reward.

![image.png](attachment:6f568e1e-05af-459f-bdd5-8a913741b4d6.png)

                                                Figure 2: Performing an Action and getting Reward

The dog will follow a policy to maximize its reward and hence will follow every command and might even learn a new action, like begging, all by itself.

![image.png](attachment:ad781118-c069-4f1e-ad69-7ba06d1d5ec5.png)

                                                Figure 3: Learning new actions

The dog will also want to run around and play and explore its environment. This quality of a model is called Exploration. The tendency of the dog to maximize rewards is called Exploitation. There is always a tradeoff between exploration and exploitation, as exploration actions may lead to lesser rewards.

![image.png](attachment:a9cbef88-753c-454b-995a-50e62a00054c.png)

                                                Figure 4: Exploration vs Exploitation

In [None]:
# !pip install stable-baselines3[extra]

### Stable-Baselines3 Docs - Reliable Reinforcement Learning Implementations¶
Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable Baselines.

Github repository: https://github.com/DLR-RM/stable-baselines3

Paper: https://jmlr.org/papers/volume22/20-1364/20-1364.pdf

RL Baselines3 Zoo (training framework for SB3): https://github.com/DLR-RM/rl-baselines3-zoo

RL Baselines3 Zoo provides a collection of pre-trained agents, scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

SB3 Contrib (experimental RL code, latest algorithms): https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

Main Features¶
Unified structure for all algorithms

PEP8 compliant (unified code style)

Documented functions and classes

Tests, high code coverage and type hints

Clean code

Tensorboard support

The performance of each algorithm was tested (see Results section in their respective page)

### Import Dependencies

In [2]:
import gym 
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy



### Description

    This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in
    ["Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem"](https://ieeexplore.ieee.org/document/6313077).
    
    A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
    
    The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces
     in the left and right direction on the cart.
     
    ### Action Space
    
    The action is a `ndarray` with shape `(1,)` which can take values `{0, 1}` indicating the direction
     of the fixed force the cart is pushed with.
     
    | Num | Action                 |
    |-----|------------------------|
    | 0   | Push cart to the left  |
    | 1   | Push cart to the right |
    
    **Note**: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle
     the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it
     
    ### Observation Space
    
    The observation is a `ndarray` with shape `(4,)` with the values corresponding to the following positions and velocities:
    
    | Num | Observation           | Min                 | Max               |
    |-----|-----------------------|---------------------|-------------------|
    | 0   | Cart Position         | -4.8                | 4.8               |
    | 1   | Cart Velocity         | -Inf                | Inf               |
    | 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
    | 3   | Pole Angular Velocity | -Inf                | Inf               |
    
    **Note:** While the ranges above denote the possible values for observation space of each element,
        it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:
    -  The cart x-position (index 0) can be take values between `(-4.8, 4.8)`, but the episode terminates
       if the cart leaves the `(-2.4, 2.4)` range.
       
    -  The pole angle can be observed between  `(-.418, .418)` radians (or **±24°**), but the episode terminates
       if the pole angle is not in the range `(-.2095, .2095)` (or **±12°**)
       
    ### Rewards
    
    Since the goal is to keep the pole upright for as long as possible, a reward of `+1` for every step taken,
    including the termination step, is allotted. The threshold for rewards is 475 for v1.
    ### Starting State
    
    All observations are assigned a uniformly random value in `(-0.05, 0.05)`
    
    ### Episode End
    The episode ends if any one of the following occurs:
    1. Termination: Pole Angle is greater than ±12°
    2. Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
    3. Truncation: Episode length is greater than 500 (200 for v0)


### Load Environment

In [3]:
environment_name = "CartPole-v0"

In [4]:
env = gym.make(environment_name)

In [5]:
env

<TimeLimit<CartPoleEnv<CartPole-v0>>>

In [6]:
# !pip install pyglet==1.5.27

In [7]:
episodes = 5
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

Episode:1 Score:32.0
Episode:2 Score:11.0
Episode:3 Score:19.0
Episode:4 Score:12.0
Episode:5 Score:14.0


### Understanding The Environment
https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

In [9]:
# 0-push cart to left, 1-push cart to the right
env.action_space.sample()

1

In [10]:
# [cart position, cart velocity, pole angle, pole angular velocity]
env.observation_space.sample()

array([ 3.8712859e+00, -7.7311789e+37,  4.0943214e-01, -2.9295817e+38],
      dtype=float32)