# What is Reinforcement Learning?

### An agent is somebody or something who/that interacts with the environment by executing certain actions, making observations, and receiving rewards for this.

# Supervised Learning
### Learning to map one or more random variables (features) to another random variable (target)
##### i.e $$f : X -> \widetilde{Y}$$, where $X = \{X^{(1)}, x^{(2)}, ..., X^{(d)}\}$ is the set of features and $\widetilde{Y}$ is the approximation of $Y$ by $f$ which can be learned with a cost function $C : C = kernel(Y, \widetilde{Y})$

## Markov Decision Processes (MDPs)
### A process is markovian if the probability of the next state is only dependent on the current state and the action taken.
#### i.e $$p ( S = s_{t+1} | s_t, a_t, s_{t-1}, ...) = p ( S = s_{t+1} | s_t, a_t)$$

## Policy ($\pi$)
### Policy is a mapping of states to actions, defines the probability of taking a specific action in a specific state.
#### i.e $$\pi ( a | s ) = P[A_t = a | S_t = s]$$

# Gymnasium

In [3]:
import gymnasium as gym

In [12]:
env = gym.make("CartPole-v1")
env.observation_space.sample()
env.reset()
env.step(1)

(array([-0.03027839,  0.24236609,  0.04314105, -0.24068168], dtype=float32),
 1.0,
 False,
 False,
 {})

In [16]:
env.reset()
total_reward = 0
while True:
    action = env.action_space.sample()
    _, reward, is_done, is_trunc, _ = env.step(action)
    total_reward += reward
    if is_done or is_trunc:
        break
print(f'total reward :  {total_reward}')

total reward :  29.0


In [4]:
import random

class RandomActionWrapper(gym.ActionWrapper):
    def __init__(self, env: gym.Env, epsilon: float = 0.8):
        super(RandomActionWrapper, self).__init__(env)
        self.epsilon = epsilon
    
    def action(self, action: gym.core.WrapperActType) -> gym.core.WrapperActType:
        if random.random() < self.epsilon:
            action = self.env.action_space.sample()
            print(f"Random action {action}")
            return action
        return action

env = RandomActionWrapper(gym.make("CartPole-v1"), epsilon=0.1)
env.reset()
total_step = 0
while True:
    _, rew, done, _, _ = env.step(0)
    total_step += 1
    if done:
        env.reset()
        break
print("Total step %d" % (total_step))

Random action 0
Total step 9
