# What is reinforcement learning?

Reinforcement Learning (RL) is a computational approach to goal directed learning from interaction with the environment, using idealized learning situations. By computational approach we can understand as using algorithms and by goal directed learning that we have a goal that we are trying to achieve. Idealized learning situations means that we are really trying to make things a little more narrow in approach to the problem.

### General Reinforcement Learning Problem

In a general RL problem, we have an environment and an agent. The environment exposes the state with the agent consent, and the agent senses that state and then takes an action. The environment processes the action, and
then produces two things, a reward, and a new state (*state+1*). And the cycle continues, the agent then senses the new state, produces a new action, and so on.

<img src="images/reinforcement_learning_problem.svg" width="40%" />

In a historical context, RL uses the idea from the psychology of animal learning, where an animal learns the behaviors not through insight but through the process of trial and error, trying different things at random. From studies of animals, Edward Thorndike created the *law of effect* which said, any behavior that is followed by pleasant consequences is likely to be repeated and any that is followed by unpleasant
consequences is likely to be stopped. From this perspective, in RL an agent tries different actions selecting from among them by comparing their consequences (*selectional* aspect), and the selection is associated with a particular situation (*associative* aspect).

### Elements of Reinforcement Learning

In the image below, we have the elements of RL, which are explained as follows:

<img src="images/elements_of_rl.svg" width="40%" />

The **time step** divides time into discrete steps ($t$), where each step determines a cycle in the environment-agent interaction. The **environment** is what defines the world that the agent interacts with. It has a basic loop that initiates by producing a state and a reward for the agent to sense and process. Then, it accepts an action from the agent and cycles back to produce another state again. 

The **agent** learns to achieve goals by interacting with the environment, having a loop where it senses the state and the reward from the environment, and then selects an action to pass to the environment. The **state** represents the situation in the environment that the agent is going to make his actions based on. 

**Reward** ($r_t$) is a scalar value, a floating point number, returned by the environment when the agent selects an action. An **action** is what an agent takes on each time step, which can be discrete as of one of a fixed number of actions like in a video game or continuous like steering the wheel at a certain angle or changing the angle of the gas pedal to feed more or less gas in. 

A **policy** ($\pi$) is a mapping from a particular state to an action to take in that state. It can be deterministic, where it is the same action each time, or stochastic, where 70% of the time you take action one, and 30% of the time you take action two.

**Value function** measures the goodness of a state in the long run as calculated by an agent. In fact, it is the expected long-term accumulation of reward, starting from state $s$, and following policy $\pi$. As analogy, the reward is kind of like an immediate pleasure or pain experienced by a person, while the value function represents a more farsighted judgement to the value of a state. Value functions come in two basic flavors: the straight value function ($V^{\pi}(s)$) that represents the goodness of that state when following policy $\pi$; and the $Q$ function ($Q^{\pi}(s,a)$), which is the goodness of a state $s$, first taking an action $a$ and then following a normal policy $\pi$.

In a **model** of the environment, we basically get the transition probability to the next state and the probability of the reward going to that state. There are two kinds of methods related to models: *model-free* methods, where we do not have a model and we have a pure trial-and-error learning experience; and *model-based* methods that are considered a planning kind of learner because you typically do not take actions in the environment. Instead, you just follow what would happen through the model itself to find out what your next state is and what your reward is. 

Regarding tasks, we have the **episodic** tasks, in which they comes to a natural end and they are typically repeated over and over (*e.g.*, tic-tac-toe and Pac-man game). In a **continuing task**, we would be sensing and controlling at each time step, without a natural endpoint (*e.g.*, air conditioner).

### Example of RL

- **Tic-tac-toe game**

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Tic_tac_toe.svg/800px-Tic_tac_toe.svg.png" width="20%" align="center" />

This is a game for two players, "X" and "O", who take turns marking the spaces in a 3×3 grid. The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row wins the game.

| Elements    | Description                                    |
| :---------- | :--------------------------------------------- |
| Agent       | "X" player                                     |
| Environment | "O" player, general rules                      |
| State       | 9x square occupant: X, O or blank              |
| Actions     | 9x place "X" in square                         |
| Rewards     | At the end of the game: 1=wins, 0=tie, -1=loss |
| Task type   | Episodic                                       |


- **Smart Thermostat**

<img src="images/thermostat.svg" width="20%" align="center" />

This is a kind of different than most thermostats. In this example, we just have the current temperature as a readout, and sort of two buttons the user can press. If he is happy for the way things are going, he can press the smiley face, and if he is unhappy, he can press the frown face.

| Elements    | Description                                    |
| :---------- | :--------------------------------------------- |
| Agent       | Air conditioning / Heat control                |
| Environment | House and its occupants                        |
| State       | Current temperature, day, time                 |
| Actions     | Heat, Cool, Off                                |
| Rewards     | Smile=+1, Frown=-1, None=0                     |
| Task type   | Continuing                                     |


### Approaches to solving RL problems

**Value function methods** are methods where we estimate the value states or state action pairs. Our policy is based on selecting the actions that lead to the largest value states. 

**Direct policy search** is a method that we model the policy itself. The input is typically a state or something we approximate as a state, and the output is the action we want to take, either a discrete action or a continuous action.

# Exercise: Simple Rooms Environment

For the exercises, we have a file called `simple_rooms.py`. This file, located in `lab/LabFiles/lib/envs/` folder contains classes to the environment of a simple room. The `Environment` class is provided as an interface. An environment must have some representation of the state of which the agent is interacting with. In addition, an environment must be able to reset itself and step to the next state. These are implemented in both the `reset()` and the `step()` function. The `reset()` function should return the initial state, while the `step()` function should take in an action and at the minimum, return the next state and the `reward()`. The `actions()` function maintains the information of how many type of actions in the environment. This is used in conjunction with the `ActionSpace` class.

The `SimpleRoomsEnv` class implements the `Environment` class and examine this in more details. The `SimpleRoomsEnv` is a simple environment of a 4x4 rooms, limited by walls. The initial state has the agent starting at the room on top left corner, with the goal to reach the room at the bottom right corner. The 4x4 rooms are illustred in the image below:

<img src="https://prod-edxapp.edx-cdn.org/assets/courseware/v1/e93aa646b1bd0e176dd7fee8b765a9b4/asset-v1:Microsoft+DAT257x+1T2018+type@asset+block/SimpleRooms.PNG" width="30%"/>

In this environment, we have four actions that can be performed (0:north, 1:east, 2:west, and 3: south) in the 4x4 environment. So, for each room we have specific transitions. For example, our environment is described as a 0 to 15 ids, as follows:

<img src="images/environment.svg" width="20%" />

In `id=0`, we have only two possible transitions (go to `id=1` or go to `id=4`). In case we are in `id=1`, we have three possibilities (go to `id=0`, go to `id=2`, or go to `id=5`). It is important to note that the blue room (`id=0`) is our start room and the red room (`id=15`) is our goal room (`reward=1`).

# Exercise: The CliffWalking Environment

The `CliffWalking` environment is a simple environment of a 4x12 tiles as illustred in the image below, which has "cliffs" or terminal states on it. The initial state has the agent starting at the tile on bottom left corner (yellow tile), with the goal to reach the tile at the bottom right corner (green tile), avoiding the cliffs (navy tiles) in the process.

<img src="https://prod-edxapp.edx-cdn.org/assets/courseware/v1/a24018ccbfbee5fb1ef116b70d46450c/asset-v1:Microsoft+DAT257x+1T2019+type@asset+block/CliffWalking.PNG" width="40%"/>

In this environment, you have to avoid the cliff tiles and get to the goal as fast as possible.