# 1. Components of an RL System

<img src="images/sar-flow.png" width="500">

Let's quickly recall what we had discussed earlier concerning the components of an RL system. We talked about the following:

> * **Agent**: The thing that is playing the game, that we want to program the RL algorithm into.
<br>
<br>
* **Environment**: The thing that the agent interacts with; the agents world.
<br>
<br> 
* **State**: Specific configuration of the environment that the agent is sensing. Note, the state only represents that which the agent can sense. 
<br>
<br> 
* **Actions**: Things that an agent can do that will affect its state. In Tic-Tac-Toe, that's placing a piece on the board. Performing an actions always brings us to the next state, which also comes with a possible reward. 
<br>
<br> 
* **Rewards**: Tells you how good your action was, not whether it was a correct or incorrect action. It does not tell you whether it was the best or worst action, it is just a number. The rewards you have received over the course of your existence doesn't necessarily represent possible rewards you could get in the future. For example, you could search a bad part of state space, hit a local max of 10pts, while the global max was actually 1000 pts. The agent does not know that (but in our case we will, since we designed the game). 

## 1.1 Notation
We know that being in a state, $S(t)$, and taking an action $A(t)$, will lead us to the reward $R(t+1)$ and the state $S(t+1)$:

#### $$S(t), A(t) \rightarrow R(t+1), S(t+1)$$

However, when we drop the time index's, we represent this as the 4-tuple: 

#### $$(s,a,r,s')$$

The $r$ is not given a prime as you would expect, but it is standard notation. So, the prime symbol doesn't strictly mean "at time t + 1"

## 1.2 New Terms 
The first new term we want to discuss is **Episode**. This represents one run of the game. For example, we will start a game of tic tac toe with an empty board, and as soon as one player gets 3 pieces in a row, that's the end of the episode. As you may imagine, our RL agent will learn across many episodes. For example, after playin 1000, 10000, or 100000 episodes, we can possibly have trained an intelligent agent. The number of episodes we will use is a hyper parameter and will depend on the game being played, the number of states, how random the game is, etc. 

Playing the game tic-tac-toe is an **episodic task** because you can play it again and again. This is different from a **continuous task** which never ends. We will not be looking at continuous tasks in this course. 

Now, the next question we ask is: when is the end of an episode? Well, there are certain states in the state space that tell us when the episode is over. These are states from which no more action can be taken. They are referred to as **terminal states**. For tic-tac-toe these are when one player gets 3 in a row, or when the board is full (a draw). 

## 1.3 Cart-Pole / Inverted Pendulum
This problem comes up all the time in RL and control systems. If you search google for inverted pendulum, you will see research papers concerning control systems, and if you search cart-pole you will see all kinds of RL research papers. At the beginning of an episode, the cart is stationary and the pole is perpendicular to the ground. Because the system is unstable, the pole will then begin to fall, and the job of the cart is to move so that the pole does not fall down. 

When the pole falls so far that it is impossible to get back up, any angle past a threshold where it is impossible to get the pole back up is a terminal state. Note, because the angle in this example is a real number, is a continuous/infinite space. We will not deal with these. 

## 1.4 Assigning Rewards
One difficult problem in reinforcement learning that we will come across is: defining rewards. We, the programmers, can be thought of as coaches to the AI. The reward is something that we give to the agent. So, we can define how we are going to reward the agent, which will drive how it learns. 

For example, if we just give it the same reward no matter what it does, then the agent will probably just end up acting randomly, since any action will lead to the same value. You don't want to do that, because it will encourage bad behavior. 

A real situation could be seen with a robot trying to solve a maze. If it manages to exit the maze it would receive a reward of 1, else it would receive 0. Most people would think that this seems reasonable, and it is _semi-reasonable_. However, with this reward structure, the robot may never actually solve the maze. So, if it has only ever received a reward of 0, it may think that it is the best it can do. A better solution would be to give the robot a -1 for every step it takes, and then it would be encouraged to solve the maze as quickly as possible. 

### 1.4.1 Caution
One point of caution is to not build your own prior knowledge into the AI. For example, in a game such as chess, an agent should be rewarded only for winning, not taking opponent's pieces, or implementing some strategy that you read about in a chess book. You want to leave the agent free to come up with its own solution. The danger of rewarding the agent for achieving sub goals, is that they may find a novel way to maximize the reward for the subgoals, without actually winning the game. For example, taking all of the opponents chess pieces and then still losing the game. 

So to summarize, we can say that:

> "The reward signal is your way of telling the agent what you want it to achieve, now how you want it to be achieved."

---

# 3. The Value Function 
Take a moment to consider the following scenario. You have an exam tomorow. You would like to hang out with your friends. You know if you hangout with you friends you will most likely have a dopamine hit and feel happy. On the other hand if you study for your exam you will feel tired and potentially bored. So why study? Well, this is the idea of **planning**.

In particular, we can describe a *value function*:

> **Value Function**: We don't just think about immediate rewards, but future rewards too. We want to assign some value to the current state that reflects the future too.

Now, we can think of this in the reverse direction as well. Let's say you receive a reward; getting hired for your dream job. Now, if you look back to you career and things you did in school, what would you attribute your success to? What state of being in your past lead you to get your dream job today? This is refered to as the *credit assignment problem*:

> **Credit Assignment Problem**: What did you do in the past that led to the reward you are receiving now? In other words, what action gets the credit. 

The credit assignment problem shows up in online advertising as well, but the concept is referred to as **attribution**. The idea is that if a user is shown the same ad 10 different times before they buy, which ad gets the credit? 

Now, closely related to the credit assignment problem is the idea of **delayed rewards**. Note that these are all kind of just different ways of saying the same thing, and the solution is also the same. Delayed rewards is just looking at the problem from the other direction. With credit assignment, we are looking into the past and asking "what action lead to the reward we are getting now?". With delayed rewards, we are asking "How is the action I am taking now related to rewards I may potentially receive in the future?" 

The idea of delayed rewards tells us that an AI must have the ability of foresight, or planning. 

## 3.1 Example Scenario

<img src="images/scenario-1.png">

Imagine the following: you are in state A, which is the second last state in a game. There are only two possible next states, both of which are terminal states. B gives you a reward of 1, and C gives you a reward of 0. You have a 50% probability of going to either state; perhaps your agent doesn't know which one is best. We can think of the value of state B as 1, and the value of state C as 0. So, what is the value of state A? We can think of it as 0.5, since it is the expected value of your final reward, given that you have a 50% chance of ending up in either state:

#### $$Value(A) = 0.5*1 + 0.5*0 = 0.5$$

Now, lets say that you are in state A, and state A can only lead to state B, and there is no other possible next state:

<img src="images/scenario-2.png">

If B gives you a reward of 1, then perhaps A's value should also be 1, since the only possible final scenario is to have a final reward of 1, once you reach A:

#### $$Value(A) = 1 * 1 = 1$$

Thus, the value tells us about the "future goodness" of a state. We can make this a little more formal; we actually call this value, the value function. 

## 3.2 Value Function
The value function is a measure of the future rewards that we may get:

> **V(s)** = the value (taking into account the probability of all possible future rewards) of a state

The name value, is not quite ideal, since it is very ambiguous. However, it is part of the RL nomenclature, so we will deal with it. 

### 3.2.1 Rewards vs. Values
It is easy to get rewards and values mixed up. The difference is that the value of state, is a measure of the possible future rewards we may get from being in that state. Rewards on the other hand are immediate. 

We, therefore, chose actions to take based on **values of states**, and not on the reward we would get by going to a state! The reward is the main goal, but we can't use the reward to tell us how good a state is, since it doesn't tell us anything about future rewards. 

### 3.2.2 Efficiency
One way to think about the value function, is that it is a fast and efficient way to determine how good it is to be in a state, without needing to search the rest of the game tree. You could try to enumerate every possible state transition, and their probabilities of occuring; however, you can guess that this would be a computationally inefficient task. In fact, tic tac toe is easy since it is only a 3x3 board, so the number of states is approximately 3^(3x3) = 19683. However, that will grow exponentially with the size of the board! For example, if you have a connect 4 board, then you get 3^(4x4) = 43 million! As we know, exponential growth is never good, and hence searching the state space is only possible in the simplest of cases. Hence, a value function that can tell you how good you will do in the future, given only your current state, is *extremely* helpful. This means that $V(s)$ gives an answer instantly, in only $O(1)$ time! The only question is, is it even possible to find a value function...

### 3.2.3 Value Functions in RL
Estimating the value function is a central task in RL, but it should be noted that not all RL algorithms require it. For instance, a genetic algorithm simply spawns offspring that have random genetic mutations, and the ones who survive the longest will go on to spawn offspring in the next generation. So, by pure evolution and natural selection, we can breed better and better agents that get iteratively better at the game! However, this is not the type of algorithm that we are interested in for RL, most of the time. 
