# Math

In this notebook of the series, we are going to review all the important concepts of reinforcement learning that we'll need for the upcoming notebooks 

---

<h3>Review Section Introduction</h3>

In this section we will outline everything we're going to cover

So here's an outline of what we're going to cover

---

<h3>The Explore-Exploit Dilemma</h3>

First, we're going to talk about the all important explore exploit dilemma

The basic idea behind this is that you have to collect data in order for our estimates to be accurate

But in order to collect data, we must spend resources to take actions and receive the corresponding rewards

The problem is we want to maximize our reward and hence we should try to take the optimal action each time

But in order to know what the optimal action is, we have to have enough data

So it's a circular problem

<img src="extras/65.0.PNG" width="400">

---

<h3>Markov Decision Process (MDP)</h3>

The second thing we're going to talk about is the Markov decision process, or MDP

<img src="extras/65.1.PNG" width="400">

The MDP is a framework that our reinforcement learning environments fit into

This is where we get the concept of transition probabilities and the Bellman equation

Transition probabilities are our model of the environment, and they tell us probabilistically where we will end up next, given where we are now and what action we take while we are here

The Bellman equation is the centerpiece of all the algorithms we study

Essentially, what it boils down to is all the algorithms we study, whether it's Q Learning, SARSA or Monte Carlo, these are all solutions to the Bellmon equation

---

<h3>Monte Carlo (MC)</h3>

Next, we are going to discuss what is perhaps the most straightforward method of solving MDP problems known as Monte Carlo 

The basic idea is, we know that the value function represents the expected return given a state

We can think of this as the average sum of future rewards once we end up in some state

Well, instead of taking this expected value, which we can't because we don't know the true underlying probability distribution, let's instead just calculate the sample mean 

$$E(X) \approx \frac{1}{N} \sum^N_{i=1} X_i$$

We do this everywhere in machine learning, starting from just basic linear classification

So it's an important technique to consider

---

<h3>Temporal Difference Learning (TD)</h3>

Next, we are going to review temporal difference learning or TD learning 

Temporal difference learning is important because it gives rise to some of our most popular reinforcement learning algorithms, such as Q-learning

The basic idea behind TD learning is that, instead of taking the Monte Carlo approach where we gather a bunch of samples and take the mean, we use a bootstrapped estimate

That probably sounds like a meaningless word, but what it means is that we use our estimates to make other estimates

This leads to all sorts of weird issues, like when we're doing gradient descent, we're not really doing gradient descent because we don't know the true target

Usually when we take the squared error between the target and a prediction, we're given the target

$$Error = (target-prediction)^2$$

But in TD Learning, we use our own estimates to estimate the target, which is kind of an odd approach

Luckily, despite this seemingly bad idea, it actually works and it forms the basis for many of the algorithms we know and love and reinforcement learning, including a few from the following notebooks

# Math

In this section, we are going to review the concepts of the explore exploit dilemma

---

<h3>The Explore-Exploit Dilenna</h3>

A good example example of this is slot machines

We're going to consider a very simplified slot machine for illustrative purposes

First, we are going to assume that the reward is binary, one or zero 

One, meaning you win and zero meaning you lose

Second, we're going to assume that in this casino that you're visiting, there is a set of slot machines that we have the option to play, but they all have different win rates

For example, with one machine, we might win 20% of the time, but with another machine, we might win 30% of the time

Obviously, we want to pick the machine with the highest win rate

---

<h3>Calculating Win Rate</h3>

Well, here's a question to consider 

How do you know which machine has the highest win rate?

Are we going to ask the manager?

Probably he wont tell us :|

The only way to determine the win rate of each slot machine would be to estimate it using data

In fact, this is pretty intuitive

The win rate is just the number of times we won, divided by the total number of times you played

$$win\ rate = \frac{\# \ times \ won}{\# \ times \ played}$$

Incidentally, this is also equal to the sample mean, pretty simple, right?

---

So let's say we play 100 times and we win 30 times, then our win rate is 30%

But what if we play 10 times and we win three times?

Well, we might say our win rate is still 30%

What if we play 1000 times and we win 300 times, then our win rate is still 30%

But what is different about these three estimates?

---

The answer is, our confidence in each of these estimates is different 

If we play three times and we win once, we might say that's a win rate of 33%, but we wouldn't be so confident about that estimate

If we wrote a scientific paper and we said, yeah, we tested this drug on three people, it works great, nobody would believe us

In statistics, they would say that our sample size is too small

As we may recall from our studies, in probability, the only way for an estimaten to be exact, is for us to collect an infinite number of samples

Now, obviously, that's impossible

Why?

Well, first of all, you don't have an infinite amount of time

But secondly, as we may know, if we go to a casino and we play a slot machine, it costs money and surely we do not have an infinite amount of money

Oddly, this is a common mistake that students make, collecting data is not free

Collecting data costs us time and resources

Therefore, for all practical purposes, we cannot collect an infinite number of samples or even a really large number of samples

The question then becomes how many samples is enough samples?

---

<h3>Multiple Slot Machines!</h3>

There is another complication to this problem, remember that there are multiple slot machines

We want to play the best one, but in order to figure out which one is best, we need to collect enough data from each machine

However, we must remember that only one of these machines is the best machine

So while we are off collecting data from all the other machines, we are wasting money by not playing the best machine

We might have had the idea earlier that, yeah, well, Iwell just collect as many samples as we can, but even this is not a good idea because the more samples we collect, the more time and resources we've spent playing suboptimal slot machines

In other words, we are losing lots of money

Collecting as many samples as we can is not a good strategy

---

<h3>The Dilemma</h3>

By now, hopefully we realize what the dilemma is 

On one hand, we need to play all the machines to collect a sufficient amount of data in order to calculate win rates that we are confident about

On the other hand, we would like to play only the machine with the highest win rate so that we can maximize your winnings

In other words, these two criteria are fundamentally at odds with one another

By trying to collect more data, we play suboptimal machines

By trying to only play the best machine, we forgo collecting data from the other machines

But moreover, we don't know which machine is the best unless we've collected enough data

---

<h3>Common Mistake!</h3>

Now, one common mistake students make is that they think this is some ridiculous abstract concept and that there's no reason they should learn about it

In reality, it's one of the most real world applicable concepts in machine learning

How many people these days own a website?

The answer is probably everyone, well not all people in the entire world, but businesses who are trying to make money

In other words, our profit is directly tied to the success of our website

If people find our content interesting, they will go to our website, click on our pages, sign up, buy things and so forth

If our website is bad, then people might leave or ignore it, even if they would have loved our product

In other words, it's very important to make sure we have a decent website in order to attract customers

---

<h3>Practical Application</h3>

Well, how do we do this and what does it have to do with slot machines?

Let's say we just created a new design for our website

We would like to know, is our new design better or is our old design better?

Well, how do you define better?

Well, one typical way of defining better is how well each page converts

We might consider a conversion to be a user signing up for an account let's say

So now we can speak about this in concrete terms

In this scenario, the win rate is the rate at which users sign up for our website

So if for every 100 users that visit our site, we get 30 sign ups, then our win rate is 30%

In online business, we usually call this a conversion rate rather than a win rate, which is a more generic term

---

So now our problem becomes, we have two designs, design A and design B, we would like to choose the design that has the highest conversion rate

Of course, we don't know the conversion rate unless we collect data, which means we have to show design A to a number of users and design B to a number of users

We need to show these designs to a sufficiently high number of users so that we have enough data to be confident about the conversion rates we've calculated

On the other hand, we don't want this number to be too high because that means we'll be showing the suboptimal design unnecessarily

So, again, we have a dilemma between exploration and exploitation 

Exploration is collecting data and exploitation is showing the highest converting design

---

<h3>Reinforcement Learning</h3>

Now, those are simple examples

So how does this apply to reinforcement learning 

Well in reinforcement learning, we will find ourselves in some state and in that state we must choose which action to perform

<img src="extras/65.2.PNG" width="200">

How will we choose what actions to perform?

The answer is we choose the action that yields the highest future reward

But how do we know which action yields the highest future award?

Of course, we would have had to have played this game before

Let's say we've played this game many times and we found ourselves in this state many times

Well, then we know based on the data we've collected, what the future rewards should be

This data collection is what we're calling exploration, because we've actually explored this grid and found out what all the future rewards are

As usual, we don't want to collect too much data because that would have meant that we spent too much time doing something suboptimal and we end up with less reward, but we don't want to collect an insufficient amount of data because then our estimate won't be accurate enough for us to be confident about our decision

---

Now we're skipping ahead a little bit, but remember, this is review, so we should be familiar with all this stuff anyway

So the next question we want to consider is, how does this play out in an actual reinforcement learning algorithm?

Let's say we have an agent which is trying to learn how to navigate its environment

As we know, we're going to keep track of $Q$ values in something called a $Q-table$

We write this as $Q(s,a)$

It's a table because the row is specified by the state $s$ and the column is specified by the action $a$

The meaning of this value is it's the expected sum of future awards, given that we were in state $s$ and took the action $a$ from that state

The question is, how do we know the value of $Q(s,a)$?

Well, we're not just given these values, we have to measure them

---

<h3>Exploration</h3>

And how do we measure $Q(s,a)$? 

As before, the solution is to collect data

That means we must enter state $s$ and perform action $a$ 

Then we keep track of the sum of rewards we get after doing that action

Of course, we have to do this many times in order to get an accurate estimate

And also we have to do this for every state and every action while we're in every state

$$Q(s,a) = AvgReward(s,a), \forall s \in S, \forall a \in A$$

So that's the exploration side of things

we must collect data in order to have an accurate $Q-table$


---

<h3>Exploitation</h3>

But remember, that's not all a reinforcement learning agent has to do

The goal of the reinforcement learning agent is to maximize its reward

So it's always trying to take the best action whenever it arrives in some state $s$

The way we choose an action given a $Q-table$ is to take the $\arg \max$ over all possible actions $a$

$$a^* = \arg \max_a Q(s,a)$$

But therein lies the problem

In order to really choose the best action, we must have accurate estimates of $Q$ for all the actions we want to consider

The problem is, if we've collected lots of data for all those actions, then we must have been acting suboptimally for some time

So we again have this dilemma between exploration and exploitation

We want to exploit it by taking what sd believe to be the best action, but sd want to explore so we can collect more data and be more certain that the action we choose is actually the best

---

<h3>How is this Dilemma Addressed?</h3>

The next question is, how do we address this dilemma in actual reinforcement learning algorithms?

The first method is very simple and it's called Epsilon-greedy

As a side note, what do we mean by greedy?

A greedy algorithm is one that is purposefully shortsighted

In our case, we choose the $\arg \max$ over all actions from $Q(s,a)$

Greedy in this context means we only look at the $Q$ value for the current state

We don't look at future states

We just ask ourselves what's the best thing to do right now using only the information we have right now

So then Epsilon-greedy says instead of taking the greedy action all the time, take a random action epsilon of the time

```
if random() < epsilon:
  a = select random action
else:
  a = argmax{ Q(s, :) }
```

Epsilon is usually set to a small value like 10%, although another thing we can do is start with an epsilon very high, like 100% and slowly decrease it as the agents estimates become more and more accurate

---

<h3>Option 2 : Probabilistic Policies</h3>

The second way we can handle exploration is by using probabilistic policie

So we can imagine for instance, that we have an input state $s$ and a weight matrix $W$, where

$$size(W) = \vert S \vert \times \vert A \vert$$

To get a probability distribution over all actions, we create a logistic regression model where we dot the weights with the state and take the softmax 

$$\pi(a \vert s) = softmax(W^Ts)$$

By having a probabilistic policy, we ensure that we are always taking random actions because we can sample from this distribution as opposed to taking the $\arg \max$

There are a few reasons why this might be advantageous compared to epsilon-greedy

First is that with Epsilon-greedy, we have to set epsilon ourselves

We don't know if that value of Epsilon is good or bad until we've tried it

With probabilistic policies, our policy is learned automatically through experience

The reason is that our environment is stochastic

In other words, it could be the case that acting probabilistically actually is the best policy and it's not actually possible to assign a single best action to a given state

# Math