The content of the $\text{Return of the Multi-Armed Bandit}$ sections were covered before in $\text{notebook 53}$

So we skip to the $\text{High Level Overview of Reinforcement Learning}$

<h1>Math</h1>

In this section, we are going to answer the question, what is reinforcement learning?

How is it different from supervised and unsupervised learning? 

What are its applications?


---

<h3>What is Reinforcement Learning?</h3>

The first thing we'll notice is how different reinforcement learning is from supervised and unsupervised learning

If we were to graphically show how close each of these are we can see here that supervised and unsupervised learning aren't that different

<img src='extras/53.1.PNG' width='800'></img>

Some examples of supervised learning might be spam detection

When an email arrives in our inbox, our e-mail application tries to classify whether it's spam or not spam

Another example is image classification, given an image we might want to determine what kind of object is in the image

For example a car truck traffic light Pedestrian Bicycle and so forth

<img src='extras/53.2.PNG' width='500'></img>

We can imagine how that might be useful for a self-driving car

How about unsupervised learning some examples of that might be clustering genetic sequences so we can determine the ancestry of different families or different types of animals (cant determine if this is referencing evoultion, in either case evolution is a myth :) )

Another example is topic modeling given a set of documents we can determine which documents discuss the same or similar topics 

With the amount of data on the Internet growing every day, we can imagine that hand labeling everything would be an infeasible task 

Unsupervised learning is very useful in this case

Whereas we've drawn a supervised and unsupervised learning on the left, in contrast reinforcement learning is way out to the right to give us some idea of how different these paradigms are

Some examples of reinforcement learning are playing humans strategy games such as tic tac toe go and chess

And another example is playing video games such as Starcraft, Super Mario and doom 

So already we can see how reinforcement learning does things which sound a lot like things that humans can do which can be very dynamic

Whereas supervised and unsupervised learning sound more like a very simplistic static tass which are unchanging

---

<h3>Supervised / Unsupervised Interfaces</h3>

With supervised and unsupervised learning, we always imagine the same interface which we've modeled around scikit learn 

For a supervised learning interface, we usually have the functions ```fit(X,Y)``` which takes in the input samples ```X``` and the targets ```Y``` and ```predict(X)``` which takes in input samples ```X``` and tries to accurately predict ```Y``` 

```python
class SupervisedModel:
    def fit(X,Y):
        ...
    def predict(X):
        ...
```
For an unsupervised learning interface, we usually just have a ```fit``` function which only takes in some input samples ```X``` 

Remember that there are no targets in unsupervised learning

Sometimes we have a ```transform``` function which takes in some input samples ```X``` and turns it into a different representation that we call ```Z```

Some examples of that might be to return a mapping to some vector or a cluster identity

```python
class UnsupervisedModel:
    def fit(X):
        ...
    def transform(X):
        ...
```

The main point of this is, supervised and unsupervised learning are actually so similar that it makes sense to put them in the same library in the first place, and it makes sense for their APIs to take on this very simple and neat format

The common theme with both of these is that the interface to these is training data 

We take in some training data either $X$ and $Y$ or just $X$ and we call a ```fit``` function

In the case of supervised learning we can then make predictions on future data 

But in both these cases, our data $X$ and our targets $Y$ are very simple 

$X$ is just $N \times D$ matrix of input data and $Y$ is just an $N$ length vector of targets

This is why we say all data is the same

This generic format doesn't change whether we're doing biology, finance, economics or any other subject

Data is just data a table of numbers 

We can fit most of our algorithms in one neat library called scikit learn

While it might seem that we're trying to make supervised and unsupervised learning seem very simplistic, these methods can actually be quite useful

Using these algorithms we can do things like face detection so that we can unlock our phone and speech recognition so that we can talk to our phone

---

<h3>Reinforcement Learning</h3>

But Reinforcement learning is different 

Reinforcement learning can guide an agent for how to act in the world

So the interface to a reinforcement learning agent is much more broad than just data, it's the entire environment

That environment can be the real world, or it can be a simulated world like a video game

As an example we could create a reinforcement learning agent to vacuum our house

Then it would be interacting with the real world

We could also create a reinforcement learning agent to learn how to walk

That would also be interacting with the real world

We can be sure that the military is interested in such technologies

They want reinforcement learning agents that can replace soldiers not to only walk but fight diffuse bombs and make important decisions while they are out on a mission

So we can see now why reinforcement learning is such a big leap from basic supervised and unsupervisedlearning

The interface isn't just tables of data but it could potentially be the entire world

Our agent is going to have sensors some cameras some microphones and accelerometer a G.P.S. and so forth

It is a continuous stream of data coming in and it's constantly reading this data to make a decision about what to do in that moment

It has to take into account both past and future

It doesn't just statically classify or label things

In other words a reinforcement learning agent is a thing that has a lifetime and in each step of its lifetime it has to make a decision about what to do 

A static supervised or unsupervised model is not
like that

It has no concept of time

We give it an input and it produces a corresponding output

---

<h3>Isnt it till supervised learning?</h3>

Now some of us, if we are creative might think, well supervised algorithms should still be able
to solve reinforcement learning tasks

For example if $X$ represents the state we're in, then $Y$ the target should just be the correct action to take for that state

So whether we are driving a car or playing a video game or playing chess we will always do the right thing

Here's the problem with that, a game like Go has $8 \times 10^{100}$ possible board positions

If we can't tell right away, that is an infeasible amount of input data 

For comparison, ImageNet, our largest image benchmark has about $10^6$ samples 

So the number of samples for Go would be $94$ orders of magnitude larger than ImageNet which can already take about one day to train
if we have state of the art hardware 

To give ourselves some idea, one order of magnitude larger would take 10 days to train, and two orders of magnitude larger would take 100 days to train

So now imagine 94 orders of magnitude larger 

Also keep in mind there may not be such a thing
as a correct action to take at all times

We don't want our AI to play the same way every single time, we want to allow for creativity and stochastic behavior

A supervised model even if it were feasible to train would only have one target per input 

So it would never be able to do human like things like say generate poetry

<h1>Math</h1>

In this section, we're going to discuss how to transition from the simple bandit problem we looked at earlier to a full reinforcement learning problem which the rest of the following notebooks is devoted to

---

<h3>From Bandits to RL</h3>

Let's start by looking at the bandit problem again but perhaps from a slightly different perspective with different terminology

The basic analogy in the bandit problem is that we are playing a bunch of slot machines and we're
trying to decide which arm to pull

---

<h3>Actions</h3>

At this point we are going to make things even more abstract

Instead of relying on an analogy which will never actually be used in real life we're going to start using terminology that is more like what we use in full reinforcement learning 

The first concept we're going to introduce is that of an action

Previously we referred to this as pulling the Bandit arm

Now we simply call it an action 

An action is something that we do or something that the agent does

If we have three advertisements we can possibly show to a user then that means there are three possible actions

Action one means show advertising and one action to means show advertised and two in action three means show advertisement and three

Note that actions can also be continuous, although that will be discussed later

For example a continuous action could be how much tork to apply to a motor or how many degrees to rotate a steering wheel

---

<h3>Expanding our Example (online Ads)</h3>

Since online advertising and running A/B test on websites or actual real world examples let's continue using those

Recognize that in our previous formulation all users are treated equally

We don't differentiate between an old senior who rarely uses a computer and a teenager who uses multiple computers for most of the day

Of course that's unrealistic

Most probably an older person would be more interested in an advertising for home insurance than a teenager who doesn't have any concept of owning a home

Similarly it should be obvious that there are certain demographics who would be more interested in (some actress name that does not concern us) new makeup line than say online courses about machine learning

Clearly this information can't be captured in a single estimate of a click through rate for each advertiser

So what's the solution

---

<h3>Contextual Bandits</h3>

The solution is contextual bandits, although we should be more concerned with the technical details rather than what we name it

Basically we are going to introduce the concept of context

Context is exactly what it sounds like

We include contextual information in our decision making process 

For example based on someone's previous search history, we might be able to guess that they are a male teenager and not a soccer mom 

Although obviously we would never encode such labels exactly

They are inherent in some latent representation of past observations

In some cases we might have true and explicit information where perhaps the user entered that information themselves

For example they might enter their birthday or their favorite hobby

In addition things like time of day and the day of the week might matter, our location in the world might matter

If we've ever worked in the online advertising industry then we are very familiar with having to deal with these input features

---

<h3>States</h3>

Luckily we can abstract this idea too

We call this the state

State is also a pretty generic term so it might take some getting used to 

But in general the idea of state is very flexible

As mentioned previously it can represent attributes of a user end of the world itself, like the time and position 

But in more general terms it can represent any measurement made by a computer

For example we might have a controller reading the temperature and humidity of the room and deciding whether it needs to be warmed to cooled 

Those readings of the temperature and humidity make up the state

---

<h3>Using Machine Learning</h3>

As an example of how we might build an algorithm to take into account the state in order to decide
an action to maximise reward, we can use machine learning

Suppose we represent our state as a feature vector $x$

Now suppose we have some parameters or weights called $w$

Then as we know from our study of linear regression, we can take the dot product of $x$ and $w$ to get a predicted reward $\hat y$

We can compare the predicted reward $\hat y$ to the true award that we eventually get called $y$
and then use the error between the two to update our model, in particular the weight vector $w$

$$\large x: \text{ feature representation of state}$$

$$\large w: \text{ model parameters}$$

$$\large \hat y = w^Tx : \text{ expected reward}$$

$$\large y : \text{true reward}$$

Now this is just a rough example

We're actually going to move on to something a bit more complex very soon

---

<h3>What's the difference between these?</h3>

There's a big difference between the two previous examples of state that we just gave 

What is the difference between a state vector representing the attributes of a user such as their age and gender and so forth vs. a state vector representing the temperature and humidity of the room

Well the answer is that in one case the sequence of states that we see depend on one another and in the other case they don't

As an example of that pretend we are a Website like The New York Times

Suppose that the user who just visited our Web site was a 40 year old soccer mom

Does that say anything about who the next visitor to our Website might be?

Our next visitor might be a user from Russia that simply wants to read American news 

But more importantly the fact that the previous visitor to your Web site was a 40 year old soccer mom has nothing at all to do with what articles
and advertisements we show to your Russian user

There is simply no relationship between the two

---

<h3>State Sequences</h3>

Now compare that to a situation where the sequence of observations that we measure is highly interconnected

If we're looking at the temperature and humidity of the room there is most certainly some structure in the sequence of observations that will help us determine how to control heating and cooling

Obviously the temperature isn't going to be 70 degrees in one instant and then zero degrees the next

If we think about something like stock prices we clearly need to look at not just the snapshot of the stock price at a single point in time but rather the sequence of stock prices

Obviously we're interested in things like is the stock going up or down

And obviously we don't know whether something is going up or down from just a single price

Another example of this is board games like chess and Go

Obviously the State of the board like where each of the pieces is will be dependent on where they were previously and they will have a strong effect on where they are in the future

So overall we hope we're getting the idea that there are some environments where it's not just the state by itself that matters but rather the sequence of states

This is what will bring us to our next topic, Markov Decision Processes

---

<h3>Summary</h3>

To summarize this lecture, here's what we talked about

First we looked at the multi-arms bandit problem once again and we defined some new terms to help us think about the problem more abstractly 

Specifically the multi-arms bandit problem is a problem of choosing an action to obtain the best reward

Next we discuss the contextual bandit problem where instead of just having to choose an action we also have to pay attention to the state which helps us have more fine grained control on which action to choose

Finally we did some foreshadowing and talked about the situation where instead of just random states which are not related to one another in terms of predicting a reward we can have states which are interdependent

This brings us to our next notebook on Markov Decision Processes

<img src='extras/53.3.PNG' width='700'></img>