# 1. Approximation Methods
We are now going to look at approximation methods. Recall that in the last section, we discussed a major disadvantage to all of the methods we have studied so far. That is, they all require us to estimate the value function for each state, and in the case of the action-value function, we have to estimate it for each state and action pair. We learned early on that the state space can grow very large, very quickly. This makes _all_ of the methods we have studied impractical. 

> * $V$ - Need to estimate |S| values
* $Q$ - Need to estimate |S|x|A| values
* |S| and |A| can be very large

The solution to this is _**approximation**_. 

## 1.1 Approximation Theory
Recall from our earlier work concerning deep learning, that neural networks are universal function approximators. That means that given the right architecture, a neural network can approximate any function to an arbitrary degree of accuracy. In practice, they do not perform perfectly, but they do perform very well.

Mathematically, what we are trying to do is first do a feature extraction: so from the state $s$ we can extract a feature vector $x$:

$$x = \varphi (s)$$

Our goal is to then find a _function_ that takes in a feature vector $x$, and a set of parameters $\theta$, that faithfully approximates the value function $V(s)$:

$$\hat{f}(x, \theta) \approx V(s)$$

## 1.2 Linear Approximation
In this section, we are going to focus specifically on linear methods. We will see that function approximation methods require us to use models that are _differentiable_, hence we wouldn't be able to use something like a decision-tree or k-nearest neighbor. In the next set of notebooks (RL with deep learning) we will look at using deep learning methods, which are also differentiable. Unlike linear models, we won't need to do feature engineering before hand, although we could. Models like convolutional neural networks will allow us to use raw pixel data as the state, and the neural network will do its own automatic feature extraction and selection. However, those are harder to implement and take away from the fundamentals of RL, so we will hold off on them for now. For now, all we will need to know are linear regression and gradient descent. 

## 1.3 Section 8 Outline
We are going to proceed with the following outline for this section:

> * We are first going to apply approximation methods to Monte Carlo Prediction. That means we will be estimating the value function given a fixed policy. But instead of representing the value function as a dictionary indexed by state, we will use a linear function approximator. Recall that MC methods require us to play the entire episodes and calculate the returns before doing any updates. So next we will...
* Apply approximation methods to `TD(0)` prediction. Remember, `TD(0)` takes aspects of both MC sampling and the bootstrap method of DP. 
* After working on the prediction problem, we will move to the control problem, and we will use SARSA for this. But we will of course be replacing $Q$ with a linear function approximator. 

## 1.4 Sanity Checking
One thing to keep in mind in this section, is that we can always sanity check our results by comparing to the non-approximated version. We expect our approximation to be close, but not perfect. One obstacle that we may encounter is that our algorithm may be implemented perfectly, but your model is bad. Remember, linear models are _not_ very expressive. So, if we extract a poor set of features, the model won't be able to learn the value function well. In other words, the model will have a large error. To avoid this, we need to proactively think about what features are good for mapping states to values. We will need to put in manual work for feature engineering in order to improve our results. 