### The need for a complete model of the environment

In the methods we have used so far, we have relied on the transition probabilities of the
environment in our policy evaluation, policy iteration, and value iteration algorithms to
obtain optimal policies. This is a luxury that we usually don't have in practice. It is either
these probabilities are very difficult to calculate for each possible transition (which is often
impossible to even enumerate), or we simply don't know them. You know what is much
easier to obtain? A sample trajectory of transitions, either from the environment itself
or from its simulation. In fact, simulation is a particularly important component in RL,
as we will discuss separately towards the end of this chapter.
Then the question becomes how we use sample trajectories to learn near-optimal policies.
Well, this is exactly what we'll cover next in the rest of this chapter with Monte Carlo and
TD methods. The concepts you will learn are at the center of many of the advanced RL
algorithms.

## Monte Carlo - model-free method

We can estimate the state values and action values in an MDP from random samples.
Monte Carlo (MC) estimation is a general concept that refers to making estimations through repeated random sampling. In the context of RL, it refers to
**a collection of methods that estimates state values and action values using sample trajectories of complete episodes**.
Using random samples is essential because often environment dynamics:
- Is too complex to deal with
- It is not known in the first place

Summing up:
- MC methods learn directly from episodes of experience
- MC is *model-free*: no knowledge of MDP transitions/rewards
- MC learns from *complete episodes*
- MC uses the simplest idea: value = mean return
- Caveat: can only apply MC to episodic MDPs (all episodes must terminate)

### Monte Carlo prediction

We need to be able to evaluate a given policy to be able to improve it. MC prediction suggests simply observing (many) sample trajectories, sequences of state-action-reward tuples, starting in $S$, to estimate expectation $v_\pi(s) = E_{\pi}[G_{t}|S_{t}=s]$.
MV policy evaluation uses *empirical mean* return instead of *expected* return.