## What is reinforcement learning?

Reinforcement learning (RL) is an area of machine learning that focuses on how you, or how some thing, might act in an environment in order to maximize some given reward. Reinforcement learning algorithms study the behavior of subjects in such environments and learn to optimize that behavior. 


## Table of Content

- Part 1: Introduction to Reinforcement Learning
    - Section 1: Markov Decision Processes (MDPs)
        - Introduction to MDPs
        - Policies and value functions
        - Learning optimal policies and value functions
    - Section 2: Q-learning
        - Introduction to Q-learning with value iteration
        - Implementing an epsilon greedy strategy
    - Section 3: Code project - Implement Q-learning with pure Python to play a game
        - Environment set up and intro to OpenAI Gym
        - Write Q-learning algorithm and train agent to play game
        - Watch trained agent play game
- Part 2: Deep Reinforcement Learning
    - Section 1: Deep Q-networks (DQNs)
        - Introduction to DQNs
        - Experience replay
    - Section 2: Code project - Implement deep Q-network with PyTorch to play a game
        - Environment set up
        - Create and train DQN to play game
        - Watch trained DQN play game
    - Section 3: Policy gradients
        - More details to come



$$v∗(s)=maxa∈A(s)qπ∗(s,a)$$
$$=maxaEπ∗[Gt∣St=s,At=a]$$
$$=maxaEπ∗[Rt+1+γGt+1∣St=s,At=a]$$
$$=maxaEπ∗[Rt+1+γv∗(St+1)∣St=s,At=a]$$
$$=maxa∑s′,rp(s′,r∣s,a)[r+γv∗(s′)]$$


## 1. Markov Decision Processes (MDPs) - Structuring a Reinforcement Learning Problem

### Components of an MDP 
Markov decision processes give us a way to formalize sequential decision making. This formalization is the basis for structuring problems that are solved with reinforcement learning. 

In an MDP, we have a decision maker, called an *agent*, that interacts with the *environment* it's placed in. These interactions occur sequentially over time. At each time step, the agent will get some representation of the environment’s *state*. Given this representation, the agent selects an *action* to take. The environment is then transitioned into a *new state*, and the agent is given a *reward* as a consequence of the previous action.

Components of an MDP:
- Agent
- Environment
- State
- Action
- Reward

This process of selecting an action from a given state, transitioning to a new state, and receiving a reward happens sequentially over and over again, which creates something called a *trajectory* that shows the sequence of states, actions, and rewards.

Throughout this process, it is the agent’s goal to maximize the total amount of rewards that it receives from taking actions in given states. This means that the agent wants to maximize not just the immediate reward, but the cumulative rewards it receives over time.

**It is the agent’s goal to maximize the cumulative rewards.**


### MDP Notation

In an MDP, we have a set of states $S$, a set of actions $A$, and a set of rewards $R$. We'll assume that each of these sets has a finite number of elements.

At each time step $t=0,1,2,⋯,$ the agent receives some representation of the environment’s state $S_{t}∈S$. Based on this state, the agent selects an action $A_{t}∈A$. This gives us the state-action pair $(S_{t},A_{t})$.

Time is then incremented to the next time step $t+1$, and the environment is transitioned to a new state $S_{t+1}∈S$. At this time, the agent receives a numerical reward $R_{t+1}∈R$ for the action $A_{t}$ taken from state $S_{t}$.

We can think of the process of receiving a reward as an arbitrary function $f$ that maps state-action pairs to rewards. At each time $t$, we have $f(S_{t},A_{t})=R_{t+1}.

The trajectory representing the sequential process of selecting an action from a state, transitioning to a new state, and receiving a reward can be represented as $S0,A0,R1,S1,A1,R2,S2,A2,R3,⋯$

This diagram nicely illustrates this entire idea. 

![title](../docs/images/DLiz/1_MDP-diagram.png)

Understanding diagram into steps.
1. At time $t$, the environment is in state $S_{t}$.
2. The agent observes the current state and selects action $A_{t}$.
3. The environment transitions to state $S_{t+1}$ and grants the agent reward $R_{t+1}$.
4. This process then starts over for the next time step, $t+1$.
    Note, $t+1$ is no longer in the future, but is now the present. When we cross the dotted line on the bottom left, the diagram shows $t+1$ transforming into the current time step $t$ so that $S_{t+1}$ and $R_{t+1}$ are now $S_{t}$ and $R_{t}$.

### Transition probabilities

Since the sets $S$ and $R$ are finite, the random variables $R_{t}$ and $S_{t}$ have well defined probability distributions. In other words, all the possible values that can be assigned to $R_{t}$ and $S_{t}$ have some associated probability. These distributions depend on the preceding state and action that occurred in the previous time step $t−1$.

For example, suppose $s′∈S$ and $r∈R$. Then there is some probability that $S_{t}=s′$ and $R_{t}=r$. This probability is determined by the particular values of the preceding state $s∈S$ and action $a∈A(s)$. Note that $A(s)$ is the set of actions that can be taken from state $s$.

Let’s define this probability.

For all $s′∈S$, $s∈S$, $r∈R$, and $a∈A(s)$, we define the **probability of the transition to state s′ with reward r from taking action a in state s as** $$p(s′,r∣s,a) = P_{r}\{S_{t}=s′,R_{t}=r∣S_{t−1}=s,A_{t−1}=a\}$$

## 2. Expected Return - What Drives a Reinforcement Learning Agent in an MDP

### Expected return
The goal of an agent in an MDP is to maximize its cumulative rewards. In order to aggregate and formalize these cumulative rewards, we introduce the concept of the expected return of the rewards at a given time step.

For now, we can think of the return simply as the sum of future rewards. Mathematically, we define the return $G$ at time $t$ as $$G_{t}=R_{t+1}+R_{t+2}+R_{t+3}+⋯+R_{T},$$ where $T$ is the final time step.

**It is the agent’s goal to maximize the expected return of rewards.**

This concept of the expected return is super important because it's the agent's objective to maximize the expected return. The expected return is what's driving the agent to make the decisions it makes.


### Episodic vs. continuing tasks
In our definition of the expected return, we introduced $T$, the final time step. When the notion of having a final time step makes sense, the agent-environment interaction naturally breaks up into subsequences, called **episodes**. For example, think about playing a game of pong. Each new round of the game can be thought of as an episode, and the final time step of an episode occurs when a player scores a point.

Each episode ends in a terminal state at time $T$, which is followed by resetting the environment to some standard starting state or to a random sample from a distribution of possible starting states. The next episode then begins independently from how the previous episode ended.

Formally, tasks with episodes are called **episodic tasks**.

There exists other types of tasks though where the agent-environment interactions don’t break up naturally into episodes, but instead continue without limit. These types of tasks are called **continuing tasks**.

Continuing tasks make our definition of the return at each time $t$ problematic because our final time step would be $T=∞$, and therefore the return itself could be infinite since we have $$G_{t}=R_{t+1}+R_{t+2}+R_{t+3}+⋯+R_{T}$$

Because of this, we need to refine they way we're working with the return.

### Discounted return

Our revision of the way we think about return will make use of discounting. Rather than the agent’s goal being to maximize the expected return of rewards, it will instead be to maximize the expected discounted return of rewards. Specifically, the agent will be choosing action $A_{t}$ at each time $t$ to maximize the expected discounted return.

**It is the agent’s goal to maximize the expected discounted return of rewards.**

To define the discounted return, we first define the **discount rate**, $γ$, to be a number between 0 and 1. The discount rate will be the rate for which we discount future rewards and will determine the present value of future rewards. With this, we define the discounted return as $$Gt=R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+⋯$$$$=∑_{k=0}^{∞}γ^{k}R_{t+k+1}$$

This definition of the discounted return makes it to where our agent will care more about the immediate reward over future rewards since future rewards will be more heavily discounted. So, while the agent does consider the rewards it expects to receive in the future, the more immediate rewards have more influence when it comes to the agent making a decision about taking a particular action.

Now, check out this relationship below showing how returns at successive time steps are related to each other. We’ll make use of this relationship later. $$G_{t}=R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+3}+⋯$$$$ =R_{t+1}+γ(R_{t+2}+γR_{t+3}+γ^{2}R_{t+3}+⋯)$$$$=R_{t+1}+γG_{t+1}$$

Also, check this out. Even though the return at time $t$ is a sum of an infinite number of terms, the return is actually finite as long as the reward is nonzero and constant, and $γ<1$.

For example, if the reward at each time step is a constant 1 and $γ<1$, then the return is $$G_{t}=∑_{k=0}^{∞}γ^{k}=1/(1−γ)$$

This infinite sum yields a *finite result*. If you want to understand this concept more deeply, then research infinite series convergence. For our purposes though, you’re free to just trust the fact that this is true, and understand the infinite sum of discounted returns is finite *if the conditions we outlined are met*. 

## 3. Policies and Value Functions - Good Actions for a Reinforcement Learning Agent

### How good is a state or action? 
With all the possible actions that an agent may be able to take in all the possible states of an environment, there are a couple of things that we might be interested in understanding.

First, we'd probably like to know how likely it is for an agent to take any given action from any given state. In other words, what is the probability that an agent will select a specific action from a specific state? This is where the notion of policies come into play, and we'll expand on this in just a moment.

Secondly, in addition to understanding the probability of selecting an action, we'd probably also like to know how good a given action or a given state is for the agent. In terms of rewards, selecting one action over another in a given state may increase or decrease the agent's rewards, so knowing this in advance will probably help our agent out with deciding which actions to take in which states. This is where value functions become useful, and we'll also expand on this idea in just a bit.

| Question | Addressed by |  
|----- |-- |  
| How probable is it for an agent to select any action from a given state? | Policies |  
| How good is any given action or any given state for an agent? | Value functions |  



### Policies
A policy is a function that maps a given state to probabilities of selecting each possible action from that state. We will use the symbol $π$ to denote a policy.

When speaking about policies, formally we say that an agent “follows a policy.” For example, if an agent follows policy $π$ at time $t$, then $π(a|s)$ is the probability that $A_{t}=a$ if $S_{t}=s$. This means that, at time $t$, under policy $π$, the probability of taking action $a$ in state $s$ is $π(a|s)$.

Note that, for each state $s∈S$, $π$ is a probability distribution over $a∈A(s)$. 

### Value functions
Value functions are functions of *states*, or of *state-action pairs*, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.

This notion of how good a state or state-action pair is is given in terms of **expected return**. Remember, the rewards an agent expects to receive are dependent on what actions the agent takes in given states. So, value functions are defined with respect to specific ways of acting. Since the way an agent acts is influenced by the policy it's following, then we can see that value functions are defined with respect to policies.


#### State-value function
The state-value function for policy $π$, denoted as $v_{π}$, tells us how good any given state is for an agent following policy $π$. In other words, it gives us the value of a state under $π$.

Formally, the value of state $s$ under policy $π$ is the expected return from starting from state $s$ at time $t$ and following policy $π$ thereafter. Mathematically we define $v_{π}(s)$ as $$v_{π}(s)=E_{π}[G_{t}∣S_{t}=s]$$$$=E_{π}[∑_{k=0}^{∞}γ^{k}R_{t+k+1}∣S_{t}=s]$$

#### Action-value function
Similarly, the action-value function for policy $π$, denoted as $q_{π}$, tells us how good it is for the agent to take any given action from a given state while following following policy $π$. In other words, it gives us the value of an action under $π$.

Formally, the value of action $a$ in state $s$ under policy $π$ is the expected return from starting from state $s$ at time $t$, taking action $a$, and following policy $π$ thereafter. Mathematically, we define $q_{π}(s,a)$ as $$q_{π}(s,a)=E_{π}[G_{t}∣S_{t}=s,A_{t}=a]$$$$=E_{π}[∑_{k=0}^{∞}γ^{k}R^{t+k+1}∣S_{t}=s,A_{t}=a]$$

Conventionally, the action-value function $q_{π}$ is referred to as the **Q-function**, and the output from the function for any given *state-action pair* is called a **Q-value**. The letter “Q” is used to represent the *quality of taking a given action in a given state*. 

## 4. What do Reinforcement Learning Algorithms Learn - Optimal Policies

### Optimality
It is the goal of reinforcement learning algorithms to find a policy that will yield a lot of rewards for the agent if the agent indeed follows that policy. Specifically, reinforcement learning algorithms seek to find a policy that will yield more return to the agent than all other policies. 

#### Optimal policy
In terms of return, a policy $π$ is considered to be better than or the same as policy $π′$ if the expected return of $π$ *is greater than or equal to the expected return of* $π′$ for all states. In other words, $π≥π′$ if and only if $v_{π}(s)≥v_{π′}(s)$ for all $s∈S$.

Remember, $v_{π}(s)$ gives the expected return for starting in state $s$ and following $π$ thereafter. A policy that is better than or at least the same as all other policies is called the **optimal policy**.

#### Optimal state-value function
The optimal policy has an associated *optimal state-value function*. We denote the optimal state-value function as $v_{∗}$ and define as $v_{∗}(s) =max_{π}v_{π}(s)$ for all $s∈S$. In other words, $v_{∗}$ gives the largest expected return achievable by any policy $π$ for each state.

#### Optimal action-value function
Similarly, the optimal policy has an *optimal action-value function, or optimal Q-function*, which we denote as $q_{∗}$ and define as $q_{∗}(s,a)=max_{π}q_{π}(s,a)$ for all $s∈S$ and $a∈A(s)$. In other words, $q_{∗}$ gives the largest expected return achievable by any policy $π$ for each possible state-action pair.


### Bellman optimality equation for $q_{∗}$

One fundamental property of $q_{∗}$ is that it must satisfy the following equation. $$q_{∗}(s,a)=E[R_{t+1}+γmax_{a′}q_{∗}(s′,a′)]$$

This is called the **Bellman optimality equation**. It states that, for any state-action pair $(s,a)$ at time $t$, the expected return from starting in state $s$, selecting action $a$ and following the optimal policy thereafter (AKA the Q-value of this pair) is going to be the expected reward we get from taking action $a$ in state $s$, which is $R_{t+1}$, plus the maximum expected discounted return that can be achieved from any possible next state-action pair $(s′,a′)$.

Since the agent is following an *optimal policy*, the following state $s′$ will be the state from which the best possible next action $a′$ can be taken at time $t+1$.

We're going to see how we can use the Bellman equation to find $q_{∗}$. Once we have $q_{∗}$, we can determine the optimal policy because, with $q_{∗}$, for any state $s$, a reinforcement learning algorithm can find the action $a$ that maximizes $q_{∗}(s,a)$.

We’re going to use this Bellman equation a lot going forward, so it will continue to materialize for us more as we progress. 

## 5. Q-Learning Explained - A Reinforcement Learning Technique

### Q-learning objective
Q-learning is the first technique we’ll discuss that can solve for the optimal policy in an MDP.

The objective of Q-learning is to find a policy that is optimal in the sense that the expected value of the total reward over all successive steps is the maximum achievable. So, in other words, the goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair.


### Q-learning with value iteration
First, as a quick reminder, remember that the Q-function for a given policy accepts a state and an action and returns the expected return from taking the given action in the given state and following the given policy thereafter.

Also, remember this Bellman optimality equation for $q_{∗}$ $$q_{∗}(s,a)=E[R_{t+1}+γ max_{a′}q_{∗}(s′,a′)]$$

#### Value iteration
The Q-learning algorithm iteravely updates the Q-values for each state-action pair using the Bellman equation until the Q-function converges to the optimal Q-function, $q_{∗}$. This approach is called **value iteration**. 

To see exactly how this happens, let’s set up an example, appropriately called The Lizard Game.


#### An example: The Lizard Game
##### The set up

Suppose we have the following environment shown below. The agent in our environment is the *lizard*. The lizard wants to eat as many crickets as possible in the least amount of time without stumbling across a bird, which will, itself, eat the lizard.

![title](../docs/images/DLiz/2_TheLizardGameQ_Learning.png)

The lizard can move left, right, up, or down in this environment. These are the actions. The states are determined by the individual tiles and where the lizard is on the board at any given time.

If the lizard lands on a tile that has 1 cricket, the reward is +1 point. Landing on an empty tile is -1 point. A tile with 5 crickets is +10 points and will end the episode. A tile with a bird is -10 points and will also end the episode.  

| State | Reward |  
| --- | --- |  
| One cricket | +1 |  
| Empty | -1 |  
| Five crickets | +10 Game over |  
| Bird | -10 Game over |  

Now, **at the start of the game**, the lizard has no idea how good any given action is from any given state. It’s not aware of anything besides the current state of the environment. In other words, it doesn’t know from the start whether navigating left, right, up, or down will result in a positive reward or negative reward.

Therefore, the Q-values for each state-action pair will all be **initialized to zero** since the lizard knows nothing about the environment at the start. Throughout the game, though, the Q-values will be iteravely updated using value iteration.

##### Storing Q-values in a Q-table
We'll be making use of a table, called a Q-table, to store the Q-values for each state-action pair. The horizontal axis of the table represents the actions, and the vertical axis represents the states. So, the dimensions of the table are the number of actions by the number of states. 

![title](../docs/images/DLiz/3_TheLizardGameQ_TableZeros.png)


As just mentioned, since the lizard knows nothing about the environment or the expected rewards for any state-action pair, **all the Q-values in the table are first initialized to zero**. Over time, though, as the lizard plays several episodes of the game, the Q-values produced for the state-action pairs that the lizard experiences will be used to update the Q-values stored in the Q-table.

As the **Q-table becomes updated, in later moves and later episodes**, the lizard can look in the Q-table and base its next action on the highest Q-value for the current state. This will make more sense once we actually start playing the game and updating the table.

##### Episodes

Now, we’ll set some standard number of episodes that we want the lizard to play. Let’s say we want the lizard to play 5 episodes. It is during these episodes that the learning process will take place.

In each episode, the lizard starts out by choosing an action from the starting state **based on the current Q-values in the table**. The lizard chooses the action based on which action has the highest Q-value in the Q-table for the current state.

But, wait... That’s kind of weird for the first actions in the first episode, right? Because all the Q-values are set zero at the start, so there’s no way for the lizard to differentiate between them to discover which one is considered better. So, what action does it start with?

To answer this question, we'll introduce the **trade-off beteween exploration and exploitation**. This will help us understand not just how an agent takes its first actions, but how exactly it chooses actions in general. 

### Exploration vs. exploitation

**Exploration** is the act of exploring the environment to find out information about it. **Exploitation** is the act of exploiting the information that is already known about the environment in order to maximize the return.

**The goal of an agent is to maximize the expected return**, so you might think that we want our agent to use exploitation all the time and not worry about doing any exploration. This strategy, however, isn’t quite right.

Think of our game. If our lizard got to the single cricket before it got to the group of five crickets, then only making use of exploitation, going forward the lizard would just learn to exploit the information it knows about the location of the single cricket to get single incremental points infinitely. It would then also be losing single points infinitely just to back out of the tile before it can come back in to get the cricket again.

If the lizard was able to explore the environment, however, *it would have the opportunity* to find the group of five crickets that would immediately win the game. If the lizard only explored the environment with no exploitation, however, then it would miss out on making use of known information that could help to maximize the return.

Given this, we need a balance of both exploitation and exploration.

## 6. Exploration vs. Exploitation - Learning the Optimal Reinforcement Learning Policy

    Q-learning - Choosing actions with an epsilon greedy strategy

### Review: The lizard game
Remember, in each episode, the agent (the lizard in our case) starts out by choosing an action from the starting state based on the current Q-value estimates in the Q-table. The lizard chooses its action based on the highest Q-value for the given state.

Since we know that all of the Q-values are first initialized to zero, there’s no way for the lizard to differentiate between them at the starting state of the first episode. So, the question remains, what action does it start with? Furthermore, for subsequent states, is it really as straight-forward as just selecting the action with the highest Q-value for the given state?

Additionally, we know that we need a balance of exploration and exploitation to choose our actions, but how exactly this is achieved is with an epsilon greedy strategy, so let's explore that now.

### Epsilon greedy strategy

To get this balance between exploitation and exploration, we use what is called an **epsilon greedy strategy**. With this strategy, we define an exploration rate $ϵ$ that we initially set to 1. This **exploration rate** is the probability that our agent will explore the environment rather than exploit it. 
    With $ϵ=1$, it is 100% certain that the agent will start out by exploring the environment.

As the agent learns more about the environment, at the start of each new episode, $ϵ$ will *decay by some rate* that we set so that the likelihood of exploration becomes less and less probable as the agent learns more and more about the environment. The agent will become *“greedy”* in terms of exploiting the environment once it has had the opportunity to explore and learn more about it.

To determine whether the agent will choose exploration or exploitation at each time step, *we generate a random number between 0 and 1*. If this number is greater than epsilon, then the agent will choose its next action via exploitation, i.e. it will choose the action with the highest Q-value for its current state from the Q-table. Otherwise, its next action will be chosen via exploration, i.e. randomly choosing its action and exploring what happens in the environment.

            if random_num > epsilon:
                choose action via exploitation
            else:
                choose action via exploration



### Choosing an action
So, recall, we first started talking about the exploration-exploitation trade-off last time because we were discussing how the lizard should choose its very first action since all the actions have a Q-value of 0.

Well now, we should know that the action will be chosen randomly via exploration since our exploration rate is set to 1 initially. Meaning, with 100% probability, the lizard will explore the environment during the first episode of the game, rather than exploit it.

Alright, so after the lizard takes an action, it observes the next state, the reward gained from its action, and updates the Q-value in the Q-table for the action it took from the previous state.

Let’s suppose the lizard chooses to move right as its action from the starting state. We can see the reward we get in this new state is −1 since, recall, empty tiles have a reward of −1 point.

### Updating the Q-value
To update the Q-value for the action of moving right taken from the previous state, we use the Bellman equation that we highlighted previously $$q_{∗}(s,a)=E[R_{t+1}+γ max_{a′}q_{∗}(s′,a′)]$$

We want to make the Q-value for the given state-action pair as close as we can to the right hand side of the Bellman equation so that the Q-value will eventually converge to the optimal Q-value $q_{∗}$.

This will happen over time by iteratively comparing the loss between the Q-value and the optimal Q-value for the given state-action pair and then updating the Q-value over and over again each time we encounter this same state-action pair to reduce the loss.

$$q_{∗}(s,a)−q(s,a)=loss$$
$$E[R_{t+1}+γ max_{a′}q_{∗}(s′,a′)]−E[∑_{k=0}^{∞}γ^{k}R_{t+k+1}]=loss$$

To actually see how we update the Q-value, we first need to introduce the idea of a learning rate.


### The learning rate
The learning rate is a number between 0 and 1, which can be thought of as **how quickly the agent abandons the previous Q-value** in the Q-table for a given state-action pair for the new Q-value.

So, for example, suppose we have a Q-value in the Q-table for some arbitrary state-action pair that the agent has experienced in a previous time step. Well, if the agent experiences that same state-action pair at a later time step once it's learned more about the environment, the Q-value will need to be updated to reflect the change in expectations the agent now has for the future returns.

We don't want to just overwrite the old Q-value, but rather, we use the learning rate as a tool to determine how much information we keep about the previously computed Q-value for the given state-action pair versus the new Q-value calculated for the same state-action pair at a later time step. We’ll denote the learning rate with the symbol $α$, and we’ll arbitrarily set $α=0.7$ for our lizard game example.

**The higher the learning rate, the more quickly the agent will adopt the new Q-value.**

For example, if the learning rate is 1, the estimate for the Q-value for a given state-action pair would be the straight up newly calculated Q-value and would not consider previous Q-values that had been calculated for the given state-action pair at previous time steps. 

### Calculating the new Q-value

The formula for calculating the new Q-value for state-action pair (s,a) at time $t$ is this:
$$q^{new}(s,a)=(1−α) q(s,a) + α (R_{t+1}+γ max_{a′}q(s′,a′))$$

                                where, q(s,a) is old value
                                       (R_{t+1}+γ max_{a′}q(s′,a′) is learned value

So, our **new Q-value is equal to a weighted sum of our old value and the learned value**. The old value in our case is $0$ since this is the first time the agent is experiencing this particular state-action pair, and we multiply this old value by $(1−α)$.

Our learned value is the reward the agent receives from moving right from the starting state plus the discounted estimate of the optimal future Q-value for the next state-action pair $(s′,a′)$ at time $t+1$. This entire learned value is then multiplied by our learning rate.

All of the math for this calculation of our concrete example state-action pair of moving right from the starting state is shown below. Suppose the discount rate $γ=0.99$. We have

$$q^{new}(s,a)=(1−α) q(s,a) +α (R_{t+1}+γ max_{a′}q(s′,a′))$$
$$=(1−0.7)(0)+0.7(−1+0.99(max_{a′}q(s′,a′)))$$

                                where, q(s,a) is old value
                                       (R_{t+1}+γ max_{a′}q(s′,a′) is learned value


Let's pause for a moment and focus on the term $max_{a′}q(s′,a′)$. Since all the Q-values are currently initialized to $0$ in the Q-table, we have 

$$max_{a′}q(s′,a′) = max(q(empty6, left),q(empty6, right),q(empty6, up),q(empty6, down))$$
$$=max(0,0,0,0)$$
$$=0$$

Now, we can substitute the value $0$ in for $max_{a′}q(s′,a′)$ in our earlier equation to solve for $q^{new}(s,a)$.
  
$$q^{new}(s,a)=(1−α) q(s,a)+α (R_{t+1}+γ max_{a′} q(s′,a′))$$
$$=(1−0.7)(0)+0.7(−1+0.99(max_{a′}q(s′,a′)))$$
$$=(1−0.7)(0)+0.7(−1+0.99(0))$$
$$=0+0.7(−1)$$
$$=−0.7$$

Alright, so now we'll take this new Q-value we just calculated and store it in our Q-table for this particular state-action pair.

We’ve now done everything needed for a single time step. This same process will happen for each time step until termination in each episode.

Once the Q-function converges to the optimal Q-function, we will have our optimal policy.

### Max steps

Oh, and speaking of termination, we can also specify a *max number of steps* that our agent can take before the episode auto-terminates. With the way the game is set up right now, termination will only occur if the lizard reaches the state with five crickets or the state with the bird.

We could define some condition that states if the lizard hasn’t reached termination by either one of these two states after 100 steps, then terminate the game after the 100th step.

## Summary: Q-Learning

1. Initialize all Q-values in the Q-table to 0.
2. For each time-step in each episode:
    - Choose an action ( considering the exploration-exploitation trade-off).
    - Observe the reward and next state.
    - Update the Q-value function ( using the formula we gave that will, overtime, make the Q-value function converge to the right hand side of the Bellman equation).



## Topics in Other Notebook

### 7. OpenAI Gym and Python for Q-learning - Reinforcement Learning Code Project
### 8. Train Q-learning Agent with Python - Reinforcement Learning Code Project
### 9. Watch Q-learning Agent Play Game with Python - Reinforcement Learning Code Project
### 10. Deep Q-Learning - Combining Neural Networks and Reinforcement Learning
### 11. Replay Memory Explained - Experience for Deep Q-Network Training
### 12. Training a Deep Q-Network - Reinforcement Learning
### 13. Training a Deep Q-Network with Fixed Q-targets - Reinforcement Learning