# Chapter 2

***k-armed Bandit Problem***: Analogy of having several attempts at a slot machine with k levers. Each of the levers has an expected or mean reward value we receive if we decide to pull it (select lever 5 as our action at timestep 3 for example). The action selected at timestep t is denoted as A_t, the reward from this action is denoted as R_t, our arbitary action select as a, and the expected reward formula is thus: q*(a) = E[R_t | A_t = a]

***Exploration vs Exploitation***: This is a tradeoff in RL, exploration is trying out new levers (with a potentially lower expected reward) with the hopes that we receive a much higher reward. Exploitation is using our current knowledge of expected rewards and selecting the action with the highest expected reward. We use different models to switch between these two efforts when we have our agent work through our map trying to collect rewards.

***Sample-average***: Sampling method for creating an expected value estimate on an action. This method is simply the average reward value of the action up to this point. Meaning we sum all the rewards we got whenever we took this specific action then divided it by the amount of times we took this specific action. 

***Greedy action selection***: A_t = argmax_a Q_t(a) (select the action a that will maximize the value that Q_t (expected reward) will return.

***epsilon-greedy methods***: methods that have a preset variable value epsilon that represents a probability that our agent will select an action randomly (all actions with equal probability) instead of the action that has the highest expected reward. Ex: eps=0.5 for the case of two actions at timestep 1: 50% chance of selecting action with highest expected prob and 50% chance of selecting randomly between action 1 and 2 (thus 0.50 + 0.25 = 0.75 chance we select the greedy option (excercise 2.1)).


Advantage of eps-greedy over greedy depends on the task. For example, if the rewards have higher variance then it takes more exploration to find the true best rewards. If there is no variance then no exploration is needed and we can just do greedy. Can also run into ***nonstationary tasks*** where the true reward value of an action changes over time.

***deterministic case***: models based on predefined logic and rules, no randomness is involved.

Memory efficient way to calculate the sample-average reward (without storing all of the previous reward values). The update formula is: Q_n+1 = Q_n + (1/n)[R_n - Q_n] ==> NewEst = OldEst + StepSize[Target - OldEst]


In nonstationary tasks we want to add more weighting to the most recent rewards observed (as tyring to get closest representation of the expected reward for the action currently (assuming it has changed)). We can change our stepsize function from the sample average function alpha(a) = (1/n) to a constant value alpha: alpha(a) = alpha. The sample average stepsize is gauranteed to converge to the true mean expected value while the constant value isnt. This is good becuase we dont want convergence on an expected value that is constantly changing.


***Optimistic initial values***: setting the initial sample reward of each action much higher to what you think it will be. This encourages the agent to explore more at the start which can help for stationary problems. (Doesnt for non stationary as eventually expected reward will change and the inital reward values wont matter).

# Chapter 3

***Markov decision process***: model only depends on the previous state. In terms of RL, the model depends only on the previous state and action taken to determine its new state and current reward:

    p(s', r | s, a) = Pr(S_t = s', R_t = r | S_t-1 = s, A_t-1 = a).
    Creating the sequence: S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3,...

Each probability function for state and reward returns a probability from a probability distribution. For each possible state (s) and action (a) we have a distribution of the probability for each of the next possible new_state and reward pairs occuring, given the previous state and action taken:  

    SUM:s'_in_S SUM:r_in_R p(s', r | s, a) = 1, For_All s in S, a in A(s)

***Markov property***: the likelihood of changing to a specific state is dependently only on the current state (and the current time elapsed), and not at all on any of the previous states. "The state must include information about all aspects of the past agent–environment interaction that make a difference for the future".

***State transition probabilities***: The probability of transitioning from the current state to some specific next state. This is done by summing all of the reward probabilities for moving from the current state to this specific next state. In other words, we dont care about which reward we're gonna get, only what the probability is that we move to this next state:
    
    p(s' | s, a) = SUM:r_in_R p(s', r | s, a)

***Expected reward for state-action pairs***: The expected reward we get for taking a specific action (a) at our current state (s):

    r(s, a) = E[R_t | S_t-1=s, A_t-1 = a] = SUM:r_in_R *r* SUM:s'_in_S p(s', r | s, a)

(Sum of each of the rewards (r) multiplied by the sum of each of the possible states we can get to (all s') if r was our reward)


***Expected reward for state–action–next-state***: The expected reward we can expect to get if we take action (a) at our current state (s) and arrive at the new state (s'):

    r(s, a, s') = E[R_t | S_t-1=s, A_t-1=a, S_t=s'] = SUM:r_in_R *r* p(s', r | s, a) / p(s' | s, a)

    (Sum of each of the rewards (r) multiplied by the probability of getting our reward (r) at the new state (s') divided by the probability of getting our new state (s') from our current state (s) and action taken (a)).

(The probability of getting a specific reward (r) at our specific next state (s') can be low, but if we factor in the probability of getting to this new state in general that will make our probability of receiving that reward realistic. In other words, what are the chances of getting our reward (r) in a specific scenerio given that this specific scenerio actually happens).

***Agent–environment boundary***: Anything that is outside of the agents control and "cannot be changed arbitrarely by the agent" is considered to be out of the agents control and part of the enviornment. Thus the boundary represents the limit of the agents absolute control. For example, a robot (which is our agent) has mechanical arms and sensory devices, these are considered part of its enviornment and not part of the agent as they are reactive to the current enviornment and actions taken so far.

***Reward hypothesis***: The agents goal is to maximize the total amount of reward it receives. Therefore not maximize immediate reward but average reward in the long run. "That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)".

***Episodes***: "when the agent–environment interaction breaks naturally into subsequences, ... such as plays of a game, trips through a maze, or any sort of repeated interaction". 

Each episode ends in a special state called the ***terminal state***.

"the next episode begins independently of how the previous one ended. Thus the episodes can all be considered to end in the same terminal state, with di↵erent rewards for the di↵erent outcomes. Tasks with episodes of this kind are called ***episodic tasks***"

***Continuing tasks***: "in many cases the agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit". Ex: Robot with a long life span.

Issue with continuing tasks is that our final time step is T=infinity, thus it is very easy for our agent to get a average reward of infinity and perform suboptimally as eventually it will reach a reward of infinity no matter how many sub optimal actions it takes. To solve this we use discounting in our rewards calculation.

***Discounting***: Adding a hyperparameter that is a function of our current timestep. This hyperparameter multiplies our current reward by some preset constant value taken to the power of our current timestep:

    G_t = R_t+1 +   gamma*R_t+2 + gamma^2*R_t+3 + ... = inf_SUM_k=0 gamma^k * R_t+k+1

gamma is in the range [0, 1] and refered to as the discount rate. "Gamma determins the present value of future rewards, a reward received
k time steps in the future is worth only  gamma^(k -1) times what it would be worth if it were received immediately". The longer you take (the larger your timestep gets) the less your reward value will be (creates sense of urgency for the agent).

if gamma = 0 then we use none of the future reward calculations in our discounted reward so the model only maximizes the immediate reward and becomes greedy, versus as gamma gets closer to 1 it adds more weight to the future reward terms and the model becomes more faresighted.

    G_t = R_t+1 + gamma*R_t+2 + gamma^2*R_t+3 + gamma^3*R_t+4 + ...
    = R_t+1 + gamma(R_t+2 + gamma*R_t+3 + ...)
    = R_t+1 + gamma*G_t+1

\*In these continuous tasks we have estimates for future reward to be received given a specific action a. These future rewards are based on the actions we expect the model to take and expected reward to receive (based on those future actions). And these future actions are the ones that will be available to us only if we take this certain action a. Thus this is a forward "future" thinking model and the gamma value is our hyperparameter for how much trust it has into these expected future values.*


***Policy***: An agents strategy of behaviour (which actions it chooses) at a given time. "Formally, a policy is a mapping from states to probabilities of selecting each possible action". Example: in maze problem, dumb agents policyt is to wander around, while smart agents policy is to plan path in head first then go straight to the end goal. 

if agent is following policy pi() at time t, then pi(a|s) is probability that A_t = a, if S_t = s. 
The pi() function gives one probability but represents given a state s, ouput a probability distribution of actions the agent will take.

Excercise 3.11: If the current state is St, and actions are selected according to stochastic policy ⇡, then what is the expectation of Rt+1 in terms of pi() and the four-argument function p (3.2)? -> pi(a|s) * r(s, a) forAll actions a in Action set of state s. the probability of taking the action multiplied by the expected reward value of taking that action at the current state s. (unsure on this one) not using the four argument function (3.2): p(s', r | s, a). 


***State-value function for policy pi()***: The expected reward (from a continuous enviornment) we can expect to get if we start at state s and follow our specific policy pi().

    v_pi(s) = E_pi[G_t| S_t = s] = E_pi[inf_SUM_k=0 gamma^k * R_t+k+1 | S_t=s ], forAll s in S

***Action-value function for policy pi()***: The expected reward (from a cont. env.) we can expect to get if we start at state s, take action a, then follow our specific policy pi() after (after action a is taken at state s follow policy - what is the expected reward?).

    q_pi(s, a) = E_pi[G_t| S_t = s, A_t = a] = E_pi[inf_SUM_k=0 gamma^k * R_t+k+1 | S_t=s, A_t = a ]

***Monte Carlo method***: Calculating an average value from many samples taken from a random variable converges to the real expected value as the number of samples converges to infinity.

***Optimal policy***: The optimal policy is the policy that gives us the highest summed reward over the long run. More formally as a form of ranking, a policy pi_1 is said to be better than or equal to a policy pi_2, if the expected return value of pi_1 is greater than or equal to the expected return of pi_2 for each state. Or: pi_1 >= pi_2 iff v_pi_1(s) >= v_pi_2(s) forAll s in S. Due to the greater than or equal to, there may be more than one optimal policy to which they are denoted as pi_*. 

Each share the same: 

***optimal state-value function***: 
    
    v_*(s) = max_pi v_pi(s) forAll s in S. 

***optimal action-value function***: 

    q_*(s, a) = max_pi q_pi(s, a) forAll s in S, a in A = E[R_t+1 + gamma\*v_\*(S_t=t+1) | S_t = a, A_t = a]

***Bellman optimality equation***: The value of a state value function following an optimal policy is equal to the expected return for the best action we can take from that state s.

    v_*(s) = max_a_in_A(s) q_pi_*(s, a)
    = max_a E_pi_*[G_t | S_t = s, A_t = a]
    = max_a SUM_s'_r p(s', r| s, a)[r + gamma*V_*(s')]
    select the action (a) that will return the max value of the probability of getting each reward at each possible next state multiplied by the reward we receive (for action a) summed with the gamma multiplied optimal state value function on that next possible state instance.

    

Meaning, starting at a state s, if we are following an optimal policy we will select the best possible action (in other words the action that has the highest expected reward)

The same Bellman optimality logic applies for the action-value function: 

    q_*(s, a) = E[R_t+1 + gamma*Max_a' q_*(S_t+1, a') | S_t = s, A_t = a]
    = SUM_s'_r p(s', r | s, a)[r + gamma* max_a' q_*(s', a')]


The beauty of having an optimal state-value function is that we can create an agent that selects the action in a greedy way by only calculating the reward for this current state on which action to take (only looks in the present and not in the future). But the model is not short sighted because the optimal reward values from the state-value function are calculated with respect to the expected future rewards so the model is actually looking into the future.

Often times we dont have either 1. an entire knowledge of the enviornment and its states 2. enough memory and computation power to calculate the optimal expected value for each state to get an optimal function to follow.

***Tabular case***: An enviornment whos state set is small enough so that we can create an entry for a expected sampled reward value for each state (or each state-action pair) as a table or array in a computers memory.

"In many cases of practical interest, however, there are far more states than could possibly be entries in a table. In these cases the functions must be approximated, using some sort of more compact parameterized function representation."

There can exist states that have an extremely low probability of being seen or stumbled upon by an agent that taking a suboptimal action for these states likely wont affect the overall expected total reward. Ex: Taking suboptimal: (0.000002)*(2) vs (0.000002)*4. (The prob of this state occuring (after a certain action) multiplied by the reward we get from it. Even though we took reward 2 instead of 4, with the probability being so low, our gain in expected reward doesnt change very much if at all).

"The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effort into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states. This is one key property that distinguishes reinforcement learning from other approaches to approximately solving MDPs."

"In reinforcement learning we are very much concerned with cases in which optimal solutions cannot be found but must be approximated in some way."