---
layout: post
title:  "Reinforcement Learning"
date:   2023-03-13 10:14:54 +0700
categories: jekyll update
---

# Introduction

In the Markov decision process (MDP), we know both the transitions and the reward functions. In reinforcement learning (RL), we know neither. We still try to maximize expected utility though. This can be called online learning: when data arrives incrementally and we learn on the go. In other words, reinforcement learning is a setting where agents do some actions, receive feedback about the reward and new state it is in. That's how it learns about the environment and updates its parameters. There are two ways to go in this settings: either we estimate a markov decision process (its probabilities and rewards) to compute the optimal policy or we estimate the optimal policy directly. The first is called model based learning and the latter is called model free learning.

In model based iteration, we estimate the transition as follows:

$$ \hat{P}(s,a,s') = \frac{\text{number of times (s,a,s') occurs}}{ \text{number of times (s,a) occur}} $$

If something never happens before causing the estimate to be $$ \frac{0}{0} $$ just need to set it to $$ \frac{1}{\text{total number of states}} $$ so it is equally probable like any other states.

And the rewards could be estimated by averaging the observed reward in state s:

$$ \hat{R} (s,a,s') = r in (s,a,r,s') $$ 

Now we can use those estimates above to solve the MDP through policy or value evaluation. In general, the algorithm is executed as follows:

- Initialize random policy $$ \pi $$

- Run $$ \pi $$ for a number of times

- Using the accumulated experience in the MDP, estimate P and R

- Use value iteration to estimate V

- Update $$ \pi $$ using a greedy manner (choose what is simply best at that point)

In model free iteration, we estimate $$ V^* $$ directly. Imagine when we don't know P and R, we are in state s, and we take action a, then we end up in state s' with reward r'. Given that, $$ Q_{\pi}(s,a) $$ is the expected utility starting at s, taking action a and then follow policy $$ \pi $$. The utility is simply the discounted cashflow from that point t. With that settings, the estimated expected utility of policy $$ \pi $$ at s would be the average of utility at t $$ u_t $$. $$ \hat{Q}_{\pi}(s,a) $$ can be estimated to be the convex combination of itself and u with (s,a,u): $$ \hat{Q}_{\pi} (s,a) \leftarrow (1 - \alpha) \hat{Q}_{\pi} (s,a) + \alpha u $$ with $$ \alpha = \frac{1}{1+(\text{number of updates to (s,a)})} $$

This is equivalent to:

$$ \Leftrightarrow \hat{Q}_{\pi}(s,a) \leftarrow \hat{Q}_{\pi}(s,a) - \alpha (\hat{Q}_{\pi}(s,a) - u) $$ 

This looks like the update rule for a linear regression using (stochastic) gradient descent. The objective is to least the squares: $$ (\hat{Q}_{\pi}(s,a) - u)^2 $$. $$ \hat{Q}_{\pi}(s,a) $$ can be treated as the prediction and u is the ground truth, this regression would be updated as data coming in (hence the name online learning).
