# Reinforcement Learning

<img src="../assets/imgs/dl/Page1.jpg" alt="RL basic idea" width="500" height="333">

## chapter 1

Learning from interaction
map situations to actions 
trial and error search 
delayed reward

...ing, like machine learning, is simultaneously a problem, a class of solution methods that work 
well on the problem, and the field studies this problem and its solution methods.

It is convenient to use single name for all three things, but at the same time essential to keep
the three conceptually separate.

. In particular, the distinction between problems and solution methods is very important in 
reinforcement learning; failing tomake this distinction is the source of many confusions.

the real problem facing a learningagent interacting over time with its environment to achieve a goal

A learning agent must be able to **`sense`** the state of its environment to some extent and must be able to take **`actions`** that affect the state.

The agent also must have a **`goal`** or goals relating to the state of the environment.

The object of this kind of learning is for the system to extrapolate, or generalize, its responses so that it acts correctly in situations not present in the training set.

an agent must be able to learn from its own experience.

In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all thesituations in which the agent has to act.

 Although one might be tempted to think of reinforcement learning as a kind of unsupervised learning because it does not rely on examples of correct behavior,reinforcement learning is trying to maximize a reward signal instead of trying to find hidden structure. 

the third paradigm of machine learning

## Challenges of Reinforcement Learning

1. trade-off between exploration and exploitation
 <!-- exploitation -->
 To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward.
 <!-- exploration -->
 But to discover such actions, it has to try actions that it has not selected before.
 
 The agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future.

 The dilemma is that neither exploration nor exploitationcan be pursued exclusively without failing at the task.

 The agent must try a variety ofactionsandprogressively favor those that appear to be best. 

 On a stochastic task, eachaction must be tried many times to gain a reliable estimate of its expected reward.

2. it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment.

Reinforcement learning takes the opposite tack, starting with a complete, interactive, goal-seeking agent.  

All reinforcement learning agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environments.

* When reinforcement learning involves planning, it has to address the interplay between planning and real-time action selection, as well as the question of how environment models are acquired and improved.

When reinforcement learning involves supervised learning, it does so for specific reasonsthat determine which capabilities are critical and which are not.

For learning research to make progress, important subproblems have to be isolated and studied, but they should be subproblems that play clear roles in complete, interactive, goal-seeking agents, even if all the details of the complete agent cannot yet be filled in.

<!-- handler of solving a problem -->

## exploration-exploitation trade-off

The exploration-exploitation trade-off is a fundamental concept in reinforcement learning (RL).

1. Exploitation:

This means choosing the action that maximizes the immediate reward based on your current knowledge. Essentially, you’re exploiting what you’ve already learned.
+ For example, if you’re playing a slot machine and you know that Machine A tends to give higher rewards, you keep pulling Machine A’s lever.

2. Exploration:

This means trying new actions to discover potentially better rewards in the long run.
+ For example, even though Machine A gives good rewards, you still try Machine B or C to find out if there’s an even better option.


Why is it important?
+ If you only exploit, you might miss better opportunities (getting stuck in a local optimum).
+ If you only explore, you waste time trying everything without benefiting from what you’ve already learned.

3. Balancing the Trade-off:
+ $ε$-greedy strategy: With probability $ε$, you explore randomly; otherwise, you choose the best-known action (exploit).
+ Softmax exploration: Assigns probabilities to actions based on their expected rewards and samples accordingly.
+ Upper Confidence Bound (UCB): Balances exploration and exploitation by considering both the reward and the uncertainty of an action.




4. Real-world analogy:

Imagine you’re traveling in a new city.
	•	Exploitation: You go to the same highly-rated restaurant every day because you know it’s good.
	•	Exploration: You try different places to discover hidden gems.

## Elements of Reinforcement Learning

a policy,a reward signal,a value function, and, optionally, a model of the environment.

+ 1. A policy defines the learning agent’s way of behaving at a given time. 

Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.


+ 2. A reward signal defines the goal of a reinforcement learning problem. 

On each time step, the environment sends to the reinforcement learning agent a single number called the reward. 

The agent’s sole objective is to maximize the total reward it receives overthe long run. 

The reward signal thus defines what are the good and bad events for the agent.

+ Whereas the reward signal indicates what is good in an immediate sense, 
+ 3. a value function specifies what is good in the long run.

Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. 

Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow and the rewards available in those states.

For example, a state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reversecould be true. 

Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards there could be no values, and the only purpose of estimating values is toachieve more reward. 
+ Nevertheless, it is values with which we are most concerned whenmaking and evaluating decisions. Action choices are made based on value judgments.

+ We seek actions that bring about states of highest value, not highest reward, 
+ because these actions obtain the greatest amount of reward for us over the long run

Rewards are basicallygiven directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime.

In fact, the mostimportant component of almost all reinforcement learning algorithms we consider is amethod for eciently estimating values.


+ 4. A model of the environment is something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave.

For example, given a state and action, the model might predict the resultant next state and next reward. 

Models are used forplanning, by which we mean any way of decidingon a course of action by considering possible future situations before they are actuallyexperienced

Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning.

### Limitations and Scope

state action reward value environment 

Our focus is on reinforcement learning methods that learn while interacting with the environment, which evolutionary methods do not do.

## Chapter 2 Multi-armed Bandits

The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructsby giving correct actions.

+ Purely evaluative feedback indicateshow good the action taken was, but not whether it was the best or the worst actionpossible. 

+ Purely instructive feedback, on the other hand, indicates the correct action totake, independently of the action actually taken.

+ Evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken.

## how to compute Q(a)

The action-value function Q(a) represents the expected cumulative reward when choosing action a in a given state and following a policy thereafter.

1. In Tabular Methods (e.g., Q-Learning):

If the state and action spaces are small, you can maintain a Q-table, where each entry Q(s, a) is updated iteratively based on experience.

Q-Learning Update Rule:
$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a{\prime}} Q(s{\prime}, a{\prime}) - Q(s, a) \right]
$$

$s$, a: Current state and action.

$r$: Immediate reward.

$s{\prime}$: Next state.

$a{\prime}$: Next action.

$\alpha$: Learning rate (how much to update).

$\gamma$: Discount factor (how much future rewards matter).


In [None]:
import numpy as np

# Initialize Q-table with zeros
Q = np.zeros((5, 2))  # 5 states, 2 actions

# Sample experience (s, a, r, s')
state = 2
action = 1
reward = 10
next_state = 3

# Hyperparameters
alpha = 0.1  # learning rate
gamma = 0.9  # discount factor

# Q-Learning update
Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state, :]) - Q[state, action])

print(Q)

2. In Deep Q-Network (DQN):

When the state space is large or continuous (e.g., images from Atari games), a neural network is used to approximate the Q-function.
$$
Q(s, a; \theta) \approx \text{Neural Network}
$$

Input: State s.
Output: Q-values for all possible actions.

⸻

DQN Loss Function:
$$
L(\theta) = \left( y_{\text{target}} - Q(s, a; \theta) \right)^2
$$
where the target is:
$$
y_{\text{target}} = r + \gamma \max_{a{\prime}} Q(s{\prime}, a{\prime}; \theta^{-})
$$

$\theta^{-}$ is the target network, a stable copy of the current Q-network.

⸻

3. Exploration Strategy:

To estimate Q(a) accurately, the agent needs to explore the environment. This is where strategies like ε-greedy come in, which balances exploration (trying new actions) and exploitation (choosing the best action so far).

If you maintain estimates of the action values, then at any time step there is at leastone action whose estimated value is greatest.

It refers to the action-value function Q(a), which estimates the expected reward for each action a.


	1.	Action-Value Estimation:
	•	As your RL agent interacts with the environment, it updates the value estimates for each action based on the rewards it receives.
	•	For example, in a simple multi-armed bandit problem (like slot machines), you estimate the average reward for each machine based on past trials.
	2.	At Every Time Step:
	•	At any point in time, there will always be one or more actions with the highest estimated value.
	•	For instance, if your estimates are Q(a_1) = 5, Q(a_2) = 8, and Q(a_3) = 3, the action a_2 is currently the best because it has the highest estimated value.
	3.	Handling Ties:
	•	If multiple actions have the same highest value, you can break ties randomly or apply a small perturbation to distinguish them.
	4.	Why Is This Important?
	•	In greedy policies, you select the action with the highest estimated value (exploitation).
	•	In ε-greedy policies, you sometimes explore (choose random actions) to avoid getting stuck in local optima.

**Exploitation is the right thing to do to maximize the expected reward on the onestep, but Exploration may produce the greater total reward in the long run.** 

🎯 The Problem:

You just moved to a new city and want to find the best restaurant.
+ You have three options:
1.	Restaurant A (you’ve been there once and had a great experience, so you believeit’s the best).
2.	Restaurant B (you’ve heard it’s good but haven’t tried it).
3.	Restaurant C (you know little about it).

⸻

✅ Exploitation (Greedy Action):

If you always go to Restaurant A, you are exploiting your current knowledge. This maximizes your immediate reward (a good meal), but ignores potential better options (B or C).

⸻

🔍 Exploration (Nongreedy Action):

If you try Restaurant B or C, you are exploring. You may get a worse meal in the short term, but gather new information. If one of them is actually better than A, you can exploit it repeatedly in the future, leading to higher total rewards in the long run.

⸻

⚖️ The Conflict:
+ If you always exploit, you miss better opportunities.
+ If you only explore, you waste time on bad choices.

⸻

🎲 The Strategy (ε-greedy):
+ With 90% probability, go to the best-known restaurant (exploitation).
+ With 10% probability, try a random one (exploration).

⸻

🧠 In Reinforcement Learning Terms:
+ Q(a): The estimated value of action a (like the quality of each restaurant).
+ $\epsilon$-greedy strategy: A balance between exploration and exploitation.
+ Goal: Refine your action-value estimates over time and maximize long-term rewards.

Because it is not possible both toexplore and to exploit with any single action selection, one often refers to the “conflict”between exploration and exploitation.

Greedy action selection always exploits current knowledge tomaximize immediate reward;
it spends no time at all sampling apparently inferior actions to see if they might really be better. 

Why in $\epsilon$-greedy methods, the probability of selecting the optimal action converges to greater than $1 - \epsilon$ ?

1. Understanding $\epsilon$-Greedy Policy

In ε-greedy action selection:
 + With probability $\epsilon$, the agent explores (chooses a random action).
 + With probability $1 - \epsilon$, the agent exploits (chooses the best-known action).

Let’s define:
 + $a^*$ as the true optimal action.
 + $\hat{a}^*$ as the agent’s current best estimate of the optimal action (which improves over time).

---

2. Behavior as Learning Progresses

As the agent gathers more experience:
 1. Initially, the agent doesn’t know $a^*$, so it explores and refines $Q(s,a)$.
 2. Over time, the Q-values converge, and the agent correctly estimates $Q(s, a^*)$.
 3.	Eventually, $\hat{a}^* \to a^*$ (i.e., the agent’s best action matches the optimal action).

---

3. Convergence to $Probability > 1 - \epsilon$

Once the agent has correctly learned the optimal action:
 + With probability $1 - \epsilon$, it chooses the optimal action (exploitation).
 + With probability $\epsilon$, it chooses randomly among all actions.

If there are n actions:
 + The chance of picking $a^*$ during exploration is $\frac{1}{n}$.
 + So, the probability of picking the optimal action overall is:

$$
P(\text{selecting } a^*) = (1 - \epsilon) + \epsilon \cdot \frac{1}{n}
$$
Since $\frac{1}{n}$ is always positive, we get:
$$
P(\text{selecting } a^*) > 1 - \epsilon
$$

---

4. Why This Matters

This means that as the agent learns better Q-values, the probability of choosing the optimal action remains high (> $1 - \epsilon$).

Even with exploration, the agent still frequently picks the optimal action. This ensures that:
 + The agent exploits most of the time.
 + It still explores enough to refine Q-values.

 ---

Example (5 Actions, $\epsilon = 0.1$)
 + With probability 0.9, the agent picks the best-known action.
 + With probability 0.1, it randomly picks one of 5 actions ($\frac{1}{5} = 0.2$).
 + Total probability of picking the optimal action:
$$
P(a^*) = 0.9 + 0.1 \times 0.2 = 0.92
$$
So, the probability is greater than 1 - 0.1 = 0.9.

---

Conclusion

The key reason $P(a^*) > 1 - \epsilon$ is because, even when exploring, there’s always a small chance of choosing the optimal action. Over time, as Q-values converge, the agent exploits more effectively, keeping the probability of selecting the optimal action high.



Exercise 2.1 In $\epsilon$-greedy action selection, for the case of two actions and $\epsilon=0.5$, what is the probability that the greedy action is selected?

$\because$ The greedy action can be selected in two ways.
1. With probability $1 - \epsilon$ the agent exploit
2. With probability $\epsilon$ the agent explore

Plus, the size of action space is $2$

$ \therefore P(\hat{a}^8) = (1 - \epsilon) + \frac{1}{2} \times \epsilon = 0.75 $ 