# Reinforcement Learning

Reinforcement learning (RL) is a framework where an agent learns to make decisions by interacting with an environment. The agent receives feedback through rewards, and the ultimate goal is to develop a strategy that maximizes the total accumulated reward.

## Fundamental Components

### State ($s$)
The **state** represents the current situation or configuration of the system. It may include various factors such as:
- **Position and Orientation:** For an autonomous vehicle or robot, the state includes its current position, orientation, and speed.
- **Environment Details:** In a game, the state could be the arrangement of pieces on a board or the configuration of a game level.

### Action ($a$)
An **action** is any decision or move that the agent can make. Actions affect the state of the system. For example:
- **Control Inputs:** For an autonomous helicopter, actions might involve adjusting the control sticks.
- **Movement Decisions:** For a Mars rover, actions could include moving left or right to reach a desired location.
- **Game Moves:** In chess, an action is one of the legal moves available from a given board position.

### Reward ($R(s)$)
The **reward** is a numerical value that provides feedback on the outcome of an action. It indicates how favorable a particular state or action is:
- **Positive Reward:** Signals that the agent is performing well (e.g., maintaining stable flight or achieving a winning move in a game).
- **Negative Reward:** Indicates poor performance (e.g., crashing an aircraft or making a losing move).

The reward function helps guide the learning process by reinforcing behaviors that lead to higher rewards.

### Discount Factor ($\gamma$)
The **discount factor** ($\gamma$) is a number between 0 and 1 that determines the importance of future rewards relative to immediate rewards. A discount factor close to 1 means future rewards are nearly as valuable as immediate ones, while a lower value makes the agent focus more on immediate rewards.

The idea is captured mathematically when computing the **return**.

### Return ($G$)
The **return** is the total accumulated reward over time, but with future rewards discounted to reflect their delayed benefit. It is given by the formula:

$$
G = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \cdots
$$

This formula shows that rewards received later are worth less than rewards received immediately, depending on the value of $\gamma$. For example, if $\gamma = 0.9$, the reward received two steps later is multiplied by $0.9^2$.

### Policy ($\pi$)
A **policy** is a function that maps each state to an action. It defines the agent’s strategy for decision-making. In mathematical terms, a policy is written as:

$$
\pi(s) = a
$$

The objective in reinforcement learning is to find an optimal policy that maximizes the expected return for every state.

## The Markov Decision Process (MDP)

Reinforcement learning problems are commonly formulated as a **Markov decision process (MDP)**, which provides a formal framework consisting of:

- **States:** All possible situations the agent can encounter.
- **Actions:** All available decisions or moves.
- **Rewards:** Immediate feedback for actions taken.
- **Transition Dynamics:** Rules that determine how actions change the state.
- **Markov Property:** The principle that the future state depends only on the current state and the action taken, not on the history of past states.

This framework allows for a clear mathematical treatment of decision-making problems and helps in designing algorithms that learn effective policies.

## Examples for Better Understanding

### Autonomous Systems (e.g., Helicopter)
- **State:** The helicopter’s position, speed, orientation, and sensor data (such as GPS and accelerometer readings).
- **Action:** Adjustments to control sticks.
- **Reward:** A small positive reward (e.g., +1 per second) when flying stably, and a large negative reward (e.g., -1000) if it crashes.
- **Goal:** To learn a policy that keeps the helicopter flying safely and performing maneuvers like aerobatic stunts.

### Mars Rover Navigation
Imagine a simplified rover that can be in one of six positions (states 1 through 6):
- **States:** Six positions with specific rewards assigned to some states (e.g., state 1 gives a reward of 100, state 6 gives 40, while states 2–5 provide zero reward).
- **Actions:** The rover can choose to move left or right.
- **Return Calculation:**  
  For instance, if the rover starts at state 4 and moves left with a discount factor of $\gamma = 0.5$, the sequence of rewards (assuming zero rewards for intermediate states until reaching state 1) is weighted as follows:

$$
G = 0 + 0.5 \times 0 + 0.5^2 \times 0 + 0.5^3 \times 100 = 12.5
$$

- **Policy:** The strategy might involve choosing actions that lead to state 1 faster to maximize the return.

### Game Playing (e.g., Chess)
- **State:** The configuration of the chess board.
- **Action:** A legal move from the current board position.
- **Reward:** Typically defined as +1 for winning, -1 for losing, and 0 for a draw.
- **Discount Factor:** Often chosen very close to 1 (e.g., $\gamma = 0.99$) to emphasize long-term outcomes.
- **Policy:** A function that selects the best move from a given board state to maximize the chance of winning the game.

## Summary

In reinforcement learning, an agent learns to maximize its accumulated reward through:
- Observing the **state** of the environment.
- Taking an **action** based on a **policy**.
- Receiving a **reward** that provides feedback.
- Considering future rewards using a **discount factor**.
- Evaluating the overall success using the **return**.

The problem is typically modeled as a **Markov decision process (MDP)**, where the next state depends solely on the current state and the chosen action. This framework is versatile and can be applied to robotics, autonomous systems, game playing, financial trading, and many other complex decision-making tasks.