# Understanding Reinforcement Learning in N Easy Steps

## Why learn Reinforcement Learning (RL)?

To me, the most basic reinforcement learning model resembles science-fiction AI more than any large language model of today. Just take a look at how an RL agent is playing (and finishing) an insanely difficult level of Super Mario:

In the beginning, this agent has no idea of what the controls are, how to progress through the game, what the obstacles are or what finishes the game. The agent must learn all these things without any human intervention - all through the power of reinforcement learning algorithms. 

RL agents excel in situations where traditional machine learning algorithms struggle. They can solve problems without predefined solutions or explicitly programmed actions and most importantly, without mounds and mounds of data. That's why RL is having significant impact on many fields. For instance, it's used in:

- Self-driving cars: RL agents can learn optimal driving strategies based on traffic conditions and road rules.
- Robotics: Robots can be trained to perform complex tasks in dynamic environments through RL.
- Game playing: AI agents can learn complex strategies in games like Go or StarCraft II using RL techniques.

Reinforcement learning is a rapidly evolving field with vast potential. As research progresses, we can expect even more groundbreaking applications in areas like resource management, healthcare, and personalized learning. 

That's why now is the best time to learn this fascinating field of computer science. This tutorial will help you get started with the fundamental ideas and concepts in RL step-by-step.

## 1. Agent and environment

Imagine you just got your cat, Bob, a fancy new scratching post. You want Bob to learn to use it instead of clawing up your furniture. This situation is a great way to understand the basics of reinforcement learning (RL), a type of problem where an agent learns from trial and error.

Bob, the curious cat, is the **agent** in this RL scenario. The agent is the learner and decision-maker. Bob needs to learn which things are okay to scratch (the post) and which are not (the expensive drapes!).

The room where Bob explores his scratching desires is the **environment**. It's everything outside the agent that it can interact with. The environment provides challenges (like that comfy-looking couch) and opportunities (the satisfying-on-the-nails scratching post!).

There are two main types of environments in RL:

* **Discrete Environments:** Imagine a classic video game where the world is like a grid, and Bob can only move up, down, left, or right. These environments have a limited number of options for both Bob (his actions)  and the room (its states, like where Bob and the post are).
* **Continuous Environments:** Now picture a super high-tech room where Bob can move in any direction, and the scratching post can be placed at any position and height. This is a continuous environment, with endless possibilities for both Bob and the room.

Our current room with furniture is a **static environment**. The furniture doesn't move, and the scratching post stays put. But imagine if the furniture and scratching post magically switched places every few hours. That would be a **dynamic environment**, which is trickier for an agent to learn in because things keep changing.

## 2. Actions and states

Everything Bob can see, smell, and hear - the furniture, the scratching post, even the dangling string on your curtains - all this information makes up the **state space**. 

The size of this state space depends on the environment:

* **Discrete Environments:** In classic video games with grids, Bob can only be in a limited number of places (states), like in front of the post or next to the couch. This means the state space, and the information Bob gets, is also limited.
* **Continuous Environments:** Now picture a super high-tech room where Bob can be anywhere and even move the scratching post. This creates a **continuous state space** with endless possibilities for Bob to explore.

The **action space** is all the things Bob can do in the state space. In our scratching post example, Bob's actions could be scratching the post, napping on the couch, or even chasing its tail.

Similar to the state space, the number of actions Bob can take depends on the environment:

* **Discrete Environments:** In a grid-world game, Bob might have a limited number of actions, like moving up, down, left, or right and scratching.
* **Continuous Environments:** In our high-tech room, Bob might have a wider range of actions, like moving in any direction, jumping, or even (hopefully not) chewing on wires.

When Bob starts his scratching post adventure, the environment is in a default state, let's call it state 0. In our case, this might be the room with the scratching post all set up.

Everything gets interesting when Bob takes an action. Walking towards the post, napping on the couch, scratching furniture, or chasing butterflies - each action changes the environment and moves Bob to a new state.

So, Bob scratches the post (action) - this makes the environment change (new state).

## 3. Rewards, time steps and episodes

Once we let Bob loose in the environment (the room), a single training episode starts. The following scenarios might happen:
- __The episode may run forever or too long__: Bob may not be interested in scratching his nails at all, so he just keeps sleeping and playing. He doesn't receive any reward for doing this. 
- __Bob scratches furniture__: 

These time steps link together to form the episode. We see how Bob's actions in one moment (scratching) affect the environment and lead to a new state in the next. Episode length varies based on Bob's success. A quick learner might master the post in a short episode, while others can take longer.

**Rewards**, like points in a game, signal good (give fish treats) or bad (get water squirted in the face) actions for Bob. By experiencing rewards across episodes, Bob learns which actions lead to success (and hopefully fewer furniture casualties).  This understanding of episodes, time steps, and rewards is key to reinforcement learning.

## 4. Exploration vs. exploitation

Bob's on his way to becoming a scratching post pro! But there's one more challenge to overcome: the exploration-exploitation dilemma. Let's break it down.

Imagine Bob is finally getting the hang of using the scratching post. It feels good and it gets treats, it's a win-win! But what if there's an even better scratching spot he hasn't discovered yet, like the one with sleeping shelves? This is the essence of the exploration-exploitation dilemma.

* **Exploration:** This is like Bob venturing out, trying new things (like scratching the curtains) and seeing what happens (water in the face or a fish treat). It helps him discover potentially better options in the environment.
* **Exploitation:** This is like Bob sticking to what works - the scratching post! It guarantees a reward (praise and satisfaction) he already knows about.

The challenge is finding a balance. Exploring too much might waste time, especially in continuous environments, while exploiting too much might make Bob miss out on something even better.

There are a few tricks to help Bob explore smartly:

* **Epsilon-greedy:** Imagine Bob flips a coin (or has some internal feline process) before taking an action. With a small chance (epsilon), he'll explore and try something new. But most of the time (1-epsilon), he'll exploit and go for the reliable scratching post.

* **Boltzmann Exploration:** This strategy is like Bob getting more likely to explore when things are going poorly (consecutive negative rewards for scratching various things). As he gets more rewards from the post, he becomes more likely to exploit that winning strategy.

By using these strategies (and many others we haven't explained), Bob can find a balance between exploring the unknown and sticking to the good stuff.

## 5. The discount factor

Let's talk about the __discount factor__ now. It lowers the value of future rewards compared to immediate ones.

Imagine Bob discovers an amazing scratching spot later. Great, but the praise you gave him earlier for using the scratching post seems less exciting now, right? The discount factor reflects this.

A high discount factor prioritizes future rewards, making exploration for potentially better scratching spots less appealing. Bob might stick to the good-but-not-great scratching post, missing the ultimate spot!

Changing discount factor is how you balance exploration vs. exploitation. A high discount factor encourages long-term rewards but might miss immediate wins. A low one keeps Bob in the present but hinders future discoveries.

## 6. Q-learning

Our curious cat Bob is well on his way to scratching post mastery! But how exactly does he learn which actions lead to the most praise and treats (and the fewest water squirts)? This is where Q-learning comes in. Imagine Q-learning as Bob's internal strategy guide, constantly updated based on his scratching adventures.

Let's say Bob discovers the scratching post behind a couch or something (state 1). Q-learning assigns a value, called a Q-value, to each possible action Bob can take in that state. Scratching the post (action 1) might have a high Q-value because it leads to rewards (praise, treats). On the other hand, scratching the couch (action 2) might have a low Q-value because it leads to no rewards (or worse, punishment).

These Q-values are like a currency for Bob. The higher the Q-value for an action in a particular state, the more attractive that action seems to him. So, initially, Bob might explore by scratching both the post and the couch (trial and error). But as he receives rewards (or punishments), the Q-values get updated. The good scratching action (scratching the post) gets a higher and higher Q-value, while the bad scratching action (couch) gets a lower and lower Q-value.

Imagine a notebook where Bob keeps track of all his scratching experiences. This notebook is like a special Q-table, a table that stores all the Q-values for every state-action pair Bob encounters. Each row in the Q-table represents a state (like seeing the scratching post), and each column represents an action (like scratching it or the couch). The cells of the table hold the Q-values, constantly being updated as Bob explores.

So how exactly does Bob update these Q-values in his personal Q-table? Here's the core idea of Q-learning:

1. **Bob takes an action (let's say scratching the post).**
2. **He observes the new state (maybe the post visible behind a curtain).**
3. **He receives a reward (treats from you based on how close Bob is to the post).**
4. **Based on these four things (state, action, reward, and new state), Bob updates the Q-value for the action he just took in the previous state (scratching the post in state 1).**
5. **The update considers the reward he received, the Q-value of the best action he could take in the new state (which might be exploring other parts of the post), and a discount factor.**

This update rule ensures that Bob learns from his experiences. Good scratching actions in previous states (like scratching the post in state 1) get their Q-values boosted because they led to rewards. Over time, the Q-table becomes a treasure trove of knowledge for Bob, guiding him towards the most rewarding scratches and away from the dreaded water squirts. 

__The role of discount factor in Q-learning__

Remember how sometimes a yummy treat seems less exciting if you know you'll get another one later? That's the discount factor at play. It tells Q-learning to value immediate rewards slightly less than future ones. This discourages Bob from getting stuck exploring every tiny scratch mark on the post (potentially missing treats) and keeps him focused on the overall goal of getting the most rewards.

By following these steps, Bob (and our Q-learning agent) continuously learns and improves.


## 7. Policy

But how does Bob use the Q-table? After all, he is just a smart cat. This is where __policy__ comes in - his personal "playbook" for action. And policies can be **stochastic** or **deterministic**.

First, imagine a confident Bob, always picking the highest Q-value action in each state. This is like a clear-cut playbook: see the scratching post - scratch it (assuming it leads to the most rewards)! This ensures consistency and avoids needless exploration.

Now, imagine adventurous Bob? Enter **stochastic policies**. These introduce randomness. Bob might explore with a small chance, trying something new (like the curtains) even if the Q-value is low. But most of the time,  he'll still rely on his Q-table knowledge, following his deterministic side and giving the post a good rub.

The key is balance between **exploitation** (sticking to proven good actions) and **exploration** (trying new things). A purely deterministic policy might miss out on better rewards. Stochastic policies, with their randomness, help Bob explore while leveraging his knowledge. This balance is crucial for long-term success.

**Epsilon-Greedy: A Simple Balance**

One way to achieve this is the **epsilon-greedy policy**. Imagine Bob flips a coin (epsilon) before acting. With a small probability (epsilon), he explores. But most of the time (1-epsilon), he exploits his knowledge. This way, Bob can learn about new possibilities while still reaping the rewards of his past experiences.

Understanding these policy types and the exploration-exploitation balance is key to Bob (and our agents) continuously improving their decision-making and navigating their environments effectively.

## 8. Other types of reinforcement learning algorithms

From a big picture perspective, reinforcement learning algorithms can be broadly categorized based on how they interact with the environment and learn from experience. Here's a breakdown of the two main categories and some popular algorithms within each:

1. Model-based Reinforcement Learning:

In this approach, the agent builds an internal model of the environment. This model represents the dynamics of the environment, including state transitions and reward probabilities. The agent can then use this model to plan and evaluate different actions before taking them in the real environment.

- Advantages:
    - Can be more sample-efficient, especially in complex environments.
    - Allows for planning and evaluation before taking actions.
- Disadvantages:
    - Building an accurate model can be challenging, especially for complex environments.
    - The model may not perfectly reflect the real environment, leading to suboptimal behavior.

Common model-based RL algorithms:

- Dyna-Q: This algorithm combines model-based and model-free learning. It learns a model of the environment and uses it to plan actions while also directly learning from experience through Q-learning (explained under Model-Free).

2. Model-Free Reinforcement Learning:

This approach focuses on learning directly from interaction with the environment without explicitly building an internal model. The agent learns the value of states and actions or the optimal policy through trial and error.

- Advantages:
    - Easier to implement in complex environments where building a model is difficult.
    - More flexible and adaptable to changes in the environment.
- Disadvantages:
    - Can be more sample-efficient, requiring more interaction with the environment to learn effectively.

Common Model-Free RL Algorithms:

- Q-Learning: This is the popular algorithm we focused in this article that learns a Q-value for each state-action pair. The Q-value represents the expected future reward of taking a specific action in a particular state. The agent can then choose the action with the highest Q-value to maximize its long-term reward.

- SARSA (State-Action-Reward-State-Action): Similar to Q-learning, but it learns a value function for each state-action pair. It updates the value based on the reward received after taking an action and the next state observed.

- Policy Gradient Methods: These algorithms directly learn the policy function, which maps states to actions. They use gradients to update the policy in the direction that is expected to lead to higher rewards. Examples include REINFORCE and Proximal Policy Optimization (PPO).

- Deep Q-Networks (DQN): This algorithm combines Q-learning with deep neural networks to handle high-dimensional state spaces, often encountered in complex environments like video games.

The choice of reinforcement learning algorithm depends on various factors, including the complexity of the environment, the availability of resources, and the desired level of interpretability. Model-based approaches might be preferable for simpler environments where building an accurate model is feasible. On the other hand, model-free approaches are often more practical for complex, real-world scenarios.

Additionally, with the rise of deep learning, Deep Q-Networks (DQN) and other deep reinforcement learning algorithms are becoming increasingly popular for tackling complex tasks with high-dimensional state spaces.

## Conclusion