# Learn Reinforcement Learning in Python: Step-by-step Tutorial

## Why learn Reinforcement Learning (RL)?

To me, the most basic reinforcement learning model resembles science-fiction AI more than any large language model of today. Just take a look at how an RL agent is playing (and finishing) an insanely difficult level of Super Mario:

![](images/super_mario.gif)

In the beginning, this agent has no idea of what the controls are, how to progress through the game, what the obstacles are or what finishes the game. The agent learns all these things without any human intervention - all through the power of reinforcement learning algorithms. 

RL agents excel in situations where traditional machine learning algorithms struggle. They can solve problems without predefined solutions or explicitly programmed actions and most importantly, without mounds and mounds of data. That's why RL is having significant impact on many fields. For instance, it's used in:

- Self-driving cars: RL agents can learn optimal driving strategies based on traffic conditions and road rules.
- Robotics: Robots can be trained to perform complex tasks in dynamic environments through RL.
- Game playing: AI agents can learn complex strategies in games like Go or StarCraft II using RL techniques.

Reinforcement learning is a rapidly evolving field with vast potential. As research progresses, we can expect even more groundbreaking applications in areas like resource management, healthcare, and personalized learning. 

That's why now is the best time to learn this fascinating field of computer science. This tutorial will help you get started with the fundamental ideas and concepts in RL and introduce how to apply it in practice using Python. 

## 1. Agent and environment

Imagine you just got your cat, Bob, a fancy new scratching post. You want Bob to learn to use it instead of clawing up your furniture. This situation is a great way to understand the basics of reinforcement learning (RL), a type of AI where an agent learns from trial and error.

Bob, the curious cat, is the **agent** in this RL scenario. The agent is the learner and decision-maker. In fancier terms, it's the one who interacts with the world and figures things out. Bob needs to learn which things are okay to scratch (the post) and which are not (the expensive drapes!).


The room where Bob explores his scratching desires is the **environment**. It's everything outside the agent that it can interact with. The environment provides challenges (like that comfy-looking couch) and opportunities (the satisfying-on-the-nails scratching post!). Here, the room has furniture and, of course, the all-important scratching post.

There are two main types of environments in RL:

* **Discrete Environments:** Imagine a classic video game where the world is like a grid, and Bob can only move up, down, left, or right. These environments have a limited number of options for both Bob (his actions)  and the room (its states, like where Bob and the post are).
* **Continuous Environments:** Now picture a super high-tech room where Bob can move in any direction, and maybe even the scratching post can be moved around! This is a continuous environment, with endless possibilities for both Bob and the room.

Our current room with furniture is a **static environment**. The furniture doesn't move, and the scratching post stays put. But imagine if the furniture and scratching post magically switched places every few hours! That would be a **dynamic environment**, which is trickier for an agent to learn in because things keep changing.

## 2. Actions and states

Imagine Bob's world like a giant video game. Everything Bob can see, smell, and hear - the furniture, the scratching post, even the dangling string on your curtains - all this information makes up the **state space**. 

The size of this state space depends on the environment:

* **Discrete Environments:** In classic video games with grids, Bob can only be in a limited number of places (states), like in front of the post or next to the couch. This means the state space, and the information Bob gets, is also limited.
* **Continuous Environments:** Now picture a super high-tech room where Bob can be anywhere and even move the scratching post. This creates a **continuous state space** with endless possibilities for Bob to explore.

The **action space** is all the things Bob can do in the state space. In our scratching post example, Bob's actions could be scratching the post, napping on the couch (lazy), or even chasing butterflies (distracted kitty).

Similar to the state space, the number of actions Bob can take depends on the environment:

* **Discrete Environments:** In a grid-world game, Bob might have a limited number of actions, like moving up, down, left, or right.
* **Continuous Environments:** In our high-tech room, Bob might have a wider range of actions, like moving in any direction, jumping, or even (hopefully not) chewing on wires.

**The Starting Point: state 0**

When Bob starts his scratching post adventure, the environment is in a default state, let's call it state 0. In our case, this might be the room with the scratching post all set up.

Everything gets interesting when Bob takes an action. Walking towards the post, napping on the couch, scratching furniture, or chasing butterflies - each action changes the environment and moves Bob to a new state.

So, Bob scratches the post (action) - this makes the environment change (new state).

## 3. Time step and rewards

Imagine Bob's scratching post adventure like a movie. This movie isn't shown all at once, but rather in short snippets called **time steps**. Each time step is like a single frame in the movie, capturing a snapshot of what Bob is doing (action) and what the environment is like (state).

These time steps help us understand the flow of events. We can see how Bob's actions in one time step (scratching something) affect the environment (maybe some satisfying furniture shredding) and lead to a new state in the next time step. The number of time steps can vary depending on the situation. Maybe Bob learns the joy of scratching posts quickly, or maybe it takes a while. 

Learning is all about getting good at something, and reinforcement learning is no different. But how does Bob know if he's doing a good job scratching that post (good kitty) or a bad job clawing the couch (naughty kitty)? This is where **rewards** come in.

Rewards are like little treats Bob gets for taking the right actions. In our example, when Bob scratches the post (action), he might receive a positive reward. This reward could be a feeling of accomplishment, a yummy treat, or even just your pat on the head (good kitty!).

On the other hand, if Bob scratches the couch, he might not get a reward, or he might even get a negative reward (like a water squirt in the face). These rewards help Bob learn which actions are good and which ones to avoid.

By understanding time steps and rewards, we can see the big picture of Bob's learning process. Each time step captures Bob's action, the state of the environment, and the reward he receives. Over many time steps, Bob learns through trial and error, figuring out which actions lead to the most rewards (and hopefully, fewer scratches on your furniture!).