# Q-Learning
## a reinforcement learning implementation

This little program is an implementation of a basic AI that trains in order to resolves the Gym's FrozenLake environment.
It is solely based on the Q-Learning algorithm, and so without any NN features.
But it can be really helpful to understand the basics of the more advanced DQN algorithm and the limits of the classic Q-Learning.

It was made by following Maxime Labonne's article ([Q-Learning for beginners](https://mlabonne.github.io/blog/reinforcement%20learning/q-learning/frozen%20lake/gym/tutorial/2022/02/13/Q_learning.html)).

In [3]:
import gym
import numpy as np
import matplotlib.pyplot as plt

### The Frozen Lake Environment

The frozen lake by default of a 4x4 grid on which there is 3 types of tiles:
- Neutral tiles (ice), on which the agent can be without any consequences
- Death tiles (holes), on which the agent dies and must then restart from the beginning
- Win tiles (present), on which the agent gain a reward and then restart from the beginning

Futhermore, because every action in the environment is uncertain (the ground is slippery).  
There is constantly a 33% chance that agent do an action that is not intended (slip to another tile).

In [4]:
env = gym.make('FrozenLake-v1')
env.reset()

0

### The Q-Table

The agent learns in the environment by registers the value of every actions in every states.
In the FrozenLake there is 4 actions for 16 states and so a Q-Table of 64 value to discover while training.

In [5]:
nb_actions = env.action_space.n
nb_states = env.observation_space.n
qtable = np.zeros((nb_states, nb_actions))

### Hyperparameters

The RL model have different takes multiples **hyperparameters** that define the agent learning's behavior.

- episodes -> simply the number of try that will do the agent to gain rewards
- alpha -> the learning rate
- gamma -> the discount factor
- espilon -> the agent curiosity (Exploration vs Exploitation)

In [6]:
episodes = 1000
alpha = 0.1
gamma = 0.9
epsilon = 1.0
epsilon_decay = 0.001

In [None]:
# lists that will stores every episodes outcomes to let us do somes fancy graph with matplotlib
outcomes = []

### Backpropagation

The agent discover/define every value of the qtable by backpropagate the value of each tiles
After each actions the agent backpropagate the value of the current maximum (action - state) to the previous action state (The value is indeed modified by the hyperparameters).

In [None]:
for t in range(episodes):
    state = env.reset()
    done = False

    outcomes.append("Failure")

    while not done:
        rnd = np.random.random()

        if rnd < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(qtable[state])

        new_state, reward, done, info = env.step(action)
        qtable[state, action] = qtable[state, action] + alpha * (reward + gamma * np.max(qtable[new_state]) - qtable[state, action])
        state = new_state

        if reward:
            outcomes[-1] = "Success"
    epsilon = max(epsilon - epsilon_decay, 0)

In [None]:
# draw outcomes fancy graph (in blue when successful)
plt.figure(figsize=(12, 5))
plt.xlabel("Run number")
plt.ylabel("Outcome")
plt.bar(range(len(outcomes)), outcomes, color="#0A047A", width=1.0)
plt.show()

## CCL:

The Q-Learning algorithm works pretty well on small and simple environment like the FrozenLake but meets its limits in much complex or huge environment (more actions, more state) because the qtable must follow the multiplication of those two factors.  
And our poors computers can't handle a 10000 x 10000 qtable that easily.  
For better performance **we need approximation**.  
An approximation that gives us **NN**.