In this notebook we will try to understand few basics of **Q-Learning**.

**Q-Learning** is a `model-free`, `off-policy` `reinforcement learning(RL)` technique.

In [39]:
import warnings
warnings.filterwarnings('ignore')

In [1]:
# install OpenGymAI module
!pip install gym

Collecting gym
  Downloading gym-0.23.1.tar.gz (626 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Collecting gym-notices>=0.0.4
  Downloading gym_notices-0.0.6-py3-none-any.whl (2.7 kB)
Building wheels for collected packages: gym
  Building wheel for gym (PEP 517): started
  Building wheel for gym (PEP 517): finished with status 'done'
  Created wheel for gym: filename=gym-0.23.1-py3-none-any.whl size=701377 sha256=4277bb28a628bec136c06c1ce26f17b796f667ed190fbf69d78804fa4b473cbf
  Stored in directory: c:\users\syeda\appdata\local\pip\cache\wheels\78\28\77\b0c74e80a2a4faae0161d5c53bc4f8e436e77aedc79136ee13
Successfully built gym
Installing collected packages: gym-notices, gym
Successfully installed gym-0.23.1 gym-notices-

In [4]:
!pip install pygame

Collecting pygame
  Downloading pygame-2.1.2-cp38-cp38-win_amd64.whl (8.4 MB)
Installing collected packages: pygame
Successfully installed pygame-2.1.2


In [15]:
import gym # toolkit which provides environment to test RL algorithms 

In [29]:
env = gym.make('MountainCar-v0') # initialise the environment

In [17]:
env.reset() # reset the environment 

array([-0.565038,  0.      ], dtype=float32)

In [12]:
# env.render() # render the environment
# env.close() # close the environment

In [18]:
# the actions that the agent in the environment can take
env.action_space # total action space

Discrete(3)

In [19]:
# as per the description the car(agent) can do following actions in step:
# move left - 0
# stand still - 1
# move right - 2
env.action_space.n

3

In [20]:
# the total observations' range of the agent fromlow to high
# the observations are [position along x-axis, velocity]
env.observation_space

Box([-1.2  -0.07], [0.6  0.07], (2,), float32)

In [21]:
# low observation space
env.observation_space.low

array([-1.2 , -0.07], dtype=float32)

In [22]:
# high observation space
env.observation_space.high

array([0.6 , 0.07], dtype=float32)

In [30]:
env.reset()

done = False
while not done:
    action = 2 # always go right
    # every step for an action gives us:
    # the new state of the agent
    # the reard for taking the step
    # whether the job is done or not, even if episodes/iterations exhaust
    # miscellaneous pther attributes
    new_state, reward, done, _ = env.step(action)
    env.render()
    print(reward, new_state, done)

env.close()

-1.0 [-0.49461326  0.00078878] False
-1.0 [-0.4930416   0.00157166] False
-1.0 [-0.4906988  0.0023428] False
-1.0 [-0.48760235  0.00309645] False
-1.0 [-0.48377535  0.003827  ] False
-1.0 [-0.47924632  0.00452904] False
-1.0 [-0.47404894  0.00519737] False
-1.0 [-0.4682218   0.00582712] False
-1.0 [-0.46180812  0.0064137 ] False
-1.0 [-0.4548552   0.00695292] False
-1.0 [-0.44741422  0.00744099] False
-1.0 [-0.43953964  0.00787457] False
-1.0 [-0.43128887  0.00825078] False
-1.0 [-0.4227216   0.00856727] False
-1.0 [-0.4138994   0.00882219] False
-1.0 [-0.40488517  0.00901422] False
-1.0 [-0.39574262  0.00914257] False
-1.0 [-0.3865356   0.00920699] False
-1.0 [-0.3773279   0.00920774] False
-1.0 [-0.36818233  0.00914558] False
-1.0 [-0.3591606   0.00902173] False
-1.0 [-0.35032272  0.00883786] False
-1.0 [-0.3417267   0.00859603] False
-1.0 [-0.33342803  0.00829867] False
-1.0 [-0.32547954  0.00794851] False
-1.0 [-0.31793097  0.00754857] False
-1.0 [-0.31082886  0.00710208] False
-1.

**Q-Learning** is based on updating `Q-values`in a `Q-table` which are the values per possible action per step for each combination of observation space.

If we see from above `reward` is always **-1**. Once we accomplish the task we will get reward **0**.


Also, the **state** of the agent is granular to 8 decimals. So if we take all the possible values for the combination of observation space samples the Q-table is going to be huge and less meaningful. So for this we take window size for a range of observation values.

In [41]:
# converting hranular observation samples to discrete values
DISCRETE_OS_SIZE = [20] * len(env.observation_space.low)
DISCRETE_OS_SIZE

[20, 20]

In [43]:
# the window size of each observation
discrete_os_win_size = (env.observation_space.high - env.observation_space.low)/DISCRETE_OS_SIZE
discrete_os_win_size

array([0.09 , 0.007])

The `position` space is 0.09 bin size and `velocity` is 0.007