<font size=5>**Project - Reinforcement Learning - How to make computer learn to play CartPole game**</font>

<font size=4>**Step 1 - Import the Modules**</font>

- <font size=3>**Importing the Modules**</font>

In [1]:
import sys
assert sys.version_info >= (3, 5)

In [2]:
import numpy as np
import tensorflow as tf
assert tf.__version__ >= "2.0"

In [3]:
from tensorflow import keras
import sklearn
assert sklearn.__version__ >= "0.20"

In [4]:
np.random.seed(42)
tf.random.set_seed(42)

In [5]:
import matplotlib as mpl
import matplotlib.pyplot as plt

In [6]:
%matplotlib inline
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

<font size=4>**Step 2 - Understanding OpenAI Gym**</font>

- <font size=3>**Introduction to OpenAI gym environment**</font>

In [7]:
import gym

In [8]:
gym.envs.registry.all()

dict_values([EnvSpec(Copy-v0), EnvSpec(RepeatCopy-v0), EnvSpec(ReversedAddition-v0), EnvSpec(ReversedAddition3-v0), EnvSpec(DuplicatedInput-v0), EnvSpec(Reverse-v0), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v0), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v3), EnvSpec(BipedalWalkerHardcore-v3), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v0), EnvSpec(KellyCoinflip-v0), EnvSpec(KellyCoinflipGeneralized-v0), EnvSpec(FrozenLake-v0), EnvSpec(FrozenLake8x8-v0), EnvSpec(CliffWalking-v0), EnvSpec(NChain-v0), EnvSpec(Roulette-v0), EnvSpec(Taxi-v3), EnvSpec(GuessingGame-v0), EnvSpec(HotterColder-v0), EnvSpec(Reacher-v2), EnvSpec(Pusher-v2), EnvSpec(Thrower-v2), EnvSpec(Striker-v2), EnvSpec(InvertedPendulum-v2), EnvSpec(InvertedDoublePendulum-v2), EnvSpec(HalfCheetah-v2), EnvSpec(HalfCheetah-v3), EnvSpec(Hopper-v2), EnvSpec(Hopper-v3), EnvSpec(Swimmer-v2), EnvSp

- <font size=3>**Understanding CartPole game of OpenAI gym**</font>

In [9]:
# Let's get the CartPole environment from gym:
env = gym.make('CartPole-v1')

In [10]:
# Set the seed for env:
env.seed(42)

[42]

In [11]:
obs = env.reset()

In [12]:
print(obs)

[-0.01258566 -0.00156614  0.04207708 -0.00180545]


The obs of CartPole has 4 values:

1. First value -0.01258566 is the position of the cart.

2. Second value -0.00156614 is the velocity of the cart.

3. Third value 0.04207708 is the angle of the pole.

4. Fourth value -0.00180545 is the angular velocity of the pole.

In [13]:
print(env.action_space)

Discrete(2)


It says that CartPole has two values for action. As discussed previously, 0 means left action, 1 means right action.

In [14]:
# help(env) #- For more info on env:

- <font size=3>**Let's make the agent play the game!**</font>

In [15]:
action = 0
obs, reward, done, info = env.step(action)
print(obs)

[-0.01261699 -0.19726549  0.04204097  0.30385076]


In [16]:
print(reward)
print(done)
print(info)

1.0
False
{}


<font size=4>**Step 3 - Hard-code simple Policy**</font>

- <font size=3>**A simple hard-coded policy**</font>

In [17]:
env.seed(42)
def basic_policy(obs):
    angle = obs[2]
    if angle < 0:
        return 0
    else:
        return 1

In [18]:
totals = []
for episode in range(500):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

In [19]:
print(np.mean(totals), np.std(totals), np.min(totals), np.max(totals))

41.718 8.858356280936096 24.0 68.0


<font size=4>**Step 4 - Neural Network for the Game**</font>

- <font size=3>**Building the Neural Network**</font>

In [20]:
keras.backend.clear_session()

In [21]:
tf.random.set_seed(42)
np.random.seed(42)

In [22]:
n_inputs = 4

model = keras.models.Sequential([
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),
    keras.layers.Dense(1, activation="sigmoid"),
])

In [23]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 5)                 25        
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 6         
Total params: 31
Trainable params: 31
Non-trainable params: 0
_________________________________________________________________


In [24]:
env.seed(42)

[42]

In [25]:
def basic_policy_untrained(obs):
    left_proba = model.predict(obs.reshape(1, -1))
    action = int(np.random.rand() > left_proba)
    return action

In [26]:
totals = []
for episode in range(50):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy_untrained(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)
    
np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

(27.16, 15.652935826866473, 9.0, 88.0)

- <font size=3>**Training the Neural Network**</font>

In [27]:
# Set np seed to 42:
np.random.seed(42)

In [28]:
n_environments = 50
n_iterations = 5000

In [29]:
# nitialize 50 different cartpole environments:
envs = [gym.make("CartPole-v1") for _ in range(n_environments)]

In [30]:
for index, env in enumerate(envs):
    env.seed(index)

In [31]:
observations = [env.reset() for env in envs]

In [32]:
optimizer = keras.optimizers.RMSprop()

In [33]:
loss_fn = keras.losses.binary_crossentropy

In [34]:
for iteration in range(n_iterations):
    # if angle < 0, we want proba(left) = 1., or else proba(left) = 0.
    target_probas = np.array([([1.] if obs[2] < 0 else [0.])
                              for obs in observations])

    with tf.GradientTape() as tape:
        left_probas = model(np.array(observations))
        loss = tf.reduce_mean(loss_fn(target_probas, left_probas))
    print("\rIteration: {}, Loss: {:.3f}".format(iteration, loss.numpy()), end="")
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    actions = (np.random.rand(n_environments, 1) > left_probas.numpy()).astype(np.int32)
    for env_index, env in enumerate(envs):
        obs, reward, done, info = env.step(actions[env_index][0])
        observations[env_index] = obs if not done else env.reset()

Iteration: 4999, Loss: 0.094

<font size=4>**Step 5 - Understanding Policy Gradients**</font>

- <font size=3>**Policy Gradients**</font>

- To train this neural network we will need to define the target probabilities y. If an action is good we should increase its probability, and conversely, if it is bad we should reduce it. But how do we know whether an action is good or bad? The problem is that most actions have delayed effects, so when you win or lose points in an episode, it is not clear which actions contributed to this result: was it just the last action? Or the last 10? Or just one action 50 steps earlier? This is called the credit assignment problem.

- The Policy Gradients algorithm tackles this problem by first playing multiple episodes, then making the actions in good episodes slightly more likely, while actions in bad episodes are made slightly less likely. First we play, then we go back and think about what we did.



<font size=4>**Step 6 - Implementing Policy Gradients**</font>

- <font size=3>**Defining play_one_step function**</font>

In [35]:
def play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        left_proba = model(obs[np.newaxis])
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))
    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, info = env.step(int(action[0, 0].numpy()))
    return obs, reward, done, grads

- <font size=3>**Defining play_multiple_episodes function**</font>

In [36]:
def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):
    all_rewards = []
    all_grads = []
    for episode in range(n_episodes):
        current_rewards = []
        current_grads = []
        obs = env.reset()
        for step in range(n_max_steps):
            obs, reward, done, grads = play_one_step(env, obs, model, loss_fn)
            current_rewards.append(reward)
            current_grads.append(grads)
            if done:
                break
        all_rewards.append(current_rewards)
        all_grads.append(current_grads)
    return all_rewards, all_grads

- <font size=3>**Defining the discount function and normalizing function**</font>

In [37]:
def discount_rewards(rewards, discount_rate):
    discounted = np.array(rewards)
    for step in range(len(rewards) - 2, -1, -1):
        discounted[step] += discounted[step + 1] * discount_rate
    return discounted


In [38]:
def discount_and_normalize_rewards(all_rewards, discount_rate):
    all_discounted_rewards = [discount_rewards(rewards, discount_rate)
                            for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean) / reward_std
            for discounted_rewards in all_discounted_rewards]

- <font size=3>**Training with Policy Gradients**</font>

In [39]:
keras.backend.clear_session()

In [40]:
tf.random.set_seed(42)
np.random.seed(42)

In [41]:
n_episodes_per_update = 10
n_iterations = 150
n_max_steps = 200
discount_rate = 0.95
n_inputs = 4

In [42]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
loss_fn = keras.losses.binary_crossentropy

In [43]:
# Define the nn_policy_gradients() function
def nn_policy_gradient(model, n_iterations, n_episodes_per_update, n_max_steps, loss_fn):
    env = gym.make("CartPole-v1")
    env.seed(42);

    for iteration in range(n_iterations):
        all_rewards, all_grads = play_multiple_episodes(
            env, n_episodes_per_update, n_max_steps, model, loss_fn)
        total_rewards = sum(map(sum, all_rewards))                     # Not shown in the book
        print("\rIteration: {}, mean rewards: {:.1f}".format(          # Not shown
            iteration, total_rewards / n_episodes_per_update), end="") # Not shown
        all_final_rewards = discount_and_normalize_rewards(all_rewards,
                                                        discount_rate)
        all_mean_grads = []
        for var_index in range(len(model.trainable_variables)):
            mean_grads = tf.reduce_mean(
                [final_reward * all_grads[episode_index][step][var_index]
                for episode_index, final_rewards in enumerate(all_final_rewards)
                    for step, final_reward in enumerate(final_rewards)], axis=0)
            all_mean_grads.append(mean_grads)
        optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))

    return model

    env.close()

In [44]:
# Now, let us build the neural network 
model = keras.models.Sequential([
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),
    keras.layers.Dense(1, activation="sigmoid"),
])

In [45]:
# Call the function:
model = nn_policy_gradient(model, n_iterations, n_episodes_per_update, n_max_steps, loss_fn)

Iteration: 149, mean rewards: 191.4

In [46]:
totals = []
for episode in range(20):
    print("Episode:",episode)
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy_untrained(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

Episode: 0
Episode: 1
Episode: 2
Episode: 3
Episode: 4
Episode: 5
Episode: 6
Episode: 7
Episode: 8
Episode: 9
Episode: 10
Episode: 11
Episode: 12
Episode: 13
Episode: 14
Episode: 15
Episode: 16
Episode: 17
Episode: 18
Episode: 19


(196.3, 16.12792609110049, 126.0, 200.0)

<font size=4>**Author:**</font>

- ***Prince Raj***