**Assignment-3 â€“ Q-Learning & Deep-Q Networks**

---

**Environment: CartPole-v1 gym**

### **Implementation**

#### **Q-Learning**

In [None]:
import gym
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
env = gym.make('CartPole-v1')

In [None]:
# Seting the number of episodes, learning rate, discount factor, and exploration parameters
num_episodes = 3500
learning_rate = 0.1
discount_factor = 0.99
max_exploration_rate = 1.0
min_exploration_rate = 0.01
exploration_decay_rate = 0.0
SHOW_EVERY = 10

In [None]:
# Setting the number of states and actions
num_states = 4 #number of possible states
num_actions = 2 #number of possible actions

# Defining a function to map a continuous state to a discrete state
def discretize_state(state): # defining fuction
    state_array = np.array(state[0]) #creating numpy array from first element
    discrete_state = tuple(np.round(state_array * [10, 100, 10, 100]).astype(int)) #Converitng to int and rounding off to nearest integer
    return discrete_state

In [None]:
# Creating the Q-Table
Q = np.zeros((20, 200, 20, 200, num_actions)) #Storing Q-values while initializing zero

In [None]:
#Creating empty lists to store rewards
rl_rewards = []
cum_rewards = [0]
reward_per_move = []
episode = 0

while episode < num_episodes:
    # Reseting the environment
    state = env.reset()
    done = False
    t = 0
    
    # Decaying the exploration rate
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)
    
    while not done:
        # Choosing an action
        if np.random.uniform() < exploration_rate:
            action = env.action_space.sample()
        else:
            discrete_state = discretize_state(state)
            action = np.argmax(Q[discrete_state])
        
        #Chosen action
        next_state, reward, done, info = env.step(action)[:4]
        
        # Updating the Q-Table
        discrete_state = discretize_state(state)
        next_discrete_state = discretize_state(next_state)
        max_next_q = np.max(Q[next_discrete_state])
        Q[discrete_state + (action,)] += learning_rate * (reward + discount_factor * max_next_q - Q[discrete_state + (action,)]) #Learning process
        
        # Updating the current state
        state = next_state
        t += 1
    rl_rewards.append(t)
    cum_rewards.append(cum_rewards[-1] + t) #adding the cumulative reward to the list

    if episode % SHOW_EVERY == 0 and episode > 0:
      avg_reward = sum(rl_rewards[-10:]) / 10
      avg_reward_per_move = t / env.spec.max_episode_steps
      reward_per_move.append(avg_reward_per_move)
      print("Average Reward per move: ", avg_reward_per_move)
      print("Episodes:", episode-9, "-", episode, "Average Reward:", avg_reward)

    # Printing the total reward for the episode
    #print("Episode:", episode, "Total Reward:", t)
    episode += 1

Average Reward per move:  0.03
Episodes: 1 - 10 Average Reward: 20.4
Average Reward per move:  0.046
Episodes: 11 - 20 Average Reward: 22.8
Average Reward per move:  0.076
Episodes: 21 - 30 Average Reward: 22.3
Average Reward per move:  0.062
Episodes: 31 - 40 Average Reward: 22.5
Average Reward per move:  0.038
Episodes: 41 - 50 Average Reward: 21.7
Average Reward per move:  0.036
Episodes: 51 - 60 Average Reward: 23.9
Average Reward per move:  0.082
Episodes: 61 - 70 Average Reward: 20.7
Average Reward per move:  0.044
Episodes: 71 - 80 Average Reward: 23.6
Average Reward per move:  0.036
Episodes: 81 - 90 Average Reward: 20.7
Average Reward per move:  0.018
Episodes: 91 - 100 Average Reward: 19.1
Average Reward per move:  0.044
Episodes: 101 - 110 Average Reward: 17.9
Average Reward per move:  0.034
Episodes: 111 - 120 Average Reward: 21.0
Average Reward per move:  0.09
Episodes: 121 - 130 Average Reward: 24.2
Average Reward per move:  0.03
Episodes: 131 - 140 Average Reward: 21.7
A

In [None]:
# Runing episodes with random agent
print("Random Agent", end="\n\n")
# Creating empty lists
random_rewards = []
random_cum_rewards = [0]
random_reward_per_move = []
episode = 0

while episode < 3500: #Setting up loop
    state = env.reset()
    done = False
    t = 0
    
    while not done:
        action = env.action_space.sample()
        next_state, reward, done, info = env.step(action)[:4]
        state = next_state
        t += 1
    random_rewards.append(t)
    random_cum_rewards.append(random_cum_rewards[-1] + t)

    if episode % SHOW_EVERY == 0 and episode > 0: #Printing average reward earned per move
      avg_reward = sum(rl_rewards[-10:]) / 10
      avg_reward_per_move = t / env.spec.max_episode_steps
      random_reward_per_move.append(avg_reward_per_move)
      print("Average Reward per move: ", avg_reward_per_move)
      print("Episodes:", episode-9, "-", episode, "Average Reward:", avg_reward)
    #print("Episode:", episode, "Total Reward:", t)
    episode += 1

Random Agent

Average Reward per move:  0.028
Episodes: 1 - 10 Average Reward: 23.4
Average Reward per move:  0.03
Episodes: 11 - 20 Average Reward: 23.4
Average Reward per move:  0.03
Episodes: 21 - 30 Average Reward: 23.4
Average Reward per move:  0.032
Episodes: 31 - 40 Average Reward: 23.4
Average Reward per move:  0.032
Episodes: 41 - 50 Average Reward: 23.4
Average Reward per move:  0.054
Episodes: 51 - 60 Average Reward: 23.4
Average Reward per move:  0.062
Episodes: 61 - 70 Average Reward: 23.4
Average Reward per move:  0.024
Episodes: 71 - 80 Average Reward: 23.4
Average Reward per move:  0.03
Episodes: 81 - 90 Average Reward: 23.4
Average Reward per move:  0.038
Episodes: 91 - 100 Average Reward: 23.4
Average Reward per move:  0.032
Episodes: 101 - 110 Average Reward: 23.4
Average Reward per move:  0.032
Episodes: 111 - 120 Average Reward: 23.4
Average Reward per move:  0.026
Episodes: 121 - 130 Average Reward: 23.4
Average Reward per move:  0.04
Episodes: 131 - 140 Average R

1.   Display the performance of the random agent





In [None]:
import plotly.graph_objects as go

# Create the figure for rewards of RL-agent
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(len(reward_per_move))), y=reward_per_move))
fig.update_layout(
    title="RL-Agent Rewards per move",
    xaxis_title="Episode",
    yaxis_title="Rewards of RL-Agent per move"
)
fig.show()

# Create the figure for cumulative rewards of RL-agent
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1, len(cum_rewards))), y=cum_rewards[1:]))
fig.update_layout(
    title="RL-Agent Cumulative Rewards",
    xaxis_title="Episode",
    yaxis_title="Cumulative Rewards of RL-Agent"
)
fig.show()

# Create the figure for rewards of Random-agent
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(len(random_reward_per_move))), y=random_reward_per_move))
fig.update_layout(
    title="Random Agent Rewards per move",
    xaxis_title="Episode",
    yaxis_title="Rewards of Random RL-Agent per move"
)
fig.show()

# Create the figure for cumulative rewards of Random-agent
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1, len(random_cum_rewards))), y=random_cum_rewards[1:]))
fig.update_layout(
    title="Random Agent Cumulative Rewards",
    xaxis_title="Episode",
    yaxis_title="Cumulative Rewards of Random RL-Agent"
)
fig.show()

We could see that there is a continuous rise in cumulative rewards over episodes. And there is not much difference in the reward gaining capacity of both agents per move.

In [None]:
# Print the average total reward for each agent
print("Average Total Reward of RL-Agent:", np.mean(rl_rewards))
print("Average Cumulative Reward of RL-Agent:", np.mean(cum_rewards))
print("Average Reward Per Move:", np.mean(reward_per_move))

print()
print("Average Total Reward of Random Agent:", np.mean(random_rewards))
print("Average Cumulative Reward of Random Agent:", np.mean(random_cum_rewards))
print("Average Reward Per Move:", np.mean(random_reward_per_move))

Average Total Reward of RL-Agent: 21.717142857142857
Average Cumulative Reward of RL-Agent: 37945.87717794916
Average Reward Per Move: 0.04358739255014327

Average Total Reward of Random Agent: 22.176
Average Cumulative Reward of Random Agent: 38691.828620394175
Average Reward Per Move: 0.04405730659025788


Compare the performance of the random agent to the RL-Agent 
   

 In terms of average overall reward and average reward per move, the RL-Agent outperforms the Random Agent by 0.317 and 0.003 points, respectively. Nonetheless, at 604.98, the difference in the average cumulative reward is insignificant. In comparison to the Random Agent, RL-Agent performs somewhat better overall.

 Sometimes, Random agent outperforms Q learning based agent by a very small number.







--------

#### **Deep-Q Network**

In [None]:
#Importing Libraries
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras.optimizers import Adam
from keras import backend as K
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory
from keras.layers import Flatten

In [None]:
state = env.reset() #Resetting environment

In [None]:
def seq_model(states, num_actions): #Defining Sequential Model
    model = Sequential()
    model.add(Dense(16, input_shape=(1,) + states, activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Flatten())
    model.add(Dense(num_actions, activation='linear'))
    return model

seq_model = seq_model(env.observation_space.shape, num_actions) #Creating model using seq_model function

print(seq_model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 1, 16)             80        
                                                                 
 dense_1 (Dense)             (None, 1, 32)             544       
                                                                 
 dense_2 (Dense)             (None, 1, 16)             528       
                                                                 
 flatten (Flatten)           (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 2)                 34        
                                                                 
Total params: 1,186
Trainable params: 1,186
Non-trainable params: 0
_________________________________________________________________
None


We have created a sequential model which takes states as input and gives number of actions as output. 

In [None]:
memory = SequentialMemory(limit=10000, window_length=1) # Creating variable to store memory
policy = BoltzmannQPolicy() # Initializing Boltzmann Q policy
deep_q_network = DQNAgent(model=seq_model, nb_actions=num_actions, memory=memory, nb_steps_warmup=10, target_model_update=1e-2, policy=policy) #Creating DQN Agent
deep_q_network.compile(Adam(lr=1e-3), metrics=['mae']) #Compiling the model

In [None]:
deep_q_network.fit(env, nb_steps=3500, visualize=False, verbose=2) # Training DQN agent with 3500 steps 

Training for 3500 steps ...
   18/3500: episode: 1, duration: 0.617s, episode steps:  18, steps per second:  29, episode reward: 18.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.444 [0.000, 1.000],  loss: 0.549746, mae: 0.550902, mean_q: 0.065598
   55/3500: episode: 2, duration: 0.307s, episode steps:  37, steps per second: 121, episode reward: 37.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.514 [0.000, 1.000],  loss: 0.466259, mae: 0.542098, mean_q: 0.190023
   76/3500: episode: 3, duration: 0.164s, episode steps:  21, steps per second: 128, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.429 [0.000, 1.000],  loss: 0.203747, mae: 0.552071, mean_q: 0.566925
   88/3500: episode: 4, duration: 0.109s, episode steps:  12, steps per second: 110, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.750 [0.000, 1.000],  loss: 0.068152, mae: 0.615992, mean_q: 0.940580
   98/3500: episode: 5, duration: 0.091s, episod

<keras.callbacks.History at 0x7f86e3db1c40>

In [None]:
scores = deep_q_network.test(env, nb_episodes=1000, visualize=False) #Finding scores
rl_rewards2 = scores.history['episode_reward'] #Extracting episode reward from the history
cum_rewards2 = [0] # Creating a list to store cumulative rewards
for r in rl_rewards2: # Looping through rewards to generate cumulative rewards
    cum_rewards2.append(cum_rewards2[-1] + r)
print(np.mean(rl_rewards2))

avg_reward_per_move2 = np.mean(scores.history['episode_reward']) / env.spec.max_episode_steps # Calculating average reward per move

Testing for 1000 episodes ...
Episode 1: reward: 189.000, steps: 189
Episode 2: reward: 378.000, steps: 378
Episode 3: reward: 187.000, steps: 187
Episode 4: reward: 305.000, steps: 305
Episode 5: reward: 156.000, steps: 156
Episode 6: reward: 185.000, steps: 185
Episode 7: reward: 215.000, steps: 215
Episode 8: reward: 169.000, steps: 169
Episode 9: reward: 185.000, steps: 185
Episode 10: reward: 173.000, steps: 173
Episode 11: reward: 179.000, steps: 179
Episode 12: reward: 163.000, steps: 163
Episode 13: reward: 198.000, steps: 198
Episode 14: reward: 162.000, steps: 162
Episode 15: reward: 224.000, steps: 224
Episode 16: reward: 157.000, steps: 157
Episode 17: reward: 178.000, steps: 178
Episode 18: reward: 179.000, steps: 179
Episode 19: reward: 207.000, steps: 207
Episode 20: reward: 160.000, steps: 160
Episode 21: reward: 189.000, steps: 189
Episode 22: reward: 183.000, steps: 183
Episode 23: reward: 440.000, steps: 440
Episode 24: reward: 193.000, steps: 193
Episode 25: reward:

In [None]:
reward_per_move2 = [x/env.spec.max_episode_steps for x in scores.history['episode_reward']] # Creating a list for getting rewards per move

In [None]:
# Creating lists to store rewards
random_rewards2 = []
random_cum_rewards2 = [0]
random_reward_per_move2 = []

# Looping 1000 times to calculate random rewards
for i in range(1000):
    state = env.reset() # Resetting the environment
    done = False
    score = 0
    while not done:
        action = env.action_space.sample()
        state, reward, done, info = env.step(action) # Taking action and storing it's values into variables
        score += reward # Storing reward in score var
    random_rewards2.append(score) # Appending score into random_rewards2
    random_cum_rewards2.append(random_cum_rewards2[-1] + t) # Appending cumulative rewards
    
    avg_reward_per_move2_random = score / env.spec.max_episode_steps # Calculating average reward per move
    random_reward_per_move2.append(avg_reward_per_move2_random) # Appending reward per move into a list
    print("Average Reward per move of Random Agent: ", avg_reward_per_move2_random) # Printing Average Reward per move of Random Agent

Average Reward per move of Random Agent:  0.08
Average Reward per move of Random Agent:  0.034
Average Reward per move of Random Agent:  0.024
Average Reward per move of Random Agent:  0.088
Average Reward per move of Random Agent:  0.036
Average Reward per move of Random Agent:  0.026
Average Reward per move of Random Agent:  0.028
Average Reward per move of Random Agent:  0.038
Average Reward per move of Random Agent:  0.026
Average Reward per move of Random Agent:  0.04
Average Reward per move of Random Agent:  0.058
Average Reward per move of Random Agent:  0.022
Average Reward per move of Random Agent:  0.042
Average Reward per move of Random Agent:  0.022
Average Reward per move of Random Agent:  0.04
Average Reward per move of Random Agent:  0.026
Average Reward per move of Random Agent:  0.042
Average Reward per move of Random Agent:  0.038
Average Reward per move of Random Agent:  0.03
Average Reward per move of Random Agent:  0.02
Average Reward per move of Random Agent:  0.0

In [None]:
# Visualizing the performace of agents
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(len(reward_per_move2))), y=reward_per_move2))
fig.update_layout(
    title="DQN RL-Agent Rewards per move",
    xaxis_title="Episode", 
    yaxis_title="Rewards of DQN RL-Agent per move"
)
fig.show()

# Create the figure for cumulative rewards of RL-agent
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1, len(cum_rewards2))), y=cum_rewards2[1:]))
fig.update_layout(
    title="DQN RL-Agent Cumulative Rewards",
    xaxis_title="Episode",
    yaxis_title="Cumulative Rewards of DQN RL-Agent"
)
fig.show()

# Create the figure for rewards of Random-agent
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(len(random_reward_per_move2))), y=random_reward_per_move2))
fig.update_layout(
    title="Random Agent Rewards per move",
    xaxis_title="Episode",
    yaxis_title="Rewards of Random RL-Agent per move"
)
fig.show()

# Create the figure for cumulative rewards of Random-agent
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1, len(random_cum_rewards2))), y=random_cum_rewards2[1:]))
fig.update_layout(
    title="Random Agent Cumulative Rewards",
    xaxis_title="Episode",
    yaxis_title="Cumulative Rewards of Random RL-Agent"
)
fig.show()

We could see that there is a continuous rise in cumulative rewards over episodes. And there is a huge difference in the reward gaining capacity of both agents per move.

In [None]:
# Print the average total reward for each agent
print("Average Total Reward of DQN RL-Agent:", np.mean(rl_rewards2))
print("Average Cumulative Reward of DQN RL-Agent:", np.mean(cum_rewards2))
print("Average Reward Per Move of DQN RL-Agent: ", avg_reward_per_move2)

print()
print("Average Total Reward of Random Agent:", np.mean(random_rewards2))
print("Average Cumulative Reward of Random Agent:", np.mean(random_cum_rewards2))
print("Average Reward Per Move of Random Agent: ", np.mean(random_reward_per_move2))

Average Total Reward of DQN RL-Agent: 189.678
Average Cumulative Reward of DQN RL-Agent: 95256.32867132867
Average Reward Per Move of DQN RL-Agent:  0.37935599999999997

Average Total Reward of Random Agent: 22.736
Average Cumulative Reward of Random Agent: 13500.0
Average Reward Per Move of Random Agent:  0.045472000000000005


 *   Compare the performance of the random agent to the RL-Agent

The DQN Agent is performing significantly better than random agent. The average reward scored by DQN agent per move is 0.37, whereas random agent scored 0.04. In regards to cumulative rewards, DQN agent scored an average of 95256, in contrast, random agent scored 13500. So, it is clear that DQN agent understands the environment better than a random agent.

**Contribution**

**Nisha Siwach**

---


 

1.   Setting up the libraries and hyper-paramaters
2.   Creating necessary functions
3.   Proccessed data that could used for model training


**Akshaykumar Thakare**

---



1.  Created code for Q-learning and setting up Q-table
2.  Performed testing and training for random and RL agent.
3.  Plotting Graphs for comparison


**Jay Sureshbhai Mangukiya**

---
1.  Created code for Deep- Q learning
2.  Plotted graphs and performed testing and training for randomÂ andÂ RLÂ agent
3.  Performed testing and training for randomÂ andÂ RLÂ agent

