In [1]:
# Reinforcement Learning
# Deep Q Networks - DQN (used for the cartpole scenario below where action space is discrete and small)

'''
High-level summary:
AGENTS takes ACTIONS in the ENV which will give REWARDS(or penalties) based on the corresponding STATE/ACTION

Background info:
state - "state" refers to a snapshot of the environment at a given time, while "environment" encompasses the entire system or context in which an agent operates

agent - is Deep NN here, takes state and action(*) as inputs, then outputs/predicts the corresponding optimal Q-value; this is done through “value learning” (vs “policy learning”) below
(*) alternatively (in the implementation below for cartpole which got only 2 possible actions - left or right): action is NOT an input, and the Deep NN will output the corresponding Q-value for each possible action
 -> during training, the “actual” action taken may or may NOT be the action which results in the highest Q-value; explore vs exploit, see hyper-parameters epsilon/epsilon_decay

Process:
using “state” from the ENV, either explore or exploit [DQNAgent’s act()], DQNAgent provides “action” to the ENV -> ENV then provides “reward, next_state, done”
.. the DQNAgent saves this state/action/reward/next_state/done combo into its memory, which will be randomly sampled to train its DNN model
.. during training (using mse loss function), the target is  “reward + gamma * (Q-value from model_prediction using next_state for the corresponding next_action which resulted in the highest Q-value)” .. as the agent's policy(*) here is to choose the action with the highest Q-value for a given state
(*) policy, pi(s), takes state as input and action is the output

Q-value from the Q-function is the total expected reward and discounted (see gamma, the discount factor) future rewards for a certain state and action

Q-values of the actions not taken are NOT updated - When an action is not taken, the Q-value for that state-action pair is not updated during the current iteration. The Q-value for the action actually taken is updated based on the reward received and the maximum Q-value of the next state. If the agent is exploring and randomly chooses an action, the Q-values of the actions not taken are still relevant for future exploitation.

Gamma (0 to 1) refers to the discount factor - It determines how much an agent values future rewards compared to immediate rewards. A higher gamma (closer to 1) means the agent prioritizes long-term rewards, while a lower gamma (closer to 0) emphasizes immediate gains.

Discounted total rewards - theoretically including all current and future rewards, up to infinity, implementation below including only the immediate next reward expected
'''


'\nHigh-level summary:\nAGENTS takes ACTIONS in the ENV which will give REWARDS(or penalties) based on the corresponding STATE/ACTION\n\nBackground info:\nstate - "state" refers to a snapshot of the environment at a given time, while "environment" encompasses the entire system or context in which an agent operates\n\nagent - is Deep NN here, takes state and action(*) as inputs, then outputs/predicts the corresponding optimal Q-value; this is done through “value learning” (vs “policy learning”) below\n(*) alternatively (in the implementation below for cartpole which got only 2 possible actions - left or right): action is NOT an input, and the Deep NN will output the corresponding Q-value for each possible action\n -> during training, the “actual” action taken may or may NOT be the action which results in the highest Q-value; explore vs exploit, see hyper-parameters epsilon/epsilon_decay\n\nProcess:\nusing “state” from the ENV, either explore or exploit [DQNAgent’s act()], DQNAgent provi

In [2]:
# Warning control
import warnings
warnings.filterwarnings('ignore') #this does NOT suppress the many *ms/step logs from keras, see verbose=0 below

In [3]:
'''
!pip freeze | grep numpy
!pip freeze | grep gym
!pip freeze | grep tensorflow
'''
!pip freeze | findstr numpy
!pip freeze | findstr gym
!pip freeze | findstr tensorflow
!pip freeze | findstr keras

numpy==1.21.3
gym==0.26.2
gym-notices==0.0.8
gymnasium==1.1.1
tensorflow==2.10.0
tensorflow-estimator==2.10.0
tensorflow-io-gcs-filesystem==0.31.0
keras==2.10.0


In [4]:
# downgraded to tensorflow 2.10 in order to use GPU, newer TF versions have issues with GPU, hence NOT using these:
'''
from tensorflow import keras
from keras import Sequential
from keras import layers
from keras.api.layers import Dense
from keras.optimizers import Nadam
'''

# from https://www.tensorflow.org/install/pip
'''
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
# Anything above 2.10 is not supported on the GPU on Windows Native
python -m pip install "tensorflow<2.11"
# Verify the installation:
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
'''


'\nconda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0\n# Anything above 2.10 is not supported on the GPU on Windows Native\npython -m pip install "tensorflow<2.11"\n# Verify the installation:\npython -c "import tensorflow as tf; print(tf.config.list_physical_devices(\'GPU\'))"\n'

In [5]:
import random
import gym
#import gymnasium as gym
import numpy as np
import tensorflow as tf
from collections import deque
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Nadam
import os # for creating directories

In [6]:
print(tf.config.list_physical_devices('GPU'))

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


#### Set hyperparameters

In [7]:
import logging
# Set the logging level to suppress "ms/step" messages
#logging.getLogger("gymnasium").setLevel(logging.CRITICAL) #this does NOT help though
#logging.getLogger("gym").setLevel(logging.CRITICAL) #this does NOT help though
logging.getLogger("openai").setLevel(logging.ERROR)


In [8]:
env = gym.make('CartPole-v0') # initialise environment

In [9]:
state_size = env.observation_space.shape[0]
state_size

4

In [10]:
action_size = env.action_space.n
action_size

2

In [11]:
batch_size = 32

In [12]:
n_episodes = 1000 # n games we want agent to play

In [13]:
output_dir = 'model_output/cartpole/'

In [14]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

#### Define agent

In [15]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000) # this double-ended queue vs traditional list: need to compare given the order is NOT relevant here, usage is just random sampling
        self.gamma = 0.95 # discount rate: agent takes future actions into account in addition to the immediate one, but discounted at this rate per time period
        self.epsilon = 1.0 # exploration rate: how much to act randomly (1.0 -> starting 100% randomly); more initially than later due to epsilon decay
        self.epsilon_decay = 0.995 # decrease number of random explorations as the agent's performance (hopefully) improves over time
        self.epsilon_min = 0.01 # minimum amount of random exploration permitted
        self.learning_rate = 0.001 # rate at which NN adjusts models parameters via SGD to reduce cost
        self.model = self._build_model() # private method

    def _build_model(self):
        # neural net to approximate Q-value function:
        model = Sequential()
        model.add(Dense(32, activation='relu',
                        input_dim=self.state_size)) # 1st hidden layer; states as input
        model.add(Dense(32, activation='relu')) # 2nd hidden layer
        model.add(Dense(self.action_size, activation='linear')) # 2 actions, so 2 output neurons: 0 and 1 (L/R)
        model.compile(loss='mse', optimizer=Nadam(learning_rate=self.learning_rate))
                      #optimizer=Nadam(lr=self.learning_rate))
        return model

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done)) # list of previous experiences, enabling re-training later

    def train(self, batch_size): # method that trains NN with experiences sampled from memory
        minibatch = random.sample(self.memory, batch_size) # sample a minibatch from memory
        for state, action, reward, next_state, done in minibatch: # extract data for each minibatch sample
            target = reward # if done (boolean whether game ended or not, i.e., whether final state or not), then target = reward
            if not done: # if not done, then predict future discounted reward
                target = (reward + self.gamma * np.amax(self.model.predict(next_state, verbose=0)[0]))
                # target Q-value = reward + (discount rate gamma) * (maximum target Q-value based on future state s' and future action a')
            target_f = self.model.predict(state, verbose=0) # approximately map current state to future discounted reward
            target_f[0][action] = target #updating ONLY the Q-value for the action taken above, other Q-values (for the actions NOT taken) staying the same
            self.model.fit(state, target_f, epochs=1, verbose=0) # single epoch of training with x=state, y=target_f; fit decreases loss btwn target_f and y_hat
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def act(self, state):
        if np.random.rand() <= self.epsilon: # if acting randomly, take random action
            return random.randrange(self.action_size)
        act_values = self.model.predict(state, verbose=0) # if not acting randomly, predict reward value based on current state
        return np.argmax(act_values[0]) # pick the action that will give the highest reward (in the case here, the action is: go left or right)

    def save(self, name):
        self.model.save_weights(name)

    def load(self, name):
        self.model.load_weights(name)

#### Interact with environment

In [16]:
agent = DQNAgent(state_size, action_size) # initialise agent

In [17]:
for e in range(1, n_episodes + 1): # iterate over episodes of gameplay
    state = env.reset() # reset state at start of each new episode of the game
    state = np.array(state[0])
    state = np.reshape(state, [1, state_size])

    done = False
    time = 0 # time represents a frame of the episode; goal is to keep pole upright as long as possible
    while not done:
        # env.render()
        action = agent.act(state) # action is either 0 or 1 (move cart left or right); decide on one or other here
        next_state, reward, terminated, truncated, _ = env.step(action) # agent interacts with env, gets feedback; 4 state data points, e.g. cart position, cart velocity, pole angle, pile velocity
        done = terminated or truncated
        reward = reward + 1 if not done else -10 # reward +1 for each additional frame with pole upright, penalty of -10 if not
        next_state = np.reshape(next_state, [1, state_size])
        agent.remember(state, action, reward, next_state, done) # remember the previous timestep's state, actions, reward, etc.
        state = next_state # set "current state" for upcoming iteration to the current next state
        if done: # if episode ends:
            print("episode: {}/{}, score: {}, epsilon: {:.2}".format(e, n_episodes, time, agent.epsilon)) # print the episode's score and agent's epsilon
        time += 1
    if len(agent.memory) > batch_size:
        agent.train(batch_size) # train the agent by replaying the experiences of the episode
    if e % 100 == 0:
        agent.save(output_dir + '{:04d}'.format(e) + ".weights" + ".h5")

episode: 1/1000, score: 36, epsilon: 1.0
episode: 2/1000, score: 21, epsilon: 0.99
episode: 3/1000, score: 13, epsilon: 0.99
episode: 4/1000, score: 42, epsilon: 0.99
episode: 5/1000, score: 20, epsilon: 0.98
episode: 6/1000, score: 10, epsilon: 0.98
episode: 7/1000, score: 37, epsilon: 0.97
episode: 8/1000, score: 14, epsilon: 0.97
episode: 9/1000, score: 13, epsilon: 0.96
episode: 10/1000, score: 12, epsilon: 0.96
episode: 11/1000, score: 15, epsilon: 0.95
episode: 12/1000, score: 14, epsilon: 0.95
episode: 13/1000, score: 15, epsilon: 0.94
episode: 14/1000, score: 18, epsilon: 0.94
episode: 15/1000, score: 22, epsilon: 0.93
episode: 16/1000, score: 39, epsilon: 0.93
episode: 17/1000, score: 13, epsilon: 0.92
episode: 18/1000, score: 24, epsilon: 0.92
episode: 19/1000, score: 29, epsilon: 0.91
episode: 20/1000, score: 14, epsilon: 0.91
episode: 21/1000, score: 15, epsilon: 0.9
episode: 22/1000, score: 32, epsilon: 0.9
episode: 23/1000, score: 13, epsilon: 0.9
episode: 24/1000, score:

In [20]:
# saved agents can be loaded with agent.load("./path/filename.h5")

In [21]:
!pip list

Package                      Version
---------------------------- -----------
absl-py                      2.2.2
asttokens                    3.0.0
astunparse                   1.6.3
cachetools                   5.5.2
certifi                      2025.4.26
charset-normalizer           3.4.2
cloudpickle                  3.1.1
colorama                     0.4.6
comm                         0.2.1
debugpy                      1.8.11
decorator                    5.1.1
exceptiongroup               1.2.0
executing                    0.8.3
Farama-Notifications         0.0.4
flatbuffers                  25.2.10
gast                         0.4.0
google-auth                  2.40.2
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
grpcio                       1.71.0
gym                          0.26.2
gym-notices                  0.0.8
gymnasium                    1.1.1
h5py                         3.13.0
idna                         3.10
ipykernel                    6.29.5
i