## Q-learning

Пересчёт Q-функции выполняется по правилу
### $Q_{k+1}(s, a) = (1-\alpha) \cdot Q_{k}(s, a) + \alpha \cdot \hat{Q}(s,a) =  Q_{k}(s, a) + \alpha \cdot (\hat{Q}(s,a) - Q_{k}(s, a))$


## Q-learning с использованием нейросети
Теперь $Q$-функция зависит от параметров аппроксимирующей её нейросети $\theta$, то есть на каждой итерации

### $Q_{\theta}(s, a) \leftarrow Q_{{\theta}}(s, a) + \alpha \cdot (r(s,a) + \gamma \cdot \max_{a'}{Q_{\theta}(s',a')} - Q_{\theta}(s, a))$

Таким образом, решается следующая задача регрессии:

Пусть у нас есть текущая версия $Q_{\theta_k}k(s, a)$, и мы хотим проделать шаг метода простой итерации для решения уравнения Беллмана:
### $Q^*(s,a) =  r(s,a) + \gamma E_{s'}[\max_{a'}{Q^*(s',a')}]$


- входом является пара $s, a$

- искомым значением на паре $s, a$ является правая часть уравнения оптимальности Беллмана, то есть
### $f(s, a) \leftarrow r(s,a) + \gamma E_{s'}[\max_{a'}{Q_{\theta_k}(s',a')}]$

- наблюдаемым значением является
### $y(s, a) \leftarrow r(s,a) + \gamma \max_{a'}{Q_{\theta_k}(s',a')},$
где $s' \leftarrow p(s'|s,a).$

- функция потерь определена как
### $Loss(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2$

In [1]:
# https://github.com/gsurma/cartpole
    
import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import adam_v2

### Реализация 1

In [2]:
ENV_NAME = "CartPole-v1"

GAMMA = 0.95
LEARNING_RATE = 0.001

MEMORY_SIZE = 1000000
BATCH_SIZE = 20

EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995

In [3]:
class DQNSolver:

    def __init__(self, observation_space, action_space):
        self.exploration_rate = EXPLORATION_MAX

        self.action_space = action_space
        self.memory = deque(maxlen=MEMORY_SIZE)

        self.model = Sequential()
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))
        self.model.add(Dense(24, activation="relu"))
        self.model.add(Dense(self.action_space, activation="linear"))
        self.model.compile(loss="mse", optimizer=adam_v2.Adam(learning_rate=LEARNING_RATE))

        
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

        
    def act(self, state):
        if np.random.rand() < self.exploration_rate:
            return random.randrange(self.action_space)
        
        q_values = self.model.predict(state)
        
        return np.argmax(q_values[0])

    
    def experience_replay(self):
        if len(self.memory) < BATCH_SIZE:
            return
        
        batch = random.sample(self.memory, BATCH_SIZE)
        for state, action, reward, state_next, terminal in batch:
            q_update = reward
            
            if not terminal:
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))
                
            q_values = self.model.predict(state)
            q_values[0][action] = q_update
            self.model.fit(state, q_values, verbose=0)
            
        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)

In [17]:
def cartpole():
    env = gym.make("CartPole-v1")
    observation_space = env.observation_space.shape[0]
    action_space = env.action_space.n
    dqn_solver = DQNSolver(observation_space, action_space)
    
    for i in range(100):
        state = env.reset()
        state = np.reshape(state, [1, observation_space])
        episode_len = 0
        
        while True:
            action = dqn_solver.act(state)
            state_next, reward, terminal_state, info = env.step(action)
            reward = reward if not terminal_state else -reward
            state_next = np.reshape(state_next, [1, observation_space])
            dqn_solver.remember(state, action, reward, state_next, terminal_state)
            dqn_solver.experience_replay()
            state = state_next
            episode_len += 1
            
            if terminal_state:
                print(i, '-', episode_len)
                break
    
    terminal_state = False
    state = env.reset()
    state = np.reshape(state, [1, observation_space])               
    
    while True:
        #env.render()
        action = dqn_solver.act(state)
        state_next, reward, terminal, info = env.step(action)
        state = np.reshape(state_next, [1, observation_space])
        print(state)
          
        if terminal:
           
            break

In [18]:
cartpole()

0 - 15
1 - 12
2 - 10
3 - 17
4 - 11
5 - 15
6 - 15
7 - 13
8 - 10
9 - 11
10 - 19
11 - 12
12 - 10
13 - 23
14 - 10
15 - 11
16 - 18
17 - 17
18 - 37
19 - 25
20 - 33
21 - 42
22 - 52
23 - 30
24 - 58
25 - 81
26 - 48
27 - 73
28 - 86
29 - 73
30 - 95
31 - 112
32 - 124
33 - 125
34 - 92
35 - 117
36 - 119
37 - 117
38 - 140
39 - 202
40 - 132
41 - 130
42 - 148
43 - 168
44 - 177
45 - 139
46 - 159
47 - 135
48 - 240
49 - 182
50 - 180
51 - 150
52 - 175
53 - 151
54 - 172
55 - 180
56 - 167
57 - 175
58 - 159
59 - 185
60 - 164
61 - 144
62 - 142
63 - 148
64 - 155
65 - 166
66 - 140
67 - 153
68 - 192
69 - 137
70 - 186
71 - 169
72 - 155
73 - 151
74 - 140
75 - 148
76 - 133
77 - 134
78 - 144
79 - 150
80 - 147
81 - 157
82 - 182
83 - 139
84 - 140
85 - 135
86 - 158
87 - 176
88 - 170
89 - 169
90 - 158
91 - 158
92 - 144
93 - 134
94 - 177
95 - 141
96 - 194
97 - 169
98 - 186
99 - 129
[[ 0.00334804 -0.2178733  -0.04252609  0.26176864]]
[[-0.00100943 -0.02217093 -0.03729071 -0.04401844]]
[[-0.00145285 -0.21673885 -0.03817108 

### Реализация 2

In [2]:
import tensorflow as tf
from tensorflow import keras
import time

In [3]:
RANDOM_SEED = 5
tf.random.set_seed(RANDOM_SEED)

env = gym.make('CartPole-v1')
env.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

print("Action Space: {}".format(env.action_space))
print("State space: {}".format(env.observation_space))

# An episode a full game
train_episodes = 300
test_episodes = 100

Action Space: Discrete(2)
State space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)


In [4]:
def agent(state_shape, action_shape):
    """ The agent maps X-states to Y-actions
    e.g. The neural network output is [.1, .7, .1, .3]
    The highest value 0.7 is the Q-Value.
    The index of the highest action (0.7) is action #1.
    """
    learning_rate = 0.001
    init = tf.keras.initializers.HeUniform()
    model = keras.Sequential()
    model.add(keras.layers.Dense(24, input_shape=state_shape, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(12, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(action_shape, activation='linear', kernel_initializer=init))
    model.compile(loss=tf.keras.losses.Huber(), optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), metrics=['accuracy'])
    return model

def get_qs(model, state, step):
    return model.predict(state.reshape([1, state.shape[0]]))[0]

def train(env, replay_memory, model, target_model, done):
    learning_rate = 0.7
    discount_factor = 0.618

    MIN_REPLAY_SIZE = 1000
    if len(replay_memory) < MIN_REPLAY_SIZE:
        return

    batch_size = 64 * 2
    mini_batch = random.sample(replay_memory, batch_size)
    current_states = np.array([transition[0] for transition in mini_batch])
    current_qs_list = model.predict(current_states)
    new_current_states = np.array([transition[3] for transition in mini_batch])
    future_qs_list = target_model.predict(new_current_states)

    X = []
    Y = []
    for index, (observation, action, reward, new_observation, done) in enumerate(mini_batch):
        if not done:
            max_future_q = reward + discount_factor * np.max(future_qs_list[index])
        else:
            max_future_q = reward

        current_qs = current_qs_list[index]
        current_qs[action] = (1 - learning_rate) * current_qs[action] + learning_rate * max_future_q

        X.append(observation)
        Y.append(current_qs)
    model.fit(np.array(X), np.array(Y), batch_size=batch_size, verbose=0, shuffle=True)
    
def play(model):
    env = gym.make('CartPole-v1')
    observation = env.reset()
    acc_reward = 0;
    done = False
    
    while not done:
        env.render()
        
        encoded = observation
        encoded_reshaped = encoded.reshape([1, encoded.shape[0]])
        predicted = model.predict(encoded_reshaped).flatten()
        action = np.argmax(predicted)        
        observation, reward, done, info = env.step(action)
        acc_reward += reward
        
    env.close()    
    return acc_reward    

In [5]:
epsilon = 1 
max_epsilon = 1 
min_epsilon = 0.01 
decay = 0.01

model = agent(env.observation_space.shape, env.action_space.n)

target_model = agent(env.observation_space.shape, env.action_space.n)
target_model.set_weights(model.get_weights())

replay_memory = deque(maxlen=50_000)

target_update_counter = 0

# X = states, y = actions
X, y = [], []

steps_to_update_target_model = 0

for episode in range(train_episodes):
    total_training_rewards = 0
    observation = env.reset()
    done = False
    
    while not done:
        steps_to_update_target_model += 1
        if True:
            env.render()

        random_number = np.random.rand()
        # 2. Explore using the Epsilon Greedy Exploration Strategy
        if random_number <= epsilon:
            # Explore
            action = env.action_space.sample()
        else:
            # Exploit best known action
            # model dims are (batch, env.observation_space.n)
            encoded = observation
            encoded_reshaped = encoded.reshape([1, encoded.shape[0]])
            predicted = model.predict(encoded_reshaped).flatten()
            action = np.argmax(predicted)
        
        new_observation, reward, done, info = env.step(action)
        replay_memory.append([observation, action, reward, new_observation, done])

        # 3. Update the Main Network using the Bellman Equation
        if steps_to_update_target_model % 4 == 0 or done:
            train(env, replay_memory, model, target_model, done)

        observation = new_observation
        total_training_rewards += reward

        if done:
            print('Total training rewards: {} after n steps = {}'.format(total_training_rewards, episode))
            total_training_rewards += 1

            if steps_to_update_target_model >= 100:
                print('Copying main network weights to the target network weights')
                target_model.set_weights(model.get_weights())
                steps_to_update_target_model = 0
            break

    epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay * episode)
env.close()

Total training rewards: 19.0 after n steps = 0 with final reward = 1.0
Total training rewards: 13.0 after n steps = 1 with final reward = 1.0
Total training rewards: 11.0 after n steps = 2 with final reward = 1.0
Total training rewards: 14.0 after n steps = 3 with final reward = 1.0
Total training rewards: 25.0 after n steps = 4 with final reward = 1.0
Total training rewards: 20.0 after n steps = 5 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 20.0 after n steps = 6 with final reward = 1.0
Total training rewards: 24.0 after n steps = 7 with final reward = 1.0
Total training rewards: 15.0 after n steps = 8 with final reward = 1.0
Total training rewards: 17.0 after n steps = 9 with final reward = 1.0
Total training rewards: 20.0 after n steps = 10 with final reward = 1.0
Total training rewards: 11.0 after n steps = 11 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 34.0

Total training rewards: 10.0 after n steps = 102 with final reward = 1.0
Total training rewards: 11.0 after n steps = 103 with final reward = 1.0
Total training rewards: 13.0 after n steps = 104 with final reward = 1.0
Total training rewards: 8.0 after n steps = 105 with final reward = 1.0
Total training rewards: 12.0 after n steps = 106 with final reward = 1.0
Total training rewards: 10.0 after n steps = 107 with final reward = 1.0
Total training rewards: 9.0 after n steps = 108 with final reward = 1.0
Total training rewards: 11.0 after n steps = 109 with final reward = 1.0
Total training rewards: 11.0 after n steps = 110 with final reward = 1.0
Total training rewards: 11.0 after n steps = 111 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 9.0 after n steps = 112 with final reward = 1.0
Total training rewards: 8.0 after n steps = 113 with final reward = 1.0
Total training rewards: 10.0 after n steps = 114 with final reward = 

Total training rewards: 22.0 after n steps = 204 with final reward = 1.0
Total training rewards: 25.0 after n steps = 205 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 30.0 after n steps = 206 with final reward = 1.0
Total training rewards: 22.0 after n steps = 207 with final reward = 1.0
Total training rewards: 69.0 after n steps = 208 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 30.0 after n steps = 209 with final reward = 1.0
Total training rewards: 52.0 after n steps = 210 with final reward = 1.0
Total training rewards: 25.0 after n steps = 211 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 31.0 after n steps = 212 with final reward = 1.0
Total training rewards: 28.0 after n steps = 213 with final reward = 1.0
Total training rewards: 44.0 after n steps = 214 with final reward = 1.0
Copying main network

Total training rewards: 124.0 after n steps = 285 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 44.0 after n steps = 286 with final reward = 1.0
Total training rewards: 237.0 after n steps = 287 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 264.0 after n steps = 288 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 81.0 after n steps = 289 with final reward = 1.0
Total training rewards: 406.0 after n steps = 290 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 113.0 after n steps = 291 with final reward = 1.0
Copying main network weights to the target network weights
Total training rewards: 97.0 after n steps = 292 with final reward = 1.0
Total training rewards: 47.0 after n steps = 293 with final reward = 1.0
Copying main network weights to the target 

In [6]:
play(target_model)

326.0