In [1]:
import retro
import torch
from torch import nn
from copy import deepcopy

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ACTION-VALUE FUNCTION / Q-LEARNING

The sources for this notebook are almost the same as the one for PPO:

- https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/deepq/dqn.py - Deep Q-Learning code from Stable Baselines
- https://github.com/saashanair/rl-series/blob/master/dqn/dqn_agent.py - Implementation of Deep Q-Learning from scratch in Pytorch, by Saasha Nair. Codes from people who are not from OpenAI (or big techs in general) are almost always way more readable and easier to understand.
- https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html - Pytorch tutorial on Reinforcement Learning using Deep Q-Learning. One of the few Pytorch tutorials that really teaches you something other than downloading and importing ready-made functions.
- https://spinningup.openai.com/en/latest/spinningup/rl_intro.html - OpenAI's Spinning Up, by Josh Achiam. However, it's not really good to learn about Q-Learning, as it's still counter-intuitive and it's focused on On-Policy algorithms, like PPO's antecessors (TRPO and Vanilla Policy Gradient)
- https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf - Deep Q-Learning original paper. Pay attention to the algorithm and the 3 equations there. Though the paper is quite clean and concise.
- https://lilianweng.github.io/posts/2018-02-19-rl-overview/#q-learning-off-policy-td-control - Lilian Weng's blog about Reinforcement Learning.

In [54]:
class SubjectActorCritic(nn.Module):

    '''
    Simple Neural Network for testing
    '''

    def __init__(self, mode='exploration', epsilon=0.99):

        super(SubjectActorCritic, self).__init__()

        self.mode = mode
        self.epsilon = epsilon

        # Input = 3x200x256

        self.neuron1 = nn.Linear(3*200*256, 128)
        self.neuron2 = nn.Linear(128, 128)

        # The environment is MultiBinary, with 12 elements that can be 0 or 1
        # Since this layer will provide both actions and value, sigmoid would imply that
        # The action is either high or low value.
        # Parallel liner layers is the best solution I could think of

        self.output_neuron = nn.ModuleList([nn.Linear(128, 2)for _ in range(12)])

        self.relu = nn.ReLU()

    def forward(self, obs):

        x = obs.contiguous().view(obs.size(0), -1)

        x = self.neuron1(x)
        x = self.relu(x)
        x = self.neuron2(x)
        x = self.relu(x)

        actions_values = []

        for layer in range(12):

            actions_values.append(self.output_neuron[layer](x))

        if self.mode == 'exploration':
            # Just collecting data

            for i in range(len(actions_values)):

                if torch.randn((1,)) < self.epsilon:

                    action = actions_values[i]
                    action = torch.zeros_like(action)
                    random_idx = torch.randint(0, 2, ([])).item()
                    action[0, random_idx] = 1.0

                    actions_values[i] = action

                    del action, i

        del x

        return actions_values

In [69]:
# Here, we'll use a single function to determine the value of each action
# And also a target network to determine the value states for the loss

action_value = SubjectActorCritic().to(device)
target_network = deepcopy(action_value).to(device).eval()

In [70]:
'''
Unfortunately, RL algorithms tend to be too sentimental,
so we'll spend quite some time adjusting hyperparameters

Even more unfortunately, it seems that research in deep RL 
is mostly done by OpenAI exclusively, and they were in love with PPO

Some parameters used in some Gym's environments, which are too simple:

https://github.com/saashanair/rl-series/blob/master/dqn/main.py
https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

For complex games, like we wnt for Hakisa, we unfortunately don't have any basis.
Maybe we could stick to something close to PPO2 parameters:

https://github.com/liuruoze/HierNet-SC2/blob/396646056dbe5f8f20e43e0ef35e59db09e907c0/param.py

"Even Ignoring Generalization Issues, The Final Results Can be Unstable and Hard to Reproduce" - Alex Irpan

"[Supervised learning] wants to work. Even if you screw something up you'll usually get something non-random back.
RL must be forced to work.
If you screw something up or don't tune something well enough you're exceedingly likely to get a policy that is even worse than random.
And even if it's all well tuned you'll get a bad policy 30% of the time, just because." - Andrej Karpathy
'''

gamma = 0.99 # Gamma for the Discount Rewards.
BATCH_SIZE = 16 # In reality, all tensors are batch 1, so we'll use gradient accumulation to simulate multiple batches.
EPOCHS = 10
lr = 1e-4
target_delay = BATCH_SIZE * 16 # Steps before applying update to the target network
action_value.epsilon = 0.99 # Epsilon for the epsilon-greedy strategy
# Pytorch tutorial uses a Tau weight to apply soft update to target network.
#update_weight = 0.005

In [71]:
# In reality, the AI will only play the game to acquire data
# The magic really happens during her training, which is done offline.

# While in PPO the AI plays the game to acquire data and then
# trains offline, DQN usually applies training online.
# We'll try training offline first, then going online.

'''
"BATCH_SIZE is the number of transitions sampled from the replay buffer" - Pytorch's DQN tutorial.

"A trajectory is a sequence of states and actions in the world. [...]
Trajectories are also frequently called episodes or rollouts."
- OpenAI's Spinning Up: https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

In DQN, the Replay Buffer(or Memory) is the same as the Rollout in PPO, but with
a more intuitive name.
'''

# Creating lists to store data.
# Using a separate cell in case of storing multiple episodes (playthroughs)
# TO CONSIDER: Using pickle or torch to save and load playthroughs

states = []
actions = []
log_probs = []
values = []
rewards = []
rewards_to_go = []

In [72]:
# Exploration Phase - Let her learn how the environment is.

env = retro.make(game="StreetFighterIISpecialChampionEdition-Genesis", state="ChunLiVsBlanka.1star")
obs = env.reset()
obs = torch.from_numpy(obs)
obs = obs/255
obs = obs.permute(2, 1, 0).unsqueeze(0).float().to(device)
steps = 0

while steps < 1000:
    env.render()

    # Collecting State --> Must be done at the beginning

    states.append(obs.cpu())

    with torch.no_grad():

        # The action is provided by the DQN network,
        # while the value, by the target network

        prob = action_value(obs)
        value = target_network(obs)

    # MultiBinary Environment --> Only 0.0 or 1.0 accepted
    bin = []
    for action in prob:
        bin.append(action.argmax().item())
    
    action = bin # This is the actual action

    del bin

    obs, reward, end, info = env.step(action)
    obs = torch.from_numpy(obs)
    obs = obs/255
    obs = obs.permute(2, 1, 0).unsqueeze(0).float().to(device)
    reward = torch.tensor(reward, device=device)

    reward = (info['health']**(1+info['matches_won'])) - (info['enemy_health']**(1+info['enemy_matches_won']))
    reward = torch.tensor(reward, device=device)
    reward = -(10.0/(torch.exp(reward) + 1.0)) + 5.0 # Normalizing to -5 to +5 (sigmoid function)

    # Collecting variables for the previous (collected) state
    
    actions.append(action) # List of indices
    log_probs.append(prob) # List of tensors
    values.append(value) # List of tensors
    rewards.append(reward.cpu()) # Pure tensors

    steps += 1

env.render(close=True)
env.close()

In [73]:
# Processing colected data

states = torch.cat(states, 0)

discounted_reward = 0 # The final rewards provide greater impact on the algorithm
for r_t in reversed(rewards):
    
    discounted_reward = r_t + discounted_reward * gamma
    rewards_to_go.insert(0, discounted_reward)

rewards_to_go = torch.tensor(rewards_to_go)

# There's no need to use advantage here --> Time-Difference Learning

In [98]:
# CONSOLIDATION PHASE - https://en.wikipedia.org/wiki/Memory_consolidation
# She remembers what she saw, and learns from it.

action_value.mode = 'consolidation'
target_network.mode = 'target' # Remember to turn off the target network exploration mode

# In RL, an Epoch consists of an entire episode + training from that episode,
# while a Batch consists of an episode.
# DQN also uses Batch as a number of samples extracted from memory.
# https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html
# We'll stick to traditional terminology of ML, to avoid needless confusion.

optimizer = torch.optim.Adam(action_value.parameters(), lr=lr)
criterion = torch.nn.MSELoss()

for epoch in range(EPOCHS):

    steps = 0
    batches = torch.randperm(len(states))

    for batch in batches:

        obs = states[batch].to(device).unsqueeze(0)

        try:
            next_obs = states[batch+1].to(device).unsqueeze(0)
        
        except: # Terminal state
            next_obs = None

        action = actions[batch]
        reward = rewards[batch].to(device)

        # Compute Q(s_t, a)
        # The model computes Q(s_t), the value of that state
        # Then we select the index of the action that was taken.

        Q_s = action_value(obs)

        Q_s_a = []

        for item in range(len(Q_s)):

            a = Q_s[item][0, action[item]]

            Q_s_a.append(a)

        # Compute target Q(s', a'), such that
        # yi = r_t + gamma * Q(s', a'), with the maximum values
        # for each action
        # However, if reached terminal state, yi = r_t
            
        if next_obs is None:

            Q_target = [0.0]*12

        else:
            
            Q_target = target_network(next_obs)

            Q_t_a = []

            for item in range(len(Q_target)):

                a = Q_target[item].argmax(-1)
                
                Q_t_a.append(a)

        Q_t_a = torch.tensor(Q_t_a, device=device)
        
        r_t = torch.full_like(Q_t_a, reward.item(), device=device)

        y_i = r_t + gamma * Q_t_a

        '''# TO CONSIDER: adding an entropy factor to the loss
        # PS: Doesn't seem much needed here, and it's too complicated in MultiBinary Environment

        entropy = []

        for item in range(len(Q_s)):

            current_prob = Q_s[item]

            e = (current_prob * torch.log(torch.clamp(current_prob, 1e-10, 1.0))).sum()
            e = -e.mean()
            entropy.append(e)
            
        entropy = torch.tensor(entropy, device=device)'''

        for item in range(len(Q_s_a)):

            loss = criterion(Q_s_a[item], y_i[item]) #- (entropy * 1e-3)
            loss.backward(retain_graph=True) 

        # Using Gradient Accumulation to avoid using batch size greater than 1 -> Lower computation cost

        if steps % BATCH_SIZE == 0:

            optimizer.step()
            action_value.zero_grad()

        if steps % target_delay:

            target_network.load_state_dict(action_value.state_dict())

        steps += 1

        if steps % 100 == 0:

            print(f"{epoch+1}/{EPOCHS}")
            print(f"Current step: {steps}")
            print(f"Current Loss: {loss.item()}")
            print(f"Model Gradients: {action_value.neuron1.weight.grad.mean()}")

1/2
Current step: 100
Current Loss: 1.3783857822418213
Model Gradients: -0.0040936581790447235




1/2
Current step: 200
Current Loss: 0.25955578684806824
Model Gradients: -0.022637929767370224
1/2
Current step: 300
Current Loss: 2.609018325805664
Model Gradients: -0.0007244820590130985
1/2
Current step: 400
Current Loss: 2.49064564704895
Model Gradients: 0.040356554090976715
1/2
Current step: 500
Current Loss: 9.377496719360352
Model Gradients: -0.007127691991627216
1/2
Current step: 600
Current Loss: 2.7017438411712646
Model Gradients: 0.026180408895015717
1/2
Current step: 700
Current Loss: 0.3448813259601593
Model Gradients: 0.020731082186102867
1/2
Current step: 800
Current Loss: 0.6616409420967102
Model Gradients: -0.04040137678384781
1/2
Current step: 900
Current Loss: 1.0524049997329712
Model Gradients: -0.021784093230962753
1/2
Current step: 1000
Current Loss: 4.267786502838135
Model Gradients: -0.03505820780992508
2/2
Current step: 100
Current Loss: 0.04128098115324974
Model Gradients: 0.005391942802816629
2/2
Current step: 200
Current Loss: 0.005966844968497753
Model Grad

In [97]:
# New cell to adjust parameters and restart training

action_value = SubjectActorCritic().to(device)
target_network = deepcopy(action_value).to(device).eval()

gamma = 0.99 # Gamma for the Discount Rewards.
BATCH_SIZE = 16 # In reality, all tensors are batch 1, so we'll use gradient accumulation to simulate multiple batches.
EPOCHS = 2
lr = 2e-4
target_delay = BATCH_SIZE * 16 # Steps before applying update to the target network
action_value.epsilon = 0.99 # Epsilon for the epsilon-greedy strategy

In [92]:
# Gameplay Mode

env = retro.make(game="StreetFighterIISpecialChampionEdition-Genesis", state="ChunLiVsBlanka.1star")
obs = env.reset()
obs = torch.from_numpy(obs)
obs = obs/255
obs = obs.permute(2, 1, 0).unsqueeze(0).float().to(device)
steps = 0

# If you'd like to save and train even more

'''states = []
actions = []
rewards = []
deltas = []'''

while steps < 1000:
    env.render()

    with torch.no_grad():

        prob = action_value(obs)
        value = target_network(obs)

    # MultiBinary Environment --> Only 0.0 or 1.0 accepted
    bin = []
    for action in prob:
        bin.append(action.argmax().item())
    
    action = bin # This is the actual action

    del bin

    obs, reward, end, info = env.step(action)
    obs = torch.from_numpy(obs)
    obs = obs/255
    obs = obs.permute(2, 1, 0).unsqueeze(0).float().to(device)
    #reward = torch.tensor(reward, device=device)

    reward = (info['health']**(1+info['matches_won'])) - (info['enemy_health']**(1+info['enemy_matches_won']))
    reward = torch.tensor(reward, device=device)
    reward = -(10.0/(torch.exp(reward) + 1.0)) + 5.0 # Normalizing to -5 to +5 (sigmoid function)

    '''states.append(obs.cpu())
    actions.append(action.cpu())
    rewards.append(reward.cpu())
    deltas.append(delta.cpu())'''

    steps += 1

env.render(close=True)
env.close()

In [93]:
# New cell to adjust parameters and restart training

action_value = SubjectActorCritic().to(device)
target_network = deepcopy(action_value).to(device).eval()

gamma = 0.99 # Gamma for the Discount Rewards.
lamb = 0.95 # Lambda for Generalized Advantage Estimation. Together with gamma, basically a "weight" for Exponential Moving Average
BATCH_SIZE = 16 # In reality, all tensors are batch 1, so we'll use gradient accumulation to simulate multiple batches.
EPOCHS = 2
lr = 1e-4
target_delay = BATCH_SIZE * 16 # Steps before applying update to the target network
action_value.epsilon = 0.99 # Epsilon for the epsilon-greedy strategy

In [99]:
# Play and Learn
# This method is more unstable than using Exploration + Consolidation.
# It's recommended to use Exploration + Consolidation and, only then, this method.

'''
"Use reinforcement learning just as the fine-tuning step:
The first AlphaGo paper started with supervised learning, and then did RL fine-tuning on top of it.
This is a nice recipe, since it lets you use a faster-but-less-powerful method to speed up initial learning.
It's worked in other contexts - see Sequence Tutor (Jaques et al, ICML 2017).
You can view this as starting the RL process with a reasonable prior, instead of a random one,
where the problem of learning the prior is offloaded to some other approach."
- Alex Irpan, https://www.alexirpan.com/2018/02/14/rl-hard.html

"In SL training, we found a learning rate of 1e-4 and 10 training epochs achieve the
best result. The best model achieves a 0.15 win rate against the level-1 built-in AI.
Note that though this result is not as good as that we acquire in the HRL method, the training
here faces 564 actions, thus is much difficult."
- Liu, Ruo-Ze et al. On Efficient Reinforcement Learning for Full-length Game of StarCraft II

'''

env = retro.make(game="StreetFighterIISpecialChampionEdition-Genesis", state="ChunLiVsBlanka.1star")
obs = env.reset()
obs = torch.from_numpy(obs)
obs = obs/255
obs = obs.permute(2, 1, 0).unsqueeze(0).float().to(device)
steps = 0

action_value.mode = 'consolidation'
target_network.mode = 'target' # Remember to turn off the target network exploration mode

optimizer = torch.optim.Adam(action_value.parameters(), lr=lr)
criterion = torch.nn.MSELoss()

while steps < 1000:
    env.render()

    Q_s = action_value(obs)
    
    # Compute Q(s_t, a)
    # The model computes Q(s_t), the value of that state
    # Then we select the index of the action that was taken.

    Q_s = action_value(obs)

    Q_s_a = []

    for item in range(len(Q_s)):

        a = Q_s[item][0, action[item]]

        Q_s_a.append(a)

    # MultiBinary Environment --> Only 0.0 or 1.0 accepted
    bin = []
    for action in prob:
        bin.append(action.argmax().item())
    
    action = bin # This is the actual action

    del bin

    # Getting next observation and its consequences

    obs, reward, end, info = env.step(action)
    obs = torch.from_numpy(obs)
    obs = obs/255
    obs = obs.permute(2, 1, 0).unsqueeze(0).float().to(device)
    reward = torch.tensor(reward, device=device)

    reward = (info['health']**(1+info['matches_won'])) - (info['enemy_health']**(1+info['enemy_matches_won']))
    reward = torch.tensor(reward, device=device)
    reward = -(10.0/(torch.exp(reward) + 1.0)) + 5.0 # Normalizing to -5 to +5 (sigmoid function)

    # Compute target Q(s', a'), such that
    # yi = r_t + gamma * Q(s', a'), with the maximum values
    # for each action
    # However, if reached terminal state, yi = r_t
        
    if end:

        Q_target = [0.0]*12

    else:
        
        Q_target = target_network(obs)

        Q_t_a = []

        for item in range(len(Q_target)):

            a = Q_target[item].argmax(-1)
            
            Q_t_a.append(a)

    Q_t_a = torch.tensor(Q_t_a, device=device)
    
    r_t = torch.full_like(Q_t_a, reward.item(), device=device)

    y_i = r_t + gamma * Q_t_a

    '''# TO CONSIDER: adding an entropy factor to the loss
    # Here, such factor may be interesting.

    entropy = []

    for item in range(len(Q_s)):

        current_prob = Q_s[item]

        e = (current_prob * torch.log(torch.clamp(current_prob, 1e-10, 1.0))).sum()
        e = -e.mean()
        entropy.append(e)
        
    entropy = torch.tensor(entropy)'''

    for item in range(len(Q_s_a)):

        loss = criterion(Q_s_a[item], y_i[item])
        loss.backward(retain_graph=True) 

    # Using Gradient Accumulation to avoid using batch size greater than 1 -> Lower computation cost

    if steps % BATCH_SIZE == 0:

        optimizer.step()
        action_value.zero_grad()

    if steps % target_delay:

        target_network.load_state_dict(action_value.state_dict())

    steps += 1

    if steps % 100 == 0:

        print(f"Current step: {steps}")
        print(f"Current Loss: {loss.item()}")

env.render(close=True)
env.close()

Current step: 100
Current Loss: 0.01855369098484516
Current step: 200
Current Loss: 0.6386929154396057
Current step: 300
Current Loss: 0.007749685551971197
Current step: 400
Current Loss: 0.05683543160557747
Current step: 500
Current Loss: 0.13423769176006317
Current step: 600
Current Loss: 0.09188412874937057
Current step: 700
Current Loss: 0.0023667553905397654
Current step: 800
Current Loss: 0.004315526224672794
Current step: 900
Current Loss: 9.851335525512695
Current step: 1000
Current Loss: 1.447236180305481


### SPECIAL: https://arxiv.org/pdf/2102.04518.pdf - A* applied to Deep Q-Learning, giving birth to Q*-Learning. Supposedly the RL method used for next OpenAI's AGI, capable of expressing logical sense. Meanwhile, PPO was used for GPT-3(and 4) using a categorical reward model (rewards 0 to 10, from worst to best)

"A* search selects the node with the lowest cost for expansion and computes the cost of all of its children.
This process continues until a node associated with a goal state is selected for expansion.
Expanding a node requires that every possible action be applied to the state associated with that node,
thereby generating new states and, subsequently, new nodes."
- This reminds a bit of how the Transformer Generation is done: Greedy or Beam Search, both trying to select
the most likely tokens given an input token (or sequence of tokens).

"DQNs are DNNs that map a single state to the sum of the transition cost and the heuristic
value for each of its successor states. This allows us to only generate one node per iteration as we can store tuples
of nodes and actions in a priority queue whose priority is determined by the DQN. When removing a tuple of a node
and action from the queue, we can then generate a new node by applying the action to the state associated with that
node."
- In this case, a Transformer *(or a Generative Transformer)* with DQN Method 
doesn't seem to be that much different from its classic counterpart, right?