# deep-significance Demo

**Note: [REDACTED] indicates links that were temporarily removed for peer-reviewing purposes.**

In this demo, we will demonstrate some of the functionalities in the deep-significance package using the Cart Pole problem (Barto et al. 1983) as implemented in OpenAI gym. 

Since this is a demo, we will use an extremely simple approach to tackling reinforcement learning problems with neural networks, namely *Deep Q-networks* (Mnih et al., 2015). Back in 2015, Deep Q-networks where the first approach to obtain competitive scores on many Atari games. In this demo, we will specificly use the package to determine the effect of replay memory on the model. 

Deep Q-Learning tries to approximate the optimal action-value function defined as 

\begin{equation*}
    Q^*(s, a) = \max_\pi \mathbb{E}\big[ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots \big| s_t = s, a_t = a, \pi \big]
\end{equation*}

The definition above reads as follow: The optimal action-value function is the policy $\pi$ that maximizes the future reward $r_t$ at a state $s_t$ by performing an action $a_t$, with subsequent rewards being increasingly discounted by a factor $\gamma$. The model weights are updated using the following $l_2$ loss:

\begin{equation*}
    \mathcal{L}(\theta) = \mathbb{E}_{(s, a, r, s^\prime) \sim U(\text{Buffer})}\bigg[\Big(r + \max_{a^\prime} Q(s^\prime, a^\prime; \theta^\text{target}) - Q(s, a; \theta)\Big)^2\bigg]
\end{equation*}

Two aspects of this loss function are especially noteworthy: First of all, since we do not know the true value of the $Q$-function in most cases, the predicted value $Q(s, a; \theta)$ is compared against the reward plus outcome of the greedy action chosen by a *target* network: To avoid having to ``hit a moving target'' (Van Hasselt et al., 2018), the target network is only updated every couple of training steps by copying the main networks parameters. Secondly, the state, action and reward used to compute the loss are not the ones just observed by the model, but instead are uniformly sampled from a *replay buffer*, a sort of memory that past experiences gets added to during training.

For that purpose, let us first define the environment along with some project requirements:

In [1]:
# STD
import random

# EXT
import gym
import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

# Import package functions
# To use deepsig in your project, simply use pip install deepsig
import sys
sys.path.insert(0, "../")
from deepsig import aso, multi_aso, aso_uncertainty_reduction, bootstrap_power_analysis, bootstrap_test




In [2]:
# Set constants
SEED = 42

# Set hyperparameters
NUM_EPISODES = 100
TARGET_UPDATE_FREQ = 10
MAX_STEPS = 1000
BATCH_SIZE = 128
DISCOUNT_FACTOR = 0.8
LEARN_RATE = 1e-3
NUM_HIDDEN = 256
MEMORY_SIZE = 10000
SHOW_AGENT = False  # Set this to true if you want to see the agent learning


In [3]:
env = gym.envs.make("CartPole-v1")

# Seed for replicability
env.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x1073f1bf0>

Next, we define a super simple Deep Q-network and replay memory class:

In [4]:
class QNetwork(nn.Module):

    def __init__(self, n_in, n_out, num_hidden=128):
        nn.Module.__init__(self)
        self.l1 = nn.Linear(n_in, num_hidden)
        self.l2 = nn.Linear(num_hidden, n_out)

    def forward(self, x):
        out = self.l1(x)
        out = F.relu(out)
        out = self.l2(out)
        return out


class ReplayMemory:

    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []

    def push(self, transition):
        if self.capacity == len(self.memory):
            self.memory.pop(0)
        self.memory.append(transition)

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

Next, we implement some utility functions:

In [5]:
def select_action(model, state, epsilon):
    with torch.no_grad():
        action = model(torch.Tensor(state))
        return torch.argmax(action).item() if random.random() > epsilon else random.choice([0,1])

def get_epsilon(it):
    return 0.05 if it >= 1000 else - 0.00095 * it + 1
    
def compute_target(model, reward, next_state, done, discount_factor, target_net):

    targets = reward + (target_net(next_state).max(1)[0] * discount_factor) * (1 - done.float())
    
    return targets.unsqueeze(1)

def compute_q_val(model, state, action):
    q_val = model(state)
    q_val = q_val.gather(1, action.unsqueeze(1).view(-1, 1))
    return q_val


In [6]:
def train(model, memory, optimizer, batch_size, discount_factor, target_net):
    # don't learn without some decent experience
    if len(memory) < batch_size:
        return None

    # random transition batch is taken from experience replay memory
    transitions = memory.sample(batch_size)

    # transition is a list of 4-tuples, instead we want 4 vectors (as torch.Tensor's)
    state, action, reward, next_state, done = zip(*transitions)

    # convert to PyTorch and define types
    state = torch.tensor(state, dtype=torch.float)
    action = torch.tensor(action, dtype=torch.int64)  # Need 64 bit to use them as index
    next_state = torch.tensor(next_state, dtype=torch.float)
    reward = torch.tensor(reward, dtype=torch.float)
    done = torch.tensor(done, dtype=torch.uint8)  # Boolean
    action = action.squeeze()

    # compute the q value
    q_val = compute_q_val(model, state, action)

    with torch.no_grad():  # Don't compute gradient info for the target (semi-gradient)
        target = compute_target(model, reward, next_state, done, discount_factor, target_net)

    # loss is measured from error between current and newly expected Q values
    loss = F.smooth_l1_loss(q_val, target)

    # backpropagation of loss to Neural Network (PyTorch magic)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.item()  # Returns a Python scalar, and releases history (similar to .detach())

In [7]:
def run_episodes(train, model, memory, env, num_episodes, batch_size, discount_factor, learn_rate, target_net,
                 target_update_freq, max_steps, show_agent):
    optimizer = optim.Adam(model.parameters(), learn_rate)
    global_steps = 0  # Count the steps (do not reset at episode start, to compute epsilon)
    episode_rewards = []

    for i in range(num_episodes):
        steps = 0
        state = env.reset()
        cum_reward = 0
        done = False

        while not done:
            
            if show_agent:
                env.render()
            
            steps += 1

            eps = get_epsilon(global_steps)

            action = select_action(model, state, eps)

            if steps % target_update_freq == 0:
                target_net.load_state_dict(model.state_dict())

            train(model, memory, optimizer, batch_size, discount_factor, target_net)
            next_state, reward, done, _ = env.step(action)
            cum_reward += reward

            memory.push((state, action, reward, next_state, done))
            state = next_state

            if steps >= max_steps:
                done = True

        global_steps += steps
        episode_rewards.append(cum_reward)
    
    if show_agent:
        env.close()

    return episode_rewards


def run_dqn(env, num_episodes, memory_size, num_hidden, batch_size, discount_factor, learn_rate, target_update_freq,
            max_steps, show_agent):
    memory = ReplayMemory(memory_size)
    n_out = env.action_space.n

    n_in = len(env.observation_space.low)
    model = QNetwork(n_in, n_out, num_hidden)
    target_net = QNetwork(n_in, n_out, num_hidden)

    cum_reward = run_episodes(
        train=train, model=model, memory=memory, env=env, num_episodes=num_episodes, batch_size=batch_size,
        discount_factor=discount_factor, learn_rate=learn_rate, target_net=target_net,
        target_update_freq=target_update_freq, max_steps=max_steps, show_agent=show_agent
    )
    return cum_reward

With the main code ready, we would now like to perform some experiments. Namely, we would like to find out what kind of effect the number of steps to update the target network has on the cumulative rewards. A first way to do this is to run one agent for two different setting (namely 10 and 20) and compare the distributions over rewards obtained during training:

In [8]:
rewards_freq_10 = run_dqn(
    env, 
    batch_size=BATCH_SIZE,
    num_episodes=NUM_EPISODES, 
    memory_size=MEMORY_SIZE, 
    num_hidden=NUM_HIDDEN, 
    discount_factor=DISCOUNT_FACTOR, 
    learn_rate=LEARN_RATE, 
    target_update_freq=10, 
    max_steps=MAX_STEPS,
    show_agent=SHOW_AGENT
)

rewards_freq_20 = run_dqn(
    env, 
    batch_size=BATCH_SIZE,
    num_episodes=NUM_EPISODES, 
    memory_size=MEMORY_SIZE, 
    num_hidden=NUM_HIDDEN, 
    discount_factor=DISCOUNT_FACTOR, 
    learn_rate=LEARN_RATE, 
    target_update_freq=10, 
    max_steps=MAX_STEPS,
    show_agent=SHOW_AGENT
)

# Print the last 20 rewards for both
print(rewards_freq_10[-20:])
print(rewards_freq_20[-20:])

[244.0, 175.0, 343.0, 244.0, 371.0, 230.0, 251.0, 284.0, 268.0, 267.0, 231.0, 60.0, 173.0, 303.0, 310.0, 224.0, 500.0, 465.0, 249.0, 308.0]
[194.0, 250.0, 213.0, 162.0, 203.0, 337.0, 76.0, 170.0, 208.0, 207.0, 290.0, 212.0, 232.0, 217.0, 175.0, 207.0, 179.0, 209.0, 212.0, 189.0]


This looks relatively similar, so which approach was more successful? We can try to answer this question using the Almost Stochastic Order test (ASO). Roughly, it works by comparing the overlap of the two empricial cumulative distribution functions of scores and checking for their overlap - if one approach approach is yielding consistently higher rewards compared to the other one, they should not overlap (and the test score should be close to 0).

In [9]:
aso(rewards_freq_10, rewards_freq_20, num_jobs=4, seed=SEED)

Bootstrap iterations: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 999/1000 [00:21<00:00, 46.02it/s]


0.02028519498045956

Since the test scores is quite low, this gives us an indication that waiting 20 steps to update the target network might be beneficial! Nevertheless, this comes with a caveat - we only checked one model run per value, and neural networks are infamous for being sensitive to their random initialization. Therefore, instead of comparing the reward distributions of two single models per run, let us compare the **distribution over average rewards over multiple runs**. We start by doing 5 runs each.

In [10]:
reward_dist_freq_10, reward_dist_freq_20 = [], []

for i in range(5):
    print(f"Performing run #{i+1}...")
    reward_dist_freq_10.append(
        np.mean(run_dqn(
            env, 
            batch_size=BATCH_SIZE,
            num_episodes=NUM_EPISODES, 
            memory_size=MEMORY_SIZE, 
            num_hidden=NUM_HIDDEN, 
            discount_factor=DISCOUNT_FACTOR, 
            learn_rate=LEARN_RATE, 
            target_update_freq=10, 
            max_steps=MAX_STEPS,
            show_agent=SHOW_AGENT
        ))
    )
    reward_dist_freq_20.append(
        np.mean(run_dqn(
            env, 
            batch_size=BATCH_SIZE,
            num_episodes=NUM_EPISODES, 
            memory_size=MEMORY_SIZE, 
            num_hidden=NUM_HIDDEN, 
            discount_factor=DISCOUNT_FACTOR, 
            learn_rate=LEARN_RATE, 
            target_update_freq=10, 
            max_steps=MAX_STEPS,
            show_agent=SHOW_AGENT
        ))
    )
    
print(reward_dist_freq_10)
print(reward_dist_freq_20)

Performing run #1...
Performing run #2...
Performing run #3...
Performing run #4...
Performing run #5...
[152.92, 107.4, 135.69, 138.38, 170.26]
[151.41, 129.96, 146.2, 96.06, 159.39]


It can sometimes be a tricky question to decide whether one has collected enough scores to allow for meaningful comparisons, especially when this question has to be balanced against the cost of compute. When the variance in our scores is too high, we might be faced with misleading results, if it is sufficient, we run more models for no apparent reason. For this purpose, deepsig implements two different functions.

First, we will take a look at bootstrap power analysis: It increases all scores in the sample by a certain factor, and then use bootstrapped versions of both samples and perform a significance test. Since the modified, new sample received a lift, the result should come out significant in most cases. If not, this is an indication that the original sample contains too much variance. Let's check that for our scores:

In [11]:
print(bootstrap_power_analysis(reward_dist_freq_10, seed=SEED))
print(bootstrap_power_analysis(reward_dist_freq_20, seed=SEED))

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 1816.21it/s]


0.6594


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 1847.71it/s]

0.5616





These scores have a direct statistical interpretation, since they signify the *statistical power*. The higher the statistical power, the lower the probability of a Type II error or false negative, i.e. not rejecting the null hypothesis when it should be! A common rule of thumb is to thrive for a power of ~0.8, therefore we might want to collect more samples here. For instance, we could decide to collect 10 or 15 samples in total. 

Another, ASO-specific way to help us make that decision is the following function:


In [12]:
print(aso_uncertainty_reduction(m_old=5, n_old=5, m_new=10, n_new=10))
print(aso_uncertainty_reduction(m_old=5, n_old=5, m_new=15, n_new=15))

1.4142135623730951
1.7320508075688772


Since ASO only computes the "true" test score value in the limit of infinitely large samples, the estimate obtained using bootstrapping has some inherent variance, which can be reduced by adding more scores to the sample. The function above compute by what factor the uncertainty in the test result is being reduced. 

We can thus read the above as adding five more samples reducing the uncertainty by a factor of 1.41, while adding ten more sample only reduces it by 1.73. To strike a compromise with our computational budget, we thus only add five more samples each.

In [13]:
for i in range(5):
    print(f"Performing run #{i+6}...")
    reward_dist_freq_10.append(
        np.mean(run_dqn(
            env, 
            batch_size=BATCH_SIZE,
            num_episodes=NUM_EPISODES, 
            memory_size=MEMORY_SIZE, 
            num_hidden=NUM_HIDDEN, 
            discount_factor=DISCOUNT_FACTOR, 
            learn_rate=LEARN_RATE, 
            target_update_freq=10, 
            max_steps=MAX_STEPS,
            show_agent=SHOW_AGENT
        ))
    )
    reward_dist_freq_20.append(
        np.mean(run_dqn(
            env, 
            batch_size=BATCH_SIZE,
            num_episodes=NUM_EPISODES, 
            memory_size=MEMORY_SIZE, 
            num_hidden=NUM_HIDDEN, 
            discount_factor=DISCOUNT_FACTOR, 
            learn_rate=LEARN_RATE, 
            target_update_freq=10, 
            max_steps=MAX_STEPS,
            show_agent=SHOW_AGENT
        ))
    )

Performing run #6...
Performing run #7...
Performing run #8...
Performing run #9...
Performing run #10...


As a sanity check, we repeat the bootstrap analysis again:

In [14]:
print(bootstrap_power_analysis(reward_dist_freq_10, seed=SEED))
print(bootstrap_power_analysis(reward_dist_freq_20, seed=SEED))

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 1798.92it/s]


0.8356


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 1776.61it/s]

0.9248





The power has increased! We now come back to the comparison:

In [15]:
print(aso(rewards_freq_10, rewards_freq_20, num_jobs=4, seed=SEED))
print(bootstrap_test(rewards_freq_10, rewards_freq_20, num_jobs=4, seed=SEED))

Bootstrap iterations: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 999/1000 [00:23<00:00, 42.62it/s]


0.02028519498045956
0.005


Now we can be fairly confident in our assessment! The last part of this demo wants to demonstrate how we could facilitate comparisons between multiple models at once, for which the package also implements a specific function. Let us first train a third kind of model for a number of runs. This time, we do not vary the update frequency of the target network, but instead the discount factor. Not that there is no specific reason we test eight runs here other than two demonstrate that ASO does not require equally-sized samples:

In [16]:
reward_dist_discount_06 = []

for i in range(8):
    print(f"Performing run #{i+1}...")
    reward_dist_discount_06.append(
        np.mean(run_dqn(
            env, 
            batch_size=BATCH_SIZE,
            num_episodes=NUM_EPISODES, 
            memory_size=MEMORY_SIZE, 
            num_hidden=NUM_HIDDEN, 
            discount_factor=0.6, 
            learn_rate=LEARN_RATE, 
            target_update_freq=10, 
            max_steps=MAX_STEPS,
            show_agent=SHOW_AGENT
        ))
    )

Performing run #1...
Performing run #2...
Performing run #3...
Performing run #4...
Performing run #5...
Performing run #6...
Performing run #7...
Performing run #8...


In [17]:
multi_aso([reward_dist_freq_10, reward_dist_freq_20, reward_dist_discount_06], num_jobs=4, seed=SEED)

Model comparisons: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌| 2990/3000 [00:50<00:00, 48.45it/s]

array([[1., 1., 0.],
       [0., 1., 0.],
       [1., 1., 1.]])

We can read this result as the violation ratio of <row> compared  to <column> is value. Note that by suppling a dictionary as an argument and using `return_df=True`, we can output the result in a more readable form:

In [18]:
res_df = multi_aso(
    {
        "update freq = 10": reward_dist_freq_10, 
        "update freq = 20": reward_dist_freq_20, 
        "discount factor = 0.6": reward_dist_discount_06
    },
    num_jobs=4, seed=SEED, return_df=True
)


Model comparisons:   0%|                                                                                                                                       | 0/3000 [00:00<?, ?it/s][A
Model comparisons:   0%|▎                                                                                                                              | 8/3000 [00:00<00:55, 53.72it/s][A
Model comparisons:   1%|█                                                                                                                             | 24/3000 [00:00<00:49, 59.80it/s][A
Model comparisons:   1%|█▎                                                                                                                            | 32/3000 [00:00<00:55, 53.49it/s][A
Model comparisons:   1%|█▋                                                                                                                            | 40/3000 [00:00<00:57, 51.10it/s][A
Model comparisons:   2%|██                                 

Model comparisons:  12%|██████████████▋                                                                                                              | 352/3000 [00:07<01:02, 42.70it/s][A
Model comparisons:  12%|███████████████                                                                                                              | 360/3000 [00:08<01:02, 42.48it/s][A
Model comparisons:  12%|███████████████▎                                                                                                             | 368/3000 [00:08<01:02, 41.88it/s][A
Model comparisons:  13%|███████████████▋                                                                                                             | 376/3000 [00:08<01:06, 39.34it/s][A
Model comparisons:  13%|████████████████                                                                                                             | 384/3000 [00:08<01:16, 34.37it/s][A
Model comparisons:  13%|████████████████▎                   

Model comparisons:  23%|█████████████████████████████                                                                                                | 696/3000 [00:16<00:57, 39.76it/s][A
Model comparisons:  23%|█████████████████████████████▎                                                                                               | 704/3000 [00:16<00:57, 40.17it/s][A
Model comparisons:  24%|█████████████████████████████▋                                                                                               | 712/3000 [00:16<00:55, 41.12it/s][A
Model comparisons:  24%|██████████████████████████████                                                                                               | 720/3000 [00:17<00:53, 42.63it/s][A
Model comparisons:  24%|██████████████████████████████▎                                                                                              | 728/3000 [00:17<00:52, 43.28it/s][A
Model comparisons:  25%|██████████████████████████████▋     

Model comparisons:  35%|███████████████████████████████████████████▎                                                                                | 1047/3000 [00:25<00:42, 45.64it/s][A
Model comparisons:  35%|███████████████████████████████████████████▌                                                                                | 1055/3000 [00:25<00:42, 45.88it/s][A
Model comparisons:  35%|███████████████████████████████████████████▉                                                                                | 1063/3000 [00:25<00:42, 45.36it/s][A
Model comparisons:  36%|████████████████████████████████████████████▎                                                                               | 1071/3000 [00:26<00:42, 45.92it/s][A
Model comparisons:  36%|████████████████████████████████████████████▌                                                                               | 1079/3000 [00:26<00:40, 47.63it/s][A
Model comparisons:  36%|████████████████████████████████████

Model comparisons:  46%|█████████████████████████████████████████████████████████▍                                                                  | 1391/3000 [00:33<00:36, 44.23it/s][A
Model comparisons:  47%|█████████████████████████████████████████████████████████▊                                                                  | 1399/3000 [00:33<00:35, 45.70it/s][A
Model comparisons:  47%|██████████████████████████████████████████████████████████▏                                                                 | 1407/3000 [00:33<00:33, 46.90it/s][A
Model comparisons:  47%|██████████████████████████████████████████████████████████▍                                                                 | 1415/3000 [00:33<00:34, 46.40it/s][A
Model comparisons:  47%|██████████████████████████████████████████████████████████▊                                                                 | 1423/3000 [00:33<00:37, 41.61it/s][A
Model comparisons:  48%|████████████████████████████████████

Model comparisons:  58%|███████████████████████████████████████████████████████████████████████▋                                                    | 1735/3000 [00:40<00:27, 46.70it/s][A
Model comparisons:  58%|████████████████████████████████████████████████████████████████████████                                                    | 1743/3000 [00:40<00:28, 44.79it/s][A
Model comparisons:  58%|████████████████████████████████████████████████████████████████████████▎                                                   | 1751/3000 [00:40<00:29, 42.81it/s][A
Model comparisons:  59%|████████████████████████████████████████████████████████████████████████▋                                                   | 1759/3000 [00:41<00:29, 42.02it/s][A
Model comparisons:  59%|█████████████████████████████████████████████████████████████████████████                                                   | 1767/3000 [00:41<00:28, 43.76it/s][A
Model comparisons:  59%|████████████████████████████████████

Model comparisons:  70%|██████████████████████████████████████████████████████████████████████████████████████▏                                     | 2086/3000 [00:48<00:19, 47.22it/s][A
Model comparisons:  70%|██████████████████████████████████████████████████████████████████████████████████████▌                                     | 2094/3000 [00:48<00:19, 47.52it/s][A
Model comparisons:  70%|██████████████████████████████████████████████████████████████████████████████████████▉                                     | 2102/3000 [00:48<00:18, 47.37it/s][A
Model comparisons:  70%|███████████████████████████████████████████████████████████████████████████████████████▏                                    | 2110/3000 [00:48<00:18, 47.69it/s][A
Model comparisons:  71%|███████████████████████████████████████████████████████████████████████████████████████▌                                    | 2118/3000 [00:48<00:18, 48.36it/s][A
Model comparisons:  71%|████████████████████████████████████

Model comparisons:  81%|████████████████████████████████████████████████████████████████████████████████████████████████████▍                       | 2430/3000 [00:55<00:09, 60.96it/s][A
Model comparisons:  81%|████████████████████████████████████████████████████████████████████████████████████████████████████▊                       | 2438/3000 [00:55<00:09, 62.25it/s][A
Model comparisons:  82%|█████████████████████████████████████████████████████████████████████████████████████████████████████                       | 2446/3000 [00:55<00:08, 63.26it/s][A
Model comparisons:  82%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                      | 2454/3000 [00:55<00:08, 63.46it/s][A
Model comparisons:  82%|█████████████████████████████████████████████████████████████████████████████████████████████████████▊                      | 2462/3000 [00:55<00:08, 63.13it/s][A
Model comparisons:  82%|████████████████████████████████████

Model comparisons:  92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋         | 2774/3000 [01:00<00:03, 63.63it/s][A
Model comparisons:  93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉         | 2782/3000 [01:01<00:03, 60.11it/s][A
Model comparisons:  93%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎        | 2790/3000 [01:01<00:03, 57.56it/s][A
Model comparisons:  93%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋        | 2798/3000 [01:01<00:03, 58.51it/s][A
Model comparisons:  94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉        | 2806/3000 [01:01<00:03, 59.70it/s][A
Model comparisons:  94%|████████████████████████████████████

In [19]:
res_df

Unnamed: 0,update freq = 10,update freq = 20,discount factor = 0.6
update freq = 10,1.0,1.0,0.0
update freq = 20,0.0,1.0,0.0
discount factor = 0.6,1.0,1.0,1.0



Model comparisons: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 2997/3000 [01:18<00:00, 57.00it/s][A

Thus, we can conclude here that lowering the discount factor actually seems to have a negative impact on the obtained rewards. 

First of all, thank for following this demo so far! Before letting you play with the different functions yourself, here a few disclaimers:

1. This demo didn't try to put forth a realistic experimental pipeline in Reinforcement learning - the cart pole problem is just a cute problem for demonstration purposes.
2. The use of significance threshold is very controversial, and ASO is no exception - instead of marking your results as significant / non-significant, report the output of the scores along with your effect size.
3. Significance tests aren't perfect and come with a certain degree of uncertainty, and ASO is no exception
    
For more information on the functions, check out the documentation under [REDACTED] or leave an issue on the Github repository [REDACTED].


### Bibliography

* Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that cansolve difficult learning control problems.IEEE transactions on systems, man, and cybernetics, (5):834–846, 1983.

* Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level controlthrough deep reinforcement learning.nature, 518(7540):529–533, 2015

* Hado Van Hasselt,  Yotam Doron,  Florian Strub,  Matteo Hessel,  Nicolas Sonnerat,  and JosephModayil.  Deep reinforcement learning and the deadly triad.arXiv preprint arXiv:1812.02648,2018.

