[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Paulescu/hands-on-rl/blob/main/03_cart_pole/notebooks/11_train_deep_q_agent_in_google_colab.ipynb)

# 00 Deep Q agent with full experimentation tracking

#### 👉 Let's train a Deep Q agent to solve the [`Cart Pole`](https://gym.openai.com/envs/CartPole-v1/) environment.

#### 👉This is the network we will use to represent the optimal Q-function 👇🏽👇🏽👇🏽
<!-- <img src="https://lever-client-logos.s3.amazonaws.com/bb006941-a5fe-4d4c-b13d-931f9b9c303f-1569362661885.png" width="500" height="400" /> -->



![nn](https://raw.githubusercontent.com/Paulescu/hands-on-rl/4c8e064d56c22828c97aa2e3f773050fa36d7d6f/04_cart_pole_tune_hparams_like_a_pro/images/deep_q_net.svg)

## Setup for Google Colab 🐍⚒️

In [3]:
if 'google.colab' in str(get_ipython()):
    
    !git clone https://github.com/Paulescu/hands-on-rl.git

    # navigate to lesson directory
    %cd /content/hands-on-rl/03_cart_pole

    # install exact package versions
    %pip install -r requirements.txt
    print('Go to Runtime > Restart runtime to make sure python uses the exact packages version we just installed.')

In [4]:
if 'google.colab' in str(get_ipython()):
    %cd /content/hands-on-rl/04_cart_pole_tune_hparams_like_a_pro
    !python setup.py install
    print('Local package installed!')

In [None]:
if 'google.colab' in str(get_ipython()):
    
    !git clone https://github.com/Paulescu/hands-on-rl.git

    # navigate to lesson directory
    %cd /content/hands-on-rl/03_cart_pole

    # install exact package versions
    %pip install .
    print('Go to Runtime > Restart runtime to make sure python uses the exact packages version we just installed.')

-----

In [5]:
%load_ext autoreload
%autoreload 2
%pylab inline
%config InlineBackend.figure_format = 'svg'

Populating the interactive namespace from numpy and matplotlib


## Environment 🌎

In [6]:
import gym
env = gym.make('CartPole-v1')

## Log in to your W&B account

In [7]:
import wandb

wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mpaulescu[0m (use `wandb login --relogin` to force relogin)


True

## Start a W&B run

In [8]:
run = wandb.init(
    project="deep-q-learning-hyperparameters",
    entity="paulescu"
)

## Hyperparameters

In [9]:
# Good hyper-parameters
# make you feel great!
hparams = {
    'learning_rate': 0.00016151809562265122,
    'discount_factor': 0.99,
    'batch_size': 32,
    'memory_size': 10000,
    'freq_steps_train': 8,
    'freq_steps_update_target': 10,
    'n_steps_warm_up_memory': 1000,
    'n_gradient_steps': 16,
    'nn_hidden_layers': [256, 256],
    'max_grad_norm': 10,
    'normalize_state': False,
    'epsilon_start': 0.9,
    'epsilon_end': 0.14856584122699473,
    'steps_epsilon_decay': 10000,
}

SEED = 2386916045
# SEED = 0

### Log hyperparameters

In [10]:
wandb.config = hparams.copy()
wandb.config.update({"seed": SEED})

## ⚠️ Fix random seeds to ensure reproducible runs

In [11]:
from src.utils import set_seed
set_seed(env, SEED)

## Deep Q-Agent

In [12]:
from src.q_agent import QAgent
agent = QAgent(env, **hparams, run=wandb) #run=run)

67,586 parameters


## Train the agent 🏋️

In [13]:
from src.loops import train
train(agent, env, n_episodes=200, run=run)

100%|██████████████████████████████████████████████████████████████████████████| 200/200 [08:32<00:00,  2.56s/it]


## Evaluate the agent ⏱️

In [14]:
from src.loops import evaluate
rewards, steps = evaluate(
    agent, env,
    n_episodes=1000,
    epsilon=0.00
)

import numpy as np
reward_avg = np.array(rewards).mean()
reward_std = np.array(rewards).std()
print(f'Reward average {reward_avg:.2f}, std {reward_std:.2f}')

100%|████████████████████████████████████████████████████████████████████████| 1000/1000 [03:38<00:00,  4.57it/s]

Reward average 500.00, std 0.00





### Log evaluation metrics

In [15]:
wandb.log({'eval/reward_avg': reward_avg})
wandb.log({'eval/reward_std': reward_std})

## Plot reward distribution and save it to W&B

In [16]:
import matplotlib.pyplot as plt
import pandas as pd

fig, ax = plt.subplots(figsize = (10, 4))
ax.set_title("Rewards")    
pd.Series(rewards).plot(kind='hist', bins=100)

wandb.log({"chart": plt})

plt.show()

## End of the experiment

In [17]:
run.finish()

VBox(children=(Label(value=' 0.01MB of 0.01MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
eval/reward_avg,▁
eval/reward_std,▁
train/epsilon,██████▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▃▃▂▁▁▁▁▁▁▁▁▁
train/loss,▁▁▁▂▂▄▃▄▇▅▃▇██▇▅▃▇▅▃▁▂▁▂▁▁▁▁▁▁▁▁▁▁▃▁▄▁▁▁
train/replay_memory_size,▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▆▆▇█████████
train/reward,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▂▂▂▃▃▃▄▄▄▄▅▅▄▆▆██
train/steps,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▂▂▂▃▃▃▄▄▄▄▅▅▄▆▆██

0,1
eval/reward_avg,500.0
eval/reward_std,0.0
train/epsilon,0.14857
train/loss,27.68528
train/replay_memory_size,10000.0
train/reward,500.0
train/steps,500.0
