# Deep Q-Network & Atari
*Prepared by Yashwanthi Anand,*

For this section, we follow the implementations available at [CleanRL](https://github.com/vwxyzjn/cleanrl).

CleanRL is an open-source library designed for clean, single-file implementations of RL algorithms, such as DQN, PPO, and A2C. CleanRL's simple implementation, makes it accessible for newcomers to RL while still being robust for research.

# Working with CleanRL

There are two ways to go about experimenting with the CleanRL – (1) Clone it to your local machine and run the code, or (2) Use an environment prebuilt of Gitpod (Note, you will have to authorize integrating Gitpod with GitHub)



## 1.   Cloning it to your local machine

Please follow the installation and steps to get started with CleanRL [here](https://github.com/vwxyzjn/cleanrl?tab=readme-ov-file#get-started)


In this session, as we focus only on running DQN on Atari, it suffices if you install the following,
```
pip install -r requirements/requirements.txt
pip install -r requirements/requirements-atari.txt
```

Alternatively, you can use `poetry` to install all the dependencies.

## 2.   Hosting on Gitpod

Please click [here](https://gitpod.io/#https://github.com/vwxyzjn/cleanrl) to launch the repository on Gitpod. This will load the entire repository.
Note, if you are new to Gitpod, you will have to authorize integrating Gitpod with GitHub.)

In this session, as we focus only on running DQN on Atari, it suffices if you install the following,
```
pip install -r requirements/requirements.txt
pip install -r requirements/requirements-atari.txt
```

Alternatively, you can use `poetry` to install all the dependencies.



## DQN Implementation

The file we will be using is located at `cleanrl/dqn_atari.py`

### To run the code
Use `python cleanrl/dqn_atari.py --env-id <ENVIRONMENT_NAME>`.
For example, if you want to run the DQN code for Atari Breakout use `python cleanrl/dqn_atari.py --env-id BreakoutNoFrameskip-v4`.
### To run the code using `poetry`
```
poetry shell
python cleanrl/dqn_atari.py --env-id BreakoutNoFrameskip-v4
```

## Weights and Biases

CleanRL uses [Weights and Biases](https://wandb.ai/site) `wandb` for experiment tracking and model monitoring. `wandb` allows users to log and visualize metrics, hyperparameters, and outputs in real-time.

Please refer to [this page](https://docs.cleanrl.dev/get-started/experiment-tracking/) to learn how to set up `wandb`



## Pseudocode



1.   Take action a<sub>t</sub> with ε-probability
2.   Store transitions (s<sub>t</sub>,a<sub>t</sub>,r<sub>t+1</sub>,s<sub>t+1</sub>) in replay buffer D
3. Sample random mini-batch of transitions (s,a,r,s’) from D
4. Compute Q-learning targets w.r.t. old, fixed weights of the target network **w<sup>-</sup>**
5. Optimize MSE between Q-network & Q-learning targets \\
J(**w<sub>i</sub>**) = 𝐄<sub>s,a,r,s’~D</sub>[((r + 𝛄 max a’ Q(s’, a’; **w<sup>-</sup>**) – Q(s, a; **w<sub>i</sub>**))<sup>2</sup>]
6. Update **w** using stochastic gradient descent






## The Code

The details of [CleanRL's DQN implementation](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py) is provided in the following code snippets.

DQN uses two networks – Q Network and the Target Network. The following cell provides the network structure for them both.



```
# Lines 106-124

# ALGO LOGIC: initialize agent here:
class QNetwork(nn.Module):
    def __init__(self, env):
        super().__init__()
        self.network = nn.Sequential(
            nn.Conv2d(4, 32, 8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(3136, 512),
            nn.ReLU(),
            nn.Linear(512, env.single_action_space.n),
        )

    def forward(self, x):
        return self.network(x / 255.0)```





```
# Lines 177-189

"""Initialize the Q Network and the Target Network"""
q_network = QNetwork(envs).to(device)
optimizer = optim.Adam(q_network.parameters(), lr=args.learning_rate)
target_network = QNetwork(envs).to(device)
target_network.load_state_dict(q_network.state_dict())


"""Initialize the replay buffer to keep a record of (s,a,r',s') tuples"""
rb = ReplayBuffer(
    args.buffer_size,
    envs.single_observation_space,
    envs.single_action_space,
    device,
    optimize_memory_usage=True,
    handle_timeout_termination=False,
)
```





```
# STEP 1: Action Selection - Q Network selects an action with epsilon probability
# Lines 196-201
# ALGO LOGIC: put action logic here

epsilon = linear_schedule(args.start_e, args.end_e, args.exploration_fraction * args.total_timesteps, global_step)
if random.random() < epsilon:
    actions = np.array([envs.single_action_space.sample() for _ in range(envs.num_envs)])
else:
    q_values = q_network(torch.Tensor(obs).to(device))
    actions = torch.argmax(q_values, dim=1).cpu().numpy()

```





```
# Step 2: Store transitions in replay buffer
# Line 219

rb.add(obs, real_next_obs, actions, rewards, terminations, infos)

```





```
# Steps 3-4
# Lines 225-232

if global_step > args.learning_starts:
    if global_step % args.train_frequency == 0:
        data = rb.sample(args.batch_size)
        with torch.no_grad():
            target_max, _ = target_network(data.next_observations).max(dim=1)
            td_target = data.rewards.flatten() + args.gamma * target_max * (1 - data.dones.flatten())
        old_val = q_network(data.observations).gather(1, data.actions).squeeze()
        loss = F.mse_loss(td_target, old_val)
```





```
# Step 5: Optimize
# Lines 241-243

optimizer.zero_grad()
loss.backward()
optimizer.step()
```





```
# Step 6: Update Target network
# Line 246-250

if global_step % args.target_network_frequency == 0:
    for target_network_param, q_network_param in zip(target_network.parameters(), q_network.parameters()):
        target_network_param.data.copy_(
            args.tau * q_network_param.data + (1.0 - args.tau) * target_network_param.data
        )
```

