-
Notifications
You must be signed in to change notification settings - Fork 38
/
index.rst
196 lines (161 loc) · 10.1 KB
/
index.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
Distributed Training
====================
AgileRL can also be used for distributed training if you have multiple devices you want to take advantage of. We use the HuggingFace `Accelerate
<https://github.com/huggingface/accelerate>`_ library to implement this in an open manner, without hiding behind too many layers of abstraction.
This should make implementations simple, but also highly customisable, by continuing to expose the PyTorch training loop beneath it all.
To launch distributed training scripts in bash, use ``accelerate launch``. To customise the distributed training properties, specify the key ``--config_file``. An example
config file has been provided at ``configs/accelerate/accelerate.yaml``.
Putting this all together, launching a distributed training script can be done as follows:
.. code-block:: bash
accelerate_launch --config_file configs/accelerate/accelerate.yaml demo_online_distributed.py
There are some key considerations to bear in mind when implementing a distributed training run:
* If you only want to execute something once, rather than repeating it for each process, e.g printing a statement, logging to W&B, then use ``if accelerator.is_main_process:``.
* Training happens in parallel on each device, meaning that steps in a RL environment happen on each device too. In order to count the number of global training steps taken, you must multiply the number of steps you have taken on a singular device by the number of devices (assuming they are equal). If you want to use distributed training to train more quickly, and normally you would train for 100,000 steps on one device, you can now train for just 25,000 steps if using four devices.
Example distributed training loop:
.. code-block:: python
from agilerl.utils.utils import makeVectEnvs, initialPopulation
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.components.replay_data import ReplayDataset
from agilerl.components.sampler import Sampler
from agilerl.hpo.tournament import TournamentSelection
from agilerl.hpo.mutation import Mutations
from accelerate import Accelerator
import numpy as np
import os
from torch.utils.data import DataLoader
from tqdm import trange
if __name__ == '__main__':
accelerator = Accelerator()
NET_CONFIG = {
'arch': 'mlp', # Network architecture
'hidden_size': [32, 32], # Actor hidden size
}
INIT_HP = {
'POPULATION_SIZE': 4, # Population size
'DOUBLE': True, # Use double Q-learning in DQN or CQN
'BATCH_SIZE': 128, # Batch size
'LR': 1e-3, # Learning rate
'GAMMA': 0.99, # Discount factor
'LEARN_STEP': 1, # Learning frequency
'TAU': 1e-3, # For soft update of target network parameters
'POLICY_FREQ': 2, # DDPG target network update frequency vs policy network
# Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
'CHANNELS_LAST': False
}
env = makeVectEnvs('LunarLander-v2', num_envs=8) # Create environment
try:
state_dim = env.single_observation_space.n # Discrete observation space
one_hot = True # Requires one-hot encoding
except Exception:
state_dim = env.single_observation_space.shape # Continuous observation space
one_hot = False # Does not require one-hot encoding
try:
action_dim = env.single_action_space.n # Discrete action space
except Exception:
action_dim = env.single_action_space.shape[0] # Continuous action space
if INIT_HP['CHANNELS_LAST']:
state_dim = (state_dim[2], state_dim[0], state_dim[1])
pop = initialPopulation(algo='DQN', # Algorithm
state_dim=state_dim, # State dimension
action_dim=action_dim, # Action dimension
one_hot=one_hot, # One-hot encoding
net_config=NET_CONFIG, # Network configuration
INIT_HP=INIT_HP, # Initial hyperparameters
population_size=INIT_HP['POPULATION_SIZE'], # Population size
accelerator=accelerator) # Accelerator
field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(action_dim=action_dim, # Number of agent actions
memory_size=10000, # Max replay buffer size
field_names=field_names) # Field names to store in memory
replay_dataset = ReplayDataset(memory, INIT_HP['BATCH_SIZE'])
replay_dataloader = DataLoader(replay_dataset, batch_size=None)
replay_dataloader = accelerator.prepare(replay_dataloader)
sampler = Sampler(distributed=True,
dataset=replay_dataset,
dataloader=replay_dataloader)
tournament = TournamentSelection(tournament_size=2, # Tournament selection size
elitism=True, # Elitism in tournament selection
population_size=INIT_HP['POPULATION_SIZE'], # Population size
evo_step=1) # Evaluate using last N fitness scores
mutations = Mutations(algo='DQN', # Algorithm
no_mutation=0.4, # No mutation
architecture=0.2, # Architecture mutation
new_layer_prob=0.2, # New layer mutation
parameters=0.2, # Network parameters mutation
activation=0, # Activation layer mutation
rl_hp=0.2, # Learning HP mutation
rl_hp_selection=['lr', 'batch_size'], # Learning HPs to choose from
mutation_sd=0.1, # Mutation strength
arch=NET_CONFIG['arch'], # Network architecture
rand_seed=1, # Random seed
accelerator=accelerator) # Accelerator)
max_episodes = 1000 # Max training episodes
max_steps = 500 # Max steps per episode
# Exploration params
eps_start = 1.0 # Max exploration
eps_end = 0.1 # Min exploration
eps_decay = 0.995 # Decay per episode
epsilon = eps_start
evo_epochs = 5 # Evolution frequency
evo_loop = 1 # Number of evaluation episodes
accel_temp_models_path = 'models/{}'.format('LunarLander-v2')
if accelerator.is_main_process:
if not os.path.exists(accel_temp_models_path):
os.makedirs(accel_temp_models_path)
print(f'\nDistributed training on {accelerator.device}...')
# TRAINING LOOP
for idx_epi in trange(max_episodes):
accelerator.wait_for_everyone()
for agent in pop: # Loop through population
state = env.reset()[0] # Reset environment at start of episode
score = 0
for idx_step in range(max_steps):
# Get next action from agent
action = agent.getAction(state, epsilon)
next_state, reward, done, _, _ = env.step(
action) # Act in environment
# Save experience to replay buffer
memory.save2memoryVectEnvs(
state, action, reward, next_state, done)
# Learn according to learning frequency
if memory.counter % agent.learn_step == 0 and len(
memory) >= agent.batch_size:
# Sample dataloader
experiences = sampler.sample(agent.batch_size)
# Learn according to agent's RL algorithm
agent.learn(experiences)
state = next_state
score += reward
# Update epsilon for exploration
epsilon = max(eps_end, epsilon * eps_decay)
# Now evolve population if necessary
if (idx_epi + 1) % evo_epochs == 0:
# Evaluate population
fitnesses = [
agent.test(
env,
swap_channels=False,
max_steps=max_steps,
loop=evo_loop) for agent in pop]
if accelerator.is_main_process:
print(f'Episode {idx_epi+1}/{max_episodes}')
print(f'Fitnesses: {["%.2f"%fitness for fitness in fitnesses]}')
print(f'100 fitness avgs: {["%.2f"%np.mean(agent.fitness[-100:]) for agent in pop]}')
# Tournament selection and population mutation
accelerator.wait_for_everyone()
for model in pop:
model.unwrap_models()
accelerator.wait_for_everyone()
if accelerator.is_main_process:
elite, pop = tournament.select(pop)
pop = mutations.mutation(pop)
for pop_i, model in enumerate(pop):
model.saveCheckpoint(f'{accel_temp_models_path}/DQN_{pop_i}.pt')
accelerator.wait_for_everyone()
if not accelerator.is_main_process:
for pop_i, model in enumerate(pop):
model.loadCheckpoint(f'{accel_temp_models_path}/DQN_{pop_i}.pt')
accelerator.wait_for_everyone()
for model in pop:
model.wrap_models()
env.close()