In [1]:
from stable_baselines3 import PPO, SAC
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.monitor import Monitor

import gym
import numpy as np
from datetime import datetime

from KBMproject import ATLA
import KBMproject.utilities as utils

from citylearn.data import DataSet

Basic constants

In [2]:
DATASET_NAME = 'citylearn_challenge_2022_phase_2'
SAVE_DIR = 'Models/ATLA/'
LOG_DIR = 'logs/Phase3/ATLA/'
VERBOSITY = 0
DEVICE = 'cuda'

The max mean difference represents the largest change between two samples for each feature minus the mean difference. This will be the maximum perturbation size for our adversary. Using the max difference represents the wors case scenario we expect to encounter based on our training data. Because this is derived from the difference between samples, we subtract the mean difference so on average the inter sample change will not exceed the max recorded value. This is our boundary for the adversary's perturbation.
see bline obs analysis.ipynb in the PPO 500 results

In [3]:
MAX_MEAN_DIFF = np.array([0.24977164, 0.24977164, 0.34341758, 0.69515118, 0.04606484,
                        0.04608573, 0.26690566, 0.26690266, 0.2669048 , 0.26690781,
                        0.62865948, 0.62865314, 0.62865568, 0.62865948, 0.52596206,
                        0.52596487, 0.52598294, 0.52596206, 0.75557218, 0.75558416,
                        0.75558188, 0.75557218, 0.28202381, 0.61189055, 0.00253725,
                        0.47459565, 0.0052361 , 0.89720221, 0.89720221, 0.89720221,
                        0.89720221])

MEAN_DIFF = np.array([0.12511418, 0.12511418, 0.18184461, 0.35953119, 0.10637713,
                     0.10636668, 0.15978021, 0.15978171, 0.15978064, 0.15977914,
                     0.36344801, 0.36345118, 0.36344991, 0.36344801, 0.3260062 ,
                     0.3260048 , 0.32599576, 0.3260062 , 0.44802713, 0.44802114,
                     0.44802228, 0.44802713, 0.16781362, 0.36620854, 0.00152669,
                     0.31896562, 0.00326229, 0.52109586, 0.52109586, 0.52109586,
                     0.52109586]) #typo made this the average between the max and mean differences, rather than mean difference

# BUG

Whenever ATLA was resumed witha trained agent, the adversary was not given additional ts before training, so never changed it's policy. This explains why the only successful training involved multiple iterations.

#### Trials below used:
adversary with BScaledSum bounded by max_mean_diff with normalized action space

##### Trial 1 (1-9-21?):
- ALT_EPISODES = 20
- PRE_TRAINING_EPISODES = 50
- N_ALT = 10

The agent's training reward appeared close to convering after 20 episodes in ATLA, as per prev work, but the reward was not flat so ALT_EPISODES could be increased. Reward was exponentially increasing after 50 episodes so PRE_TRAINING_EPISODES could be increased. ATLA rewards and evals converged after 8 alternations.

##### Trial 2 (1-10-15?)
- N_ALT = 10
- ALT_EPISODES = 20
- PRE_TRAINING_EPISODES = 300

Agent was closer to convergin with longer pre-training. Final eval/training rewards did not exceed trial one. Agent did not appear to converge after 10 alternations, perhaps this must be increased with PRE_TRAINING_EPISODES. The same was true within alternations, perhaps ALT_EPISODES must also increase. KPIs for this agent were worse than trail 1

##### Trial 3 (1-11-14?)
- ALT_EPISODES = 30
- N_ALT = 15
- PRE_TRAINING_EPISODES = 0
- PRE_TRAINED_AGENT = '20 bin PPO 300 results\default_PPO_citylearn_challenge_2022_phase_2_Building_6_20_bins_0.zip'

Agent KPIs were similar to no coltroller, which is minimally exploitable, but also is also useless.

since 20 was too few and 300 too much, could 100 pretraining episodes be better?

#### Trials below used:
adversary with BScaledSum bounded by mean_diff with normalized action space

##### Trial 4 (1-12-21)
- N_ALT = 10
- ALT_EPISODES = 20
- PRE_TRAINED_AGENT = '20 bin PPO 300 results\default_PPO_citylearn_challenge_2022_phase_2_Building_6_20_bins_0.zip'

With the smaller adv action space this agent is the best performer yet, slightly outperforming trial 1. It seems that the capability of the adversary psuh the agent into taking minimal actions to prevent being maipulated, which is why more training led to worse performance. 

##### Trial 5 (1-13-14) used half the mean difference in the adv action space
- N_ALT = 10
- ALT_EPISODES = 20
- PRE_TRAINED_AGENT = '20 bin PPO 300 results\default_PPO_citylearn_challenge_2022_phase_2_Building_6_20_bins_0.zip'
 
 While it's KPIs were lower than the pre-ATLA model, under the ACG attack the adversarial regret was reduced and it's KPIs were higher in the presence of the adversary. It's unclear if more ATLA alternations would improve convergence.


##### Trial 6 (1-14-15) Uses the pre-ATLA PPO 500 as the pre-trained agent, instead of the previous 300

#### mean diff/2 does not seem powerful enough as an attack, it only reduces performance by a few hundred points

##### Trial 7 (1-16-10)
- N_ALT = 10
- ALT_EPISODES = 20
- PRE_TRAINING_EPISODES = 0
- PRE_TRAINED_AGENT = '20 bin PPO 500 results/default_PPO_citylearn_challenge_2022_phase_2_Building_6_20_bins_500.zip'
- PERTURBATION_SCALE = 1
- PERTURBATION_SPACE = MEAN_DIFF*PERTURBATION_SCALE

trained with mean diff, the eval score were still incresing slowly when training ended. Notable the CityLearn+ATLA paper showed the agent converging during each alt of 20 episodes, perhaps longer and fewr alts will perform better

Trial 8 (1-17-9)
- N_ALT = 3*
- AGENT_ALT_EPISODES = 100* #PPO take longer to converge than SAC
- ADV_ALT_EPISODES = 20 #SAC adaquately converges in this time
- PRE_TRAINING_EPISODES = 0
- PRE_TRAINED_AGENT = '20 bin PPO 500 results/default_PPO_citylearn_challenge_2022_phase_2_Building_6_20_bins_500.zip'
- PERTURBATION_SCALE = 1
- PERTURBATION_SPACE = MEAN_DIFF*PERTURBATION_SCALE

Agent plateaued after third iteration, adding alternations may improve performance

Trial 9 (1-17-21)
- N_ALT = 5*
- AGENT_ALT_EPISODES = 100 #PPO take longer to converge than SAC
- ADV_ALT_EPISODES = 20 #SAC adaquately converges in this time
- PRE_TRAINING_EPISODES = 0
- PRE_TRAINED_AGENT = '20 bin PPO 500 results/default_PPO_citylearn_challenge_2022_phase_2_Building_6_20_bins_500.zip'
- PERTURBATION_SCALE = 1
- PERTURBATION_SPACE = MEAN_DIFF*PERTURBATION_SCALE

Evals and mean reward were flat for 5th alternation. Produced the highest rewards so far, comparable to 1-14-15 which had half the perturbation space

Trial 10 (1-18-21)
- N_ALT = 5
- AGENT_ALT_EPISODES = 100 #PPO take longer to converge than SAC
- ADV_ALT_EPISODES = 20 #SAC adaquately converges in this time
- PRE_TRAINING_EPISODES = 0
- PRE_TRAINED_AGENT = '20 bin PPO 500 results/default_PPO_citylearn_challenge_2022_phase_2_Building_6_20_bins_500.zip'
- PERTURBATION_SCALE = 2*
- PERTURBATION_SPACE = MEAN_DIFF*PERTURBATION_SCALE

Reward was in the -7000s at the end of training, meaning that the agen't performance was too far below the baseline 

##### All above masked time features

##### Trial 11 (1-25-*)
Same as trial 9 without any features masked
- N_ALT = 5
- AGENT_ALT_EPISODES = 100 #PPO take longer to converge than SAC
- ADV_ALT_EPISODES = 20 #SAC adaquately converges in this time
- PRE_TRAINING_EPISODES = 0
- PRE_TRAINED_AGENT = '20 bin PPO 500 results/default_PPO_citylearn_challenge_2022_phase_2_Building_6_20_bins_500.zip'
- PERTURBATION_SCALE = 1
- PERTURBATION_SPACE = MEAN_DIFF*PERTURBATION_SCALE
- MASK=np.arange(0,31)*



##### Trial 12 (1-30-15)
Same hyper-params as the most successful trial, but the training order of the adversary and agent are reversed to match the ATLA paper implementation. The agent will start training agsint the random perturbations of an untrained adversary
- N_ALT = 5
- AGENT_ALT_EPISODES = 100 #PPO take longer to converge than SAC
- ADV_ALT_EPISODES = 20 #SAC adaquately converges in this time
- PRE_TRAINING_EPISODES = 0
- PRE_TRAINED_AGENT = '20 bin PPO 500 results/default_PPO_citylearn_challenge_2022_phase_2_Building_6_20_bins_500.zip'
- MASK=np.arange(6,31) #only features 7-31 will be perturbed, temporal features left alone
- PERTURBATION_SCALE = 1
- PERTURBATION_SPACE = MEAN_DIFF*PERTURBATION_SCALE

Agent might need another alt to coverge, since one was "lost" training against the random adversary. It's interesting that the rewards IMPROVED with the untrained/random adversary

##### Trial 13 (2-1-9)

Increase number of alts so eval scores are flat/converges

- N_ALT = 7*
- AGENT_ALT_EPISODES = 100 #PPO take longer to converge than SAC
- ADV_ALT_EPISODES = 20 #SAC adaquately converges in this time
- PRE_TRAINING_EPISODES = 0
- PRE_TRAINED_AGENT = '20 bin PPO 500 results/default_PPO_citylearn_challenge_2022_phase_2_Building_6_20_bins_500.zip'
- MASK=np.arange(6,31) #only features 7-31 will be perturbed, temporal features left alone
- PERTURBATION_SCALE = 1
- PERTURBATION_SPACE = MEAN_DIFF*PERTURBATION_SCALE

Results were worse than the previous trial, instead try loading the previous agent and adversary, to continue alternations

# BEST
##### Trial 14 (2-3-03)

- Continuation of trial 12, by loading agent and adversary
- N_ALT = 2
- AGENT_ALT_EPISODES = 100 #PPO take longer to converge than SAC
- ADV_ALT_EPISODES = 20 #SAC adaquately converges in this time
- PRE_TRAINING_EPISODES = 0
- PRE_TRAINED_AGENT = 'Models\ATLA\PPO agent 100 alts over 500+500 1-30-15.zip'
- PRE_TRAINED_ADV = 'Models\ATLA\SAC adversary 20 alts over 100 1-30-15.zip'
- MASK=np.arange(6,31) #only features 7-31 will be perturbed, temporal features left alone
- PERTURBATION_SCALE = 1
- PERTURBATION_SPACE = MEAN_DIFF*PERTURBATION_SCALE

Power consumption is the same as 12, but other metrics  improved

In [4]:
BINS = 20
R_EXP = 3 #for norm distance reward

N_ALT = 2
AGENT_ALT_EPISODES = 100 #PPO take longer to converge than SAC
ADV_ALT_EPISODES = 20 #SAC adaquately converges in this time
PRE_TRAINING_EPISODES = 0
PRE_TRAINED_AGENT = 'Models\ATLA\PPO agent 100 alts over 500+500 1-30-15.zip'
PRE_TRAINED_ADV = 'Models\ATLA\SAC adversary 20 alts over 100 1-30-15.zip'
MASK=np.arange(6,31) #only features 7-31 will be perturbed, temporal features left alone
PERTURBATION_SCALE = 1
PERTURBATION_SPACE = MEAN_DIFF*PERTURBATION_SCALE

EVAL_PER_ALT = 1
ADV_TOTAL_EPISODES = ADV_ALT_EPISODES*N_ALT
AGENT_TOTAL_EPISODES = AGENT_ALT_EPISODES*N_ALT + PRE_TRAINING_EPISODES


Define SB3 environments, note the the eval and training environments must be difference objects

In [5]:
kwargs = dict(
    schema=DataSet.get_schema(DATASET_NAME),
    action_bins=BINS,
    T=None #this was supposed to make evaluations shorter, but does not work... never passed it in lol
)
agent_env = utils.make_discrete_env(schema=DataSet.get_schema(DATASET_NAME),  
                        action_bins=BINS,
                        seed=0)

agent_eval_env = utils.make_discrete_env(schema=DataSet.get_schema(DATASET_NAME),  
                        action_bins=BINS,
                        seed=42)

adv_env = utils.make_discrete_env(schema=DataSet.get_schema(DATASET_NAME),  
                        action_bins=BINS,
                        seed=0)

adv_eval_env = utils.make_discrete_env(schema=DataSet.get_schema(DATASET_NAME),  
                        action_bins=BINS,
                        seed=42)
if kwargs['T'] is not None:
    print('T should be None unless this is a test')

In [6]:
T = agent_env.time_steps - 1
print(f'Each episode is {T} timesteps')

Each episode is 8759 timesteps


Define agent (could load/save pretrained agent)

In [7]:
if PRE_TRAINED_AGENT is None:
    policy_kwargs = dict(net_arch=[256, 256])
    agent = PPO('MlpPolicy', 
                agent_env,
                device=DEVICE,
                policy_kwargs=policy_kwargs,
                tensorboard_log=LOG_DIR,
                verbose=VERBOSITY,
                )
    print('new agent defined')
else:
    agent = PPO.load(path=PRE_TRAINED_AGENT,
                     env=agent_env,
                     device=DEVICE,
                     tensorboard_log=LOG_DIR,
                     verbose=VERBOSITY,
                     print_system_info=True,
                     #force_reset=False, #default is true for continued training ref: https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#stable_baselines3.ppo.PPO.load
                     )
    print('agent loaded from storage')

== CURRENT SYSTEM INFO ==
- OS: Windows-10-10.0.22631-SP0 10.0.22631
- Python: 3.10.12
- Stable-Baselines3: 1.8.0
- PyTorch: 1.12.1
- GPU Enabled: True
- Numpy: 1.25.1
- Gym: 0.21.0

== SAVED MODEL SYSTEM INFO ==
- OS: Windows-10-10.0.19045-SP0 10.0.19045
- Python: 3.10.12
- Stable-Baselines3: 1.8.0
- PyTorch: 1.12.1
- GPU Enabled: True
- Numpy: 1.23.5
- Gym: 0.21.0

agent loaded from storage


The number of timesteps and agent has trained is non-zero when loaded from storage, this must be added to the pause and total timesteps so training is not prematurely aborted

In [8]:
agent_n_ts = agent.num_timesteps

In [9]:
now = datetime.now()
dtg = f'{now.month}-{now.day}-{now.hour}'

Name contains RL algorithm, episodes per alternation and total episodes, followed by a the date-time with hour precision

In [10]:
agent_name = f'{agent.__class__.__name__} agent {AGENT_ALT_EPISODES} alts over '
if PRE_TRAINING_EPISODES > 0:
    agent_name += f'{PRE_TRAINING_EPISODES}+'
else:
    agent_name += f'{agent_n_ts//T}+'
agent_name += f'{AGENT_ALT_EPISODES*N_ALT} {dtg}'


Agent pre-training

In [11]:
if PRE_TRAINING_EPISODES > 0:
    print(f'Pre-training for {PRE_TRAINING_EPISODES*T} timesteps ({PRE_TRAINING_EPISODES} episodes)')
    agent.learn(total_timesteps=AGENT_TOTAL_EPISODES*T + agent_n_ts,
                callback=[EvalCallback(Monitor(agent_eval_env),
                                       eval_freq=PRE_TRAINING_EPISODES//EVAL_PER_ALT*T,
                                       verbose=VERBOSITY),
                          ATLA.HParamCallback(),
                          ATLA.PauseOnStepCallback(PRE_TRAINING_EPISODES*T + agent_n_ts)], #stops training before ts budget expended
                tb_log_name=agent_name,
                reset_num_timesteps=False, #allows training to continue where it left off
                progress_bar=True,
                log_interval=1 #start logging after first epsiode, for debugging
                )
    print(f'Agent pretrained for {agent.num_timesteps - agent_n_ts} timesteps, or {(agent.num_timesteps - agent_n_ts)/T} episodes')
else:
    print('No pretraining specified')

No pretraining specified


Define adversary's reward

In [12]:
rwd = ATLA.NormScaleReward(adv_env, 
                            np.inf,
                            exp=R_EXP,
                            )

Define an adv action space in [-1,1] for ATLA.BScaledSumPrevProj, which scale a maximum perturbation

In [13]:
normalized_a_space = gym.spaces.Box(low=-1*np.ones(MASK.shape),
                                    high=np.ones(MASK.shape),
                                    dtype='float32',)

  logger.warn(


##### Parameterize the B function
- The adversary adds a bounded perturbation to the current observation with B(s) as BScaledSum
- 

Max perturbation reduced to mean diff devided by 2

In [14]:
B_params = dict(
    #clip_bound=np.ones(agent_env.observation_space.shape)*0.33,
    max_perturbation=np.ones(MASK.shape)*PERTURBATION_SPACE[MASK]
                  )

Define adversary's environment

In [15]:
kwargs = dict(
    #adv_reward=rwd, #use default negative agent reward
    victim=agent,
    B=ATLA.BScaledSum,
    action_space=normalized_a_space, #[-1,1] for scaled B defined above
    feature_mask=MASK, 
    B_kwargs=B_params,
)
adv_eval_env = ATLA.AdversaryATLAWrapper(env=adv_eval_env, **kwargs)
adv_eval_env = Monitor(adv_eval_env)

adv_env = ATLA.AdversaryATLAWrapper(env=adv_env, **kwargs)


In [16]:
#check_env(adv_env,)

Define adversary

In [17]:
if PRE_TRAINED_ADV is None:
    policy_kwargs = dict(net_arch=[256, 256])
    adversary = SAC('MlpPolicy', 
            Monitor(adv_env),
            device=DEVICE,
            policy_kwargs=policy_kwargs,
            tensorboard_log=LOG_DIR,
            verbose=VERBOSITY,
            )
    print('new agent defined')
else:
    adversary = SAC.load(path=PRE_TRAINED_ADV,
                     env=adv_env,
                     device=DEVICE,
                     tensorboard_log=LOG_DIR,
                     verbose=VERBOSITY,
                     print_system_info=True,
                     #force_reset=False, #default is true for continued training ref: https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#stable_baselines3.ppo.PPO.load
                     )
    print('agent loaded from storage')


== CURRENT SYSTEM INFO ==
- OS: Windows-10-10.0.22631-SP0 10.0.22631
- Python: 3.10.12
- Stable-Baselines3: 1.8.0
- PyTorch: 1.12.1
- GPU Enabled: True
- Numpy: 1.25.1
- Gym: 0.21.0

== SAVED MODEL SYSTEM INFO ==
- OS: Windows-10-10.0.19045-SP0 10.0.19045
- Python: 3.10.12
- Stable-Baselines3: 1.8.0
- PyTorch: 1.12.1
- GPU Enabled: True
- Numpy: 1.23.5
- Gym: 0.21.0

agent loaded from storage


In [18]:
adv_name = f'{adversary.__class__.__name__} adversary {ADV_ALT_EPISODES} alts over {ADV_TOTAL_EPISODES} {dtg}'

Define the adversary's perturbation function for the victim environment. We use a function which applies the corresponding B(s) to the adversary's prediction 

In [19]:
perturbation = ATLA.sb3_perturbation(adversary,)

Wrap agent's environments for ATLA

In [20]:
#perturbation=adversary.predict
agent_env = ATLA.VictimATLAWrapper(agent_env,
                                   perturbation,)
agent_eval_env = ATLA.VictimATLAWrapper(agent_eval_env,
                                        perturbation,)
agent_eval_env = Monitor(agent_eval_env)


In [21]:
check_env(agent_env)

replace pre-training environment with ATLA environment

In [22]:
agent.set_env(agent_env)

Define ATLA evaluation callbacks

In [23]:
kwargs = dict(
    verbose=VERBOSITY
)

adv_eval_callback = EvalCallback(adv_eval_env, 
                                 eval_freq=ADV_ALT_EPISODES//EVAL_PER_ALT*T, 
                                 **kwargs)
agent_eval_callback = EvalCallback(agent_eval_env,
                                   eval_freq=AGENT_ALT_EPISODES//EVAL_PER_ALT*T,
                                   **kwargs)

Conduct ATLA. Note:
- the agents are not reset between iterations, this prevents attributes like scaled exploration and learning rates from resetting.
- A callback pauses training after a number of episodes has elapsed but before the max training budget is reached (does this work better than resetting?). 
- Adversary was originally trained first, so that the agent would start training against a trained adversary. However, the agent is trained first in the ATLA paper, implying it first faces a randomly initialized adversary. Try reversing them?

In [24]:
kwargs = dict(
    reset_num_timesteps=False, #allows training to continue where it left off between .learn() calls
    progress_bar=False, # progress bar really slows cell execution
    log_interval=1 #start logging after first epsiode, useful for debugging
)
print(f'ATLA for {N_ALT}')
for alt in range(N_ALT):
    #first trial had the agent train first, so these were reversed
    agent.learn(total_timesteps=AGENT_TOTAL_EPISODES*T + agent_n_ts,
                callback=[agent_eval_callback,
                          ATLA.HParamCallback(),
                          ATLA.PauseOnStepCallback(T*(AGENT_ALT_EPISODES*(1 + alt) + PRE_TRAINING_EPISODES) + agent_n_ts)],
                    tb_log_name=agent_name,
                    **kwargs)
    print(f'Agent trained has for {agent.num_timesteps} ts ({agent.num_timesteps/T} episodes) up to iteration {alt}')

    adversary.learn(total_timesteps=ADV_TOTAL_EPISODES*T,
                    callback=[adv_eval_callback,
                              ATLA.AdvDistanceTensorboardCallback(),
                              ATLA.HParamCallback(),
                              ATLA.PauseOnStepCallback(ADV_ALT_EPISODES*(1 + alt)*T)], #pauses training before the max ts, updates each iter
                    tb_log_name=adv_name,
                    **kwargs)
    print(f'Adversary has trained for {adversary.num_timesteps} ts ({adversary.num_timesteps/T} episodes) up to iteration {alt}')

    

ATLA for 2
Agent trained has for 9636072 ts (1100.1338052289075 episodes) up to iteration 0
Adversary has trained for 875901 ts (100.00011416828406 episodes) up to iteration 0
Agent trained has for 10511972 ts (1200.1338052289075 episodes) up to iteration 1
Adversary has trained for 875902 ts (100.0002283365681 episodes) up to iteration 1


Save models

In [25]:
if SAVE_DIR is not None:
    agent.save(SAVE_DIR + agent_name)
    adversary.save(SAVE_DIR + adv_name)