DQN can't find a good policy #11

frostyduck · 2020-06-11T04:57:47Z

According your advice, I switched to Stable-Baselines instead of openAI baseline in the Kundur system training.

def main(learning_rate, env):
    tf.reset_default_graph()  
    graph = tf.get_default_graph()

    model = DQN(CustomDQNPolicy, env, learning_rate=learning_rate, verbose=0)
    callback = SaveOnBestTrainingRewardCallback(check_freq=1000, storedData=storedData)
    time_steps = 900000
    model.learn(total_timesteps=int(time_steps), callback=callback)

    print("Saving final model to: " + savedModel + "/" + model_name + "_lr_%s_90w.pkl" % (str(learning_rate)))
    model.save(savedModel + "/" + model_name + "_lr_%s_90w.pkl" % (str(learning_rate)))

However after 900000 steps of training DQN agent cannot find a good policy. Please see average reward progress plot

https://www.dropbox.com/preview/DQN_adaptivenose.png?role=personal

I used the following env settings

case_files_array.append(folder_dir +'/testData/Kundur-2area/kunder_2area_ver30.raw')
case_files_array.append(folder_dir+'/testData/Kundur-2area/kunder_2area.dyr')
dyn_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_dyn_config.json'
rl_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json'

Mu suggestion is that in the baseline scenario kunder_2area_ver30.raw (without system loading), short circuit might not lead to loss of stability during the simulation. Therefore, (perhaps) DQN agent finds a "no action" policy, that so as not to receive the actionPenalty = 2.0. Because according the reward progress plot, during training agent cannot find a policy better than mean reward 603.05. When testing, mean_reward = 603.05 means "no action" policy (please see figure bellow)

https://www.dropbox.com/preview/no%20actions%20case.png?role=personal

However it's only my suggestion, I can wrong. I thought to try scenarios with increasing load in order to get for sure loss of stability during simulation.

Originally posted by @frostyduck in #9 (comment)

The text was updated successfully, but these errors were encountered:

qhuang-pnl · 2020-06-13T18:33:11Z

Sorry I cannot open the figures in your dropbox. Probably you did not make it publicly accessible. If possible, please directly post it here or send it to my email qiuhua dot huang at pnnl dot gov.

Is the result based on only one random seed? You may also try different random seeds. It could have a huge difference.

qhuang-pnl · 2020-06-15T15:55:26Z

I went through your codes and results, the input and configuration files (*.raw and *.json) and the NN structure are different from our original testing code:
https://github.com/RLGC-Project/RLGC/blob/master/src/py/trainKundur2areaGenBrakingAgent.py

I would suggest you changing and making them the same as our original testing code, because we don't know the performance for other combinations/settings

And at least 3 random seeds should be tried.

frostyduck · 2020-06-15T16:41:38Z

Thank you! I tried initially the code with your original settings (raw and json files, NN structure), therefore I began to change these settings. However, I will repeat more carefully them as the original settings.

And at least 3 random seeds should be tried.

Do you mean to try different np.random.seed()??

qhuang-pnl · 2020-06-15T16:58:34Z

Set the last parameter 'seed' in DQN class according to https://stable-baselines.readthedocs.io/en/master/modules/dqn.html

frostyduck · 2020-06-17T00:53:13Z

I have repeated training of Kundur system using original settings. I used Stable-Baselines (DQN agent) instead of openAI baseline.

case_files_array.append(folder_dir +'/testData/Kundur-2area/kunder_2area_ver30.raw')
case_files_array.append(folder_dir+'/testData/Kundur-2area/kunder_2area.dyr')
dyn_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_dyn_config.json'
rl_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json'

class CustomDQNPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomDQNPolicy, self).__init__(*args, **kwargs,
                                           layers=[128, 128],
                                           layer_norm=False,
                                           feature_extraction="mlp")

def main(learning_rate, env):
    tf.reset_default_graph() 
    graph = tf.get_default_graph()
    model = DQN(CustomDQNPolicy, env, learning_rate=learning_rate, verbose=0, seed=5)
    callback = SaveOnBestTrainingRewardCallback(check_freq=1000, storedData=storedData)
    time_steps = 900000
    model.learn(total_timesteps=int(time_steps), callback=callback)

However I've got the same result. For some reason, DQN agent cannot overcome the mark of mean reward equal to ~603.

https://photos.app.goo.gl/SSJyQQsA3vDhz1nt7

I decided run your full original testing code with openAI baseline DQN model. However, I've got the same "~603 problem" policy.

Case id: 0, Fault bus id: Bus3, fault start time: 1,000000, fault duration: 0,585000

| % time spent exploring | 2 |
| episodes | 3.27e+03 |
| mean 100 episode reward | -940 |
| steps | 9e+05 |

Restored model with mean reward: -602.8
Saving final model to: ./previous_model/kundur2area_multistep_581to585_bus2_90w_lr_0.0001_90w.pkl
total running time is -99249.84962964058
Java server terminated with PID: 12763
Finished!!

frostyduck · 2020-07-13T08:36:14Z

Sorry for the slow response. Do you still need help on this issue?

Yes, I still need your help on this issue.

qhuang-pnl · 2020-08-03T06:18:49Z

Hi,

I believe we did not correctly commit one RL training configuration file. Please use this updated one: https://github.com/RLGC-Project/RLGC/blob/master/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json

The settings are corresponding to our paper.

frostyduck · 2020-09-08T01:57:56Z

Hi,

I believe we did not correctly commit one RL training configuration file. Please use this updated one: https://github.com/RLGC-Project/RLGC/blob/master/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json

The settings are corresponding to our paper.

I tried the DQN-agent training with these settings, however, I faced again with "~603 problem" policy.

Case id: 0, Fault bus id: Bus3, fault start time: 1,000000, fault duration: 0,583000
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 3.36e+03 |
| mean 100 episode reward | -709     |
| steps                   | 9e+05    |
--------------------------------------
Restored model with mean reward: -603.0
Saving final model to: ./previous_model/kundur2area_multistep_581to585_bus2_90w_lr_0.0001_90w.pkl
total running time is 15627.888870954514_

I think, in your simulation settings, the duration of the short circuits is not long enough to cause loss stability (penalty = - 1000). Therefore, the agent chooses a no-action policy, which is probably consistent with this "~603 problem". Perhaps, in this case, the RL agent has no motivation to find a better policy to reduce negative rewards.

frostyduck · 2021-05-21T04:53:42Z

@qhuang-pnl , I probably figured out a bug where, during training, the agent cannot overcome the reward boundary of -602. The fact is that during training and testing in the environment (Kundur's scheme), short circuits are not simulated. I checked it out. That is, the agent learns purely on the normal operating conditions of the system. In this case, the optimal policy is never to apply the dynamic brake, i.e. actions are always 0.

I'm guessing it has something to do with the PowerDynSimEnvDef modifications. Initially, you used PowerDynSimEnvDef_v2, and now I am working with PowerDynSimEnvDef_v7

RLGC-Project deleted a comment from thuang Aug 3, 2020

RL4Grid added the bug Something isn't working label Dec 24, 2020

frostyduck mentioned this issue May 26, 2021

Bug in the 2-area Kundur case #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DQN can't find a good policy #11

DQN can't find a good policy #11

frostyduck commented Jun 11, 2020 •

edited

Loading

qhuang-pnl commented Jun 13, 2020

qhuang-pnl commented Jun 15, 2020 •

edited

Loading

frostyduck commented Jun 15, 2020

qhuang-pnl commented Jun 15, 2020

frostyduck commented Jun 17, 2020

frostyduck commented Jul 13, 2020

qhuang-pnl commented Aug 3, 2020 •

edited

Loading

frostyduck commented Sep 8, 2020

frostyduck commented May 21, 2021

DQN can't find a good policy #11

DQN can't find a good policy #11

Comments

frostyduck commented Jun 11, 2020 • edited Loading

qhuang-pnl commented Jun 13, 2020

qhuang-pnl commented Jun 15, 2020 • edited Loading

frostyduck commented Jun 15, 2020

qhuang-pnl commented Jun 15, 2020

frostyduck commented Jun 17, 2020

Case id: 0, Fault bus id: Bus3, fault start time: 1,000000, fault duration: 0,585000

| % time spent exploring | 2 | | episodes | 3.27e+03 | | mean 100 episode reward | -940 | | steps | 9e+05 |

frostyduck commented Jul 13, 2020

qhuang-pnl commented Aug 3, 2020 • edited Loading

frostyduck commented Sep 8, 2020

frostyduck commented May 21, 2021

frostyduck commented Jun 11, 2020 •

edited

Loading

qhuang-pnl commented Jun 15, 2020 •

edited

Loading

| % time spent exploring | 2 |
| episodes | 3.27e+03 |
| mean 100 episode reward | -940 |
| steps | 9e+05 |

qhuang-pnl commented Aug 3, 2020 •

edited

Loading