Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQN can't find a good policy #11

Open
frostyduck opened this issue Jun 11, 2020 · 9 comments
Open

DQN can't find a good policy #11

frostyduck opened this issue Jun 11, 2020 · 9 comments
Labels
bug Something isn't working

Comments

@frostyduck
Copy link

frostyduck commented Jun 11, 2020

According your advice, I switched to Stable-Baselines instead of openAI baseline in the Kundur system training.

def main(learning_rate, env):
    tf.reset_default_graph()  
    graph = tf.get_default_graph()

    model = DQN(CustomDQNPolicy, env, learning_rate=learning_rate, verbose=0)
    callback = SaveOnBestTrainingRewardCallback(check_freq=1000, storedData=storedData)
    time_steps = 900000
    model.learn(total_timesteps=int(time_steps), callback=callback)

    print("Saving final model to: " + savedModel + "/" + model_name + "_lr_%s_90w.pkl" % (str(learning_rate)))
    model.save(savedModel + "/" + model_name + "_lr_%s_90w.pkl" % (str(learning_rate)))

However after 900000 steps of training DQN agent cannot find a good policy. Please see average reward progress plot

https://www.dropbox.com/preview/DQN_adaptivenose.png?role=personal

I used the following env settings

case_files_array.append(folder_dir +'/testData/Kundur-2area/kunder_2area_ver30.raw')
case_files_array.append(folder_dir+'/testData/Kundur-2area/kunder_2area.dyr')
dyn_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_dyn_config.json'
rl_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json'

Mu suggestion is that in the baseline scenario kunder_2area_ver30.raw (without system loading), short circuit might not lead to loss of stability during the simulation. Therefore, (perhaps) DQN agent finds a "no action" policy, that so as not to receive the actionPenalty = 2.0. Because according the reward progress plot, during training agent cannot find a policy better than mean reward 603.05. When testing, mean_reward = 603.05 means "no action" policy (please see figure bellow)

https://www.dropbox.com/preview/no%20actions%20case.png?role=personal

However it's only my suggestion, I can wrong. I thought to try scenarios with increasing load in order to get for sure loss of stability during simulation.

Originally posted by @frostyduck in #9 (comment)

@qhuang-pnl
Copy link
Collaborator

Sorry I cannot open the figures in your dropbox. Probably you did not make it publicly accessible. If possible, please directly post it here or send it to my email qiuhua dot huang at pnnl dot gov.

Is the result based on only one random seed? You may also try different random seeds. It could have a huge difference.

@qhuang-pnl
Copy link
Collaborator

qhuang-pnl commented Jun 15, 2020

I went through your codes and results, the input and configuration files (*.raw and *.json) and the NN structure are different from our original testing code:
https://github.com/RLGC-Project/RLGC/blob/master/src/py/trainKundur2areaGenBrakingAgent.py

I would suggest you changing and making them the same as our original testing code, because we don't know the performance for other combinations/settings

And at least 3 random seeds should be tried.

@frostyduck
Copy link
Author

Thank you! I tried initially the code with your original settings (raw and json files, NN structure), therefore I began to change these settings. However, I will repeat more carefully them as the original settings.

And at least 3 random seeds should be tried.

Do you mean to try different np.random.seed()??

@qhuang-pnl
Copy link
Collaborator

Set the last parameter 'seed' in DQN class according to https://stable-baselines.readthedocs.io/en/master/modules/dqn.html

@frostyduck
Copy link
Author

I have repeated training of Kundur system using original settings. I used Stable-Baselines (DQN agent) instead of openAI baseline.

case_files_array.append(folder_dir +'/testData/Kundur-2area/kunder_2area_ver30.raw')
case_files_array.append(folder_dir+'/testData/Kundur-2area/kunder_2area.dyr')
dyn_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_dyn_config.json'
rl_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json'
class CustomDQNPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomDQNPolicy, self).__init__(*args, **kwargs,
                                           layers=[128, 128],
                                           layer_norm=False,
                                           feature_extraction="mlp")
def main(learning_rate, env):
    tf.reset_default_graph() 
    graph = tf.get_default_graph()
    model = DQN(CustomDQNPolicy, env, learning_rate=learning_rate, verbose=0, seed=5)
    callback = SaveOnBestTrainingRewardCallback(check_freq=1000, storedData=storedData)
    time_steps = 900000
    model.learn(total_timesteps=int(time_steps), callback=callback)

However I've got the same result. For some reason, DQN agent cannot overcome the mark of mean reward equal to ~603.

https://photos.app.goo.gl/SSJyQQsA3vDhz1nt7

I decided run your full original testing code with openAI baseline DQN model. However, I've got the same "~603 problem" policy.

Case id: 0, Fault bus id: Bus3, fault start time: 1,000000, fault duration: 0,585000

| % time spent exploring | 2 |
| episodes | 3.27e+03 |
| mean 100 episode reward | -940 |
| steps | 9e+05 |

Restored model with mean reward: -602.8
Saving final model to: ./previous_model/kundur2area_multistep_581to585_bus2_90w_lr_0.0001_90w.pkl
total running time is -99249.84962964058
Java server terminated with PID: 12763
Finished!!

@frostyduck
Copy link
Author

Sorry for the slow response. Do you still need help on this issue?

Yes, I still need your help on this issue.

@RLGC-Project RLGC-Project deleted a comment from thuang Aug 3, 2020
@qhuang-pnl
Copy link
Collaborator

qhuang-pnl commented Aug 3, 2020

Hi,

I believe we did not correctly commit one RL training configuration file. Please use this updated one: https://github.com/RLGC-Project/RLGC/blob/master/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json

The settings are corresponding to our paper.

@frostyduck
Copy link
Author

Hi,

I believe we did not correctly commit one RL training configuration file. Please use this updated one: https://github.com/RLGC-Project/RLGC/blob/master/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json

The settings are corresponding to our paper.

I tried the DQN-agent training with these settings, however, I faced again with "~603 problem" policy.

Case id: 0, Fault bus id: Bus3, fault start time: 1,000000, fault duration: 0,583000
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 3.36e+03 |
| mean 100 episode reward | -709     |
| steps                   | 9e+05    |
--------------------------------------
Restored model with mean reward: -603.0
Saving final model to: ./previous_model/kundur2area_multistep_581to585_bus2_90w_lr_0.0001_90w.pkl
total running time is 15627.888870954514_

I think, in your simulation settings, the duration of the short circuits is not long enough to cause loss stability (penalty = - 1000). Therefore, the agent chooses a no-action policy, which is probably consistent with this "~603 problem". Perhaps, in this case, the RL agent has no motivation to find a better policy to reduce negative rewards.

@RL4Grid RL4Grid added the bug Something isn't working label Dec 24, 2020
@frostyduck
Copy link
Author

@qhuang-pnl , I probably figured out a bug where, during training, the agent cannot overcome the reward boundary of -602. The fact is that during training and testing in the environment (Kundur's scheme), short circuits are not simulated. I checked it out. That is, the agent learns purely on the normal operating conditions of the system. In this case, the optimal policy is never to apply the dynamic brake, i.e. actions are always 0.

I'm guessing it has something to do with the PowerDynSimEnvDef modifications. Initially, you used PowerDynSimEnvDef_v2, and now I am working with PowerDynSimEnvDef_v7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants