Skip to content

AngelJavierSalazar/dynamic_programming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Screenshot 2022-07-24 at 16 57 12

dynamic_programming

Note: Cloned from stefan-jansen/machine-learning-for-trading

Notebook to formulate an RL problem, applying model-based and model-free methods.

Application of value and policy iteration to a toy environment that consists of a 3 x 4 grid that's depicted in the following diagram with the following features:

States: 11 states represented as two-dimensional coordinates. One field is not accessible and the top two states in the rightmost column are terminal, that is, they end the episode.
Actions: Movements on each step, that is, up, down, left, and right. The environment is randomized so that actions can have unintended outcomes. For each action, there is an 80% probability to move to the expected state, and 10% probability to move in an adjacent direction (for example, right or left instead of up or up or down instead of right).
Rewards: As depicted in the right-hand side panel, each state results in -.02, except for the +1/-1 rewards in the terminal states:

Screenshot 2022-07-24 at 17 01 10


The right panel of the preceding GridWorld diagram shows the optimal value estimate that's produced by Value Iteration and the corresponding greedy policy. The negative rewards, combined with the uncertainty in the environment, produce an optimal policy that involves moving away from the negative terminal state.

The results are sensitive to both the rewards and the discount factor. The cost of the negative state affects the policy in the surrounding fields, and you should modify the example in the corresponding notebook to identify threshold levels that alter the optimal action selection.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors