## Value Iteration, Bellman Optimization, and Policy Iteration in R
**Author:** Lauren Washington

## Environment Exploration

In [7]:
#start server in terminal with $python gym_http_server.py
#install.packages("gym")
library(gym)
#install.packages("MDPtoolbox")
library(MDPtoolbox)

In [8]:
#remote_base "http://127.0.0.1:5000"
rl_env_setup_func <- function(remote_base, env_name) {
#start server in terminal with $python gym_http_server.py
    remote_base <<- "http://127.0.0.1:5000"
    client <<- create_GymClient(remote_base)
    #print(client)
    #create instance
    instance_id <<- env_create(client, env_name)
    #print(instance_id)
    #list the environments
    list_envs <<- env_list_all(client)
    #print(list_envs)
    #set up your agent
    action_space_info <- env_action_space_info(client, instance_id)
    #print(action_space_info)
    agent <<- random_discrete_agent(action_space_info)
    agent <<- random_discrete_agent(action_space_info[["n"]])
    }

## Model Exploration

### Value Iteration


mdp_value_iteration Solves discounted MDP using value iteration algorithm  
**Description**  
Solves discounted MDP with value iteration algorithm
**Usage**  
    mdp_value_iteration(P, R, discount, epsilon, max_iter, V0)
**Arguments**  
**P** transition probability array. P can be a 3 dimensions array [S,S,A] or a list [[A]], each element containing a sparse matrix [S,S].  
**R** reward array. R can be a 3 dimensions array [S,S,A] or a list [[A]], each element containing a sparse matrix [S,S] or a 2 dimensional matrix [S,A] possibly sparse.  
**discount** discount factor. discount is a real number which belongs to [0; 1[. For discount equals to 1, a warning recalls to check conditions of convergence.  
**epsilon (optional) :** search for an epsilon-optimal policy. epsilon is a real in ]0; 1]. By default, epsilon = 0.01.
**max_iter**
**V0**
**Details**
**Value**    
**policy**    
**iter**
**cpu_time**

### Bellman Optimization

mdp_bellman_operator(P, PR, discount, Vprev)  
**Arguments**  
**P** transition probability array. P can be a 3 dimensions array [S,S,A] or a list [[A]], each element containing a sparse matrix [S,S].  
**PR** reward array. PR can be a 2 dimension array [S,A] possibly sparse. discount discount factor. discount is a real number belo**nging to ]0; 1]. Vprev value fonction. Vprev is a vector of length S.  
**Details**  
**Value**  
**V** new value fonction. V is a vector of length S.  
**policy** policy is a vector of length S. Each element is an integer corresponding to an action.  

### Policy Iteration

mdp_policy_iteration(P, R, discount, policy0, max_iter, eval_type)  
**Arguments**  
**P** transition probability array. P can be a 3 dimensions array [S,S,A] or a list [[A]], each element containing a sparse matrix [S,S].  
**R** reward array. R can be a 3 dimensions array [S,S,A] or a list [[A]], each element containing a sparse matrix [S,S] or a 2 dimensional matrix [S,A] possibly sparse.  
**discount** discount factor. discount is a real which belongs to ]0; 1[.  
**policy0** (optional) starting policy. policy0 is a S length vector. By default, policy0 is the
policy which maximizes the expected immediate reward.
**max_iter** (optional) maximum number of iterations to be done. max_iter is an integer greater than 0. By default, max_iter is set to 1000.
**eval_type** (optional) define function used to evaluate a policy. eval_type is 0 for mdp_eval_policy_matrix use, mdp_eval_policy_iterative is used in all other cases. By default, eval_type
is set to 0.
**Details**
**Value**
**V** optimal value fonction. V is a S length vector
**policy** optimal policy. policy is a S length vector. Each element is an integer corre-
sponding to an action which maximizes the value function iter number of iterations
**cpu_time** CPU time used to run the program

In [9]:
#define value iteration function
value_iteration_func <- function(S, p) {
    param <- mdp_example_forest(S, p)
    P <<- param$P
    R <<- param$R
    #run 
    vi <- mdp_value_iteration(P, R , 0.9)
    V <<- vi$V
    print(vi$V)
    print(vi$policy)
}


In [10]:
#define belman optimality function
bellman_func <- function(Vprev) {
    bellman_e <- mdp_bellman_operator(P, R, 0.9, Vprev = V)  
    bellman_policy <<- bellman_e$policy
    
    print(bellman_e$V)
    print(bellman_e$policy)

}

In [11]:
#define policy iteration function
policy_iteration_func <- function(policy0) {
    pi <- mdp_policy_iteration(P, R , 0.9, policy0 = bellman_policy)
    V <<- pi$V
    policy <<- pi$policy
    print(V)
    print(policy)  
}

In [13]:
#run env_setup_func
rl_env_setup_func("http://127.0.0.1:5000", "Pendulum-v0")

#run value iteration, feed in optimal value into bellman optimization function, 
#feed in optimal policy into policy iteration function
rl_models <- list(
    value_iteration = value_iteration_func(S = 3, p = 0.1),
    bellman_equation = bellman_func(Vprev),
    policy_iteration = policy_iteration_func(policy0)
 )

[1] "MDP Toolbox: iterations stopped, epsilon-optimal policy found"
[1] 5.148382 5.804728 6.616064
[1] 1 1 2
[1] 5.165184 5.822367 6.633544
[1] 1 1 2
[1] 5.320952 5.977860 6.788857
[1] 1 1 2
