# AA228/CS238 Optional Final Project: Escape Roomba

This notebook tests the QMDP + Monte Carlo Tree Search implementation

In [1]:
# activate project environment
# include these lines of code in any future scripts/notebooks
#---
import Pkg
if !haskey(Pkg.installed(), "AA228FinalProject")
    jenv = joinpath(dirname(@__FILE__()), ".") # this assumes the notebook is in the same dir
    # as the Project.toml file, which should be in top level dir of the project. 
    # Change accordingly if this is not the case.
    Pkg.activate(jenv)
end
#---

"/home/colasg/Documents/AA228FinalProject/Project.toml"

In [3]:
# import necessary packages
using AA228FinalProject
using POMDPs
using POMDPPolicies
using BeliefUpdaters
using ParticleFilters
using POMDPSimulators
using BasicPOMCP
using Cairo
using Gtk
using Random
using Statistics # to evaluate policy
using Printf
using JLD # to save alpha vectors

## Define the POMDP

### Create state space, action space, sensor and construct POMDP

The QMDP offline method compute 1 alpha vector $\alpha_a$ per action, with components $\alpha_a(s)$ for $s \in \mathcal{S}, a \in \mathcal{A}$

This methods only works with finite state and action spaces, we first define the discretization/

Then we instantiate a Bump sensor. The Bumper indicates when contact has been made between any part of the Roomba and any wall.

Next, we instantiate the MDP, which defines the underlying simulation environment, assuming full observability. The MDP takes many arguments to specify details of the problem. One argument we must specify here is the ```config```. This argument, which can take values 1,2, or 3, specifies the room configuration, with each configuration corresponding to a different location for the goal and stairs.

Finally, we instantiate the POMDP. The POMDP takes as an argument the underlying MDP as well as the sensor, which it uses to define the observation model. 

In [4]:
# discrete state space
num_x_pts = 50
num_y_pts = 50
num_th_pts = 20
sspace = DiscreteRoombaStateSpace(num_x_pts,num_y_pts,num_th_pts);

In [5]:
# discrete action space
vlist = [0, 5, 10]
omlist = [-1, 0, 1]
aspace = vec(collect(RoombaAct(v, om) for v in vlist, om in omlist));

In [6]:
sensor = Bumper()
config = 1 # 1,2, or 3
m = RoombaPOMDP(sensor=sensor, mdp=RoombaMDP(config=config, sspace=sspace, aspace=aspace));
println("Number of discrete states:", n_states(m))
println("Number of discrete actions:", n_actions(m))

Number of discrete states:150000
Number of discrete actions:9


### Setting up a Particle Filter

Here, as the state space is high dimensional, we instantiate a particle filter.

First, we instantiate a resampler, which is responsible for updating the belief state given an observation. The first argument for both resamplers is the number of particles that represent the belief state. The lidar resampler takes a low-variance resampler as an additional argument, which is responsible for efficiently resampling a weighted set of particles. 

Next, we instantiate a ```SimpleParticleFilter```, which enables us to perform our belief updates.

Finally, we pass this particle filter into a custom struct called a ```RoombaParticleFilter```, which takes two additional arguments. These arguments specify the noise in the velocity and turn-rate, used when propegating particles according to the action taken. These can be tuned depending on the type of sensor used.

In [7]:
num_particles = 5000
resampler = BumperResampler(num_particles)

spf = SimpleParticleFilter(m, resampler)

v_noise_coefficient = 2.0
om_noise_coefficient = 0.5

belief_updater = RoombaParticleFilter(spf, v_noise_coefficient, om_noise_coefficient);

## Solve the POMDP

### Load the QMDP alpha vectors

In [8]:
QMDP_alphas = load("QMDP_alphas.jld")["QMDP_alphas"];

### Define a policy : Monte Carlo Search Tree

First we create a struct that subtypes the Policy abstract type, defined in the package ```POMDPPolicies.jl```. Here, we can also define certain parameters, such as a variable defining the depth 'd'.

Next, we define a function that can take in our policy and the belief state and return the desired action. We do this by defining a new ```POMDPs.action``` function that will work with our policy. 

In [10]:
# use QMDP alphas as a starting heuristic
struct QMDPHeuristic <: Policy
    alphas::Array{Array{Float64,1},1} # store the alpha vectors
end

# define what the policy actually does in POMDPs.action : note this is a function of state, not belief
function POMDPs.action(p::QMDPHeuristic, s::RoombaState) 
    k = stateindex(m, s)
    # find the action associated with the highest alpha vector value in state s
    idx = findmax([p.alphas[i][k] for i in 1:length(actions(m))])[2]
    a = actions(m)[idx]
    return a # this may need to be different for discrete actions?
end

# QMDP heuristic policy
p = QMDPHeuristic(QMDP_alphas)

# MC started with Random heuristic
#solver = POMCPSolver()
# MC started with QMDP heuristic
solver = POMCPSolver(estimate_value=FORollout(p))

# corresponding policy
planner = solve(solver, m);


### Simulation and rendering

Here, we will demonstrate how to seed the environment, run a simulation, and render the simulation. To render the simulation, we use the ```Gtk``` package. 

The simulation is carried out using the ```stepthrough``` function defined in the package ```POMDPSimulators.jl```. During a simulation, a window will open that renders the scene. It may be hidden behind other windows on your desktop.

In [17]:
# first seed the environment
Random.seed!(1)

# run the simulation
c = @GtkCanvas()
win = GtkWindow(c, "Roomba Environment", 600, 600)
for (t, step) in enumerate(stepthrough(m, planner, belief_updater, max_steps=100))
    @guarded draw(c) do widget
        
        # the following lines render the room, the particles, and the roomba
        ctx = getgc(c)
        set_source_rgb(ctx,1,1,1)
        paint(ctx)
        render(ctx, m, step)

        std_pos = std(particles(step.b)) ./ [40., 20., 6., 1.] 
        println("Time: ", t, " Variance:", std_pos, sum(std_pos))
        
        # render some information that can help with debugging
        # here, we render the time-step, the state, and the observation
        move_to(ctx,300,400)
        show_text(ctx, @sprintf("t=%d, s=%s, o=%.3f, a=%s",t,string(step.s),step.o, string(step.a)))
    end
    show(c)
    sleep(0.01) # to slow down the simulation
end

Time: 2 Variance:[0.272878, 0.273344, 0.307988, 0.0]0.854209718366848
Time: 3 Variance:[0.272878, 0.273344, 0.30797, 0.0]0.8541914089354211
Time: 4 Variance:[0.272878, 0.273344, 0.305867, 0.0]0.8520879728224384
Time: 5 Variance:[0.272878, 0.273344, 0.305793, 0.0]0.8520141901536573
Time: 6 Variance:[0.272878, 0.273344, 0.305753, 0.0]0.8519739540348571
Time: 7 Variance:[0.272878, 0.273344, 0.302867, 0.0]0.8490885058943773
Time: 8 Variance:[0.272878, 0.273344, 0.300476, 0.0]0.8466974867544068
Time: 9 Variance:[0.272878, 0.273344, 0.300356, 0.0]0.8465778761423972
Time: 10 Variance:[0.272878, 0.273344, 0.301411, 0.0]0.8476320859549853
Time: 11 Variance:[0.271107, 0.269059, 0.304815, 0.0]0.8449808994758743
Time: 12 Variance:[0.271107, 0.269059, 0.304839, 0.0]0.845004841052792
Time: 13 Variance:[0.268354, 0.261814, 0.309464, 0.0]0.8396317886719785
Time: 14 Variance:[0.306998, 0.330953, 0.304633, 0.231571]1.1741552386257552
Time: 15 Variance:[0.284478, 0.341959, 0.316308, 0.104708]1.0474528371

### Specifying initial states and beliefs
If, for debugging purposes, you would like to hard-code a starting location or initial belief state for the roomba, you can do so by taking the following steps.

First, we define the initial state using the following line of code:
```
is = RoombaState(x,y,th,0.)
```
Where ```x``` and ```y``` are the x,y coordinates of the starting location and ```th``` is the heading in radians. The last entry, ```0.```, respresents whether the state is terminal, and should remain unchanged.

If you would like to initialize the Roomba's belief as perfectly localized, you can do so with the following line of code:
```
b0 = Deterministic(is)
```
If you would like to initialize the standard unlocalized belief, use these lines:
```
dist = initialstate_distribution(m)
b0 = initialize_belief(belief_updater, dist)
```
Finally, we call the ```stepthrough``` function using the initial state and belief as follows:
```
stepthrough(m,planner,belief_updater,b0,is,max_steps=300)
```

### Evaluation 

Here, we demonstate a simple evaluation of the policy's performance for a few random seeds. This is meant to serve only as an example, and we encourage you to develop your own evaluation metrics.

We intialize the robot using five different random seeds, and simulate its performance for 100 time-steps. We then sum the rewards experienced during its interaction with the environment and track this total reward for the five trials.
Finally, we report the mean and standard error for the total reward. The standard error is the standard deviation of a sample set divided by the square root of the number of samples, and represents the uncertainty in the estimate of the mean value.

In [12]:
total_rewards = []

for exp = 1:50
    
    Random.seed!(exp)
    
    traj_rewards = sum([step.r for step in stepthrough(m, planner, belief_updater, max_steps=100)])
    
    println("Experience: ", string(exp), " Reward: ", traj_rewards)

    push!(total_rewards, traj_rewards)
end

@printf("Mean Total Reward: %.3f, StdErr Total Reward: %.3f", mean(total_rewards), std(total_rewards)/sqrt(5))

Experience: 1 Reward: 0.0
Experience: 2 Reward: 3.200000000000003
Experience: 3 Reward: -0.6999999999999975
Experience: 4 Reward: 6.1
Experience: 5 Reward: 2.8
Experience: 6 Reward: 2.700000000000003
Experience: 7 Reward: -4.8000000000000025
Experience: 8 Reward: -0.4999999999999982
Experience: 9 Reward: 2.0000000000000036
Experience: 10 Reward: 0.7000000000000028
Experience: 11 Reward: 4.500000000000002
Experience: 12 Reward: -18.000000000000004
Experience: 13 Reward: 9.9
Experience: 14 Reward: -4.399999999999999
Experience: 15 Reward: -15.0
Experience: 16 Reward: 5.3999999999999995
Experience: 17 Reward: 1.2000000000000046
Experience: 18 Reward: 6.4
Experience: 19 Reward: -1.1999999999999975
Experience: 20 Reward: -2.299999999999999
Experience: 21 Reward: 2.1000000000000014
Experience: 22 Reward: 1.700000000000001
Experience: 23 Reward: 1.3000000000000007
Experience: 24 Reward: 8.4
Experience: 25 Reward: -0.09999999999999964
Experience: 26 Reward: -11.2
Experience: 27 Reward: -13.2
E