You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each state in the state space is a 3x3 grid (or an array of 9 numbers) of one's and zeros that denote the location of the battleships, where 1 are "hit" tiles and 0 are "miss" tiles. There are 9 actions corresponding to each of the tiles in the 3x3 grid. There are two observations, "hit" or "miss."
This should be a simple POMDP to implement, but the catch is that the agent cannot click on a tile twice until the state has changed (i.e. as long as the agent is in state s, each action can only be chosen once). I've implemented this as a field of the POMDP struct called "board" which is an array of -1 (tile not chosen by agent yet), 0 (tile chosen but miss), and 1 (tile chosen and hit). I am using the gen function to implement the transition so that I can generate the next state, observation, and reward simultaneously. In the gen function, once the agent has gotten all hit tiles, the board is re-cleared into -1's and the state changes. The code is pretty short and can be seen below.
I tried using BasicPOMCP on this problem, but I am noticing that the board isnt getting cleared across state changes. This is probably because of the rather "hack-y" way I have implemented checking if an agent has taken a specific action or not.
What would be the best way to implement this problem? I briefly considered making the entire board the observation, but I am not sure if I am able to pass the current observation into the gen function, only the state and action...
The larger question here would be what is the best way to implement constraints on the action space of the agent based on the current history/observation?
Thanks for the help in advance!
struct GridClickPOMDP <:POMDP{Array{Int64},Int64,Bool}
board::Array{Int64}
discount::Float64
GridClickPOMDP()=new([-1 for n in 1:9],1.0)
end
state_matrix=npzread("data/state_space.npy")
state_space=[r[:] for r in eachrow(state_matrix)]
state_idxs=Dict()
for i=1:length(state_space)
state_idxs[state_space[i]]=i
end
state_probs=npzread("data/state_probs.npy")
POMDPs.states(pomdp::GridClickPOMDP)=state_space
POMDPs.stateindex(pomdp::GridClickPOMDP,s::Array{Int64,1})=state_idxs[s]
POMDPs.actions(pomdp::GridClickPOMDP)=collect(1:9)
POMDPs.actionindex(pomdp::GridClickPOMDP,a::Int64)=a
function POMDPs.gen(pomdp::GridClickPOMDP,s,a,rng)
if pomdp.board[a]!=-1
return(sp=s,o=false,r=-200)
else
pomdp.board[a]=s[a]
obs=(s[a]==1)
if length(pomdp.board[pomdp.board.==1])==length(s[s.==1])
s_next=state_space[rand(Categorical(state_probs))]
for i=1:9
pomdp.board[i]=-1
end
return(sp=s_next,o=obs,r=10)
else
if obs
rew=1
else
rew=-1
end
return(sp=s,o=obs,r=rew)
end
end
end
POMDPs.initialstate_distribution(pomdp::GridClickPOMDP)=SparseCat(state_space,state_probs)
POMDPs.discount(pomdp::GridClickPOMDP)=1.0
The text was updated successfully, but these errors were encountered:
Hi,
Fantastic job with this library. Looks really nice. I am trying to implement a POMDP problem that is pretty much a scaled down version of the battleship problem in the original Silver & Veness POMCP paper (https://papers.nips.cc/paper/2010/hash/edfbe1afcf9246bb0d40eb4d8027d90f-Abstract.html).
Each state in the state space is a 3x3 grid (or an array of 9 numbers) of one's and zeros that denote the location of the battleships, where 1 are "hit" tiles and 0 are "miss" tiles. There are 9 actions corresponding to each of the tiles in the 3x3 grid. There are two observations, "hit" or "miss."
This should be a simple POMDP to implement, but the catch is that the agent cannot click on a tile twice until the state has changed (i.e. as long as the agent is in state s, each action can only be chosen once). I've implemented this as a field of the POMDP struct called "board" which is an array of -1 (tile not chosen by agent yet), 0 (tile chosen but miss), and 1 (tile chosen and hit). I am using the
gen
function to implement the transition so that I can generate the next state, observation, and reward simultaneously. In the gen function, once the agent has gotten all hit tiles, the board is re-cleared into -1's and the state changes. The code is pretty short and can be seen below.I tried using BasicPOMCP on this problem, but I am noticing that the board isnt getting cleared across state changes. This is probably because of the rather "hack-y" way I have implemented checking if an agent has taken a specific action or not.
What would be the best way to implement this problem? I briefly considered making the entire board the observation, but I am not sure if I am able to pass the current observation into the gen function, only the state and action...
The larger question here would be what is the best way to implement constraints on the action space of the agent based on the current history/observation?
Thanks for the help in advance!
The text was updated successfully, but these errors were encountered: