Using Monte Carlo Learning and Policy Evaluation-Iteration methods to chart a path for an agent to reach a goal through an icy terrain while also avoiding intermittent holes and dealing with stochastic wind which make the agents' actions non-correspondent with it's observed state transition
Original Policy | Policy after 30 episodes |
---|---|
The policy is found to converge within 30 episodes while the state values take ~1000 episodes to converge by Monte-Carlo method
Final state values: