Two reinforcement learning implementations based on Q learning and deep reinforcement learning using python
- OpenAI Gym
- PyTorch
Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The surface is described using a grid like the following:
This grid is our environment where S is the agent’s starting point, and it’s safe. F represents the frozen surface and is also safe. H represents a hole, and if our agent steps in a hole in the middle of a frozen lake, well, that’s not good. Finally, G represents the goal, which is the space on the grid where the prized frisbee is located.
The agent can navigate left, right, up, and down, and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of one if it reaches the goal, and zero otherwise.
- Initialize all Q-values in the Q-table to 0.
- For each time-step in each episode:
- Choose an action ( considering the exploration-exploitation trade-off).
- Observe the reward and next state.
- Update the Q-value function ( using the formula we gave that will, overtime, make the Q-value function converge to the right hand side of the Bellman equation).
The cart and pole problem consists of a cart that can move left and right along a frictionless track. The cart has a pole attached to the top of it, which starts out in a vertical upright position, however, by design, the pole will fall either to the left or right when not balanced. The goal here is to prevent this pole from falling over. A reward of
+
1
will be given for each time step that the pole remains upright, and an episode will deemed over when the pole is more than
15
degrees from vertical or when the cart moves more than
2.4
units from the center of the screen.
- Initialize replay memory capacity.
- Initialize the policy network with random weights.
- Clone the policy network, and call it the target network.
- For each episode:
- Initialize the starting state.
- For each time step:
- Select an action.
- Via exploration or exploitation.
- Execute selected action in an emulator.
- Observe reward and next state.
- Store experience in replay memory.
- Sample random batch from replay memory.
- Preprocess states from batch.
- Pass batch of preprocessed states to policy network.
- Calculate loss between output Q-values and target Q-values.
- Requires a pass to the target network for the next state
- Gradient descent updates weights in the policy network to minimize loss.
- After x time steps, weights in the target network are updated to the weights in the policy network.
- Select an action.