Skip to content

Fork of a Reinforcement Learning agent for Tic-Tac-Toe. Implements the example from Chapter 1 of Sutton and Barto.

License

Notifications You must be signed in to change notification settings

Naereen/Wesley-Tansey-RL-TicTacToe

 
 

Repository files navigation

Reinforcement Learning in 3x3 Tic-Tac-Toe, learning by random self-playing

Implementation in Python (2 or 3), forked from tansey/rl-tictactoe.

A quick Python implementation of the 3x3 Tic-Tac-Toe value function learning agent, as described in Chapter 1 of "Reinforcement Learning: An Introduction" by Sutton and Barto 📖.


Usage of this program 📝

This implementation is simply one Python file (tictactoe.py):

# Run this program and keep its output
python2 tictactoe.py | tee ./tictactoe.log

The code is pylint-valid with both Python 2 and 3 (2.7+ and 3.4+).

made-with-python

Example 💪

See the figure below for an example of what is achieved with this rather simple implementation: Main example

Numerical results are also available in the (long) results.csv CSV file.

Limitation 💫


Explanations 👍

The agent contains a lookup table that maps states to values, where initial values are 1 for a win, 0 for a draw or loss, and 0.5 otherwise. At every move, the agent chooses either the maximum-value move (greedy) or, with some probability epsilon, a random move (exploratory); by default epsilon=0.1.

The agent updates its value function (the lookup table) after every greedy move, following the equation:

V(s) <- V(s) + alpha * ( V(s') - V(s) )

Why? 💥

This particular implementation addresses the question posed in Exercise 1.1:

What would happen if the RL agent taught itself via self-play?

The result is that the agent learns only how to maximize its own potential payoff, without consideration for whether it is playing to a win or a draw. Even more to the point, the agent learns a myopic strategy where it basically has a single path that it wants to take to reach a winning state.

If the path is blocked by the opponent, the values will then usually all become 0.5 and the player is effectively moving randomly (so we did nothing clever in this case).

Note that if you use a loss value of -1, then the agent learns to play the optimal strategy in the minimax sense.


📜 Authors

📜 License ? GitHub license

MIT Licensed (file LICENSE).

Maintenance Ask Me Anything ! Analytics

ForTheBadge uses-badges ForTheBadge uses-git

ForTheBadge built-with-science

About

Fork of a Reinforcement Learning agent for Tic-Tac-Toe. Implements the example from Chapter 1 of Sutton and Barto.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%