Reinforcement Learning in 3x3 Tic-Tac-Toe, learning by random self-playing

Implementation in Python (2 or 3), forked from tansey/rl-tictactoe.

A quick Python implementation of the 3x3 Tic-Tac-Toe value function learning agent, as described in Chapter 1 of "Reinforcement Learning: An Introduction" by Sutton and Barto 📖.

Usage of this program 📝

This implementation is simply one Python file (tictactoe.py):

# Run this program and keep its output
python2 tictactoe.py | tee ./tictactoe.log

The code is pylint-valid with both Python 2 and 3 (2.7+ and 3.4+).

Example 💪

See the figure below for an example of what is achieved with this rather simple implementation:

Numerical results are also available in the (long) results.csv CSV file.

Limitation 💫

Only 3 by 3 (3x3) Tic-Tac-Toe is implemented.

See this other project for a higher-dimensional Tic-Tac-Toe.

Explanations 👍

The agent contains a lookup table that maps states to values, where initial values are 1 for a win, 0 for a draw or loss, and 0.5 otherwise. At every move, the agent chooses either the maximum-value move (greedy) or, with some probability epsilon, a random move (exploratory); by default epsilon=0.1.

The agent updates its value function (the lookup table) after every greedy move, following the equation:

V(s) <- V(s) + alpha * ( V(s') - V(s) )

Why? 💥

This particular implementation addresses the question posed in Exercise 1.1:

What would happen if the RL agent taught itself via self-play?

The result is that the agent learns only how to maximize its own potential payoff, without consideration for whether it is playing to a win or a draw. Even more to the point, the agent learns a myopic strategy where it basically has a single path that it wants to take to reach a winning state.

If the path is blocked by the opponent, the values will then usually all become 0.5 and the player is effectively moving randomly (so we did nothing clever in this case).

Note that if you use a loss value of -1, then the agent learns to play the optimal strategy in the minimax sense.

📜 Authors

Forked and cleaned up by Lilian Besson (Naereen), the 28/12/2015, for my MVA master project for the "Reinforcement Learning" course.
Created by Wesley Tansey, 1/21/2013.

📜 License ?

MIT Licensed (file LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
requirements.txt		requirements.txt
results.csv		results.csv
selfplay_random_-1loss.png		selfplay_random_-1loss.png
tictactoe.py		tictactoe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

_config.yml

_config.yml

requirements.txt

requirements.txt

results.csv

results.csv

selfplay_random_-1loss.png

selfplay_random_-1loss.png

tictactoe.py

tictactoe.py

Repository files navigation

Reinforcement Learning in 3x3 Tic-Tac-Toe, learning by random self-playing

Usage of this program 📝

Example 💪

Limitation 💫

Explanations 👍

Why? 💥

📜 Authors

📜 License ?

About

Releases

Packages

Languages

License

Naereen/Wesley-Tansey-RL-TicTacToe

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning in 3x3 Tic-Tac-Toe, learning by random self-playing

Usage of this program 📝

Example 💪

Limitation 💫

Explanations 👍

Why? 💥

📜 Authors

📜 License ?

About

Topics

Resources

License

Stars

Watchers

Forks

Languages