# REINFORCEMENT LEARNING FROM SCRATCH

In this exercise, we will build up a Reinforcement Learning (RL) process. We will train a model to help us play Tic Tac Toe. Use the provided class `tic_tac_toe.py` for this exercise.

**a) Get familiar with the provided class `TicTacToe` by creating a game and playing a few rounds with random choices. The AI player is supposed to be the player 'o'.**

**b) Collect the information of 1000 games that can be used to train a model later on:**

*   Set up a loop that plays 1000 games.
*   Each game is played by both players alternating. Player 'x' (You) always begins with a random play. Afterward, player 'o' makes its random move.
*   For each of your turns, i.e., the turns of player 'x', store the board state before the turn and the cell that was chosen (information is returned by the random play method and can be retrieved from the board object).
*   Always check if the game is finished after a player’s turn and if so, compute the rewards for this game (use the provided function `get_rewards()` in the file). This method expects the winner of the game as a parameter and a list of player x’s plays as tuples of chosen cell and board state before play.

**c) The rewards are given by this function as a tuple where the first value is the rewards (target) and the second value is the training data in the form of chosen cell and board state.**

*   Train a model of your choice (can be a Deep Learning Model or another model) to predict the expected reward given the provided training data.
*   Split the available data before training (test ratio of 0.2).

**d) Evaluate the trained model on the test set in terms of prediction deviation. Explain how the trained model could be used to help human players win Tic Tac Toe games.**


In [9]:
import numpy as np
from tic_tac_toe import *

# Initialize a TicTacToe game instance
game = TicTacToe()

# Play a few rounds with random choices
for _ in range(3):  # Play 3 random games
    game.reset_game_state()
    game.print_board()

    while not game.check_for_winner() and not game.check_for_board_filled():
        if game.get_last_move() != game._mappings['x']:
            # Player 'x' (human) makes a random move
            game.random_play_x()
        if not game.check_for_winner() and not game.check_for_board_filled():
            if game.get_last_move() != game._mappings['o']:
                # Player 'o' (AI) makes a random move
                game.random_play_o()
        game.print_board()

    winner = game.check_for_winner()
    if winner:
        print(f"Winner is {winner}")
    else:
        print("It's a draw!")


_____________
|   |   |   |
|   |   |   |
|   |   |   |
_____________
_____________
|   | x |   |
|   |   |   |
| o |   |   |
_____________
_____________
|   | x |   |
| o | x |   |
| o |   |   |
_____________
_____________
| o | x |   |
| o | x | x |
| o |   |   |
_____________
Winner is o
_____________
|   |   |   |
|   |   |   |
|   |   |   |
_____________
_____________
|   | x |   |
|   |   |   |
| o |   |   |
_____________
_____________
|   | x |   |
|   | x |   |
| o |   | o |
_____________
_____________
| o | x |   |
| x | x |   |
| o |   | o |
_____________
_____________
| o | x | x |
| x | x | o |
| o |   | o |
_____________
_____________
| o | x | x |
| x | x | o |
| o | x | o |
_____________
Winner is x
_____________
|   |   |   |
|   |   |   |
|   |   |   |
_____________
_____________
|   | o |   |
|   |   |   |
|   |   | x |
_____________
_____________
|   | o | o |
|   |   |   |
|   | x | x |
_____________
_____________
| x | o | o |
|   |   | o |
|   | x | x |
__________

In [10]:
# Initialize the TicTacToe game
game = TicTacToe()

# Parameters
num_games = 1000

# Data collection
collected_data = []

for _ in range(num_games):
    game.reset_game_state()
    game_data = []

    while not game.check_for_winner() and not game.check_for_board_filled():
        # Player 'x' plays
        board_state_before = game.get_board().copy()
        chosen_cell_x = game.random_play_x()
        game_data.append((chosen_cell_x, board_state_before))
        
        # Check for winner after x's move
        winner = game.check_for_winner()
        if winner or game.check_for_board_filled():
            break
        
        # Player 'o' plays
        game.random_play_o()

    # Determine the winner and compute rewards
    winner = game.check_for_winner()
    rewards, training_data = get_rewards(winner, game_data)
    collected_data.append((rewards, training_data))

# Prepare the data
all_rewards = np.concatenate([data[0] for data in collected_data])
all_training_data = np.concatenate([data[1] for data in collected_data])

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from tic_tac_toe import TicTacToe, get_rewards, transform_move
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    all_training_data, all_rewards, test_size=0.2, random_state=42
)

# Normalize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define a deep learning model
def create_model():
    model = Sequential([
        Dense(128, input_dim=X_train.shape[1], activation='relu'),
        Dense(64, activation='relu'),
        Dense(32, activation='relu'),
        Dense(1, activation='linear')  # Linear activation for regression
    ])
    
    model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
    return model

# Create and train the model
model = create_model()
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.1, verbose=1)



Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x2cf933036a0>

In [19]:
# Evaluate the model
y_pred = model.predict(X_test).flatten()
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error on Test Set: {mse:.2f}")

# Function to suggest the best move using the trained model
def suggest_move(board_state, model, scaler, game):
    possible_moves = []
    for move in game.get_possible_moves():
        transformed_move = transform_move(move[0], move[1])
        state = np.concatenate([board_state.flatten(), transformed_move])
        possible_moves.append(state)
    
    # Scale the possible moves
    possible_moves = scaler.transform(possible_moves)
    
    # Predict rewards for each move
    rewards = model.predict(possible_moves).flatten()
    
    # Choose the move with the highest predicted reward
    best_move_idx = np.argmax(rewards)
    best_move = game.get_possible_moves()[best_move_idx]
    
    return best_move


Mean Squared Error on Test Set: 4504.95
_____________
|   |   |   |
|   |   |   |
|   |   |   |
_____________


TypeError: suggest_move() missing 1 required positional argument: 'game'

In [20]:
def play_game(model, scaler):
    game = TicTacToe()
    game.reset_game_state()
    
    # Print initial board
    print("Initial board:")
    game.print_board()
    
    # Game loop
    while not game.check_for_winner() and not game.check_for_board_filled():
        # Player using suggestion
        if game.get_last_move() != game._mappings['x']:
            suggested_move = suggest_move(game.get_board(), model, scaler, game)
            # Ensure the suggested move is valid
            if suggested_move in game.get_possible_moves():
                game.play_x(*suggested_move)
                print("Player 'x' makes a move:")
                game.print_board()
            else:
                print("Suggested move is invalid. Retrieving a new move.")
                continue
        
        # Check for winner after player 'x' move
        if game.check_for_winner() or game.check_for_board_filled():
            break
        
        # Computer player (random 'o')
        if game.get_last_move() != game._mappings['o']:
            game.random_play_o()
            print("Player 'o' (computer) makes a move:")
            game.print_board()
    
    # Determine the winner
    winner = game.check_for_winner()
    if winner:
        print(f"Winner is {winner}")
    else:
        print("It's a draw!")


In [22]:
play_game(model, scaler)

Initial board:
_____________
|   |   |   |
|   |   |   |
|   |   |   |
_____________
Player 'x' makes a move:
_____________
|   |   |   |
|   | x |   |
|   |   |   |
_____________
Player 'o' (computer) makes a move:
_____________
| o |   |   |
|   | x |   |
|   |   |   |
_____________
Player 'x' makes a move:
_____________
| o |   |   |
|   | x |   |
| x |   |   |
_____________
Player 'o' (computer) makes a move:
_____________
| o |   |   |
| o | x |   |
| x |   |   |
_____________
Player 'x' makes a move:
_____________
| o |   |   |
| o | x | x |
| x |   |   |
_____________
Player 'o' (computer) makes a move:
_____________
| o |   |   |
| o | x | x |
| x | o |   |
_____________
Player 'x' makes a move:
_____________
| o |   | x |
| o | x | x |
| x | o |   |
_____________
Winner is x
