*Cell with all imports for notebook*

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
import torch.nn.functional as F
import random
import copy
import math
import pygame

# Training a neural network to play the 2048 game with Pytorch

*by Chad Schwenke and Logan Cadman, December, 2023* 

## Introduction

Utilizing machine learning, our goal was to train a neural network to play the 2048 game. In 2048 you start out with a 4x4 game board with two tiles, either 2 or 4, in a random position. The player must move all tiles up, down, left, or right. Each move will add another 2 or 4 tile in a random empty position, and tiles can only merge and add if they are the same number. It's a strategic movement game in which you essentially combine tiles to create a higher tile on the game board which increases your score.

Training a neural network to play 2048 seemed like an interesting endeavor for multiple reasons. First, our inspiration for this project comes from assignment 5 in class where we trained a reinforcement algorithm to play the tic-tac-toe game. That assignment showed us that training a neural network to play a simple game could be quite interesting. This is because it would allow us to play with different machine learning concepts by providing a practical and fun approach to apply what we've learned in class. 

2048 is a relatively simple game at first glance, but it is more complex than the tic-tac-toe game. 2048 has clear objectives and easily measurable outcomes such as the score, highest tile, and number of tiles merged. It seems to provide a good platform for experimenting with a multiude of different network structures, learning strategies, optimization methods, and different reward systems. The game requires a strategy and some planning to achieve high scores. Neural networks often use something called a 'discount factor' or in our case we called it 'gamma' this essentially determines if the agent is going to look for an immediate reward, a future reward, or a balance between the two. This is an interesting system in which the agent must be strategicly planning, and a decrease in the 'gamma' would show wether or not short term thinking is more or less benefical than an increase in gamma which shows if long term thinking is better. 

While looking at simple Q function learning algorithms as introduced first in class, we realized this was something that would not be possible in 2048 because there are 18 possible states for each tile, and for a 4x4 board there are 16 total tiles on the board at a time leaving us to believe that there are 18^16 possible board combinations which in no way could be stored in a set. We realized it would however be possible to store a partial set of boards in a set, but did not end up doing this. Rather when the agent plays 2048, they will create their own data and then learn and act upon that data to improve its actions using deep Q learning.

After doing a bit of work on this project we determined that we would not be adapting the a5 code for the 2048 game. This was due to a combination of factors. Overall, we wouldn't have been learning as much if we tried to adapt that code, by creating our own code using PyTorch it allowed us to learn and better understand machine learning. We also didn't want to re-invent the wheel by doing a Tensorflow implementation because it would have been nearly identical, and our goal was to learn RL and not compare frameworks. We did however use the a5 code as a guide as well as the official Pytorch tutorial [Paszke, 2023].

We searched around on online forms and it appears that the maximum tile possible in 2048 is believed to be either 65,536 OR 131,072. Being that these are so high and neither of us have ever achieved a tile even close to that, it peaked our interest even more. Would it be possible for a neural network agent to achieve a tile with that score? This is an interesting idea because that tile would be extremely hard for a human to achieve.

For our neural network we ended up using a few different network designs. These include a simple fully connected network 'QNetworkSimple', another with dropout layers 'QNetworkDropout' but similiar to the simple network, and lastly 'QNetworkConv' a convolutional neural network. We tried a few other designs too, but these are the ones that we did the majority of our tests with.

In [None]:
# Set device to GPU if available
device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")

In [None]:
class QNetworkSimple(nn.Module):
    def __init__(self):
        super(QNetworkSimple, self).__init__()
        self.fc1 = nn.Linear(16, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 4)

    def forward(self, x):
        x = x.to(device)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

This simple network is fully connected and has three layers. It is designed to take in a fully flattened game board with 16 inputs. This is a game board, that has been flattened from the original 4x4 board. Each layer is represented by fc1, fc2, and fc3. The first layer is taking in 16 inputs and mapping that to 128 outputs, the second layer is taking in 128 and mapping that to 64, and the last layer is taking in 64 and mapping that to 4. The forward function then has 'x = x.to(device)' to ensure we are using a gpu if available. We then activate two of the layers with the relu function so that it is non-linear, and leave the last layer as output.

In [None]:
class QNetworkDropout(nn.Module):
    def __init__(self):
        super(QNetworkDropout, self).__init__()
        self.fc1 = nn.Linear(16, 64)
        self.fc2 = nn.Linear(64, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 4)
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        x = x.to(device)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x

This network design is very similiar to our last network but modified to use dropout layers which help with regularization. There are four layers declared, with one dropout layer. The dropout layer zeros out elements of the input tensors. In the forward function we again use relu to add some non-linear and each dropout layer after is used to again add regulurization to the previous layer, this is then repeated.

In [None]:
class QNetworkConv(nn.Module):
    def init(self):
        super(QNetworkConv, self).__init__()
        self.conv_block = ConvBlock(input_dim=1, output_dim=32)
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(512, 128)
        self.fc2 = nn.Linear(128, 4)

    def forward(self, x):
        x = x.to(device)
        x = self.conv_block(x)
        x = self.flatten(x)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

class ConvBlock(nn.Module):
    def init(self, input_dim, output_dim):
        super(ConvBlock, self).init()
        self.input_dim = input_dim
        d = output_dim // 4
        self.conv1 = nn.Conv2d(input_dim, d, kernel_size=1, padding='same')
        self.conv2 = nn.Conv2d(input_dim, d, kernel_size=2, padding='same')
        self.conv3 = nn.Conv2d(input_dim, d, kernel_size=3, padding='same')
        self.conv4 = nn.Conv2d(input_dim, d, kernel_size=4, padding='same')
        self.relu = nn.ReLU()

    def forward(self, x):
        if len(x.shape) == 1:
            x = x.view(1, self.input_dim, 4, 4)
        elif len(x.shape) == 2:
            x = x.view(-1, self.input_dim, 4, 4)

        x1 = self.relu(self.conv1(x))
        x2 = self.relu(self.conv2(x))
        x3 = self.relu(self.conv3(x))
        x4 = self.relu(self.conv4(x))
        return torch.cat((x1, x2, x3, x4), dim=1)

This section of code contains two things, first we have a 'QNetworkConv' which is a convolutional neural network, and it required the addition of the second function 'ConvBlock'. This network design has a convolutional block, a flattened layer, and two connected layers. The forward method then passes these layers through the convolutional block, and doing some similiar actions that we have previously discussed for the other designs. The convolutional block used 4 layers as seen by conv1, conv2 and so on. The forward function in the 'ConvBlock' reshapes its input and then applies it to the relu activation function.

This ConvBlock was inspired from [Chan, 2022]. We made modifications to adapt our 1D tensor as well as adding activation throughout. We chose to build a network with one of these blocks after seeing their strong performance.

## Methods

In order to create this project there were a few things we needed to create. First, we needed a 2048 game board, this was defined as 'class Board', it contains everything related to the game. We then created a class 'replayBuffer' which contains the data that is being created by the agent, that can then be revisited. After that we have our training loop, an evaluation model, and lastly a random evaluation model to test as a baseline. The code and explanations for these are below.

## 2048 Game board

In [None]:
class Board:
    def __init__(self):
        # Declare variables for score, last added tile position, merges made in last move, and game over flag
        self.score = 0
        self.last_added_tile = None
        self.merges_in_last_move = 0
        self.game_over = False

        # Initialize board with all zeros (4x4 grid) and then add two tiles
        self.board = [[0] * 4 for _ in range(4)]
        self._add_new_tile()
        self._add_new_tile()

    def print_board(self):
        # Iterates through each row of the board and prints it
        for row in self.board:
            print(' '.join(map(str, row)))

    def print_score(self):
        # Outputs the current score
        print(f"Score: {self.score}")

    def print_highest_tile(self):
        # Outputs the current highest tile
        print(f"Highest Tile: {self.highest_tile()}")

    def move_tiles(self, direction):
        # Check if game is already over
        if self.game_over:
            return False

        # Reset merge count to zero
        self.merges_in_last_move = 0

        # Make a copy of the original board
        original_board = [row[:] for row in self.board]

        # Transpose board for up and down moves
        if direction in ('u', 'd'):
            self._transpose_board()

        # Run logic for all rows or columns in move
        for i in range(4):
            # Up and left are treated the same after transpose
            if direction in ('u', 'l'):
                shifted_row = self._shift(self.board[i])
                self.board[i] = self._merge(shifted_row)
            # Down and right must be reversed first since functions treat everything as left
            elif direction in ('d', 'r'):
                reversed_row = list(reversed(self.board[i]))
                shifted_row = self._shift(reversed_row)
                merged_row = self._merge(shifted_row)
                # Undo reverse
                self.board[i] = list(reversed(merged_row))

        # Run transpose again to reset
        if direction in ('u', 'd'):
            self._transpose_board()

        # Check if board changed from original
        if self.board != original_board:
            # Add a new tile and then check if game is over
            self._add_new_tile()
            if not self._moves_available():
                self.game_over = True
            return True
        # If board did not change from original, the move was invalid but game is not yet over
        return False

    def possible_moves(self):
        # Create lists to store moves and try all moves
        possible_moves = []
        directions = ['u', 'd', 'l', 'r']

        # Try all possible directions
        for direction in directions:
            if self._simulate_move(direction):
                possible_moves.append(direction)

        # Return list of directions
        return possible_moves

    def get_flattened_board(self):
        flattened_board = []
        for row in self.board:
            for value in row:
                # Traverse board and append each value
                flattened_board.append(value)
        return flattened_board

    def get_normalized_flattened_board(self):
        normalized_flattened_board = []
        for row in self.board:
            for value in row:
                # Traverse board and append each value normalized to base 2 (except zeros)
                normalized_value = math.log(value, 2) if value != 0 else 0
                normalized_flattened_board.append(normalized_value)
        return normalized_flattened_board

    def _transpose_board(self):
        # Converts rows to columns and columns to rows
        transposed = []
        for col_index in range(4):
            new_row = []
            for row in self.board:
                new_row.append(row[col_index])
            transposed.append(new_row)
        self.board = transposed

    def _shift(self, row):
        # Shifts non-zero elements to the left in a row
        shifted_row = []
        for value in row:
            if value != 0:
                shifted_row.append(value)
        # Append zeros to the end of the row to maintain its size
        while len(shifted_row) < 4:
            shifted_row.append(0)
        return shifted_row

    def _merge(self, row):
        # Merge adjacent tiles with the same value
        for i in range(3):
            if row[i] == row[i + 1] and row[i] != 0:
                row[i] *= 2
                row[i + 1] = 0
                self.score += row[i]
                # Update the number of merges
                self.merges_in_last_move += 1
        # Shift again to ensure tiles are properly aligned
        return self._shift(row)

    def _add_new_tile(self):
        # Check for possible empty positions
        empty_positions = []
        for i in range(4):
            for j in range(4):
                # If cell is zero it is empty
                if self.board[i][j] == 0:
                    empty_positions.append((i, j))
        if empty_positions:
            i, j = random.choice(empty_positions)
            # Add a 2 or 4 to a random empty position (10% chance to be 4, 90% chance to be 2)
            self.board[i][j] = 4 if random.random() < 0.1 else 2
            self.last_added_tile = (i, j)

    def _moves_available(self):
        # Check if there are any moves available
        for i in range(4):
            for j in range(4):
                # Check for empty spot
                if self.board[i][j] == 0:
                    return True
                # Check for possible merges in the row
                if i < 3 and self.board[i][j] == self.board[i + 1][j]:
                    return True
                # Check for possible merges in the column
                if j < 3 and self.board[i][j] == self.board[i][j + 1]:
                    return True
        return False

    def _simulate_move(self, direction):
        # Make a copy of the whole object before simulating move
        board_copy = copy.deepcopy(self)

        # Simulate the move on the copy
        board_copy.move_tiles(direction)

        # Return weather or not the board changed
        return board_copy.board != self.board
    
    def highest_tile(self):
        # Returns the highest tile on the board
        highest_tile = 0
        for row in self.board:
            for value in row:
                if value > highest_tile:
                    highest_tile = value
        return highest_tile

The class above, 'class Board' defines all the logic for our 2048 game. This code can be called by a human playable version of the game, or the neural network. We have included what we called our 'interface' later which shows a playable version of the game calling upon the board, and a terminal version that is also human playable.

Initially for the game to run, you must declare a game object, and the set of moves. This calls the classes init function, which declares variables for score, the last added tile position, the number of merges from the last move, and the game over flag. We then need to initialize a game board which is a 4x4 grid, or in this case a matrix, which is a 2-dimensional list, and then we call an add tile function which adds the intial two tiles to the game board.

From there we have a 'possible_moves' function, this simply creates and stores the possible moves by first testing them and then returns the possible moves as a list. This calls upon the '_simulate_move' function, which takes in the direction you wish to move in as a parameter. It first creates a copy of the game board, and then simulates a move in that direction on the board. This brings us to 'move_tiles', which is the longest but most interesting function on the board. This function first checks to see if the game is over, resets the variable we initialized for the merges in last move to 0, and makes a copy of the board. It then takes the board and transposes it, this causes the first row of the original board to become the first column of the new board and so on. This allows for us to only deal with two moves rather than all four, so if the move is up or down, then the board is transposed, and if it is not up or down, we leave it alone. When we get to the loop in our move tiles function, if the direction is up or left then the board is shifted, and merged. When the direction is down or right, we first reverse the list, shift, and then undo the reverse. This allows the code to handle the direction of up, down, left, and right as if it were the same move which lets the board reuse the same shift and merge logic by instead modifying the board itself. The board is then transposed again, and checked to see if it has changed. If it hasn't changed, then the move was invalid.

The 'move_tiles' function brings up our next few functions to discuss. We have tranpose board, shift, and merge.

The transpose function function essentially takes the 2d list/matrix and flips it diagonally. We first create a transposed list, then loop through the whole matrix, and within that loop we go through each row. While we are looping through each row you then append the value to the new row. Then we take the current row and place it on the transposed board. At the end of the function the original board is now the transposed board.

The shift function takes all the zeros on the board and shifts them to the left in the given matrix. It does this by first making an empty list, then entering a loop that goes over each element, within the loop we check the value, if true append it to the shifted row. Once the loop is done, we then do a while loop that appends zeros to the end of the row to maintain its size.

The merge function is used to merge adjacent tiles on the board in a given row with the same value together. It does this by first enetering the loop to iterate over each index. Then it checks the tile and the next tile. If those tiles are equal, then it doubles the value of the current tile and resets the other tile to 0, while also incrementing the merges in last move. After the loop finished we shift again to ensure tiles are aligned correctly.

We have get_flattened_board and get_normalized_flattened_board which are used to transform the gameboard. These were introduced specifically for the neural network because it is easier to work on a 1 dimensional list, rather than our 2D list/matrix. The flattened board function traverses the game board row by row and will taken each value and append it to a new list that is the new flattened board. End result is a flattened 1D board. The normalized version does practically the same thing however, it normalizes the board by taking the base-2 log of the each value and appending that instead. Thus the 1D list from that function contains the same 1D list but with normalized values. 

The moves available function checks if there are any moves available. It first loops over the game board, and then in a nested loop it checks if a tile is empty, if there are possible merges in a row, and then if there are any possible merges in the column. If a condition is true it returns true, saying that there are moves that are possible, and if none are it returns false saying that none are possible.

The highest tile function just takes the board and returns the highest tile. It loops over each value in each row and determines which tile is highest and returns it.

## Main for running the game in terminal

In [None]:
# The main function for running the game
def main():
    # Make game object and set of allowed moves
    game = Board()
    move_commands = {'u': 'up', 'd': 'down', 'l': 'left', 'r': 'right'}

    # Loop until game over
    while not game.game_over:
        # Print the board and score, prompt user for move
        game.print_board()
        game.print_score()
        game.print_highest_tile()

        # Print possible moves
        print(game.possible_moves())

        move = input("Enter your move (u, d, l, r): ").lower()
        # Check if move was not possible and notify user if so
        if move in move_commands:
            if not game.move_tiles(move):
                print("Move not possible. Try a different direction.")
        else:
            print("Invalid input. Please enter 'u', 'd', 'l', or 'r'.")
        # Check if game is over and notify user if so
        if game.game_over:
            print("Game over! Your final score is:", game.score)

The code above is our main function for originally testing the game in our terminal, it is a basic model to let you play.

## Interface for using pygame to play game with GUI

In [None]:
pygame.init()

# Set up the display and caption
width, height = 400, 450
screen = pygame.display.set_mode((width, height))
pygame.display.set_caption("2048 Game")

# Set styling for colors and font
background_color = (250, 245, 240)
tile_colors = {0: (205, 193, 180), 2: (238, 228, 218), 4: (237, 224, 200), 
               8: (242, 177, 121), 16: (245, 149, 99), 32: (246, 124, 95), 
               64: (246, 94, 59), 128: (237, 207, 114), 256: (237, 204, 97), 
               512: (237, 200, 80), 1024: (237, 197, 63), 2048: (237, 194, 46)}
font = pygame.font.SysFont("arial", 40)

# Create a game instance
game = Board()

# Function to draw game board
def draw_board():
    # Fill screen with background color
    screen.fill(background_color)
    # Loop over each value in the game board
    for i, row in enumerate(game.board):
        for j, value in enumerate(row):
            # Get the color for this value (white if missing) and draw rectangle on screen
            tile_color = tile_colors.get(value, (255, 255, 255))
            pygame.draw.rect(screen, tile_color, (j * 100, i * 100, 100, 100))
            # If the value was not zero, draw the numerical value on this tile's center as well
            if value != 0:
                # If we are at the index of the last added tile, draw it red for distinction
                if (i, j) == game.last_added_tile:
                    text_surface = font.render(str(value), True, (255, 0, 0))
                else:
                    text_surface = font.render(str(value), True, (0, 0, 0))
                text_rect = text_surface.get_rect(center=(j * 100 + 50, i * 100 + 50))
                screen.blit(text_surface, text_rect)

    # Draw the score below the game board
    score_text = font.render(f"Score: {game.score}", True, (0, 0, 0))
    score_rect = score_text.get_rect(center=(width // 2, height - 25))
    screen.blit(score_text, score_rect)

# Main game loop
running = True
while running:
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            running = False
        # Get input from arrow keys or WASD on users keyboard for moves
        if event.type == pygame.KEYDOWN:
            if event.key in (pygame.K_w, pygame.K_UP):
                game.move_tiles('u')
            elif event.key in (pygame.K_s, pygame.K_DOWN):
                game.move_tiles('d')
            elif event.key in (pygame.K_a, pygame.K_LEFT):
                game.move_tiles('l')
            elif event.key in (pygame.K_d, pygame.K_RIGHT):
                game.move_tiles('r')

    # Draw the game board
    draw_board()

    # Update the display
    pygame.display.flip()

# Quit Pygame
pygame.quit()

The code above is a simple interface using pygame that allows for the game to be played with a gui. The instructions for this were obtained from the pygame documentation [PyGame, 2023].

## ReplayBuffer

In [None]:
class ReplayBuffer:
    def __init__(self, capacity):
        self.capacity = capacity
        self.buffer = []
        self.position = 0

    def push(self, state, action, reward, next_state, done):
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        self.buffer[self.position] = (state, action, reward, next_state, done)
        self.position = (self.position + 1) % self.capacity

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

Above is our class known as ReplayBuffer which is a crucial component of our training process. Since it is infeasialbe to store all possible 2048 game boards, we must make a system that stores boards and then 'flushes' them out periodically. The most important variable in this object is the capacity and it determines how many samples will be kept at any given time. Inside, the push function will check if there is still capactity avaliable and create empty space if so. Then, the next set of information from our training loop is stored and the counter is incremented. If this counter goes over the capacity, the modulo by the capacity will take it back to the beginning. Lastly, the sample function will simply return a random sample of information from the buffer. Our training loop makes use of this sample function for obtaining training samples for the neural network. Our usage of and strucure of our buffer was inspired by [Paszke, 2023].

## Variables for network

In [None]:
q_network = QNetworkSimple().to(device)
optimizer = optim.Adam(q_network.parameters(), lr=5e-5)
criterion = nn.MSELoss()
replay_buffer = ReplayBuffer(capacity=10000)
num_episodes = 1000
batch_size = 100
epsilon = 0.9
epsilon_decay = .998
gamma = 0.90
action_indices = {'u': 0, 'd': 1, 'l': 2, 'r': 3}
all_scores = []
all_highest_tiles = []

Above is an example set of each of the necessary variables to run our training loop. First, q_network is the object that holds our model and is set to one of the three model types we have above. Next, we choose Adam and MSELoss for our optimizer and loss function. We ran some tests with SGD and other L1 loss functions, but found them to not give us as good of results.

Now we declare our replay buffer with a desired capacity. In our tests we found 10,000 to give good results. Then, we configure our number of episodes and batch size. In our training loop, each episode is one playthrough of the 2048 game in addition to one batch being trained through our network. 1,000 episodes and a batch size of 100 worked well in our testing.

Epsilon, epsilon_decay, and gamma are the three variables that control the reinforcement learning process. Epsilon controls the likelihood that a random action is taken or an action calculated by the q_network. In the beginning of training, we generally want the network to explore possiblilties and thus, it starts off high at 0.9. However as our model learns, we want it to start using the Q values from our network to make decisions. This variable is then decayed throughout each episode by .998. Over a thousand iterations (0.9 * (.998 ** 1000)) the epsilon decays to a comfortable .1216 in which our network uses the Q values much more often. Gamma is used to control the networks desire of immediate rewards versus long term rewards. Since 2048 has both the need for long and short term rewards (mostly long in our opinion as players) we chose to keep this at a higher 0.9 which prioritizes longer term rewards.

The action_indices dictionary is used to map the outputs of the network to their corresponding position. The all_scores and all_highest_tiles lists are used to hold scores and highest tiles achieved throughout each episode for tracking performance.

## Training loop

In [None]:
# Loop through batches
for episode in range(num_episodes):
    # Start a new game board and get the initial state
    game = Board()
    state = game.get_normalized_flattened_board()

    # Loop through steps until the game is over
    while not game.game_over:
        # Get the possible moves
        possible_moves = game.possible_moves()

        # Choose an action using epsilon-greedy
        if np.random.uniform() > epsilon:
            with torch.no_grad():
                # Get action values from the network
                state_tensor = torch.FloatTensor(state).unsqueeze(0)
                action_values = q_network(state_tensor)

                # Filter action values for only possible actions
                masked_action_values = torch.full(action_values.shape, float('-inf'))
                for action in possible_moves:
                    index = action_indices[action]
                    masked_action_values[0][index] = action_values[0][index]
                action = torch.argmax(masked_action_values).item()

                # Convert action index to action key
                for action_key, index in action_indices.items():
                    if index == action:
                        action = action_key
                        break
        # Otherwise choose a random action
        else:
            action = random.choice(possible_moves)

        # Move the tiles using the action and store the reward, state, and done
        game.move_tiles(action)
        reward = game.merges_in_last_move
        next_state = game.get_normalized_flattened_board()
        done = game.game_over
        replay_buffer.push(state, action, reward, next_state, done)

        # Sample and update network if buffer is large enough
        if len(replay_buffer.buffer) > batch_size * 10:
            batch = replay_buffer.sample(batch_size)
            # Split batch into separate components, convert actions to indices
            states, actions, rewards, next_states, dones = zip(*batch)
            actions_modified = [action_indices[action] for action in actions]

            # Convert to tensors
            states = torch.FloatTensor(states).to(device)
            actions = torch.LongTensor(actions_modified).to(device)
            rewards = torch.FloatTensor(rewards).to(device)
            next_states = torch.FloatTensor(next_states).to(device)
            dones = torch.BoolTensor(dones).to(device)

            # Compute current Q values
            current_q_values = q_network(states).gather(1, actions.unsqueeze(1))

            # Compute next Q values
            next_q_values = q_network(next_states).max(1)[0]

            # Zero out Q values that will lead to done state
            next_q_values[dones] = 0.0

            # Compute target Q values
            target_q_values = rewards + (gamma * next_q_values)

            # Compute loss
            loss = criterion(current_q_values, target_q_values.unsqueeze(1))

            # Optimize the network
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Update the state
        state = next_state

    # Update epsilon
    epsilon *= epsilon_decay

    # Store this games score
    all_scores.append(game.score)
    all_highest_tiles.append(game.highest_tile())

    # Display progress
    if episode % 100 == 0:
            print(f"Ran {episode} episodes...")

The above training loop is the part of our code in which the learning happens. It makes use of the variables and objects initialized above and loops through the count of episodes. It first creates a game board and then gets the state of that board as a normalized 1D list for our network. Next, a loop starts until this game is over. While this game is being played, the agent decides wether or not it will make a random move or a move based on the Q values using the epsilon variable described earlier. It does this by making a random value from 0 to 1. If this random value is greater than epsilon a "greedy" or Q value based action is taken, otherwise a random one is. If a Q value based decision or random decision is taken, logic is ran such that ONLY a valid move can be taken (i.e. our training will not let you keep trying to move right if moving right is not doing anything). This is achieved through making the Q values that are invalid as -infinity. While this process is going on, each of the games states before and after the move, the action taken in the move, the reward given by the move, and a flag for weather or not the game is over are stored in the replay buffer.

Once the replay buffer grows such that at least 10 batch sizes worth of information is inside, training will begin. Our training logic simply samples random samples from the replay buffer and converts each item inside the sample into a tensor for easy processing. Next, the current and next maximum Q values are calculated for each of the samples. Any next Q values that will lead to a done state are zeroed. The gamma variable is then used to calculate the target Q values based on the next Q values, and the loss is then calculated between the current and target Q values. Finally the network back propagates and the optimizer steps.

Another reason we did not make this code in tenserflow or from A5 was because this function would have been made the exact same in all three cases but with slightly differing code syntax. Because of this we chose to stick with the largest framework which is becoming PyTorch so we can use this code and knowledge moving forward.

This training loop and its functionality was inspired by the pytorch docs [Paszke, 2023].

## Evaluate model

In [None]:
def evaluate_model(num_games=100):
    top_scores = []
    total_score = 0
    for game_num in range(num_games):
        game = Board()
        state = game.get_normalized_flattened_board()
        while not game.game_over:
            with torch.no_grad():
                state_tensor = torch.FloatTensor(state).unsqueeze(0)
                action_values = q_network(state_tensor)              
                possible_moves = game.possible_moves()
                masked_action_values = torch.full(action_values.shape, float('-inf'))
                for move in possible_moves:
                    index = action_indices[move]
                    masked_action_values[0][index] = action_values[0][index]                
                action = torch.argmax(masked_action_values).item()
                chosen_action = list(action_indices.keys())[list(action_indices.values()).index(action)]
            game.move_tiles(chosen_action)
            state = game.get_normalized_flattened_board()
        total_score += game.score
        top_scores.append(game.highest_tile())
        if game_num % 100 == 0:
            print(f"Played {game_num} games...")
    avg_score = total_score / num_games
    print(f"Average Score: {avg_score}")
    return top_scores

This function is used to evaluate the performance of our trained model by having it play a set number of games determined by the input parameter 'num_games', and then calculating the average score. The function first initializes the top score and total score. This tracks the highest tile reached and the total score of all games. We start by entering a loop that is set to run each game, in the range of the number of games, looping over each game that needs to be played. When each game starts we must intialize a new game board, and then we immediately store the state of the board which is a normalized and flattened board so it is easier for the network to work with, but also because that is what we did for training our model. We then enter another loop that continues until the game ends. While the game is running inside this loop, the network determines the best action to take based on the current board, which is our 'state_tensor'. This is then passed through the q_network and returned to 'action_values'. This is then masked and we move into the next for loop, 'for move in possible moves'. Here the chosen action is then performed, and the state of the game board is updated once more. These goes on and on. Once the game is over, the total score, and top scores are calculated.

## Evaluate random

In [None]:
from game import Board
import random

def evaluate_random_actions(num_games=1000):
    top_scores = []
    total_score = 0
    for game_num in range(num_games):
        game = Board()
        while not game.game_over:
            possible_moves = game.possible_moves()
            action = random.choice(possible_moves)
            game.move_tiles(action)
        total_score += game.score
        top_scores.append(game.highest_tile())
        if game_num % 100 == 0:
            print(f"Played {game_num} games...")
    avg_score = total_score / num_games
    print(f"Average Score with Random Actions: {avg_score}")
    return top_scores

scores = evaluate_random_actions()

This function does the same as our evaluate_model, except it does not use the network. Instead, we assign action to random, and every move played is completely random. (As random as a computer is)

## Results

*Note that for the graphs below the index axis represents the 1000 games and the value axis represents the highest tile achieved, these values were sorted before displaying*

In our experiments, we trained for the best possible network using each of our three network structures described earlier. Additionally, we ran some tests using a random agent to provide a baseline comparison against our trained models. Hyperparameters were kept the same throughout our three models as we found the models to perform quite differently throughout each trial. So to keep comparisons consistent, we kept the hyperparameters the same and trained each model multiple times until a desirable one was achieved. We found the hyperparameters below to work well for our experiments:

Replay buffer capacity = 10,000

Number of Episodes = 1,000

Batch Size = 100

Epsilon = 0.9

Epsilon Decay = .998

Gamma = 0.90

Learning Rate = 5e-5

For our random baseline, we found that randomly selecting moves in the 2048 game to not be a good strategy. This baseline performed much worse than any of our trained networks did. Despite the poor performance, this provides a useful benchmark to see if our neural networks are in fact learning moves that provide an advantage to the player.

Average accuracy of a random player across 1,000 games: 1086.104

![random player](random.png)

Of the three model structures we tested, the QNetworkDropout performed the worst. We hypothesize that this could be due to the dropout chance variable being quite high at 0.5. This means that 50% of samples into that particular layer were zeroed out. While we thought this would possibly enhance our accuracy incase overfitting was an issue, we saw that it only made things worse. Further testing would need to be done to see if a network with dropout improves this task.

Average accuracy of the QNetworkDropout model across 1,000 games: 1618.652

![dropout](droupout.png)

The experiment with the QNetworkConv had performance in between our two models with linear layers. While we expected convolutional layers to best model this game due to the relevance of spacial structures in 2048, they did not have the best performance in our experiments. Nonetheless, the model still performed quite well in comparison to the base line with an average score of roughly 2x.

Average accuracy of the QNetworkConv model across 1,000 games: 2137.132

![conv](conv.png)

In our experiment testing our QNetworkSimple model consisting of three linear layers, we achieved the highest average score across all experiments. This model had an average almost 3x the random baseline while playing a thousand games and was also able to get the 1024 tile in two cases! We suspect that the inputs from the game 2048 are more easily modeled by a multiple linear layers as opposed to a convolutional ones which is providing the better accuracy here.

Average accuracy of the QNetworkSimple model across 1,000 games: 2794.344

![simple](simple.png)

## Conclusions

Throughout this project we learned how to take a problem and apply reinforcement learning agents to it. We gained familiarity with the PyTorch framework, learned how it can be leveraged to to provide simple to understand deep Q learning logic, and apply it to the 2048 game.

What was most difficult was determing how to go from our 2048 game board class and apply the deep Q learning logic. We first had to think about how the network would be rewarded, how it could be limited from making invalid choices and not going anywhere, and how we could track metrics on the network to evaluate our networks. Additionally it was difficult adapting the game board to a tensor that was compatible with our models. Next, we had to define how the neural network would be trained and how each 'episode' would function. Thankfully Pytorch has a nice guide that gave us a great place to start [Paszke, 2023].

After we had our training logic down, it was interesting running various experiments with different parameters. What surprised us the most was that in comparison to many other online attempts at this problem [Baluja 2021, Virdee 2018, Pan 2019, Goenawan 2020, Chan 2022] our models did not perform nearly as well. Despite not having as much success, we were able to train models to far outperform random which was amazing and fun to see!

We think these differences could be attributed to our model structures or hyperparameters, but due to our early knowledge of RL we are unsure the exact cause. This project was really fun and taught us a lot about a concept that we were not familar with. As we continue to learn more about RL we would like to come back to this work someday to investigate how we can get a top tier performing model!

### Who did what

Chad did: QNetworkSimple, get_flattened_board, get_normalized_flattened_board, _transpose_board, _moves_available, ReplayBuffer, evaluate_model, created the terminal interface, and collaboratively programmed training loop.

Logan did: QNetworkDropout, QNetworkConv, move_tiles, possible_moves, _shift, _merge, _add_new_tile, _simulate_move, _simulate_move, evaluate_random_actions, created the interface for playing with PyGame, and collaboratively programmed training loop.

Some portions of this project were done collaboratively and some were done individually. The training loop was largely done in a pair programming manner allowing for both of us to effectively bounce ideas back and forth, although Chad was a bit more experienced here, and Logan (Myself) learned a lot.

### References

* [Goodfellow, et al., 2016] Ian Goodfellow and Yoshua Bengio and Aaron Courville, [Deep Learning](http://www.deeplearningbook.org), MIT Press. 2014.
* [Baluja, 2021] Michael Baluja, [Reinforcement Learning for 2048](https://github.com/michaelbaluja/rl-2048), 2021
* [Virdee, 2018] Navjinder Virdee [Trained A Neural Network To Play 2048 using Deep-Reinforcement Learning](https://github.com/navjindervirdee/2048-deep-reinforcement-learning), 2018
* [Pan, 2019] Tianyi Pan [Applied Reinforcement Learning with 2048](https://www.linkedin.com/pulse/part-1-applied-reinforcement-learning-2048-tianyi-pan), 2019
* [Goenawan, 2020] Nathaniel Goenawan and Simon Tao and Katherine Wu [What’s in a Game: Solving 2048 with Reinforcement Learning](https://web.stanford.edu/class/aa228/reports/2020/final41.pdf), Stanford 2020
* [Paszke, 2023] Adam Paszke and Mark Towers [Reinforcement Learning (DQN) Tutorial](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html), PyTorch 2023
* [Chan, 2022] LH Chan [2048_rl](https://github.com/qwert12500/2048_rl/blob/main/2048_Reinforcement_Learning.ipynb), 2022
* [PyGame, 2023] PyGame Docs [pygame documentation](https://www.pygame.org/docs/)

In [1]:
import io
import nbformat
import glob
nbfile = glob.glob('TermProjectChadLogan.ipynb')
if len(nbfile) > 1:
    print('More than one ipynb file. Using the first one.  nbfile=', nbfile)
with io.open(nbfile[0], 'r', encoding='utf-8') as f:
    nb = nbformat.read(f, nbformat.NO_CONVERT)
word_count = 0
for cell in nb.cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print('Word count for file', nbfile[0], 'is', word_count)

Word count for file TermProjectChadLogan.ipynb is 4286
