# Submission for Coursework 3
Team: Billie-Jo Powers, Jasmine Burgess, Mark Holcroft & $\tiny{Harry Ellington}$  

In [None]:
# imports 
import sys
import pickle
import numpy as np 
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
sys.path.append('..')

## Introduction

Pig is a simple dice game of chance, whose interesting probabilistic features have been well‐studied in a vast number of academic papers. This report re‐examines such a paper by Todd W. Nellor and Clifton G.M. Presser. The aim of their paper is to find the optimal policy a player of the game Pig should use, and they calculate this using the value iteration algorithm. At the time of the publishing of the paper, classical value iteration was seen as too slow and states took too long to converge, and hence a ‘layered’ approach of working backwards was used. This works by using value iteration on subsets of the state space, ensuring these converge before moving onto the next subset. The paper found that the optimal policy is non‐smooth and includes unusual and unintuitive features.

## Simplified Game of Pig

Neller and Presser initially work with a simplified version of the game Pig, named "Piglet". In this version, we replace the dice with a coin, where landing on tails is the equivalent of rolling a one. Traditionally, the goal of Piglet is to reach 10 points first; however, the number of equations required to calculate the proability of winning from each possible state is still too large for a simple example, so we instead set a goal of 2. 

Letting $P_{i,j,k}$ be the probability of the player winning given that their score is $i$, the opponents score is $j$, and the current turn total is $k$, below we recreate figure $2$ from Neller and Presser, which shows the result of applying value iteration to Piglet with a goal score of $2$. We were able to confirm that our results gave the same final probabilities as Neller and Presser. 

In [None]:
from piglet import PigletSolver
piglet_plotter = PigletSolver(goal = 2, epsilon = 10e-6)
piglet_plotter(convergence_plots=True)

#talk about how early states have to wait for later ones to converge. Hence this motivates other procedure



## Surface Plot of Optimal Policy

We are now able to look at reproducing the result given for the game of Pig, as with Piglet, layered value iteration is used to find an optimal policy for playing the game of Pig. Here, we are working with a six sided dice and a goal score of $100$, as in the traditional rules, though our code used is capable of adapting the rules to user preference. Figure 3 in Neller and Presser gives a plot of the optimal policy, for which our recreation is given below. The interpretation of the graph is that for states below the surface, one should roll again, and for states above the surface, the optimal choice is to hold. 

Our recreation has distinct similarities to that produced by Neller and presser, which indicates a good level of reproduicibilty in this regard. 



In [None]:
# Due to pre complilation of loops using numba package, the optimal policy can be derived in minimal time
from notebook_writeup.optimised_layered_vi import pig_layered_value_iteration
 
die_size = 6
target_score = 100
max_turn = 100
values, policy = pig_layered_value_iteration(target_score=target_score, 
                                             die_sides=die_size, 
                                             max_turn=max_turn, 
                                             epsilon=1e-6)

In [None]:
# We can display our policy using the following plotting module
from notebook_writeup.plotting_tools import generate_box_plots
generate_box_plots(policy, title = 'Plot of optimal Policy', pad=True)

### Extracting Reachable States 

Figures 5 and 6 of the original paper depict the optimal policy at all reachable states. Note here that the reachable states are calculated based on the assumption that the opponent could be playing using any strategy ‐ if they were to be playing optimally, these figures would be symmetric about the axis Player Score = Opponent Score. These figures have been reproduced below. They show clear similarities with those of the original paper, particularly the omission of states involving player scores of between 1 and 19, and the tendency to increase the maximum turn score for larger values of opponent score.

Our recreation does exhibit slight deviations from Neller and Presser's results in regions where both players have high scores. This is likely due to those states being more difficult to reach, and so they appear less often in our simulations. 

In [None]:
# reachable states are extractable from experiment (PLACEHOLDER)
with open('pickle_and_config_files/reachable_states.pkl', 'rb') as d:
    reachable_array = pickle.load(d)
generate_box_plots(reachable_array, title = 'Plot of Reachable States')

Another graph which was successfully replicated was Figure 4. Shown in below, it depicts a cross‐section of the reachable states for an opponent score of 30, and includes the optimal boundary and the “hold at 20” heuristic as a benchmark for comparison. The graph was produced using simulation, and shares the same features and properties with that of the original paper. Note crucially that the reachable states are states which are reachable during any stage of the game ‐ not only when the opponent is on a score of 30. 

In [None]:
cross_section_reachable = np.array([[reachable_array[(i, 30, k)] for i in range(101)] for k in range(100, -1, -1)])
cross_section_policy = policy[:, 30, ::-1]

In [None]:
plt.figure(figsize=(8, 8))

# Show the two imshow layers
plt.imshow(cross_section_policy.T, cmap='Blues', alpha=0.6)
plt.imshow(cross_section_reachable, cmap='Reds', alpha=0.4)

# Axis labels
plt.xlabel("Player Score")
plt.ylabel("Opponent Score")

# Custom legend using colored patches
policy_patch = mpatches.Patch(color='blue', alpha=0.6, label='Policy Region')
reachable_patch = mpatches.Patch(color='red', alpha=0.4, label='Reachable Region')
plt.legend(handles=[policy_patch, reachable_patch], loc='upper right')

plt.title("Policy vs Reachable State Overlay")

plt.show()


notes on the above plot: I think it looks pretty shit. I will be rerunning the code which maps the state space for many more iterations. Hopefully that will give us a lot more smoothness on the red region

### Contours of winning probability

The final graph we reproduce from Teller and Presser is figure $7$, which gives the win probability contours for optimal play. This gives a visualisation of the probability of the optimal player winning given that they are in state $(i,j,k)$. In particular, this contnour gives the states for which the probability of winning is $3\%, 9\%, 27\%$, and $81\%$. Our reproduction of the graph is given below and is consistent with that found in 

In [None]:
# make use of our final values outputted by the 
from notebook_writeup.plotting_tools import plot_isosurface_from_array

plot_isosurface_from_array(values, isovalues=[0.03, 0.09, 0.27, 0.81])

### Reproducing Score Analytics 

Neller and Presser conclude that in the scenario where both players are playing optimally, the player who goes first will win $53.06\%$ of the time ( i.e. $P_{0,0,0} = 0.5306$). To reproduce these results, we used a simulation of the game pig, with 100,000 replications, with a starting seed of 123. The code used for these simulation is available in the files called below, and in this case give that the first player wins $53.047\%$ of the games. 

In [None]:
# In section 'the solution' they rattle off some %s for wins and stuff. we will try and reproduce them here 
from competition import Competition
competition_function = Competition(policy, policy, replications=100000, seed = 123)
win_percentage = competition_function()
print(f'For two optimal policies playing against each other, player 1 wins {round(win_percentage*100, 3)}% of the time.')

We also note that our value $P_{0,0,0}$, as given below is $0.5306$, which is consistent with Neller and Pressers results. 

In [None]:
# note that this win percentage is the same as given in the value function
float(values[0,0,0] * 100)

Neller and Presser also reach some conclusion on what is expected when one player uses the optimal strategy and the other employs the "hold at 20" policy. To recreate these results, we again employed the simulation technique used above, where one player employs the "hold at 20" policy according to the following code:

In [None]:
# we have another class called opponents which can make policies for us to play against if we want to 
from competition import Opponents
hold_at_20_policy = Opponents.hold_at_n(20) # NOTE THAT YOU CAN TUNE THIS, WALK THEM THROUGH SOME OTHER VALS

Neller and Presser conclude that when the optimal player goes first in this scenario, they should win $58.74\%$ of the time, and when the "hold at 20" player goes first, they should win $47.76\%$ of the time. Below is given out simulations of the same scenarios, again with 100,000 replications and a starting seed of 123. 

In our simulations, when the optimal player goes firt, we find that they win $57.21\%$ of the games. Since this result is below what was expected from Neller and Pressers results, w include  confidence interval for this reult. **need to find the confidence interval code and insert it**. We also found in our simulation that when the hold at 20 player goes first, they win $49.518\%$ fo the time which is again a slight deviation from the results found by Teller and Nelson, a confidence interval is again provided to assess this. 

**going to need to add more discussion on this with the confidence intervals as it seems we will have to conclude that we could not completely reproduce the results from the paper**

In [None]:
competition_function_h20 = Competition(policy, hold_at_20_policy, replications=100000, seed = 10000)
win_percentage_h20 = competition_function_h20()
print(f'For optimal policy playing against hold at 20, optimal player wins {round(win_percentage_h20*100, 3)}% of the time.')

In [None]:
#standard error for "h20_1"
se = np.sqrt(win_percentage_h20 * (1 - win_percentage_h20) / 100000)

# Confidence interval for "h_20_1"
ci_lower = win_percentage_h20 - 1.96 * se
ci_upper = win_percentage_h20 + 1.96 * se

print(f"95% Confidence Interval: [{ci_lower:.3f}, {ci_upper:.3f}]")

In [None]:
competition_function_h20_2 = Competition(hold_at_20_policy, policy, replications=100000, seed = 123)
win_percentage_h20_2 = competition_function_h20_2()
print(f'For optimal policy playing against hold at 20, h20 player wins {round(win_percentage_h20_2*100, 3)}% of the time.')